Bioinformatics Unit 1: Exercise 2

SSU Home | SSU Biology | CourseInfo | Forum | Home

Glossary   |   Self Tests   |   Software   |   Objectives   |   Articles


Unit 1: Databases & Queries
Exercise 2, Section A

Pre-Exercise: Explore some tutorials

Exercise 1: Intro to bioinformatics

Exercise:

Exercise 2: Section B

Part 1- Basic search strategies

Exercise 2: Section C

Part 2- BLAST searches

Project 1: A look at the literature

Part 3- Taxonomic searches

Summary questions

Further exploration

Amino acid table

.

Exercise 2: Molecular Databases and Search Tools
Section A


Note: Bring a disk to class to save your log and any other useful files.


Objectives for Part A:

1. Gain an understanding of the basics of bioinformatics.

  • Explore some key sites containing databases and application software.
  • Become familiar with some of the uses of bioinformatics.
  • Appreciate the dynamic nature of this rapidly growing field.

2. Become familiar with some uses of molecular databases.

  • Be able to query databases and search for molecular sequences.
  • Be able to find similar sequences to a known sequence using BLAST.
  • Know how to evaluate the quality of information obtained and analyze the results.
  • Develop strategies for refining searches and improving results in data retrieved.

3. Be able to evaluate and understand results.

  • Use a working knowledge of key molecular biology principles.
  • Assess results in terms of significance to the question being asked and the methods being used.

Introduction:

In this exercise, you will begin to explore a variety of types of databases and interfaces useful in various types of bioinformatics and computational biology analyses. You will get a taste of searching for specific nucleic acid and protein sequences by using keywords. You will be able to search for sequences related to a query sequence using BLAST. You will learn how to analyze your search results and how to refine your searches. After playing with these tools some, you may wish to try your hand at asking some basic questions and looking for ways to answer them. For example, your questions may focus on conserved sequence patterns. Or perhaps you may want to explore the phylogeny of a specific gene or gene family. Curiosity and imagination are great tools to use along with what you'll learn here.

In addition to developing "search smarts" over the next week, you will have the opportunity to become familiar with NCBI and Biology Workbench. These are online interfaces which make it easy to access the molecular databases and to access a variety of software tools. Tutorials at these sites and elsewhere provide excellent support as you work with these databases and applications. The interfaces can greatly reduce navigation time to obtain what you need.

You will also be exploring several databases which we'll be using later on in the course. They are introduced here as part of a general survey of useful databases. Included are a database of ecological interest, genome databases, and sites linking sequence information and metabolic pathways. After this exercise, there is another short project. To help you gain an insight to researching a problem using these tools, Dr. Sakanari will be presenting a research project and asking you to follow up with some inquiries of your own.

Time management: First of all, plan on giving yourself some short breaks. Breaks between small chunks will actually help the learning process and cut the frustration down to a reasonable level. The more you play with different sections, the more they begin to fit together. The more they fit together, the more understanding you will gain. Don't expect to "get it" after one round of search and analysis.

Reminder: Using a log can help you backtrack to things you'd like to check out later. Saving sequences, along with pertinent accession numbers and identifiers, will become extremely useful, especially when trying to use multiple analytical applications on the same data set.

There are summary questions at the end of this section. Read them through before you start browsing. You can answer them as you go, or answer them after browsing the following sites. Points = 6. [The other 4 points will come in the later parts of this exercise.] Due 9/4 midnight or 9/9.

[top of page]

Pre-Exercise:

Tutorial lessons: Browse the following two tutorial sites to see what they have to offer.

1. Explore the tutorials at NCBI. Besides the basic introductory lessons, browse the "Coffee Break" tutorials to see what is there.

[www.ncbi.nlm.nih.gov/Education]

Explore the following sub-sites:

a. To become familiar with the size and layout of NCBI, click on the site map. For future use, this site map will allow you to navigate to different sections quickly, or to find something that you saw at one point but can't seem to find again.

b. From the main Education page, you can get an overview of bioinformatics and some key terms.

c. Next, try the tutorial Nucleotides, then try BLAST. Even if you have used BLAST before, this is very helpful. There are also new features which can help in doing BLAST searches more effectively.

d. To better understand BLAST statistics and the results of a search, browse the statistics tutorial, available from the left-hand menu bar. [In future, you can also access both this tutorial and the main BLAST tutorial from the the BLAST page.]

e. An interesting way to learn about different aspects of bioinformatics and some specifics on applications is to go to Coffee Break. These are a series of short essays on a variety of topics with specific tutorials imbedded. For example, the last posting, 5 June 02, discusses small RNAs involved in the regulation of developmental timing. One of the embedded tutorials is about NCBI's electronic access to reference books, which is becoming quite useful. Click on Archive on Coffee Break's main page to access the list of other topics. Try out a couple now. Come back for more later.

[top of page]

Exercise:

Part 1: Basic Search Strategies

A. Finding a sequence or two. 

1. Go to NCBI:

[www.ncbi.nlm.nih.gov/]

The default on the search bar is GenBank, allowing nucleotide searches. You can select Proteins, PubMed, Genomes, etc. from the pulldown menu to do other types of searches. To get you started, enter one of the following keywords or phrases from the following list, or select something else you have an interest in:

  • green fluorescent protein
  • tubulin
  • poliovirus
  • MHC
  • ribulosebisphosphate carboxylase

You can refine your search if you get too many hits. For example, you can narrow down to a given species or to a single chain of a tertiary protein. You can add to your word string or go to "Limits" under your search window for advanced search options. Use the "All Fields" pulldown menu to specify a field. Boolean operators AND, OR, NOT must be in upper case. Help on using limits is available.

2. To retrieve a single entry, click on its accession number highlighted in blue. To retrieve several sequences at once, select them by checking the boxes to the left. Then choose which type of display you want, such as GenBank, then click "Display". Review the type of information available. Try some of the other display formats & links. Use your log to save some examples of different kinds of reports, along with a label and explanatory comments as to source, Web page type [NCBI search], and anything else you deem useful for future use.

3. Try displaying your selected sequences in FASTA format* by selecting it in the display menu. For saving FASTA reports to your log, copy the entire report, beginning at ">gi|xxx ...." and ending at the end of the sequence. Paste this in your log as a backup and for future use. [What, you didn't open a log yet?? You can do it now.] After pasting, click back to your Web page. Congratulations! You now have a sequence or sequences, which you can use in the next section.

Note: You can also save search results and FASTA reports to the on-line clipboard. Beware, however; that there is a time limit for material remaining on the clipboard.

* FASTA format: This is a concise, no frills, means of storing nucleotide and protein sequences. There are several other format types, but FASTA can be read by nearly all application tools you might want to use. Although there are tools available for converting files from one format type to another, it's nice to start with one that is widely accepted. [Note: FASTA itself is an alignment tool for sequence comparisons. We will not be specifically using it in this course, but if you are interested in knowing more, you are welcome to dig a bit deeper.]

If using sequences of your own or ones acquired from a text source, it is fairly easy to format yourself. One line of description is allowed, preceded by ">" as in ">gi|xxx" or ">sea urchin clone 122". If you are only working with one sequence, and therefore do not need a reference description, you may omit this header line. The sequence itself begins on the next line, using upper case standard IUB/IUPAC nucleotide [ATGC] or protein [single letter codes- see table] designations. In addition to these code designations, you may use an "X" to represent an unknown amino acid residue, an "N" to represent an unknown nucleotide, and an asterisk [*] can be for either. Keep the text lines to 60 characters in length, using hard returns rather than line-wrap, and avoid inserting any spaces anywhere.

4. Go back up to the search bar and select Protein. Repeat steps 1-3 and compare what you see here to what you saw with nucleotide searching. Note differences in the number of retrieved sequences. Instead of a GenBank report, you can retrieve a GenPept report. Save a FASTA format sequence.

5. How do you retrieve a protein sequence matching a nucleotide sequence or visa versa? To the right, in blue, note that some entries show a link to nucleotide, if in Entrez-Protein. In Entrez-Nucleotide, there are similar links to protein. Check it out and compare the information you get back.

 

B. Finding a group of related sequences.

This can be done in a variety of ways, in different locations. A brief introduction to a few of these is given here.

Scan the right side of a page of results from your search in A above. Note that some entries have BLink, Domains, &/or Links in blue type. Click on one to see what you get back. Another way is to check if Protein Neighbors or Nucleotide Neighbors shows in the Display menu for your selected sequence(s). Selecting it will give you a list as well. Be sure to check the boxes before clicking on "Display".

 

The above strategies have their uses, but as is they are not that interesting. You'll need to explore further to get to the fun stuff. For further analysis following any of these search strategies, sequences of interest need to be retrieved and saved in FASTA format. If you are interested in structure, be sure to save MMDB and PDB accession numbers, as well, for future use. {More on those in Unit 5.]

 

Here's a good point for a break.....

[top of page]

Part 2: BLAST Searches

A: Finding homologous nucleotide sequences by using BLAST.

1. The shuttle to return to NCBI departs here:

[www.ncbi.nlm.nih.gov/]

If you are at the NCBI home page, click on the BLAST button. If you are at the Entrez home page, click on BLAST on the left-hand menu panel. On some formats of results pages, BLAST can be accessed by clicking on it at the top of the page. Read the overview; and check out the "FAQ", "course", and "tutorial" links on the left-hand menu if you are interested further explanations, especially in the statistical analytical tools used and their levels of reliability; and more about running BLAST searches. For help while in BLAST, just click on any linked label to get an explanation.

2. Select blastn on the first menu page. On the next page, use defaults for your first try. For database window, leave it "nr" [non-redundant] as this will give you the broadest search of linked databases without duplication of individual sequences. Go to your log or the clipboard and select the FASTA formatted sequence of a nucleotide sequence of interest and copy it. Go back to the BLAST page. Click the cursor inside the top window, then paste. [Your FASTA sequence should still be in the buffer. If not, go to your log and copy/paste it over.] Be sure you eliminate any spaces to the left of the ">gi|xxx" line and any spaces to the left of the sequence lines. Do not touch the right-side of any lines.

3. Below the search window are other menu selections, such as one which says "pairwise" in the "Format" section next to "alignment view". For your first search, use the defaults. To retrieve your BLAST report, click "Format results". Browse the report. Note that the graph is active, meaning you can navigate the report by mousing-over the bars and clicking on them to get to the alignments. The window above the graph tells you the identity of each sequence retrieved. You can also scroll through the report. For details of interpreting the report, 1) click on the line above the graph, "Distribution of xx Blast Hits...", and/or 2) click on "FAQ". You can also toggle over to Darwin 2000, if you have kept a window open.

  • For long gene sequences, % similarity scores will be high, even with a reasonable number of non-match residues and gaps, since calculations are based on the total number of residues. For shorter gene sequences and fragments, homology scores will often be lower, because each difference will make a greater numerical impact on the calculations. Expect or E values are useful. A value of 0.0 indicates extremely good homology or an exact match. A value of 10 or more indicates random chance and will go unreported. Values in-between will give a sense of the relative degree of significant match.
  • For a pairwise Blast-n, vertical lines and blanks are used between the query and subject lines. [In Blast-p, the middle line between query and subject gives a letter if an identical match, a + if similar in character, or a blank if no match.] In both cases, hyphens in the query or subject lines indicate gaps.

4. Go back to the BLAST query page. Change from pairwise in the format section below to flat query anchored and browse the report. Compare the results to your first report.
 

B: Finding homologous protein sequences by using BLAST.

In Blast-p, the middle line between query and subject gives a letter if an identical match, a + if similar in character, or a blank if no match. See the amino acid table for the one-letter code and the groups based on chemical characteristics.

1. Go to your log or the clipboard and select the FASTA formatted sequence of a protein sequence of interest and copy it. If you don't have a sequence already, go to either NCBI or Entrez home page, select GenPept or Protein, and enter your choice of search terms. Retrieve your target sequence in FASTA format.

2. Go to BLAST. Select blastp and leave database "nr". Paste your sequence and run your choice of "pairwise" and "flat query-anchored" searches. Compare the similarities and differences found between running "blastp" and "blastn". Note the informational advantages of each.

C: After a breather, come back later for more types of BLAST searches.

Try BLAST-x to compare a nucleotide sequence translated in all possible reading frames against protein sequence databases. Explore the uses of BLAST 2 and PSI-BLAST. Find out what they do and how they can be used. A useful approach is to take sequences you have already run in basic BLAST and use them for other types of BLAST searches, then compare the results. You should see some differences in the sensitivity of the searches in the case of PSI-BLAST. Depending on future inquiries and projects, these may be of benefit.

[top of page]

Part 3: Taxonomic Searches

A. Exploration.

1. Go to: ; then click on Taxonomy on the menu bar.

a. Click on Genetic Codes on the left-hand bar. Here you can discover that AUG does not always mean methionine, among other things. [Don't linger long; you can return later for a more thorough browse of the different coding tables.]

b. Return to Taxonomy; then select some taxonomic group which interests you. Note also, that on the Taxonomy homepage, you can access database information of some common model organisms directly from here. This is a great resource site for both study and research.

1) Page down through the taxon groups until you find something of interest. Make a selection. [Note: You can adjust the number of levels you want displayed as you browse.]

2) Once you arrive at a species level, you will find some basic information and information on how many different types of nucleotide, protein, and structure sequences are available for retrieval. To retrieve sequences, simply click on the type you want.

2. For a diversion, if you are unfamiliar with The Tree of Life, it is well worth a visit:

[http://tolweb.org/tree/phylogeny.html]

This resource is useful for many purposes, especially when designing an inquiry involving different species. It is an ongoing project, therefore it is worth checking back periodically as the tree grows.

B. Combining taxonomic and BLAST searches to focus on a desired group or groups.

This is easy to do. After entering your query sequence in a BLAST search window, scroll down to the options section. You can enter taxonomic limits in "Limit by Entrez query" or select a taxonomic group or model species in the pulldown menu. If you fail to get what you expected, go back to Taxonomy and browse the hierarchy to select an appropriate level to search. Try one of your sequences and play with setting taxonomic limits.

[top of page]

Summary Questions:

Try to limit your answers to 2-3 typed pages [12 pt font]. This length should be sufficient for your comments and any appropriate copy/pasted examples. [You need not retype or copy/paste the questions as part of your responses, but do number them please.]

1. Give an example of a successful basic search you did. Give the following information:

a. Word or word string used
b. Give brief examples from your log of some types of reports you retrieved
c. Give one example of a sequence in FASTA format

2. Summarize one of your BLAST search results. Give the following information:

a. What was your query sequence, which you entered in FASTA?

b. How many closely related sequences did you retrieve? What was the worst match you retrieved? Do you consider this match to still be significant or not? Why?

c. What was the range of "E" scores? What does this mean?

d. What information did you get from the query-anchored report, which you didn't see in the first report?

e. Which style of BLAST report did you prefer? Why?

f. Explain why six reading frames are checked in BLAST-x.

g. Briefly summarize the difference in sensitivity when using PSI-BLAST versus basic BLAST. Give an example where PSI-BLAST would be particularly useful.

h. Give a brief summary of your experiences with setting taxonomic limits in BLAST.

3. [Required question] On the whole at this point, what part of conducting searches is most comfortable for you? What part is still difficult?

Further exploration:

As you read and digest what you have just gone through, go back and browse some of the sites in more detail. This exercise is like one of those fast-paced package tours, where you probably want to get out and leave the trail to explore more independently. It also helps to do segments of browsing to help solidify where you have been and what you have been doing. The more time you can put into it, the more you'll get back from your efforts.

[top of page]

.

Updated 09/11/2003 by bsc@classroomtools.com, thatcher@sonoma.edu