Exercise 2: Molecular Databases and Search
Tools
Section A
Note: Bring a disk to class to save your log and any
other useful files.
Objectives for Part A:
1. Gain an understanding of the basics of
bioinformatics.
- Explore some key sites containing databases and
application software.
- Become familiar with some of the uses of
bioinformatics.
- Appreciate the dynamic nature of this rapidly growing
field.
2. Become familiar with some uses of molecular
databases.
- Be able to query databases and search for molecular
sequences.
- Be able to find similar sequences to a known sequence
using BLAST.
- Know how to evaluate the quality of information
obtained and analyze the results.
- Develop strategies for refining searches and
improving results in data retrieved.
3. Be able to evaluate and understand results.
- Use a working knowledge of key molecular biology
principles.
- Assess results in terms of significance to the
question being asked and the methods being used.
Introduction:
In this exercise, you will begin to explore a variety of
types of databases and interfaces useful in various types of
bioinformatics and computational biology analyses. You will
get a taste of searching for specific nucleic acid and
protein sequences by using keywords. You will be able to
search for sequences related to a query sequence using
BLAST. You will learn how to analyze your search results and
how to refine your searches. After playing with these tools
some, you may wish to try your hand at asking some basic
questions and looking for ways to answer them. For example,
your questions may focus on conserved sequence patterns. Or
perhaps you may want to explore the phylogeny of a specific
gene or gene family. Curiosity and imagination are great
tools to use along with what you'll learn here.
In addition to developing "search smarts" over the next
week, you will have the opportunity to become familiar with
NCBI and Biology Workbench. These are online interfaces
which make it easy to access the molecular databases and to
access a variety of software tools. Tutorials at these sites
and elsewhere provide excellent support as you work with
these databases and applications. The interfaces can greatly
reduce navigation time to obtain what you need.
You will also be exploring several databases which we'll
be using later on in the course. They are introduced here as
part of a general survey of useful databases. Included are a
database of ecological interest, genome databases, and sites
linking sequence information and metabolic pathways. After
this exercise, there is another short project. To help you
gain an insight to researching a problem using these tools,
Dr. Sakanari will be presenting a research project and
asking you to follow up with some inquiries of your own.
Time management: First of all,
plan on giving yourself some short breaks. Breaks between
small chunks will actually help the learning process and cut
the frustration down to a reasonable level. The more you
play with different sections, the more they begin to fit
together. The more they fit together, the more understanding
you will gain. Don't expect to "get it" after one round of
search and analysis.
Reminder: Using a log
can help you backtrack to things you'd like to check out
later. Saving sequences, along with pertinent accession
numbers and identifiers, will become extremely useful,
especially when trying to use multiple analytical
applications on the same data set.
There are summary questions at the end of this
section. Read them through before you start browsing. You
can answer them as you go, or answer them after browsing the
following sites. Points = 6. [The other 4 points will
come in the later parts of this exercise.] Due 9/4
midnight or 9/9.
[top of
page]
Pre-Exercise:
Tutorial lessons: Browse the following two
tutorial sites to see what they have to offer.
1. Explore the tutorials at NCBI. Besides
the basic introductory lessons, browse the "Coffee
Break" tutorials to see what is there.
 [www.ncbi.nlm.nih.gov/Education]
Explore the following sub-sites:
a. To become familiar with the size and layout
of NCBI, click on the site map. For future use,
this site map will allow you to navigate to different
sections quickly, or to find something that you saw at
one point but can't seem to find again.
b. From the main Education page, you can
get an overview of bioinformatics and some key terms.
c. Next, try the tutorial
Nucleotides, then try BLAST.
Even if you have used BLAST before, this is very helpful.
There are also new features which can help in doing BLAST
searches more effectively.
d. To better understand BLAST statistics
and the results of a search, browse the statistics
tutorial, available from the left-hand menu bar.
[In future, you can also access both this tutorial
and the main BLAST tutorial from the the BLAST
page.]
e. An interesting way to learn about different
aspects of bioinformatics and some specifics on
applications is to go to Coffee
Break. These are a series of short essays on a
variety of topics with specific tutorials imbedded. For
example, the last posting, 5 June 02, discusses small
RNAs involved in the regulation of developmental timing.
One of the embedded tutorials is about NCBI's electronic
access to reference books, which is becoming quite
useful. Click on Archive on Coffee
Break's main page to access the list of other topics.
Try out a couple now. Come back for more later.
[top of
page]
Exercise:
Part 1: Basic Search
Strategies
A. Finding a sequence or two.
1. Go to NCBI:
[www.ncbi.nlm.nih.gov/]
The default on the search bar is GenBank, allowing
nucleotide searches. You can select Proteins, PubMed,
Genomes, etc. from the pulldown menu to do other types of
searches. To get you started, enter one of the following
keywords or phrases from the following list, or select
something else you have an interest in:
- green fluorescent protein
- tubulin
- poliovirus
- MHC
- ribulosebisphosphate carboxylase
You can refine your search if you get too many hits. For
example, you can narrow down to a given species or to a
single chain of a tertiary protein. You can add to your word
string or go to "Limits" under your search window for
advanced search options. Use the "All Fields" pulldown menu
to specify a field. Boolean operators AND, OR, NOT
must be in upper case. Help on using limits is
available.
2. To retrieve a single entry, click on its
accession number highlighted in blue. To retrieve
several sequences at once, select them by checking the boxes
to the left. Then choose which type of display you want,
such as GenBank, then click "Display". Review the
type of information available. Try some of the other display
formats & links. Use your log to save some examples of
different kinds of reports, along with a label and
explanatory comments as to source, Web page type [NCBI
search], and anything else you deem useful for future
use.
3. Try displaying your selected sequences in
FASTA format* by selecting it in the display menu.
For saving FASTA reports to your log, copy the entire
report, beginning at ">gi|xxx ...." and ending at the end
of the sequence. Paste this in your log as a backup and for
future use. [What, you didn't open a log yet??
You can do it now.] After pasting, click back to your
Web page. Congratulations! You now have a sequence or
sequences, which you can use in the next section.
Note: You can also save search results and FASTA
reports to the on-line clipboard. Beware, however; that
there is a time limit for material remaining on the
clipboard.
* FASTA format: This is a concise, no frills,
means of storing nucleotide and protein sequences. There are
several other format types, but FASTA can be read by nearly
all application tools you might want to use. Although there
are tools available for converting files from one format
type to another, it's nice to start with one that is widely
accepted. [Note: FASTA itself is an alignment tool for
sequence comparisons. We will not be specifically using it
in this course, but if you are interested in knowing more,
you are welcome to dig a bit deeper.]
If using sequences of your own or ones acquired from a
text source, it is fairly easy to format yourself. One line
of description is allowed, preceded by ">" as in
">gi|xxx" or ">sea urchin clone 122". If you are only
working with one sequence, and therefore do not need a
reference description, you may omit this header line. The
sequence itself begins on the next line, using upper case
standard IUB/IUPAC nucleotide [ATGC] or protein
[single letter codes- see table]
designations. In addition to these code designations, you
may use an "X" to represent an unknown amino acid residue,
an "N" to represent an unknown nucleotide, and an asterisk
[*] can be for either. Keep the text lines to 60
characters in length, using hard returns rather than
line-wrap, and avoid inserting any spaces anywhere.
4. Go back up to the search bar and select
Protein. Repeat steps 1-3 and compare what you see
here to what you saw with nucleotide searching. Note
differences in the number of retrieved sequences. Instead of
a GenBank report, you can retrieve a GenPept report.
Save a FASTA format sequence.
5. How do you retrieve a protein sequence matching
a nucleotide sequence or visa versa? To the right, in blue,
note that some entries show a link to nucleotide, if in
Entrez-Protein. In Entrez-Nucleotide, there are similar
links to protein. Check it out and compare the information
you get back.
B. Finding a group of related
sequences.
This can be done in a variety of ways, in different
locations. A brief introduction to a few of these is given
here.
Scan the right side of a page of results from your search
in A above. Note that some entries have
BLink, Domains,
&/or
Links in blue type. Click on one to see what
you get back. Another way is to check if Protein
Neighbors or Nucleotide Neighbors
shows in the Display menu for your selected sequence(s).
Selecting it will give you a list as well. Be sure to check
the boxes before clicking on "Display".
The above strategies have their uses, but as is
they are not that interesting. You'll need to explore
further to get to the fun stuff. For further analysis
following any of these search strategies, sequences of
interest need to be retrieved and saved in FASTA format. If
you are interested in structure, be sure to save MMDB and
PDB accession numbers, as well, for future use. {More on
those in Unit 5.]
Here's a good point for a break.....
[top of
page]
Part 2: BLAST Searches
A: Finding homologous nucleotide sequences by
using BLAST.
1. The shuttle to return to NCBI departs here:
[www.ncbi.nlm.nih.gov/]
If you are at the NCBI home page, click on the BLAST
button. If you are at the Entrez home page, click on BLAST
on the left-hand menu panel. On some formats of results
pages, BLAST can be accessed by clicking on it at the top of
the page. Read the overview; and check out the "FAQ",
"course", and "tutorial" links on the left-hand menu if you
are interested further explanations, especially in the
statistical analytical tools used and their levels of
reliability; and more about running BLAST searches. For help
while in BLAST, just click on any linked label to get an
explanation.
2. Select blastn on the first menu
page. On the next page, use defaults for your first try. For
database window, leave it "nr" [non-redundant] as
this will give you the broadest search of linked databases
without duplication of individual sequences. Go to your log
or the clipboard and select the FASTA formatted
sequence of a nucleotide sequence of interest and
copy it. Go back to the BLAST page. Click the cursor
inside the top window, then paste. [Your FASTA
sequence should still be in the buffer. If not, go to your
log and copy/paste it over.] Be sure you eliminate
any spaces to the left of the ">gi|xxx" line and any
spaces to the left of the sequence lines. Do not touch the
right-side of any lines.
3. Below the search window are other menu
selections, such as one which says "pairwise" in the
"Format" section next to "alignment view". For your first
search, use the defaults. To retrieve your BLAST
report, click "Format results". Browse the report. Note that
the graph is active, meaning you can navigate the
report by mousing-over the bars and clicking on them to get
to the alignments. The window above the graph tells you the
identity of each sequence retrieved. You can also scroll
through the report. For details of interpreting the report,
1) click on the line above the graph, "Distribution of xx
Blast Hits...", and/or 2) click on "FAQ". You can also
toggle over to Darwin 2000, if you have kept a window
open.
- For long gene sequences, % similarity scores
will be high, even with a reasonable number of non-match
residues and gaps, since calculations are based on the
total number of residues. For shorter gene sequences and
fragments, homology scores will often be lower, because
each difference will make a greater numerical impact on
the calculations. Expect or E values are
useful. A value of 0.0 indicates extremely good homology
or an exact match. A value of 10 or more indicates random
chance and will go unreported. Values in-between will
give a sense of the relative degree of significant
match.
- For a pairwise Blast-n, vertical lines
and blanks are used between the query and subject lines.
[In Blast-p, the middle line between query and
subject gives a letter if an identical match, a + if
similar in character, or a blank if no match.] In
both cases, hyphens in the query or subject lines
indicate gaps.
4. Go back to the BLAST query page. Change from
pairwise in the format section below to
flat query anchored and browse the report.
Compare the results to your first report.
B: Finding homologous protein sequences by using
BLAST.
In Blast-p, the middle line between query and
subject gives a letter if an identical match, a + if similar
in character, or a blank if no match. See the amino
acid table for the one-letter code and the groups based
on chemical characteristics.
1. Go to your log or the clipboard and select
the FASTA formatted sequence of a protein sequence of
interest and copy it. If you don't have a sequence already,
go to either NCBI or Entrez home page, select GenPept
or Protein, and enter your choice of search
terms. Retrieve your target sequence in FASTA format.
2. Go to BLAST. Select blastp and leave
database "nr". Paste your sequence and run your choice of
"pairwise" and "flat query-anchored" searches. Compare the
similarities and differences found between running "blastp"
and "blastn". Note the informational advantages of each.
C: After a breather, come back later for more types of
BLAST searches.
Try BLAST-x to compare a nucleotide sequence
translated in all possible reading frames against protein
sequence databases. Explore the uses of BLAST 2
and PSI-BLAST. Find out what they do and how they can
be used. A useful approach is to take sequences you have
already run in basic BLAST and use them for other types of
BLAST searches, then compare the results. You should see
some differences in the sensitivity of the searches in the
case of PSI-BLAST. Depending on future inquiries and
projects, these may be of benefit.
[top of
page]
Part 3: Taxonomic Searches
A. Exploration.
1. Go to: ;
then click on Taxonomy on the menu bar.
a. Click on Genetic Codes on the
left-hand bar. Here you can discover that AUG does
not always mean methionine, among other
things. [Don't linger long; you can return later for
a more thorough browse of the different coding
tables.]
b. Return to Taxonomy; then select some
taxonomic group which interests you. Note also, that on
the Taxonomy homepage, you can access database
information of some common model organisms directly from
here. This is a great resource site for both study and
research.
1) Page down through the taxon groups
until you find something of interest. Make a
selection. [Note: You can adjust the number of
levels you want displayed as you browse.]
2) Once you arrive at a species level, you
will find some basic information and information on
how many different types of nucleotide, protein, and
structure sequences are available for retrieval. To
retrieve sequences, simply click on the type you
want.
2. For a diversion, if you are unfamiliar with
The Tree of Life, it is well worth a visit:
[http://tolweb.org/tree/phylogeny.html]
This resource is useful for many purposes, especially
when designing an inquiry involving different species. It
is an ongoing project, therefore it is worth checking
back periodically as the tree grows.
B. Combining taxonomic and BLAST searches to focus on
a desired group or groups.
This is easy to do. After entering your query sequence in
a BLAST search window, scroll down to the options section.
You can enter taxonomic limits in "Limit by Entrez query" or
select a taxonomic group or model species in the pulldown
menu. If you fail to get what you expected, go back to
Taxonomy and browse the hierarchy to select an appropriate
level to search. Try one of your sequences and play with
setting taxonomic limits.
[top of
page]
Summary Questions:
Try to limit your answers to 2-3 typed pages [12 pt
font]. This length should be sufficient for your
comments and any appropriate copy/pasted examples. [You
need not retype or copy/paste the questions as part of your
responses, but do number them please.]
1. Give an example of a successful basic search
you did. Give the following information:
a. Word or word string used
b. Give brief examples from your log of some types
of reports you retrieved
c. Give one example of a sequence in FASTA
format
2. Summarize one of your BLAST search results.
Give the following information:
a. What was your query sequence, which
you entered in FASTA?
b. How many closely related sequences did you
retrieve? What was the worst match you retrieved? Do you
consider this match to still be significant or not?
Why?
c. What was the range of "E" scores? What does
this mean?
d. What information did you get from the
query-anchored report, which you didn't see in the first
report?
e. Which style of BLAST report did you prefer?
Why?
f. Explain why six reading frames are checked
in BLAST-x.
g. Briefly summarize the difference in
sensitivity when using PSI-BLAST versus basic BLAST. Give
an example where PSI-BLAST would be particularly
useful.
h. Give a brief summary of your experiences
with setting taxonomic limits in BLAST.
3. [Required question] On the whole at
this point, what part of conducting searches is most
comfortable for you? What part is still difficult?
Further
exploration:
As you read and digest what you have just gone through,
go back and browse some of the sites in more detail. This
exercise is like one of those fast-paced package tours,
where you probably want to get out and leave the trail to
explore more independently. It also helps to do segments of
browsing to help solidify where you have been and what you
have been doing. The more time you can put into it, the more
you'll get back from your efforts.
[top of
page]
.
|