Bioinformatics Unit 4: Exercise

SSU Home | SSU Biology | CourseInfo | Calendar | Home

Glossary   |   Self Tests   |   Software   |   Objectives   |   Articles


Unit 4: Phylogenetics
Exercise

Pre-Exercise:

Project : Phylogenetic diversity

Exercise

Part A: Alignments

Part B: Tree building

Summary questions

Further exploration

.

Exercise: Sequence Alignments and Tree Building

Alignment tools and methods for building phylogenetic trees


Objectives:

1. Gain a basic appreciation of phylogenetics.

  • Know the difference between phylogeny and taxonomy
  • Explore phenetic and cladistic approaches
  • Examine criteria of data selection and analysis

2. Become familiar with on-line resources helpful in studying the phylogeny of organisms.

  • Explore and use sites, such as Phylodendron Web and PHYLIP servers
  • Examine phylogenetic literature on-line

3. Learn to use alignment and tree building tools, and to analyze the results.

  • BLAST2 and ALIGN for selected pairwise alignments
  • ClustalW and ClustalX for multiple sequence alignments
  • Build trees using DRAWTREE and DRAWGRAM

 

Introduction:

When comparing sequence homologues from different source species, there is recognition that they are related. The questions often asked are "how are they related" and "when did they diverge"? These are not new questions. Before people started making comparisons at the molecular level, they were using other types of data such as morphology and developmental processes to try to answer these questions. Phylogenetics attempts to reconstruct evolutionary history. In a phylogenetic analysis, one compares the results of evolutionary processes, be it shape and size of specific bones or patterns of DNA or protein sequences, in an attempt to determine how different groups or species may have been derived during evolution. Looking at the molecular sequences can give insight into the nature of the processes which lead to divergence, in addition to analyzing the relative degree of relatedness. The mechanisms of change include 1) random mutation, which can be seen as genetic drift in the absence of selective pressure, 2) sequence duplication, which may be duplication of small segments, genes, or even whole genomes, and 3) recombination, which includes transposons, translocations, and viral activity, to mix up sequences within an organism, to remove sequences, or to introduce sequences from another organism.

There is considerable debate regarding the best approach to take when analyzing sequence alignments for phylogenetic relationships. Accepted approaches include distance calculations, parsimony, and maximum likelihood. New on the scene is Bayesian analysis, which is expected to gain popularity as people become familiar with it. It is good to become familiar with the different methods of analysis. It helps in understanding the arguments being made, both in terms of how things should be done and of the resulting analyses. When examining the results of molecular phylogenetic analysis, care should be taken to compare the results to other independent means of analysis and/or to other data sets.

An excellent resource book for this unit is:

Hall, Barry, 2001. Phylogenetic Trees Made Easy. A how-to manual for molecular biologists. Sinauaer Associates, Inc. ISBN: 0878933115.

If you intend to use these methods in your work or research, this self-proclaimed "cookbook" is highly recommended to extend beyond the basic introduction in this exercise. It contains both basic information and guided tutorials in the use of a variety of tools.

There are summary questions at the end of this section. Read them through before you start browsing. You can answer them as you go, or answer them after browsing the following sites. Points = 10. Due 10/30.

[top of page]  

Pre-Exercise:

1. Review basic background on evolution. Use either one of your own texts or go to Dr. Jasper's site, which has excellent notes and interesting links:

http://www.zo.utexas.edu/faculty/sjasper/bio304/syl304.html

2. Jargon, and the associated concepts, you need to know:

  • phylogenetics vs. taxonomy
  • cladistic vs. phenetic
  • clustering
  • parsimony vs. maximum likelihood
[top of page]

Exercise:

Part A: Multiple sequence alignments [MSA] and pairwise alignments

1. You have already tried using ClustalW to do multiple sequence alignements. It is now time to dust off the cobwebs ['tis the season!] and get back to it. Recall that you need to start with sequences in FASTA format, or have files in Biology Workbench. For this exercise, please use the following set of accession numbers to obtain some protein sequences of glutamate synthase. These were chosen because they have similar enzymatic function and are therefore conserved, while being drawn from five kingdoms.

A38596

Maize

plant

CAC05496

Aradopsis

plant

AAC08261

Porphyra

algae

CAA76602

Plasmodium

protozoa

AAF49409

Drosophila

animal

CAB92626

Neurospora

fungi

CAA61505

Saccharomyces

fungi

BAB05447

Bacillus

bacteria

AAA58014

E. coli

bacteria

AAK94787

Klebsiella

bacteria

CAB64595

Nostoc

bacteria

AAG44102

Staphylococcus

bacteria


These numbers can be used to obtain the protein sequences from NCBI. FASTA format is recommended for easy transfer into your log and uploading into Biology Workbench or elsewhere.

2. In your log, you need to remove the spaces on the line preceding ">gi|xxx..." for each entry. [Don't disturb the sequence lines.] This is necessary when running Clustal, because any extra spaces will terminate the alignment for all entries beyond those spaces.


3. At this point, you have two options. Choose the one you like. [You can always return to try another option.]

a. You can use ClustalW at EBI. You need to careful when you paste in your sequences that you remove any spaces at the beginning of lines. You may leave a blank line between sequence entries.

http://www2.ebi.ac.uk/clustalw/ or http://www.ebi.ac.uk/clustalw/

Explore the site. You can read about the windows by clicking on them. [For the next steps, see 4b.] Once you have an alignment, you can save it, or transfer it to another application or to Biology Workbench.

b. You can upload your sequences directly into Biology Workbench before aligning them. Be sure to check each sequence entry for inadvertent spaces at the beginning of lines. If you find any, remove them. Within Workbench, you can use ClustalW in either Protein Tools or Nucleotide Tools [see 4c], and then use other applications in Alignment Tools for analysis.

4. Follow the directions according to the option you selected in 3 above.

a. Paste your grouped FASTA sequences into the text box. For your first run, use the defaults. The alignments will take a few minutes. You may want to enter your e-mail to retrieve the report. If you run fails, the first check to see if the FASTA format and the left-hand spaces are OK. If a run seems to take too long, try "off-hours", keeping in mind that this is a European site, or try making your alignment request smaller. You can do this by selecting only 4-6 sequences. Alternatively, you may want to focus on just one region or domain of your sequences. In that case, you can select portions of the FASTA reports. After using the defaults, try changing some of the settings after reading about them in the support pages. For report viewing, see 5 below.

Note: When running Clustal on a set of sequences, you may need to edit your sequences before you get reasonable alignments. It doesn't hurt to try a test run first. As you work through the following, consider what might be some of the causes of misalignments. This will be discussed in class after you have some results in hand.

b. Select the sequences you want to compare by checking the boxes. Choose ClustalW. Initially accept the defaults. On repeated runs, try changing some of the settings. [Go to the EBI site for documentation support.] Try running subgroups of sequences and try changing the order of sequences. You can change order by selecting a sequence and choosing a menu item, then return. The selected sequence is at the top of the list. You can easily scramble your list by randomly selecting and copying different sequences. You can also create edited sequences to select a region or to remove nonstandard characters. To save alignments, select "Import alignments". Then you can use the Alignment Tools. [See 6 below. You should skip 5.]

[top of page]  

5. For option b only. [Option b, skip to 6 below.] Once you have the report, browse to see what you have. Click on Jalview for a graphical display. Wait for the calculations and color assignment to be complete before trying to navigate. For your convenience, consensus notations and colors used in Jalview are assigned as follows:

Consensus line notations:
* = identical or conserved residues in all sequences in the alignment
: = indicates conserved substitutions
. = indicates semi-conserved substitutions.

Characteristics:

Amino acids:

red: small & hydrophobic R groups

AVFPMILW

blue: acidic

DE

magenta: basic

RHK

green: hydroxyl + X

STYHCNGQ

gray: other

Symbols for amino acids

Compare the results of your different runs. Which parameters did you change? What was the effect? Record for future reference. Upload some alignments into Biology Workbench.

6. At this point, everyone should have some aligned sequences in Biology Workbench. To check, select Alignment Tools after selecting the appropriate session. You should see blocks of sequences listed. If that is not true, go back and continue working on alignments and/or uploads until you do.

a. Use Boxshade and Textshade to easily view conserved and non-conserved regions. Note that these are similar, but not identical to Jalview. These can be saved and used as graphic inserts in reports and manuscripts. Use one of these to browse your alignments and to make comparisons between your different alignments.

b. Make note of conserved regions.

[top of page]

Part B: Tree building

1. Neighbor joining [NJ] is a clustering method to group pairwise distances. It is the favored distance calculation method because equal rates of evolution are not assumed, as in the arithmetic approach. In Biology Workbench's Alignment Tools, try the following:

a. Use Clustaldist to obtain a set of distance calculations.

b. Choose either DNAdist or Protdist, depending on whether you have nucleotide or protein alignments. Run the same alignments again to obtain a second set of distance results. How do these results differ from the first set? Which application appears to be more sensitive to differences?

c. Sketch a tree based on distance calculations obtained from Clustaldist. Sketch another tree based on the calculation results obtained from DNAdist or Protdist.

d. Use Drawtree to produce a PHYLIP unrooted tree. Compare this tree to your sketches.

2. Parsimony [also known as max pars, for maximum parsimony, and as MP] is a method which looks for the minimum number of changes which satisfy the data. It examines sequence comparisons rather than a numerical result, as in NJ.

Use DNAPars or ProtPars to generate a tree which maximizes parsimony. How many calculation steps were required to obtain the tree? Do different alignment runs affect the outcome of the final tree? If so, how?

3. Next try using Drawgram, a PHYLIP rooted tree tool. This allows you to build a variety of tree types from the same alignment. You can generate a phenogram, based on neighbor joining, which can then be compared to your Drawtree result. You can generate a cladogram, based on parsimony, which can then be compared to both the phenogram and to the tree obtained using DNAPars or ProtPars. If you are feeling adventurous, try out some of the other tree types.

4. OK, now you are at the point where computational intensity increases considerably. To try running maximum likelihood [ML] or Bayesian analysis on your alignments, it is recommended that you download suitable software, along with any server-stored alignments of interest and run them on your PC. This is required if you want to examine protein alignments.

For nucleotide alignments, you can use WebPHYLIP's DNAML to do maximum likelihood. This is a good site to explore for other programs within PHYLIP. Try it now, or come back while you are working on the project:

http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/

For recommended downloads and other sites to explore, go to Further Exploration below.

[top of page]

Summary Questions:

Try to limit your answers to 2-3 typed pages [12 pt font]. This length should be sufficient for your comments and any appropriate copy/pasted examples. [You need not retype or copy/paste the questions as part of your responses.]

1. Summarize one of you MSA results. Give the following information:

a. Which option did you use? ClustalW at EBI or ClustalW in Workbench?

b. Were mutations evenly distributed or were there regions relatively free of them? What might be the reason for conserved regions? How could you test this?

2. Summarize the results obtained using distance calculations. How did your sketched trees compare to the computer-generated tree(s) in Drawtree?

3. Summarize the results obtained using parsimony. How did the maximum parsimony tree(s) compare with the NJ-based trees?

4. Summarize your explorations of Drawgram, especially between creating phenograms and cladograms. After exploring the other types of trees, which one did you like best overall? Why? In considering the available choices for a tree displayed in a publication aimed at a general scientific readership, which type would you choose? Why? [If you enjoyed playing with trees, try out Tree View, which you can download. See below.]

5. Optional: If you ran maximum likelihood [ML] on you alignments, summarize how the results compared to NJ and MP.

[top of page]

Further exploration:

1. More information on PHYLIP can be found at these related sites:

http://evolution.genetics.washington.edu/phylip.html http://evolution.genetics.washington.edu/phylip/phylipweb.html

2. A wide variety of phylogenetic software can be downloaded from collections. The following is very easy to use:

http://evolution.genetics.washington.edu/phylip/software.html

The cross-reference list is useful, although you should check elsewhere for possible updates:

http://evolution.genetics.washington.edu/phylip/software.xref.html

3. PUZZLE is a good program to download for maximum likelihood analysis of nucleotide and protein sequence alignments. This program has several cool features worth exploring:

http://www.tree-puzzle.de/

Warning: While running ML, your computer will be dedicated to running this program. Forget playing games or checking your mail. Go for a run or take a nap if you are running a large set of sequences.

4. MrBayes is freeware to download for Bayesian analysis:

http://morphbank.ebc.uu.se/mrbayes/

Warning: While running Bay, your computer will be dedicated to running this program. Although not as bad as ML for large sets, it is slower than ML for fewer sequences [roughly, less than 40]. For exploration purposes, use a small set, and then go enjoy a leisurely cup of hot chocolate.

5. Tree View, software download for drawing quality trees using a variety of file formats:

http://taxonomy.zoology.gla.ac.uk/rod/treeview.html

6. Another tree drawing program is Phylodenron. It can be downloaded from the U. Washington site in 2 above, or it can be accessed from a server:

http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

7. Additional access to lots of cool applications:

http://bioweb.pasteur.fr/intro-uk.html

Access to good documentation on many applications: [left-hand frame- index] http://www.molbiol.ox.ac.uk/

 [top of page]

.

SSU Home | SSU Biology | SiteMap | Search | CourseInfo | Forum | Home

Glossary   |   Self Tests   |   Software   |   Objectives   |   Articles

Updated 10/28/03 by thatcher@sonoma.edu