Bioinformatics Unit 1: Exercise 2

SSU Home | SSU Biology | CourseInfo | Forum | Home

Glossary   |   Self Tests   |   Software   |   Objectives   |   Articles


Unit 1: Databases & Queries
Exercise 2, Section C

Exercise:

Exercise 1: Intro to bioinformatics

Part 1- Other Interfaces

Exercise 2: Section A

Part 2- Other Databases

Exercise 2: Section B

Summary questions

Project 1: A look at the literature

Further exploration

.

Exercise 2: Multiple Sequence Alignment
Section C


Note: Bring a disk to class to save your log and any other useful files.


Objectives for Part C:

Learn to use alignment tools and to analyze the results.

  • BLAST2 and ALIGN for selected pairwise alignments
  • ClustalW for multiple sequence alignments

 

Introduction:

When comparing sequence homologues from different source species, there is recognition that they are related. The questions often asked are "how are they related" and "when did they diverge"? These are not new questions. Before people started making comparisons at the molecular level, they were using other types of data such as morphology and developmental processes to try to answer these questions. Phylogenetics attempts to reconstruct evolutionary history. In a phylogenetic analysis at the molecular level, one needs to compare groups of selected sequences before beginning to tease out the details of evolutionary processes. Looking at the alignment of multiple molecular sequences can give insight into the nature of the processes which lead to divergence, in addition to analyzing the relative degree of relatedness. The mechanisms of change include 1) random mutation, which can be seen as genetic drift in the absence of selective pressure, 2) sequence duplication, which may be duplication of small segments, genes, or even whole genomes, and 3) recombination, which includes transposons, translocations, and viral activity, to mix up sequences within an organism, to remove sequences, or to introduce sequences from another organism. Various tools and algorithms can be applied to aligned sets of sequences to aid in the recognition of different events and to develop phylogenetic trees. [You have tho opportunity to explore some of these in Unit 4.]

When comparing similar sequences of proteins, one may also ask for example "do they appear to have the same function" or "where are the possible functional sites"? For many types of structural analyses, one needs to start with a selected group of protein sequences and compare them using multiple sequence alignment [MSA]. Other tools can then be applied to these alignments to examine conservation of domains and motifs and to identify regions of structural variability. [You will have the opportunity to explore some of these in Unit 5.]

There is considerable debate regarding the best approach to take when making multiple sequence alignments and analyzing the resultant output. Here we introduce ClustalW, which is based on the same scoring matrices used in BLAST alignments. In discussion, we will highlight the connections between making pairwise alignments and aligning sets of sequences.

There are summary questions at the end of this section. Read them through before you start browsing. You can answer them as you go, or answer them after browsing the following sites. Points = 2. Due 9/11 midnight or 9/16.

[top of page]  

Pre-Exercise:

1. Optional: For a tutorial on multiple sequence alignment [MSA], go to:

http://www.rickhershberger.com/darwin2000/

Click on the module "Multiple sequence alignments" to learn about ClustalW, and work through the guided activity to learn what this can do for you. Warning: Some patience is required. Pay attention to details, or you may be frustrated by the lack of results.

[top of page]

Exercise:

Part A: Multiple sequence alignments [MSA] and pairwise alignments

1. By this time you should have one or more sets of related protein or nucleotide sequences stored in your log in FASTA format, or have a file or two in Biology Workbench. You are welcome to use these in this exercise. Be sure to have, or get, at least one sequence which is from an outgroup. Otherwise, you can use the following set of accession numbers to obtain some protein sequences of glutamate synthase to use. These were chosen because they have similar enzymatic function and are therefore conserved, while being drawn from five kingdoms.

A38596

Maize

plant

CAC05496

Aradopsis

plant

AAC08261

Porphyra

algae

CAA76602

Plasmodium

protozoa

AAF49409

Drosophila

animal

CAB92626

Neurospora

fungi

CAA61505

Saccharomyces

fungi

BAB05447

Bacillus

bacteria

AAA58014

E. coli

bacteria

AAK94787

Klebsiella

bacteria

CAB64595

Nostoc

bacteria

AAG44102

Staphylococcus

bacteria


These numbers can be used to obtain the protein sequences from NCBI. FASTA format is recommended for easy transfer into your log and uploading into Biology Workbench or elsewhere.

2. In your log, you need to remove the spaces on the line preceding ">gi|xxx..." for each entry. [Don't disturb the sequence lines.] This is necessary when running Clustal, because any extra spaces will terminate the alignment for all entries beyond those spaces.


3. At this point, you have two options. Choose the one you want to try. [You can always return to try another option.

a. You can use a server-based version at EBI for ClustalW. You need to be careful when you paste in your sequences that you remove any spaces at the beginning of lines. You may leave a blank line between sequence entries.

http://www2.ebi.ac.uk/clustalw/ or http://www.ebi.ac.uk/clustalw/

Explore the site. Note the default settings and what you are able to change. You can read about the windows by clicking on them. [For the next steps, see 4a.] Once you have an alignment, you can save it, or transfer it to another application or to Biology Workbench.

b. You can upload your sequences directly into Biology Workbench before aligning them. Be sure to check each sequence entry for inadvertent spaces at the beginning of lines. If you find any, remove them. Within Workbench, you can use ClustalW in either Protein Tools or Nucleotide Tools [see 4b], and then use other applications in Alignment Tools for analysis.

4. Follow the directions according to the option you selected in 3 above.

Note: When running Clustal on a set of sequences, you may need to edit your sequences before you get reasonable alignments. It doesn't hurt to try a test run first. As you work through the following, consider what might be some of the causes of misalignments. This will be discussed in class after you have some results in hand.

a. Paste your grouped FASTA sequences into the text box. For your first run, use the defaults. The alignments will take a few minutes. You may want to enter your e-mail to retrieve the report. If you run fails, the first check to see if the FASTA format and the left-hand spaces are OK. If a run seems to take too long, try "off-hours", keeping in mind that this is a European site, or try making your alignment request smaller. You can do this by selecting only 4-6 sequences. Alternatively, you may want to focus on just one region or domain of your sequences. In that case, you can select portions of the FASTA reports. After using the defaults, try changing some of the settings after reading about them in the support pages. For report viewing, see 5 below.

b. Select the sequences you want to compare by checking the boxes. Choose ClustalW. Initially accept the defaults. On repeated runs, try changing some of the settings. [Go to the EBI site for documentation support.] Try running subgroups of sequences. You can also create edited sequences to select a region or to remove nonstandard characters. To save alignments, select "Import alignments". Then you can use the Alignment Tools. [See 6 below. You should skip 5.]

[top of page]

5. For option a above only. [Option b, skip to 6 below.] Once you have the report, browse to see what you have. Click on Jalview for a graphical display. Wait for the calculations and color assignment to be complete before trying to navigate. For your convenience, consensus notations and colors used in Jalview are assigned as follows:

Consensus line notations:
* = identical or conserved residues in all sequences in the alignment
: = indicates conserved substitutions
. = indicates semi-conserved substitutions.

Characteristics:

Amino acids:

red: small & hydrophobic R groups

AVFPMILW

blue: acidic

DE

magenta: basic

RHK

green: hydroxyl + X

STYHCNGQ

gray: other

Symbols for amino acids

Compare the results of your different runs. Which parameters did you change? What was the effect? Record for future reference. Upload some alignments into Biology Workbench.

6. At this point, everyone should have some aligned sequences in Biology Workbench. To check, select Alignment Tools after selecting the appropriate session. You should see blocks of sequences listed. If that is not true, go back and continue working on alignments and/or uploads until you do.

a. Use Boxshade and Textshade to easily view conserved and non-conserved regions. Note that these are similar, but not identical to Jalview. These can be saved and used as graphic inserts in reports and manuscripts. Use one of these to browse your alignments and to make comparisons between your different alignments.

b. Make note of conserved regions. Hypothesize the reason they are conserved. How can you test your hypothesis?

7. So you tried using different combinations of subgroups of sequences as a way to cross-check your alignments. What else can you do? You can compare some select pairwise alignments to see how they compare to your MSAs. Use either BLAST2 [BL2SEQ in Biology Workbench] or ALIGN [in Workbench under Nucleotide Tools and Protein Tools] to compare pairs of sequences. Try both pairs which appear to be closely related and pairs which are distantly related. Do your alignments agree?

Summary Questions:

Try to limit your answers to 1-2 [12 pt font]. This length should be sufficient for your comments and any appropriate copy/pasted examples. [You need not retype or copy/paste the questions as part of your responses.]

Summarize one of you MSA results. Give the following information:

a. What group of protein or nucleotide sequences did you select?

b. Were mutations evenly distributed or were there regions relatively free of them? What might be the reason for conserved regions? How could you test this?

c. What differences did you notice when you changed the matrices selected for alignment?

d. Did running only selected subgroups of sequences affect the alignment results? If so, how?

e. How did pairwise alignments compare to the multiple alignment?

[top of page]

.

Updated 09/11/2003 by bsc@classroomtools.com, thatcher@sonoma.edu