Bioinformatics Unit 5: Exercise

SSU Home | SSU Biology | SiteMap | Search | CourseInfo | Forum | Home

Glossary   |   Self Tests   |   Software   |   Objectives   |   Articles


Unit 5:
Exercise

History

Exercise 2:

Summary questions

.

Exercise 2: Protein Secondary Structure Prediction

By Michael Terribilini


Objectives:

1. Understand why we would want to predict protein secondary structure.

2. Understand the basics behind predicting secondary structure from amino acid sequence.

3. Learn how to use some online tools for predicting protein secondary structure and how to analyze the results.

Reading:

Gibas and Jambeck pgs 274-280

[top of page]

Why predict secondary structure?

As we know, the structure of a protein determines its function. If we were able to know what the structure of any given amino acid sequence was, we would be able to at least guess at what function it performed. Predicting structure from sequence is one of the most important goals of bioinformatics based research. Unfortunately, it is a very difficult problem. Many research groups have decided to focus on the simplified version of the problem, predicting secondary structure from sequence.

As a step towards 3D structure prediction, secondary structure predictions can be used to identify the fold of a protein or the general motif. If the secondary structure is known, the 3D structure problem becomes one of arranging the known secondary structure elements into the correct 3D structure. Some of the other uses of secondary structure prediction are fold recognition, genome annotation, and predicting regions of a protein that are likely to undergo structural changes.

 

History of secondary structure prediction:

There have been three "generations" of methods developed to predict secondary structure. The first generation used physicochemical principles, expert rules, and statistics to predict secondary structure. These methods only took into account a single amino acid at a time. These methods were developed in the 1970s and their accuracy was about 50% of residues predicted correctly.

The second generation methods improved by looking at a window of adjacent amino acids. They used the local sequence information to predict the secondary structure state of the amino acid in the center of a sliding window that walked through the entire sequence. These methods were developed through the 1980s and into the 1990s and were accurate for about 60% of residues.

The third generation methods use multiple sequence alignments as inputs instead of a single amino acid sequence. They take advantage of the evolutionary information contained in the MSA's to increase accuracy of predictions to about 75%. The latest advance in the third generation methods is to use PSI-BLAST to search for the sequences used to generate the MSA.

[top of page]

Issues with accuracy:

Now that we have mentioned the accuracy of prediction methods, we must explain what it means. All of the secondary structure prediction methods predict each amino acid of a protein sequence to be in one of three states, helix, strand, or loop. This is a simplification of the actual secondary structure states observed in protein structures which include different types of helices and strands. As an example, a 3, 10 helix would be considered correctly predicted if the method assigned it into the helix state. The methods ignore the distinction between a 3, 10 helix and an alpha helix.

The most widely used measure of accuracy is called Q3. It is calculated by summing the number of residues correctly predicted and dividing by the total number of residues. Most methods use Q3 exclusively to report their accuracy. A newer method of computing accuracy is called segment overlap, SOV. It attempts to quantify how accurately a method predicts each secondary structure element. It takes into account the starting and ending residues of each secondary structure element and the length of each element. The penalty for predicting all helix in a sequence that contained two helices and a short loop linking them in the Q3 score is just the penalty of wrongly predicting the residues in the loop. The SOV method would penalize for predicting only one secondary structure feature when there are two helices and it would penalize for missing the correct ending location of the first helix and the correct starting location of the second helix.

Of course, accuracy of predictions can only be measured when the structure of the protein being predicted is known. One problem with this is that the methods usedto assign secondary structure to known structures disagree. Two of the most widely used methods for assigning secondary structure to known structures, DSSP and STRIDE, agree on the secondary structure state of 95% of residues. When you consider a third method, DEFINE, also, the three methods only agree for 71% of residues (Cuff and Barton, 1999). This shows that assigning secondary structure state to an amino acid is not exact, whether you are assigning that state in a prediction or from a known structure.

 

PHD: An example of a protein secondary structure prediction method.

Most of the current best prediction methods are based on neural networks. We will examine one of these called PHD. The original paper on PHD describes the algorithm and can be found here:

Burkhard Rost and Chris Sander. Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 1993, 232:584-599
[top of page]

Exercise:

It is possible for the same short amino acid sequence to be in different secondary structure states in different proteins. These are known as "chameleon" sequences because they can adopt different conformations. We will test the accuracy of some secondary structure prediction methods on some of these chameleon sequences.

1. Obtain the following sequences:

1ak5
3cox
1gcu
1bgl
1efv
1p04

Note: These are PDB identifiers, so they should be easy to find.

2. Go to the PHD server

3. Click on the link to submit requests to META-PP.

4. Enter your email address, a description of the protein, and the sequence.

5. Click the checkboxes for each of the secondary structure servers listed. There should be seven of them listed.

6. Click on the Submit/Run Prediction button at the bottom of the screen.

7. Your results should arrive in your email shortly, usually about 5 to 10 minutes.

8. Now it is time to analyze the results. Start by locating the actual structure information for each of the six proteins in PDB. Now you can compare the "real"

secondary structure with what was returned by the prediction methods. Each of these pairs of sequences contains a short region that has a different secondary

structure in the two proteins. We will check to see how well the methods did in predicting these regions that take on more than one secondary structure.

Locate the following short sequences:

RVPALV 1ak5 starting at position 125 3cox starting at position 7

LITTAHA 1gcu starting at position 121 1bgl starting at position 833

LLPRVA 1efv starting at position 100 1p04 starting at position 75

Record the actual secondary structure for these short sequences and the predicted secondary structure. How well were these regions predicted?

 

Questions

1. List the results from the predictions of the chameleon sequences. How accurate were the predictions? Did all of the prediction methods you tried perform about the same or were some better than others?

2. Do you think a protein's sequence contains all of the information neccessary to specify structure? What do you think has more influence on structure, sequence or the cellular processes used to transform a sequence to a folded structure?

3. Using PSI-BLAST to generate the sequences used for the MSA input into the secondary structure prediction algorithm has been shown to increase the accuracy of the prediction. Why do you think using PSI-BLAST would be better than using BLAST?

[top of page]

SSU Home | SSU Biology | SiteMap | Search | CourseInfo | Forum | Home

Glossary   |   Self Tests   |   Software   |   Objectives   |   Articles

Updated 11/14/02 by thatcher@sonoma.edu