Exercise 2: Protein Secondary Structure
Prediction
By Michael Terribilini
Objectives:
1. Understand why we would want to predict protein
secondary structure.
2. Understand the basics behind predicting secondary
structure from amino acid sequence.
3. Learn how to use some online tools for predicting
protein secondary structure and how to analyze the
results.
Reading:
Gibas and Jambeck pgs 274-280
[top of
page]
Why predict secondary structure?
As we know, the structure of a protein determines its
function. If we were able to know what the structure of any
given amino acid sequence was, we would be able to at least
guess at what function it performed. Predicting structure
from sequence is one of the most important goals of
bioinformatics based research. Unfortunately, it is a very
difficult problem. Many research groups have decided to
focus on the simplified version of the problem, predicting
secondary structure from sequence.
As a step towards 3D structure prediction, secondary
structure predictions can be used to identify the fold of a
protein or the general motif. If the secondary structure is
known, the 3D structure problem becomes one of arranging the
known secondary structure elements into the correct 3D
structure. Some of the other uses of secondary structure
prediction are fold recognition, genome annotation, and
predicting regions of a protein that are likely to undergo
structural changes.
History of secondary structure prediction:
There have been three "generations" of methods developed
to predict secondary structure. The first generation used
physicochemical principles, expert rules, and statistics to
predict secondary structure. These methods only took into
account a single amino acid at a time. These methods were
developed in the 1970s and their accuracy was about 50% of
residues predicted correctly.
The second generation methods improved by looking at a
window of adjacent amino acids. They used the local sequence
information to predict the secondary structure state of the
amino acid in the center of a sliding window that walked
through the entire sequence. These methods were developed
through the 1980s and into the 1990s and were accurate for
about 60% of residues.
The third generation methods use multiple sequence
alignments as inputs instead of a single amino acid
sequence. They take advantage of the evolutionary
information contained in the MSA's to increase accuracy of
predictions to about 75%. The latest advance in the third
generation methods is to use PSI-BLAST to search for the
sequences used to generate the MSA.
[top of
page]
Issues with accuracy:
Now that we have mentioned the accuracy of prediction
methods, we must explain what it means. All of the secondary
structure prediction methods predict each amino acid of a
protein sequence to be in one of three states, helix,
strand, or loop. This is a simplification of the actual
secondary structure states observed in protein structures
which include different types of helices and strands. As an
example, a 3, 10 helix would be considered correctly
predicted if the method assigned it into the helix state.
The methods ignore the distinction between a 3, 10 helix and
an alpha helix.
The most widely used measure of accuracy is called Q3. It
is calculated by summing the number of residues correctly
predicted and dividing by the total number of residues. Most
methods use Q3 exclusively to report their accuracy. A newer
method of computing accuracy is called segment overlap, SOV.
It attempts to quantify how accurately a method predicts
each secondary structure element. It takes into account the
starting and ending residues of each secondary structure
element and the length of each element. The penalty for
predicting all helix in a sequence that contained two
helices and a short loop linking them in the Q3 score is
just the penalty of wrongly predicting the residues in the
loop. The SOV method would penalize for predicting only one
secondary structure feature when there are two helices and
it would penalize for missing the correct ending location of
the first helix and the correct starting location of the
second helix.
Of course, accuracy of predictions can only be measured
when the structure of the protein being predicted is known.
One problem with this is that the methods usedto assign
secondary structure to known structures disagree. Two of the
most widely used methods for assigning secondary structure
to known structures, DSSP and STRIDE, agree on the secondary
structure state of 95% of residues. When you consider a
third method, DEFINE, also, the three methods only agree for
71% of residues (Cuff and Barton, 1999). This shows that
assigning secondary structure state to an amino acid is not
exact, whether you are assigning that state in a prediction
or from a known structure.
PHD: An example of a protein secondary structure
prediction method.
Most of the current best prediction methods are based on
neural networks. We will examine one of these called PHD.
The original paper on PHD describes the algorithm and can be
found here:
Burkhard Rost and Chris Sander. Prediction of
Protein Secondary Structure at Better than 70% Accuracy.
Journal of Molecular Biology 1993, 232:584-599
[top of
page]
Exercise:
It is possible for the same short amino acid sequence to
be in different secondary structure states in different
proteins. These are known as "chameleon" sequences because
they can adopt different conformations. We will test the
accuracy of some secondary structure prediction methods on
some of these chameleon sequences.
1. Obtain the following sequences:
1ak5
3cox
1gcu
1bgl
1efv
1p04
Note: These are PDB identifiers, so they should be easy
to find.
2. Go to the PHD server
3. Click on the link to submit requests to
META-PP.
4. Enter your email address, a description of the
protein, and the sequence.
5. Click the checkboxes for each of the secondary
structure servers listed. There should be seven of them
listed.
6. Click on the Submit/Run Prediction button at
the bottom of the screen.
7. Your results should arrive in your email
shortly, usually about 5 to 10 minutes.
8. Now it is time to analyze the results. Start by
locating the actual structure information for each of the
six proteins in PDB. Now you can compare the "real"
secondary structure with what was returned by the
prediction methods. Each of these pairs of sequences
contains a short region that has a different secondary
structure in the two proteins. We will check to see how
well the methods did in predicting these regions that take
on more than one secondary structure.
Locate the following short sequences:
RVPALV 1ak5 starting at position 125 3cox
starting at position 7
LITTAHA 1gcu starting at position 121 1bgl
starting at position 833
LLPRVA 1efv starting at position 100 1p04
starting at position 75
Record the actual secondary structure for these short
sequences and the predicted secondary structure. How well
were these regions predicted?
Questions
1. List the results from the predictions of the
chameleon sequences. How accurate were the predictions? Did
all of the prediction methods you tried perform about the
same or were some better than others?
2. Do you think a protein's sequence contains all
of the information neccessary to specify structure? What do
you think has more influence on structure, sequence or the
cellular processes used to transform a sequence to a folded
structure?
3. Using PSI-BLAST to generate the sequences used
for the MSA input into the secondary structure prediction
algorithm has been shown to increase the accuracy of the
prediction. Why do you think using PSI-BLAST would be better
than using BLAST?
[top of
page]
|