基因组计划描述了线性的氨基酸序列,但是如果想深入了解蛋白的生物学作用,那么就需要了解蛋白质的结构和功能。尽管实验方法可以解析出一部分蛋白的高分辨率的结构来,结构计算预测方法依然将非常重要的,它们将为大部分蛋白的结构研究提供非常有价值的信息,尤其是对那些不能是实验上测定结构的蛋白。第一类蛋白结构预测方法包括threading和comparative
modeling,依赖于数据库中已有的结构知识和信息。第二类方法是从头预测的方法,直接从序列计算结构。在这篇综述中,作者阐述了这种方法的关键特性,准确度,以及它在单个和全基因组蛋白预测和功能研究方面的应用。最后作者讨论了蛋白结构预测方法在全世界范围的结构基因组学研究中的重要意义。下面是这篇文章的正文。
By David Baker1 and Andrej Sali2
Modeling of a sequence based on known
structures consists of four steps: finding known structures related to the
sequence to be modeled (i.e., templates), aligning the sequence with the
templates, building a model, and assessing the model (1). The templates for
modeling may be found by sequence comparison methods, such as PSI-BLAST (2), or
by sequence-structure threading methods (3) that can sometimes reveal more
distant relationships than purely sequence-based methods. In the latter case,
fold assignment and alignment are achieved by threading the sequence through
each of the structures in a library of all known folds. Each sequence-structure
alignment is assessed by the energy of a corresponding coarse model, not by
sequence similarity as in sequence comparison methods.
Comparative structure prediction
produces an all-atom model of a sequence, based on its alignment to one or more
related protein structures. Comparative model building includes either
sequential or simultaneous modeling of the core of the protein, loops, and side
chains. In the original comparative approach, a model is constructed from a few
template core regions and from loops and side chains obtained from either
aligned or unrelated structures (4-6). Another family of comparative methods
relies on approximate positions of conserved atoms from the templates to
calculate the coordinates of other atoms (7). A third group of methods uses
either distance geometry or optimization techniques to satisfy spatial
restraints obtained from the sequence-template alignment (8-10). There are also
many methods that specialize in the modeling of loops (11) and side chains (12)
within the restrained environment provided by the rest of the structure.
De novo Structure Prediction
Although comparative modeling is
limited to protein families with at least one known structure, de novo structure
prediction has no such limitation. De novo methods start from the assumption
that the native state of a protein is at the global free energy minimum and
carry out a large-scale search of conformational space for protein tertiary
structures that are particularly low in free energy for the given amino acid
sequence. The two key components of such methods are the procedure for
efficiently carrying out the conformational search and the free energy function
used for evaluating possible conformations. To allow rapid and efficient
searching of conformational space, often only a subset of the atoms in the
protein chain is represented explicitly; the potential functions must then
include terms that reflect the averaged-out effects of the omitted atoms and
solvent molecules.
Recently, there have been a number of
promising advances in de novo structure prediction (13-16). A particularly
successful method, called Rosetta, is based on a picture of protein folding in
which short segments of the protein chain flicker between different local
structures consistent with their local sequence, and folding to the native state
occurs when these local segments are oriented such that low free energy
interactions are made throughout the protein (17). In simulating this process,
each short segment is allowed to sample the local structures adopted by the
sequence segment in known protein structures, and a search is carried out
through the combinations of these local structures for compact tertiary
structures that bury the hydrophobic residues and pair the -strands. This
strategy resolves some of the problems with both the conformational search and
the free energy function: The search is greatly accelerated because switching
between different possible local structures can occur in a single step, and
fewer demands are placed on the free energy function because the use of
fragments of known structures ensures that the local interactions are close to
optimal.
Accuracy and Applications of Models
The accuracy of a comparative model is
related to the percentage sequence identity on which it is based, correlating
with the relationship between the structural and sequence similarity of two
proteins (Fig. 1) (1, 18, 19). High-accuracy comparative models are based on
more than 50% sequence identity to their templates. They tend to have about 1 ?
root mean square (RMS) error for the main-chain atoms, which is comparable to
the accuracy of a medium-resolution nuclear magnetic resonance (NMR) structure
or a low-resolution x-ray structure. The errors are mostly mistakes in
side-chain packing, small shifts or distortions of the core main-chain regions,
and occasionally larger errors in loops. Medium-accuracy comparative models are
based on 30 to 50% sequence identity. They tend to have about 90% of the
main-chain modeled with 1.5 ? RMS error. There are more frequent side-chain
packing, core distortion, and loop modeling errors, and there are occasional
alignment mistakes (18). Finally, low-accuracy comparative models are based on
less than 30% sequence identity. The alignment errors increase rapidly below 30%
sequence identity and become the most substantial origin of errors in
comparative models. In addition, when a model is based on an almost
insignificant alignment to a known structure, it may also have an entirely
incorrect fold. Accuracies of the best model building methods are relatively
similar when used optimally (19, 20). Other factors such as template selection
and alignment accuracy usually have a larger impact on the model accuracy,
especially for models based on less than 40% sequence identity to the templates.
There is a wide range of applications
of protein structure models (Figs. 1 and 2). For example, high- and
medium-accuracy comparative models frequently are helpful in refining functional
predictions that have been based on a sequence match alone because ligand
binding is more directly determined by the structure of the binding site than by
its sequence. It is often possible to correctly predict features of the target
protein that do not occur in the template structure. The size of a ligand may be
predicted from the volume of the binding site cleft (Fig. 2A). For example, the
complex between docosahexaenoic fatty acid and brain lipid-binding protein was
modeled on the basis of its 62% sequence identity to the crystallographic
structure of adipocyte lipid-binding protein (PDB code 1ADL) (21). A number of
fatty acids were ranked for their affinity to brain lipid-binding protein
consistently with site-directed mutagenesis and affinity chromatography
experiments, even though the ligand specificity profile of this protein is
different from that of the template structure. Another example is prediction of
a binding site for a charged ligand based on a cluster of charged residues on
the protein, as was done for mouse mast cell protease 7 (Fig. 2B) (22). The
prediction of a proteoglycan binding patch was confirmed by site-directed
mutagenesis and heparin-affinity chromatography experiments. Fortunately, errors
in the functionally important regions in comparative models are many times
relatively low because the functional regions, such as active sites, tend to be
more conserved in evolution than the rest of the fold. The utility of
low-accuracy comparative models can be illustrated by a molecular model of the
whole yeast ribosome, whose construction was facilitated by fitting comparative
models of many ribosomal proteins into the electron microscopy map of the
ribosomal particle (23). This example also suggests that structural genomics of
single proteins or their domains, combined with protein structure prediction,
may contribute substantially to efficient structural characterization of large
macromolecular assemblies.
The accuracy and reliability of models
produced by de novo methods is much lower than that of comparative models based
on alignments with more than 30% sequence identity, but the basic topology of a
protein or domain can in some cases be predicted reasonably well (Fig. 1, D and
E). For roughly 40% of proteins shorter than 150 amino acids that have been
examined, one of the five most commonly recurring models generated by Rosetta
has sufficient global similarity to the true structure to recognize it in a
search of the protein structure database. Reasonable models can in some cases be
produced for domains of even very large proteins by using multiple sequence
alignments to identify domain boundaries (Fig. 1D).
The accuracy of de novo models is too
low for problems requiring high-resolution structure information. Instead, the
low-resolution models produced by these methods can reveal structural and
functional relationships between proteins not apparent from their amino acid
sequences and provide a framework for analyzing spatial relationships between
evolutionarily conserved residues or between residues shown experimentally to be
functionally important. These applications are illustrated by examples from the
recent CASP4 blind protein structure prediction experiment (24, 25). The
predicted structure of a protein involved in cell lysis (26) was found to be
structurally related to a protein with a similar function but no significant
sequence similarity (Fig. 2B). The predicted structure of a domain of the
mismatch repair protein MutS (27) (Fig. 1D) has structural similarity to
proteins with related functions (28). Functionally important residues of the
signaling protein Frizzled (29) were clustered in the predicted structure in a
surface patch likely to be involved in a key protein-protein interaction (Fig.
2C). Thus, in favorable cases de novo predictions can provide some of the most
important functional insights obtainable from experimentally determined
structures.
Modeling on a Genomic Scale
Threading and comparative modeling
methods have already been applied on a genomic scale (18, 30, 31). In total,
domains in 58% of all 600,000 known protein sequences were modeled with ModPipe
(18) and MODELLER (9) and deposited into a comprehensive database of comparative
models, ModBase (32-34). The Web interface to the database allows flexible
querying for fold assignments, sequence-structure alignments, models, and model
assessments of interest. An integrated sequence/structure viewer, ModView,
allows inspection and analysis of the query results. ModBase will be
increasingly interlinked with other applications and databases such that
structures and other types of information can be easily used for functional
annotation. Although the current number of modeled proteins may look impressive
given the early stage of structural genomics, usually only one domain per
protein is modeled (on the average, proteins have slightly more than two
domains), and two-thirds of the models are based on less than 30% sequence
identity to the closest template.
Automation and large-scale modeling
with de novo methods have lagged behind those of comparative modeling methods,
because of the relatively poor quality of the models produced and the relatively
large amount of computer time required. However, inspired by the potential for
functional insights, large-scale modeling calculations have been initiated with
Rosetta. In the first such project, models for representatives of all PFAM
families with less than 150 amino acids and currently not linked to proteins of
known structure have been produced. Strong structural similarities of these
models to structures of previously determined proteins can indicate previously
unidentified relationships that may provide functional insights. It should soon
be possible to extend these large-scale calculations to cover most of the
domains not represented in ModBase.
The Role of Protein Structure
Prediction in Structural Genomics
Structural genomics aims to
structurally characterize most protein sequences by an efficient combination of
experiment and prediction (35-37). This aim will be achieved by careful
selection of target proteins and their structure determination by x-ray
crystallography or NMR spectroscopy. There are a variety of target selection
schemes (38), ranging from focusing on only novel folds to selecting all
proteins in a model genome. A model-centric view requires that targets be
selected such that most of the remaining sequences can be modeled with useful
accuracy by comparative modeling. Even with structural genomics, the structure
of most of the proteins will be modeled, not determined by experiment. As
discussed above, the accuracy of comparative models and correspondingly the
variety of their applications decrease sharply below the 30% sequence identity
cutoff, mainly as a result of a rapid increase in alignment errors. Thus, we
will need to determine protein structures so that most of the remaining
sequences are related to at least one known structure at higher than 30%
sequence identity (36, 37). It was recently estimated that this cutoff requires
a minimum of 16,000 targets to cover 90% of all protein domain families,
including those of membrane proteins (36). These 16,000 structures will allow
the modeling of a very much larger number of proteins. For example, New York
Structural Genomics Research Consortium measured the impact of its structures by
documenting the number and quality of the corresponding models for detectably
related proteins in the nonredundant sequence database. For each new structure,
on average, ~100 protein sequences without any prior structural characterization
could be modeled at least at the fold level (39). This large leverage of
structure determination by protein structure modeling illustrates and justifies
the premise of structural genomics.
De novo structure prediction will
contribute to structural genomics in several ways. Large-scale de novo
prediction can guide target selection by focusing experimental structure
determination on proteins likely to adopt novel folds. De novo methods should
also be useful in complementing comparative modeling methods by building
portions of proteins not present in template structures. In addition, de novo
methods supplemented by restraints from cross linking or other experiments can
provide models for proteins not readily amenable to x-ray crystallographic or
NMR analysis. Finally, large-scale de novo modeling may allow coarse
structure-based insights into protein function of a large number of proteins
well in advance of experimentally determined structures.
Conclusions
Improvement in the accuracy of models
produced by both de novo and comparative modeling approaches will require
methods that finely sample protein conformational space using a free energy or
scoring function that has sufficient accuracy to distinguish the native
structure from the nonnative conformations. Despite many years of development of
molecular simulation methods, attempts to refine models that are already
relatively close to the native structure have met with relatively little
success. This failure is likely to be due to inaccuracies in the potential
functions used in the simulations, particularly in the treatment of
electrostatics and solvation effects. Improvements in sampling strategies may
also be necessary, given the relatively long time scale of protein folding
(milliseconds to seconds). Combination of physical chemistry with the vast
amount of information in known protein structures may provide a route to
development of improved potential functions. The refinement of de novo and
comparative models provides a good test and application of the molecular
dynamics methods widely used to simulate biological macromolecules (40).
Automated methods for deducing function
from structure will be critical to obtaining functional insights from both
predicted and experimentally determined structures. Considerable insight can be
gained from structural comparison of a given structure with all other known
protein structures using methods such as Dali (41), which can frequently detect
structural relationships with functional significance that are not evident from
sequence comparisons. Also promising are methods that match a structure against
a library of structural motifs associated with different functions (42-44). For
higher resolution models produced by comparative modeling methods, functional
sites on proteins can potentially be identified and characterized by explicit
ligand docking calculations. Finally, large-scale protein-protein docking
calculations in years to come may contribute to the identification and
characterization of protein interaction networks.