Computational predictions of the functional impact of genetic variation play a critical role in human genetics research. For nonsynonymous coding variants, most prediction algorithms make use of patterns of amino acid substitutions observed among homologous proteins at a given site. In particular, substitutions observed in orthologous proteins from other species are often assumed to be tolerated in the human protein as well. We examined this assumption by evaluating a panel of nonsynonymous mutants of a prototypical human enzyme, methylenetetrahydrofolate reductase (MTHFR), in a yeast cell-based functional assay. As expected, substitutions in human MTHFR at sites that are well-conserved across distant orthologs result in an impaired enzyme, while substitutions present in recently diverged sequences (including a 9-site mutant that “resurrects” the human-macaque ancestor) result in a functional enzyme. We also interrogated 30 sites with varying degrees of conservation by creating substitutions in the human enzyme that are accepted in at least one ortholog of MTHFR. Quite surprisingly, most of these substitutions were deleterious to the human enzyme. The results suggest that selective constraints vary between phylogenetic lineages such that inclusion of distant orthologs to infer selective pressures on the human enzyme may be misleading. We propose that homologous proteins are best used to reconstruct ancestral sequences and infer amino acid conservation among only direct lineal ancestors of a particular protein. We show that such an “ancestral site preservation” measure outperforms other prediction methods, not only in our selected set for MTHFR, but also in an exhaustive set of E. coli LacI mutants.
The rapid pace of technological advances in DNA sequencing methods is leading to the discovery of genetic variants at a remarkable rate. Indeed, it is conceivable that entire individual genomes will be sequenced routinely in the near future. While these platforms greatly increase our ability to catalog variation, they are also creating a downstream need to efficiently process and filter this information to ultimately identify genetic causes underlying human disease. Since empirical evaluation of the biological effects of mutation is not practical at such a scale, computational methods that predict such effects are needed. In this paper, we describe a novel methodology to predict whether mutations that lead to amino acid substitutions in proteins will impact protein function and, therefore, may be more likely to have physiological consequences. Specifically, we use orthologous proteins to reconstruct the likely sequences of ancestral proteins in the human lineage. We found that the longer a position has been preserved from direct ancestors in the lineage leading to the human enzyme, the more likely that mutation at that site will have a deleterious effect. We demonstrated that the method should be generally applicable to all proteins.