Document Type

Technical Report

Publication Date


Technical Report Number



Evolutionary pressures on proteins to maintain structure and function have constrained their sequences over time and across species. The sequence record thus contains valuable information regarding the acceptable variation and covariation of amino acids in members of a protein family. When designing new members of a protein family, with an eye toward modified or improved stability or functionality, it is incumbent upon a protein engineer to uncover such constraints and design conforming sequences. This paper develops such an approach for protein design: we first mine an undirected probabilistic graphical model of a given protein family, and then use the model generatively to sample new sequences. While sampling from an undirected model is difficult in general, we present two complementary algorithms that effectively sample the sequence space constrained by our protein family model. One algorithm focuses on the high-likelihood regions of the space. Sequences are generated by sampling the cliques in a graphical model according to their likelihood while maintaining neighborhood consistency. The other algorithm designs a fixed number of high-likelihood sequences that are reflective of the amino acid composition of the given family. A set of shuffled sequences is iteratively improved so as to increase their mean likelihood under the model. Tests for two important protein families, WW domains and PDZ domains, show that both sampling methods converge quickly and generate diverse high-quality sets of sequences for further biological study.


Submitted to KDD 2007.