Document Type


Publication Date


Publication Title

BMC Bioinformatics


Geisel School of Medicine


Background: Trait heterogeneity, which exists when a trait has been defined with insufficient specificity such that it is actually two or more distinct traits, has been implicated as a confounding factor in traditional statistical genetics of complex hu man disease. In the absence of de tailed phenotypic data collected consistently in combination with genetic data, unsupervised computational methodologies offer the potential for discovering underlying trait heteroge neity. The performance of three such methods – Bayesian Classification, Hyperg raph-Based Clustering, and Fuzzy k -Modes Clustering – appropriate for categorical data were comp ared. Also tested was the ability of these methods to detect trait heterogeneity in the presence of locus heteroge neity and/or gene-gene interaction , which are two other complicating factors in discovering genetic models of complex human disease. To dete rmine the efficacy of applying the Bayesian Classification method to re al data, the reliability of its intern al clustering metr ics at finding good clusterings was evaluated using permutation testing. Results: Bayesian Classifica tion outperformed the other two method s, with the exception that the Fuzzy k -Modes Clustering performed best on the most comp lex genetic model. Bayesian Classificati on achieved excellent recovery for 75% of the da tasets simulated under the simplest genetic model, while it achieved moderate recovery for 56% of datase ts with a sample size of 500 or more (across all simulated models) and for 86% of datasets with 10 or fewer nonfuncti onal loci (across all si mulated models). Neither Hypergraph Clustering nor Fuzzy k -Modes Clustering achieved good or excellent cluster recovery for a majority of datasets even under a re stricted set of conditions. When usin g the average log of class strength as the internal clustering metric, th e false positive rate was controlled very well, at three percent or less for all three significance levels (0. 01, 0.05, 0.10), and the false negative rate was acceptably low (18 percent) for the least stringent sign ificance level of 0.10. Conclusion: Bayesian Classificati on shows promise as an unsuper vised computational method for dissecting trait hetero geneity in genotypic data. Its control of fa lse positive and false negative rates lends confidence to the validity of its results. Further investigation of how differ ent parameter settings may improve the performance of Bayesian Classification, especi ally under more comp lex genetic models, is ongoing.



Original Citation

Thornton-Wells TA, Moore JH, Haines JL. Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data. BMC Bioinformatics. 2006 Apr 12;7:204. doi: 10.1186/1471-2105-7-204. PMID: 16611359; PMCID: PMC1525209.