Model selection for mixtures of mutagenetic trees

Junming Yin, Niko Beerenwinkel, Jörg Rahnenführer, Thomas Lengauer

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

The evolution of drug resistance in HIV is characterized by the accumulation of resistance-associated mutations in the HIV genome. Mutagenetic trees, a family of restricted Bayesian tree models, have been applied to infer the order and rate of occurrence of these mutations. Understanding and predicting this evolutionary process is an important prerequisite for the rational design of antiretroviral therapies. In practice, mixtures models of K mutagenetic trees provide more flexibility and are often more appropriate for modelling observed mutational patterns. Here, we investigate the model selection problem for K-mutagenetic trees mixture models. We evaluate several classical model selection criteria including cross-validation, the Bayesian Information Criterion (BIC), and the Akaike Information Criterion. We also use the empirical Bayes method by constructing a prior probability distribution for the parameters of a mutagenetic trees mixture model and deriving the posterior probability of the model. In addition to the model dimension, we consider the redundancy of a mixture model, which is measured by comparing the topologies of trees within a mixture model. Based on the redundancy, we propose a new model selection criterion, which is a modification of the BIC. Experimental results on simulated and on real HIV data show that the classical criteria tend to select models with far too many tree components. Only cross-validation and the modified BIC recover the correct number of trees and the tree topologies most of the time. At the same optimal performance, the runtime of the new BIC modification is about one order of magnitude lower. Thus, this model selection criterion can also be used for large data sets for which cross-validation becomes computationally infeasible.

Original languageEnglish (US)
Article number17
JournalStatistical Applications in Genetics and Molecular Biology
Volume5
Issue number1
DOIs
StatePublished - Jan 1 2006
Externally publishedYes

Fingerprint

Model Selection
Bayesian Information Criterion
Mixture Model
Model Selection Criteria
Patient Selection
Cross-validation
K-tree
HIV
Redundancy
Mutation
Empirical Bayes Method
Topology
Drug Resistance
Mutation Rate
Prior Probability
Pedigree
Akaike Information Criterion
Posterior Probability
Prior distribution
Large Data Sets

Keywords

  • BIC
  • Empirical bayes
  • Mixtures of mutagenetic trees
  • Model selection

ASJC Scopus subject areas

  • Statistics and Probability
  • Molecular Biology
  • Genetics
  • Computational Mathematics

Cite this

Model selection for mixtures of mutagenetic trees. / Yin, Junming; Beerenwinkel, Niko; Rahnenführer, Jörg; Lengauer, Thomas.

In: Statistical Applications in Genetics and Molecular Biology, Vol. 5, No. 1, 17, 01.01.2006.

Research output: Contribution to journalArticle

Yin, Junming ; Beerenwinkel, Niko ; Rahnenführer, Jörg ; Lengauer, Thomas. / Model selection for mixtures of mutagenetic trees. In: Statistical Applications in Genetics and Molecular Biology. 2006 ; Vol. 5, No. 1.
@article{50f782ece4e54813ad05ba12de47109a,
title = "Model selection for mixtures of mutagenetic trees",
abstract = "The evolution of drug resistance in HIV is characterized by the accumulation of resistance-associated mutations in the HIV genome. Mutagenetic trees, a family of restricted Bayesian tree models, have been applied to infer the order and rate of occurrence of these mutations. Understanding and predicting this evolutionary process is an important prerequisite for the rational design of antiretroviral therapies. In practice, mixtures models of K mutagenetic trees provide more flexibility and are often more appropriate for modelling observed mutational patterns. Here, we investigate the model selection problem for K-mutagenetic trees mixture models. We evaluate several classical model selection criteria including cross-validation, the Bayesian Information Criterion (BIC), and the Akaike Information Criterion. We also use the empirical Bayes method by constructing a prior probability distribution for the parameters of a mutagenetic trees mixture model and deriving the posterior probability of the model. In addition to the model dimension, we consider the redundancy of a mixture model, which is measured by comparing the topologies of trees within a mixture model. Based on the redundancy, we propose a new model selection criterion, which is a modification of the BIC. Experimental results on simulated and on real HIV data show that the classical criteria tend to select models with far too many tree components. Only cross-validation and the modified BIC recover the correct number of trees and the tree topologies most of the time. At the same optimal performance, the runtime of the new BIC modification is about one order of magnitude lower. Thus, this model selection criterion can also be used for large data sets for which cross-validation becomes computationally infeasible.",
keywords = "BIC, Empirical bayes, Mixtures of mutagenetic trees, Model selection",
author = "Junming Yin and Niko Beerenwinkel and J{\"o}rg Rahnenf{\"u}hrer and Thomas Lengauer",
year = "2006",
month = "1",
day = "1",
doi = "10.2202/1544-6115.1164",
language = "English (US)",
volume = "5",
journal = "Statistical Applications in Genetics and Molecular Biology",
issn = "1544-6115",
publisher = "Berkeley Electronic Press",
number = "1",

}

TY - JOUR

T1 - Model selection for mixtures of mutagenetic trees

AU - Yin, Junming

AU - Beerenwinkel, Niko

AU - Rahnenführer, Jörg

AU - Lengauer, Thomas

PY - 2006/1/1

Y1 - 2006/1/1

N2 - The evolution of drug resistance in HIV is characterized by the accumulation of resistance-associated mutations in the HIV genome. Mutagenetic trees, a family of restricted Bayesian tree models, have been applied to infer the order and rate of occurrence of these mutations. Understanding and predicting this evolutionary process is an important prerequisite for the rational design of antiretroviral therapies. In practice, mixtures models of K mutagenetic trees provide more flexibility and are often more appropriate for modelling observed mutational patterns. Here, we investigate the model selection problem for K-mutagenetic trees mixture models. We evaluate several classical model selection criteria including cross-validation, the Bayesian Information Criterion (BIC), and the Akaike Information Criterion. We also use the empirical Bayes method by constructing a prior probability distribution for the parameters of a mutagenetic trees mixture model and deriving the posterior probability of the model. In addition to the model dimension, we consider the redundancy of a mixture model, which is measured by comparing the topologies of trees within a mixture model. Based on the redundancy, we propose a new model selection criterion, which is a modification of the BIC. Experimental results on simulated and on real HIV data show that the classical criteria tend to select models with far too many tree components. Only cross-validation and the modified BIC recover the correct number of trees and the tree topologies most of the time. At the same optimal performance, the runtime of the new BIC modification is about one order of magnitude lower. Thus, this model selection criterion can also be used for large data sets for which cross-validation becomes computationally infeasible.

AB - The evolution of drug resistance in HIV is characterized by the accumulation of resistance-associated mutations in the HIV genome. Mutagenetic trees, a family of restricted Bayesian tree models, have been applied to infer the order and rate of occurrence of these mutations. Understanding and predicting this evolutionary process is an important prerequisite for the rational design of antiretroviral therapies. In practice, mixtures models of K mutagenetic trees provide more flexibility and are often more appropriate for modelling observed mutational patterns. Here, we investigate the model selection problem for K-mutagenetic trees mixture models. We evaluate several classical model selection criteria including cross-validation, the Bayesian Information Criterion (BIC), and the Akaike Information Criterion. We also use the empirical Bayes method by constructing a prior probability distribution for the parameters of a mutagenetic trees mixture model and deriving the posterior probability of the model. In addition to the model dimension, we consider the redundancy of a mixture model, which is measured by comparing the topologies of trees within a mixture model. Based on the redundancy, we propose a new model selection criterion, which is a modification of the BIC. Experimental results on simulated and on real HIV data show that the classical criteria tend to select models with far too many tree components. Only cross-validation and the modified BIC recover the correct number of trees and the tree topologies most of the time. At the same optimal performance, the runtime of the new BIC modification is about one order of magnitude lower. Thus, this model selection criterion can also be used for large data sets for which cross-validation becomes computationally infeasible.

KW - BIC

KW - Empirical bayes

KW - Mixtures of mutagenetic trees

KW - Model selection

UR - http://www.scopus.com/inward/record.url?scp=85045416681&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045416681&partnerID=8YFLogxK

U2 - 10.2202/1544-6115.1164

DO - 10.2202/1544-6115.1164

M3 - Article

VL - 5

JO - Statistical Applications in Genetics and Molecular Biology

JF - Statistical Applications in Genetics and Molecular Biology

SN - 1544-6115

IS - 1

M1 - 17

ER -