Missing data and the design of phylogenetic analyses

Research output: Contribution to journalArticle

307 Citations (Scopus)

Abstract

Concerns about the deleterious effects of missing data may often determine which characters and taxa are included in phylogenetic analyses. For example, researchers may exclude taxa lacking data for some genes or exclude a gene lacking data in some taxa. Yet, there may be very little evidence to support these decisions. In this paper, I review the effects of missing data on phylogenetic analyses. Recent simulations suggest that highly incomplete taxa can be accurately placed in phylogenies, as long as many characters have been sampled overall. Furthermore, adding incomplete taxa can dramatically improve results in some cases by subdividing misleading long branches. Adding characters with missing data can also improve accuracy, although there is a risk of long-branch attraction in some cases. Consideration of how missing data does (or does not) affect phylogenetic analyses may allow researchers to design studies that can reconstruct large phylogenies quickly, economically, and accurately.

Original languageEnglish (US)
Pages (from-to)34-42
Number of pages9
JournalJournal of Biomedical Informatics
Volume39
Issue number1 SPEC. ISS.
DOIs
StatePublished - Feb 2006
Externally publishedYes

Fingerprint

Phylogeny
Genes
Research Personnel

Keywords

  • Accuracy
  • Bayesian analysis
  • Maximum likelihood
  • Missing data
  • Neighbor-joining
  • Parsimony
  • Phylogenetic method
  • Phylogeny
  • Systematics

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics
  • Computer Science (miscellaneous)
  • Catalysis

Cite this

Missing data and the design of phylogenetic analyses. / Wiens, John J.

In: Journal of Biomedical Informatics, Vol. 39, No. 1 SPEC. ISS., 02.2006, p. 34-42.

Research output: Contribution to journalArticle

@article{b2760aa5ba3f4380968a4301a35f8be4,
title = "Missing data and the design of phylogenetic analyses",
abstract = "Concerns about the deleterious effects of missing data may often determine which characters and taxa are included in phylogenetic analyses. For example, researchers may exclude taxa lacking data for some genes or exclude a gene lacking data in some taxa. Yet, there may be very little evidence to support these decisions. In this paper, I review the effects of missing data on phylogenetic analyses. Recent simulations suggest that highly incomplete taxa can be accurately placed in phylogenies, as long as many characters have been sampled overall. Furthermore, adding incomplete taxa can dramatically improve results in some cases by subdividing misleading long branches. Adding characters with missing data can also improve accuracy, although there is a risk of long-branch attraction in some cases. Consideration of how missing data does (or does not) affect phylogenetic analyses may allow researchers to design studies that can reconstruct large phylogenies quickly, economically, and accurately.",
keywords = "Accuracy, Bayesian analysis, Maximum likelihood, Missing data, Neighbor-joining, Parsimony, Phylogenetic method, Phylogeny, Systematics",
author = "Wiens, {John J}",
year = "2006",
month = "2",
doi = "10.1016/j.jbi.2005.04.001",
language = "English (US)",
volume = "39",
pages = "34--42",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",
number = "1 SPEC. ISS.",

}

TY - JOUR

T1 - Missing data and the design of phylogenetic analyses

AU - Wiens, John J

PY - 2006/2

Y1 - 2006/2

N2 - Concerns about the deleterious effects of missing data may often determine which characters and taxa are included in phylogenetic analyses. For example, researchers may exclude taxa lacking data for some genes or exclude a gene lacking data in some taxa. Yet, there may be very little evidence to support these decisions. In this paper, I review the effects of missing data on phylogenetic analyses. Recent simulations suggest that highly incomplete taxa can be accurately placed in phylogenies, as long as many characters have been sampled overall. Furthermore, adding incomplete taxa can dramatically improve results in some cases by subdividing misleading long branches. Adding characters with missing data can also improve accuracy, although there is a risk of long-branch attraction in some cases. Consideration of how missing data does (or does not) affect phylogenetic analyses may allow researchers to design studies that can reconstruct large phylogenies quickly, economically, and accurately.

AB - Concerns about the deleterious effects of missing data may often determine which characters and taxa are included in phylogenetic analyses. For example, researchers may exclude taxa lacking data for some genes or exclude a gene lacking data in some taxa. Yet, there may be very little evidence to support these decisions. In this paper, I review the effects of missing data on phylogenetic analyses. Recent simulations suggest that highly incomplete taxa can be accurately placed in phylogenies, as long as many characters have been sampled overall. Furthermore, adding incomplete taxa can dramatically improve results in some cases by subdividing misleading long branches. Adding characters with missing data can also improve accuracy, although there is a risk of long-branch attraction in some cases. Consideration of how missing data does (or does not) affect phylogenetic analyses may allow researchers to design studies that can reconstruct large phylogenies quickly, economically, and accurately.

KW - Accuracy

KW - Bayesian analysis

KW - Maximum likelihood

KW - Missing data

KW - Neighbor-joining

KW - Parsimony

KW - Phylogenetic method

KW - Phylogeny

KW - Systematics

UR - http://www.scopus.com/inward/record.url?scp=31344471014&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=31344471014&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2005.04.001

DO - 10.1016/j.jbi.2005.04.001

M3 - Article

C2 - 15922672

AN - SCOPUS:31344471014

VL - 39

SP - 34

EP - 42

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

IS - 1 SPEC. ISS.

ER -