Scaling of accuracy in extremely large phylogenetic trees.

O. R. Bininda-Emonds, S. G. Brady, J. Kim, Michael Sanderson

Research output: Chapter in Book/Report/Conference proceedingChapter

47 Citations (Scopus)

Abstract

The accuracy of phylogenetic inference was examined in simulated data sets up to nearly 10,000 taxa, the size of the largest set of homologous genes in existing molecular sequence databases. Even with a simple search algorithm (maximum parsimony without branch swapping), the number of characters needed to estimate 80% of a tree correctly can scale remarkably well at optimal substitution rates (on the order of log N, where N is the number of taxa). In other words, the number of taxa in an analysis can be doubled and only an arithmetic increase in the number of characters is required to maintain the same level of accuracy. Even substitution rates that are much higher than normally used in phylogenetic studies did not affect the scaling too adversely. However, scaling is usually worse than log N for more stringent levels of accuracy. Moreover, errors are not distributed randomly throughout the tree. Shallow nodes are remarkably easy to reconstruct and display favourable log-linear scaling. The deepest nodes are extremely difficult to reconstruct accurately, even with branch swapping, and the scaling is poor. Therefore, the strategy of sequencing large numbers of homologous genes may not always provide global solutions to extreme phylogenetic problems and alternative strategies may be required.

Original languageEnglish (US)
Title of host publicationPacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Pages547-558
Number of pages12
StatePublished - 2001
Externally publishedYes

Fingerprint

Chemical Databases
Genes
Datasets

Cite this

Bininda-Emonds, O. R., Brady, S. G., Kim, J., & Sanderson, M. (2001). Scaling of accuracy in extremely large phylogenetic trees. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (pp. 547-558)

Scaling of accuracy in extremely large phylogenetic trees. / Bininda-Emonds, O. R.; Brady, S. G.; Kim, J.; Sanderson, Michael.

Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 2001. p. 547-558.

Research output: Chapter in Book/Report/Conference proceedingChapter

Bininda-Emonds, OR, Brady, SG, Kim, J & Sanderson, M 2001, Scaling of accuracy in extremely large phylogenetic trees. in Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. pp. 547-558.
Bininda-Emonds OR, Brady SG, Kim J, Sanderson M. Scaling of accuracy in extremely large phylogenetic trees. In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 2001. p. 547-558
Bininda-Emonds, O. R. ; Brady, S. G. ; Kim, J. ; Sanderson, Michael. / Scaling of accuracy in extremely large phylogenetic trees. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. 2001. pp. 547-558
@inbook{0375c613e56f4525a48cff7d93a89ad8,
title = "Scaling of accuracy in extremely large phylogenetic trees.",
abstract = "The accuracy of phylogenetic inference was examined in simulated data sets up to nearly 10,000 taxa, the size of the largest set of homologous genes in existing molecular sequence databases. Even with a simple search algorithm (maximum parsimony without branch swapping), the number of characters needed to estimate 80{\%} of a tree correctly can scale remarkably well at optimal substitution rates (on the order of log N, where N is the number of taxa). In other words, the number of taxa in an analysis can be doubled and only an arithmetic increase in the number of characters is required to maintain the same level of accuracy. Even substitution rates that are much higher than normally used in phylogenetic studies did not affect the scaling too adversely. However, scaling is usually worse than log N for more stringent levels of accuracy. Moreover, errors are not distributed randomly throughout the tree. Shallow nodes are remarkably easy to reconstruct and display favourable log-linear scaling. The deepest nodes are extremely difficult to reconstruct accurately, even with branch swapping, and the scaling is poor. Therefore, the strategy of sequencing large numbers of homologous genes may not always provide global solutions to extreme phylogenetic problems and alternative strategies may be required.",
author = "Bininda-Emonds, {O. R.} and Brady, {S. G.} and J. Kim and Michael Sanderson",
year = "2001",
language = "English (US)",
pages = "547--558",
booktitle = "Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing",

}

TY - CHAP

T1 - Scaling of accuracy in extremely large phylogenetic trees.

AU - Bininda-Emonds, O. R.

AU - Brady, S. G.

AU - Kim, J.

AU - Sanderson, Michael

PY - 2001

Y1 - 2001

N2 - The accuracy of phylogenetic inference was examined in simulated data sets up to nearly 10,000 taxa, the size of the largest set of homologous genes in existing molecular sequence databases. Even with a simple search algorithm (maximum parsimony without branch swapping), the number of characters needed to estimate 80% of a tree correctly can scale remarkably well at optimal substitution rates (on the order of log N, where N is the number of taxa). In other words, the number of taxa in an analysis can be doubled and only an arithmetic increase in the number of characters is required to maintain the same level of accuracy. Even substitution rates that are much higher than normally used in phylogenetic studies did not affect the scaling too adversely. However, scaling is usually worse than log N for more stringent levels of accuracy. Moreover, errors are not distributed randomly throughout the tree. Shallow nodes are remarkably easy to reconstruct and display favourable log-linear scaling. The deepest nodes are extremely difficult to reconstruct accurately, even with branch swapping, and the scaling is poor. Therefore, the strategy of sequencing large numbers of homologous genes may not always provide global solutions to extreme phylogenetic problems and alternative strategies may be required.

AB - The accuracy of phylogenetic inference was examined in simulated data sets up to nearly 10,000 taxa, the size of the largest set of homologous genes in existing molecular sequence databases. Even with a simple search algorithm (maximum parsimony without branch swapping), the number of characters needed to estimate 80% of a tree correctly can scale remarkably well at optimal substitution rates (on the order of log N, where N is the number of taxa). In other words, the number of taxa in an analysis can be doubled and only an arithmetic increase in the number of characters is required to maintain the same level of accuracy. Even substitution rates that are much higher than normally used in phylogenetic studies did not affect the scaling too adversely. However, scaling is usually worse than log N for more stringent levels of accuracy. Moreover, errors are not distributed randomly throughout the tree. Shallow nodes are remarkably easy to reconstruct and display favourable log-linear scaling. The deepest nodes are extremely difficult to reconstruct accurately, even with branch swapping, and the scaling is poor. Therefore, the strategy of sequencing large numbers of homologous genes may not always provide global solutions to extreme phylogenetic problems and alternative strategies may be required.

UR - http://www.scopus.com/inward/record.url?scp=0035222013&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035222013&partnerID=8YFLogxK

M3 - Chapter

C2 - 11262972

AN - SCOPUS:0035222013

SP - 547

EP - 558

BT - Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing

ER -