STBase: One million species trees for comparative biology

Michelle M Mcmahon, Akshay Deepak, David Fernández-Baca, Darren Boss, Michael Sanderson

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

Comprehensively sampled phylogenetic trees provide themost compelling foundations for strong inferences in comparative evolutionary biology.Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses.Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly ofmulti-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data inmulti-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.

Original languageEnglish (US)
Article numbere0117987
JournalPLoS One
Volume10
Issue number2
DOIs
StatePublished - Feb 13 2015

Fingerprint

Biological Sciences
Genes
loci
Databases
phylogeny
Metadata
Phylogeny
biologists
Biological Phenomena
Gene Duplication
Sequence Alignment
Nucleic Acid Databases
sequence alignment
gene duplication
prototypes
Names
genes
Datasets
Weights and Measures

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

STBase : One million species trees for comparative biology. / Mcmahon, Michelle M; Deepak, Akshay; Fernández-Baca, David; Boss, Darren; Sanderson, Michael.

In: PLoS One, Vol. 10, No. 2, e0117987, 13.02.2015.

Research output: Contribution to journalArticle

Mcmahon, Michelle M ; Deepak, Akshay ; Fernández-Baca, David ; Boss, Darren ; Sanderson, Michael. / STBase : One million species trees for comparative biology. In: PLoS One. 2015 ; Vol. 10, No. 2.
@article{4a0588dc3c1f4d46ab2388a502766db3,
title = "STBase: One million species trees for comparative biology",
abstract = "Comprehensively sampled phylogenetic trees provide themost compelling foundations for strong inferences in comparative evolutionary biology.Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses.Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly ofmulti-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data inmulti-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.",
author = "Mcmahon, {Michelle M} and Akshay Deepak and David Fern{\'a}ndez-Baca and Darren Boss and Michael Sanderson",
year = "2015",
month = "2",
day = "13",
doi = "10.1371/journal.pone.0117987",
language = "English (US)",
volume = "10",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "2",

}

TY - JOUR

T1 - STBase

T2 - One million species trees for comparative biology

AU - Mcmahon, Michelle M

AU - Deepak, Akshay

AU - Fernández-Baca, David

AU - Boss, Darren

AU - Sanderson, Michael

PY - 2015/2/13

Y1 - 2015/2/13

N2 - Comprehensively sampled phylogenetic trees provide themost compelling foundations for strong inferences in comparative evolutionary biology.Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses.Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly ofmulti-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data inmulti-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.

AB - Comprehensively sampled phylogenetic trees provide themost compelling foundations for strong inferences in comparative evolutionary biology.Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses.Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly ofmulti-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data inmulti-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.

UR - http://www.scopus.com/inward/record.url?scp=84922986944&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84922986944&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0117987

DO - 10.1371/journal.pone.0117987

M3 - Article

C2 - 25679219

AN - SCOPUS:84922986944

VL - 10

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 2

M1 - e0117987

ER -