The PhyLoTA Browser: Processing GenBank for molecular phylogenetics research

Michael Sanderson, Darren Boss, Duhong Chen, Karen A. Cranston, Andre Wehe

Research output: Contribution to journalArticle

90 Citations (Scopus)

Abstract

As an archive of sequence data for over 165,000 species, GenBank is an indispensable resource for phylogenetic inference. Here we describe an informatics processing pipeline and online database, the PhyLoTA Browser (http://loco.biosci.arizona.edu/pb), which offers a view of GenBank tailored for molecular phylogenetics. The first release of the Browser is computed from 2.6 million sequences representing the taxonomically enriched subset of GenBank sequences for eukaryotes (excluding most genome survey sequences, ESTs, and other high-throughput data). In addition to summarizing sequence diversity and species diversity across nodes in the NCBI taxonomy, it reports 87,000 potentially phylogenetically informative clusters of homologous sequences, which can be viewed or downloaded, along with provisional alignments and coarse phylogenetic trees. At each node in the NCBI hierarchy, the user can display a "data availability matrix" of all available sequences for entries in a subtaxa-by-clusters matrix. This matrix provides a guidepost for subsequent assembly of multigene data sets or supertrees. The database allows for comparison of results from previous GenBank releases, highlighting recent additions of either sequences or taxa to GenBank and letting investigators track progress on data availability worldwide. Although the reported alignments and trees are extremely approximate, the database reports several statistics correlated with alignment quality to help users choose from alternative data sources.

Original languageEnglish (US)
Pages (from-to)335-346
Number of pages12
JournalSystematic Biology
Volume57
Issue number3
DOIs
StatePublished - Jun 2008

Fingerprint

Nucleic Acid Databases
phylogenetics
phylogeny
Research
Databases
matrix
sequence homology
eukaryotic cells
Data Display
statistics
informatics
Informatics
Information Storage and Retrieval
taxonomy
Expressed Sequence Tags
species diversity
eukaryote
Sequence Homology
Eukaryota
genome

Keywords

  • GenBank
  • Phylogenetic database
  • Phylogenomics
  • Phyloinformatics

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics

Cite this

The PhyLoTA Browser : Processing GenBank for molecular phylogenetics research. / Sanderson, Michael; Boss, Darren; Chen, Duhong; Cranston, Karen A.; Wehe, Andre.

In: Systematic Biology, Vol. 57, No. 3, 06.2008, p. 335-346.

Research output: Contribution to journalArticle

Sanderson, Michael ; Boss, Darren ; Chen, Duhong ; Cranston, Karen A. ; Wehe, Andre. / The PhyLoTA Browser : Processing GenBank for molecular phylogenetics research. In: Systematic Biology. 2008 ; Vol. 57, No. 3. pp. 335-346.
@article{72fbfe4f05254e99b20dc6308b8d4f77,
title = "The PhyLoTA Browser: Processing GenBank for molecular phylogenetics research",
abstract = "As an archive of sequence data for over 165,000 species, GenBank is an indispensable resource for phylogenetic inference. Here we describe an informatics processing pipeline and online database, the PhyLoTA Browser (http://loco.biosci.arizona.edu/pb), which offers a view of GenBank tailored for molecular phylogenetics. The first release of the Browser is computed from 2.6 million sequences representing the taxonomically enriched subset of GenBank sequences for eukaryotes (excluding most genome survey sequences, ESTs, and other high-throughput data). In addition to summarizing sequence diversity and species diversity across nodes in the NCBI taxonomy, it reports 87,000 potentially phylogenetically informative clusters of homologous sequences, which can be viewed or downloaded, along with provisional alignments and coarse phylogenetic trees. At each node in the NCBI hierarchy, the user can display a {"}data availability matrix{"} of all available sequences for entries in a subtaxa-by-clusters matrix. This matrix provides a guidepost for subsequent assembly of multigene data sets or supertrees. The database allows for comparison of results from previous GenBank releases, highlighting recent additions of either sequences or taxa to GenBank and letting investigators track progress on data availability worldwide. Although the reported alignments and trees are extremely approximate, the database reports several statistics correlated with alignment quality to help users choose from alternative data sources.",
keywords = "GenBank, Phylogenetic database, Phylogenomics, Phyloinformatics",
author = "Michael Sanderson and Darren Boss and Duhong Chen and Cranston, {Karen A.} and Andre Wehe",
year = "2008",
month = "6",
doi = "10.1080/10635150802158688",
language = "English (US)",
volume = "57",
pages = "335--346",
journal = "Systematic Biology",
issn = "1063-5157",
publisher = "Oxford University Press",
number = "3",

}

TY - JOUR

T1 - The PhyLoTA Browser

T2 - Processing GenBank for molecular phylogenetics research

AU - Sanderson, Michael

AU - Boss, Darren

AU - Chen, Duhong

AU - Cranston, Karen A.

AU - Wehe, Andre

PY - 2008/6

Y1 - 2008/6

N2 - As an archive of sequence data for over 165,000 species, GenBank is an indispensable resource for phylogenetic inference. Here we describe an informatics processing pipeline and online database, the PhyLoTA Browser (http://loco.biosci.arizona.edu/pb), which offers a view of GenBank tailored for molecular phylogenetics. The first release of the Browser is computed from 2.6 million sequences representing the taxonomically enriched subset of GenBank sequences for eukaryotes (excluding most genome survey sequences, ESTs, and other high-throughput data). In addition to summarizing sequence diversity and species diversity across nodes in the NCBI taxonomy, it reports 87,000 potentially phylogenetically informative clusters of homologous sequences, which can be viewed or downloaded, along with provisional alignments and coarse phylogenetic trees. At each node in the NCBI hierarchy, the user can display a "data availability matrix" of all available sequences for entries in a subtaxa-by-clusters matrix. This matrix provides a guidepost for subsequent assembly of multigene data sets or supertrees. The database allows for comparison of results from previous GenBank releases, highlighting recent additions of either sequences or taxa to GenBank and letting investigators track progress on data availability worldwide. Although the reported alignments and trees are extremely approximate, the database reports several statistics correlated with alignment quality to help users choose from alternative data sources.

AB - As an archive of sequence data for over 165,000 species, GenBank is an indispensable resource for phylogenetic inference. Here we describe an informatics processing pipeline and online database, the PhyLoTA Browser (http://loco.biosci.arizona.edu/pb), which offers a view of GenBank tailored for molecular phylogenetics. The first release of the Browser is computed from 2.6 million sequences representing the taxonomically enriched subset of GenBank sequences for eukaryotes (excluding most genome survey sequences, ESTs, and other high-throughput data). In addition to summarizing sequence diversity and species diversity across nodes in the NCBI taxonomy, it reports 87,000 potentially phylogenetically informative clusters of homologous sequences, which can be viewed or downloaded, along with provisional alignments and coarse phylogenetic trees. At each node in the NCBI hierarchy, the user can display a "data availability matrix" of all available sequences for entries in a subtaxa-by-clusters matrix. This matrix provides a guidepost for subsequent assembly of multigene data sets or supertrees. The database allows for comparison of results from previous GenBank releases, highlighting recent additions of either sequences or taxa to GenBank and letting investigators track progress on data availability worldwide. Although the reported alignments and trees are extremely approximate, the database reports several statistics correlated with alignment quality to help users choose from alternative data sources.

KW - GenBank

KW - Phylogenetic database

KW - Phylogenomics

KW - Phyloinformatics

UR - http://www.scopus.com/inward/record.url?scp=45849099814&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=45849099814&partnerID=8YFLogxK

U2 - 10.1080/10635150802158688

DO - 10.1080/10635150802158688

M3 - Article

C2 - 18570030

AN - SCOPUS:45849099814

VL - 57

SP - 335

EP - 346

JO - Systematic Biology

JF - Systematic Biology

SN - 1063-5157

IS - 3

ER -