Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons

Illyoung Choi, Alise J. Ponsero, Matthew Bomhoff, Ken Youens-Clark, John H Hartman, Bonnie L Hurwitz

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Background: Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results: We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions: A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.

Original languageEnglish (US)
JournalGigaScience
Volume8
Issue number2
DOIs
StatePublished - Feb 1 2019

Fingerprint

Metagenome
Metagenomics
Biodiversity
Biological Phenomena
Data reduction
Firearms
Availability
Cluster Analysis
Databases
Chemical analysis
Datasets

ASJC Scopus subject areas

  • Health Informatics
  • Computer Science Applications

Cite this

Libra : scalable k-mer-based tool for massive all-vs-all metagenome comparisons. / Choi, Illyoung; Ponsero, Alise J.; Bomhoff, Matthew; Youens-Clark, Ken; Hartman, John H; Hurwitz, Bonnie L.

In: GigaScience, Vol. 8, No. 2, 01.02.2019.

Research output: Contribution to journalArticle

Choi, Illyoung ; Ponsero, Alise J. ; Bomhoff, Matthew ; Youens-Clark, Ken ; Hartman, John H ; Hurwitz, Bonnie L. / Libra : scalable k-mer-based tool for massive all-vs-all metagenome comparisons. In: GigaScience. 2019 ; Vol. 8, No. 2.
@article{8367827c86914a5c9f3b64b792b60182,
title = "Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons",
abstract = "Background: Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results: We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions: A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.",
author = "Illyoung Choi and Ponsero, {Alise J.} and Matthew Bomhoff and Ken Youens-Clark and Hartman, {John H} and Hurwitz, {Bonnie L}",
year = "2019",
month = "2",
day = "1",
doi = "10.1093/gigascience/giy165",
language = "English (US)",
volume = "8",
journal = "GigaScience",
issn = "2047-217X",
publisher = "BioMed Central",
number = "2",

}

TY - JOUR

T1 - Libra

T2 - scalable k-mer-based tool for massive all-vs-all metagenome comparisons

AU - Choi, Illyoung

AU - Ponsero, Alise J.

AU - Bomhoff, Matthew

AU - Youens-Clark, Ken

AU - Hartman, John H

AU - Hurwitz, Bonnie L

PY - 2019/2/1

Y1 - 2019/2/1

N2 - Background: Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results: We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions: A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.

AB - Background: Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results: We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions: A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.

UR - http://www.scopus.com/inward/record.url?scp=85061003161&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061003161&partnerID=8YFLogxK

U2 - 10.1093/gigascience/giy165

DO - 10.1093/gigascience/giy165

M3 - Article

C2 - 30597002

AN - SCOPUS:85061003161

VL - 8

JO - GigaScience

JF - GigaScience

SN - 2047-217X

IS - 2

ER -