A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes

Naruekamol Pookhao, Michael B. Sohn, Qike Li, Isaac Jenkins, Ruofei Du, Hongmei Jiang, Lingling An

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Motivation: With the advance of new sequencing technologies producing massive short reads data, metagenomics is rapidly growing, especially in the fields of environmental biology and medical science. The metagenomic data are not only high dimensional with large number of features and limited number of samples but also complex with a large number of zeros and skewed distribution. Efficient computational and statistical tools are needed to deal with these unique characteristics of metagenomic sequencing data. In metagenomic studies, one main objective is to assess whether and how multiple microbial communities differ under various environmental conditions. Results: We propose a two-stage statistical procedure for selecting informative features and identifying differentially abundant features between two or more groups of microbial communities. In the functional analysis of metagenomes, the features may refer to the pathways, subsystems, functional roles and so on. In the first stage of the proposed procedure, the informative features are selected using elastic net as reducing the dimension of metagenomic data. In the second stage, the differentially abundant features are detected using generalized linear models with a negative binomial distribution. Compared with other available methods, the proposed approach demonstrates better performance for most of the comprehensive simulation studies. The new method is also applied to two real metagenomic datasets related to human health. Our findings are consistent with those in previous reports.

Original languageEnglish (US)
Pages (from-to)158-165
Number of pages8
JournalBioinformatics
Volume31
Issue number2
DOIs
StatePublished - Jan 15 2015

Fingerprint

Metagenome
Metagenomics
Functional analysis
Functional Analysis
Feature Selection
Feature extraction
Health
Sequencing
Elastic Net
Zero Distribution
Negative binomial distribution
Skewed Distribution
Generalized Linear Model
Binomial Distribution
Biology
Pathway
Subsystem
High-dimensional
Simulation Study
Linear Models

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability

Cite this

A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes. / Pookhao, Naruekamol; Sohn, Michael B.; Li, Qike; Jenkins, Isaac; Du, Ruofei; Jiang, Hongmei; An, Lingling.

In: Bioinformatics, Vol. 31, No. 2, 15.01.2015, p. 158-165.

Research output: Contribution to journalArticle

Pookhao, Naruekamol ; Sohn, Michael B. ; Li, Qike ; Jenkins, Isaac ; Du, Ruofei ; Jiang, Hongmei ; An, Lingling. / A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes. In: Bioinformatics. 2015 ; Vol. 31, No. 2. pp. 158-165.
@article{ca30c761e45a43f4be82b22ad31e6215,
title = "A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes",
abstract = "Motivation: With the advance of new sequencing technologies producing massive short reads data, metagenomics is rapidly growing, especially in the fields of environmental biology and medical science. The metagenomic data are not only high dimensional with large number of features and limited number of samples but also complex with a large number of zeros and skewed distribution. Efficient computational and statistical tools are needed to deal with these unique characteristics of metagenomic sequencing data. In metagenomic studies, one main objective is to assess whether and how multiple microbial communities differ under various environmental conditions. Results: We propose a two-stage statistical procedure for selecting informative features and identifying differentially abundant features between two or more groups of microbial communities. In the functional analysis of metagenomes, the features may refer to the pathways, subsystems, functional roles and so on. In the first stage of the proposed procedure, the informative features are selected using elastic net as reducing the dimension of metagenomic data. In the second stage, the differentially abundant features are detected using generalized linear models with a negative binomial distribution. Compared with other available methods, the proposed approach demonstrates better performance for most of the comprehensive simulation studies. The new method is also applied to two real metagenomic datasets related to human health. Our findings are consistent with those in previous reports.",
author = "Naruekamol Pookhao and Sohn, {Michael B.} and Qike Li and Isaac Jenkins and Ruofei Du and Hongmei Jiang and Lingling An",
year = "2015",
month = "1",
day = "15",
doi = "10.1093/bioinformatics/btu635",
language = "English (US)",
volume = "31",
pages = "158--165",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "2",

}

TY - JOUR

T1 - A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes

AU - Pookhao, Naruekamol

AU - Sohn, Michael B.

AU - Li, Qike

AU - Jenkins, Isaac

AU - Du, Ruofei

AU - Jiang, Hongmei

AU - An, Lingling

PY - 2015/1/15

Y1 - 2015/1/15

N2 - Motivation: With the advance of new sequencing technologies producing massive short reads data, metagenomics is rapidly growing, especially in the fields of environmental biology and medical science. The metagenomic data are not only high dimensional with large number of features and limited number of samples but also complex with a large number of zeros and skewed distribution. Efficient computational and statistical tools are needed to deal with these unique characteristics of metagenomic sequencing data. In metagenomic studies, one main objective is to assess whether and how multiple microbial communities differ under various environmental conditions. Results: We propose a two-stage statistical procedure for selecting informative features and identifying differentially abundant features between two or more groups of microbial communities. In the functional analysis of metagenomes, the features may refer to the pathways, subsystems, functional roles and so on. In the first stage of the proposed procedure, the informative features are selected using elastic net as reducing the dimension of metagenomic data. In the second stage, the differentially abundant features are detected using generalized linear models with a negative binomial distribution. Compared with other available methods, the proposed approach demonstrates better performance for most of the comprehensive simulation studies. The new method is also applied to two real metagenomic datasets related to human health. Our findings are consistent with those in previous reports.

AB - Motivation: With the advance of new sequencing technologies producing massive short reads data, metagenomics is rapidly growing, especially in the fields of environmental biology and medical science. The metagenomic data are not only high dimensional with large number of features and limited number of samples but also complex with a large number of zeros and skewed distribution. Efficient computational and statistical tools are needed to deal with these unique characteristics of metagenomic sequencing data. In metagenomic studies, one main objective is to assess whether and how multiple microbial communities differ under various environmental conditions. Results: We propose a two-stage statistical procedure for selecting informative features and identifying differentially abundant features between two or more groups of microbial communities. In the functional analysis of metagenomes, the features may refer to the pathways, subsystems, functional roles and so on. In the first stage of the proposed procedure, the informative features are selected using elastic net as reducing the dimension of metagenomic data. In the second stage, the differentially abundant features are detected using generalized linear models with a negative binomial distribution. Compared with other available methods, the proposed approach demonstrates better performance for most of the comprehensive simulation studies. The new method is also applied to two real metagenomic datasets related to human health. Our findings are consistent with those in previous reports.

UR - http://www.scopus.com/inward/record.url?scp=84928989683&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84928989683&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btu635

DO - 10.1093/bioinformatics/btu635

M3 - Article

C2 - 25256572

AN - SCOPUS:84928989683

VL - 31

SP - 158

EP - 165

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 2

ER -