Information theory applied to the sparse gene ontology annotation network to predict novel gene function

Ying Tao, Lee Sam, Jianrong Li, Carol Friedman, Yves A Lussier

Research output: Contribution to journalArticle

116 Citations (Scopus)

Abstract

Motivation: Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes). Results: We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97% recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43-58%) can be achieved for the human GO Annotation file dated 2003.

Original languageEnglish (US)
JournalBioinformatics
Volume23
Issue number13
DOIs
StatePublished - Jul 1 2007
Externally publishedYes

Fingerprint

Information Theory
Molecular Sequence Annotation
Gene Ontology
Information theory
Ontology
Annotation
Genes
Gene
Predict
Prediction
Cross-validation
Semantic Similarity
Semantics
Fold
Functional Genomics
Reverse Genetics
Term
Computer Simulation
Confidence interval
Reverse

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Biochemistry
  • Molecular Biology
  • Computational Mathematics
  • Statistics and Probability

Cite this

Information theory applied to the sparse gene ontology annotation network to predict novel gene function. / Tao, Ying; Sam, Lee; Li, Jianrong; Friedman, Carol; Lussier, Yves A.

In: Bioinformatics, Vol. 23, No. 13, 01.07.2007.

Research output: Contribution to journalArticle

@article{07f47f2f623044adb1fb4bc2b7d8d6bd,
title = "Information theory applied to the sparse gene ontology annotation network to predict novel gene function",
abstract = "Motivation: Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes). Results: We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97{\%} recall 77{\%}) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90{\%} at a recall of 36{\%} for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51{\%} (95{\%} confidence interval: 43-58{\%}) can be achieved for the human GO Annotation file dated 2003.",
author = "Ying Tao and Lee Sam and Jianrong Li and Carol Friedman and Lussier, {Yves A}",
year = "2007",
month = "7",
day = "1",
doi = "10.1093/bioinformatics/btm195",
language = "English (US)",
volume = "23",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "13",

}

TY - JOUR

T1 - Information theory applied to the sparse gene ontology annotation network to predict novel gene function

AU - Tao, Ying

AU - Sam, Lee

AU - Li, Jianrong

AU - Friedman, Carol

AU - Lussier, Yves A

PY - 2007/7/1

Y1 - 2007/7/1

N2 - Motivation: Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes). Results: We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97% recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43-58%) can be achieved for the human GO Annotation file dated 2003.

AB - Motivation: Despite advances in the gene annotation process, the functions of a large portion of gene products remain insufficiently characterized. In addition, the in silico prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or functional genomic approaches. To our knowledge, no prediction method has been demonstrated to be highly accurate for sparsely annotated GO terms (those associated to fewer than 10 genes). Results: We propose a novel approach, information theory-based semantic similarity (ITSS), to automatically predict molecular functions of genes based on existing GO annotations. Using a 10-fold cross-validation, we demonstrate that the ITSS algorithm obtains prediction accuracies (precision 97% recall 77%) comparable to other machine learning algorithms when compared in similar conditions over densely annotated portions of the GO datasets. This method is able to generate highly accurate predictions in sparsely annotated portions of GO, where previous algorithms have failed. As a result, our technique generates an order of magnitude more functional predictions than previous methods. A 10-fold cross validation demonstrated a precision of 90% at a recall of 36% for the algorithm over sparsely annotated networks of the recent GO annotations (about 1400 GO terms and 11000 genes in Homo sapiens). To our knowledge, this article presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions than more widely used cross-validation approaches. By manually assessing a random sample of 100 predictions conducted in a historical rollback evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43-58%) can be achieved for the human GO Annotation file dated 2003.

UR - http://www.scopus.com/inward/record.url?scp=34547840224&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34547840224&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btm195

DO - 10.1093/bioinformatics/btm195

M3 - Article

C2 - 17646340

AN - SCOPUS:34547840224

VL - 23

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 13

ER -