Evaluation of high-throughput functional categorization of human disease genes

James L. Chen, Yang Liu, Lee T. Sam, Jianrong Li, Yves A Lussier

Research output: Contribution to journalArticle

16 Citations (Scopus)

Abstract

Background: Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. Results: Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations. Conclusion: Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.

Original languageEnglish (US)
Article numberS7
JournalBMC Bioinformatics
Volume8
Issue numberSUPPL. 3
DOIs
StatePublished - May 9 2007
Externally publishedYes

Fingerprint

Categorization
Gene Ontology
High Throughput
Genes
Throughput
Gene
Evaluation
Annotation
Ontology
Molecular Sequence Annotation
Gold
Protein
Semantics
Semantic Web
Proteins
Human
Hormones
Transcription Factor
Well-defined
Transcription Factors

ASJC Scopus subject areas

  • Medicine(all)
  • Structural Biology
  • Applied Mathematics

Cite this

Evaluation of high-throughput functional categorization of human disease genes. / Chen, James L.; Liu, Yang; Sam, Lee T.; Li, Jianrong; Lussier, Yves A.

In: BMC Bioinformatics, Vol. 8, No. SUPPL. 3, S7, 09.05.2007.

Research output: Contribution to journalArticle

Chen, James L. ; Liu, Yang ; Sam, Lee T. ; Li, Jianrong ; Lussier, Yves A. / Evaluation of high-throughput functional categorization of human disease genes. In: BMC Bioinformatics. 2007 ; Vol. 8, No. SUPPL. 3.
@article{cff4d797f4b7402498a0e4294fd62f3e,
title = "Evaluation of high-throughput functional categorization of human disease genes",
abstract = "Background: Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. Results: Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56{\%} and 47{\%} precision with 62{\%} and 71{\%} recall respectively. However, approximately 15{\%} of the studied human disease genes remain without GO annotations. Conclusion: Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75{\%}. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.",
author = "Chen, {James L.} and Yang Liu and Sam, {Lee T.} and Jianrong Li and Lussier, {Yves A}",
year = "2007",
month = "5",
day = "9",
doi = "10.1186/1471-2105-8-S3-S7",
language = "English (US)",
volume = "8",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "SUPPL. 3",

}

TY - JOUR

T1 - Evaluation of high-throughput functional categorization of human disease genes

AU - Chen, James L.

AU - Liu, Yang

AU - Sam, Lee T.

AU - Li, Jianrong

AU - Lussier, Yves A

PY - 2007/5/9

Y1 - 2007/5/9

N2 - Background: Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. Results: Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations. Conclusion: Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.

AB - Background: Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. Results: Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations. Conclusion: Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.

UR - http://www.scopus.com/inward/record.url?scp=34249851359&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34249851359&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-8-S3-S7

DO - 10.1186/1471-2105-8-S3-S7

M3 - Article

C2 - 17493290

AN - SCOPUS:34249851359

VL - 8

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - SUPPL. 3

M1 - S7

ER -