Bio-ontology and text: Bridging the modeling gap

Carol Friedman, Tara Borlawsky, Lyudmila Shagina, H. Rosie Xing, Yves A Lussier

Research output: Contribution to journalArticle

24 Citations (Scopus)

Abstract

Motivation: Natural language processing (NLP) techniques are increasingly being used in biology to automate the capture of new biological discoveries in text, which are being reported at a rapid rate. Yet, information represented in NLP data structures is classically very different from information organized with ontologies as found in model organisms or genetic databases. To facilitate the computational reuse and integration of information buried in unstructured text with that of genetic databases, we propose and evaluate a translational schema that represents a comprehensive set of phenotypic and genetic entities, as well as their closely related biomedical entities and relations as expressed in natural language. In addition, the schema connects different scales of biological information, and provides mappings from the textual information to existing ontologies, which are essential in biology for integration, organization, dissemination and knowledge management of heterogeneous phenotypic information. A common comprehensive representation for otherwise heterogeneous phenotypic and genetic datasets, such as the one proposed, is critical for advancing systems biology because it enables acquisition and reuse of unprecedented volumes of diverse types of knowledge and information from text. Results: A novel representational schema, PGschema, was developed that enables translation of phenotypic, genetic and their closely related information found in textual narratives to a well-defined data structure comprising phenotypic and genetic concepts from established ontologies along with modifiers and relationships. Evaluation for coverage of a selected set of entities showed that 90% of the information could be represented (95% confidence interval: 86-93%; n = 268). Moreover, PGschema can be expressed automatically in an XML format using natural language techniques to process the text. To our knowledge, we are providing the first evaluation of a translational schema for NLP that contains declarative knowledge about genes and their associated biomedical data (e.g. phenotypes).

Original languageEnglish (US)
Pages (from-to)2421-2429
Number of pages9
JournalBioinformatics
Volume22
Issue number19
DOIs
StatePublished - Oct 2006
Externally publishedYes

Fingerprint

Natural Language Processing
Genetic Databases
Ontology
Data structures
Language
Natural Language
Processing
Modeling
Genetic Phenomena
Knowledge Management
Schema
Systems Biology
Protein Biosynthesis
Knowledge management
XML
Genes
Confidence Intervals
Phenotype
Biology
Reuse

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Friedman, C., Borlawsky, T., Shagina, L., Xing, H. R., & Lussier, Y. A. (2006). Bio-ontology and text: Bridging the modeling gap. Bioinformatics, 22(19), 2421-2429. https://doi.org/10.1093/bioinformatics/btl405

Bio-ontology and text : Bridging the modeling gap. / Friedman, Carol; Borlawsky, Tara; Shagina, Lyudmila; Xing, H. Rosie; Lussier, Yves A.

In: Bioinformatics, Vol. 22, No. 19, 10.2006, p. 2421-2429.

Research output: Contribution to journalArticle

Friedman, C, Borlawsky, T, Shagina, L, Xing, HR & Lussier, YA 2006, 'Bio-ontology and text: Bridging the modeling gap', Bioinformatics, vol. 22, no. 19, pp. 2421-2429. https://doi.org/10.1093/bioinformatics/btl405
Friedman, Carol ; Borlawsky, Tara ; Shagina, Lyudmila ; Xing, H. Rosie ; Lussier, Yves A. / Bio-ontology and text : Bridging the modeling gap. In: Bioinformatics. 2006 ; Vol. 22, No. 19. pp. 2421-2429.
@article{b6012bf7b5a4427c81fff569b3eb3162,
title = "Bio-ontology and text: Bridging the modeling gap",
abstract = "Motivation: Natural language processing (NLP) techniques are increasingly being used in biology to automate the capture of new biological discoveries in text, which are being reported at a rapid rate. Yet, information represented in NLP data structures is classically very different from information organized with ontologies as found in model organisms or genetic databases. To facilitate the computational reuse and integration of information buried in unstructured text with that of genetic databases, we propose and evaluate a translational schema that represents a comprehensive set of phenotypic and genetic entities, as well as their closely related biomedical entities and relations as expressed in natural language. In addition, the schema connects different scales of biological information, and provides mappings from the textual information to existing ontologies, which are essential in biology for integration, organization, dissemination and knowledge management of heterogeneous phenotypic information. A common comprehensive representation for otherwise heterogeneous phenotypic and genetic datasets, such as the one proposed, is critical for advancing systems biology because it enables acquisition and reuse of unprecedented volumes of diverse types of knowledge and information from text. Results: A novel representational schema, PGschema, was developed that enables translation of phenotypic, genetic and their closely related information found in textual narratives to a well-defined data structure comprising phenotypic and genetic concepts from established ontologies along with modifiers and relationships. Evaluation for coverage of a selected set of entities showed that 90{\%} of the information could be represented (95{\%} confidence interval: 86-93{\%}; n = 268). Moreover, PGschema can be expressed automatically in an XML format using natural language techniques to process the text. To our knowledge, we are providing the first evaluation of a translational schema for NLP that contains declarative knowledge about genes and their associated biomedical data (e.g. phenotypes).",
author = "Carol Friedman and Tara Borlawsky and Lyudmila Shagina and Xing, {H. Rosie} and Lussier, {Yves A}",
year = "2006",
month = "10",
doi = "10.1093/bioinformatics/btl405",
language = "English (US)",
volume = "22",
pages = "2421--2429",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "19",

}

TY - JOUR

T1 - Bio-ontology and text

T2 - Bridging the modeling gap

AU - Friedman, Carol

AU - Borlawsky, Tara

AU - Shagina, Lyudmila

AU - Xing, H. Rosie

AU - Lussier, Yves A

PY - 2006/10

Y1 - 2006/10

N2 - Motivation: Natural language processing (NLP) techniques are increasingly being used in biology to automate the capture of new biological discoveries in text, which are being reported at a rapid rate. Yet, information represented in NLP data structures is classically very different from information organized with ontologies as found in model organisms or genetic databases. To facilitate the computational reuse and integration of information buried in unstructured text with that of genetic databases, we propose and evaluate a translational schema that represents a comprehensive set of phenotypic and genetic entities, as well as their closely related biomedical entities and relations as expressed in natural language. In addition, the schema connects different scales of biological information, and provides mappings from the textual information to existing ontologies, which are essential in biology for integration, organization, dissemination and knowledge management of heterogeneous phenotypic information. A common comprehensive representation for otherwise heterogeneous phenotypic and genetic datasets, such as the one proposed, is critical for advancing systems biology because it enables acquisition and reuse of unprecedented volumes of diverse types of knowledge and information from text. Results: A novel representational schema, PGschema, was developed that enables translation of phenotypic, genetic and their closely related information found in textual narratives to a well-defined data structure comprising phenotypic and genetic concepts from established ontologies along with modifiers and relationships. Evaluation for coverage of a selected set of entities showed that 90% of the information could be represented (95% confidence interval: 86-93%; n = 268). Moreover, PGschema can be expressed automatically in an XML format using natural language techniques to process the text. To our knowledge, we are providing the first evaluation of a translational schema for NLP that contains declarative knowledge about genes and their associated biomedical data (e.g. phenotypes).

AB - Motivation: Natural language processing (NLP) techniques are increasingly being used in biology to automate the capture of new biological discoveries in text, which are being reported at a rapid rate. Yet, information represented in NLP data structures is classically very different from information organized with ontologies as found in model organisms or genetic databases. To facilitate the computational reuse and integration of information buried in unstructured text with that of genetic databases, we propose and evaluate a translational schema that represents a comprehensive set of phenotypic and genetic entities, as well as their closely related biomedical entities and relations as expressed in natural language. In addition, the schema connects different scales of biological information, and provides mappings from the textual information to existing ontologies, which are essential in biology for integration, organization, dissemination and knowledge management of heterogeneous phenotypic information. A common comprehensive representation for otherwise heterogeneous phenotypic and genetic datasets, such as the one proposed, is critical for advancing systems biology because it enables acquisition and reuse of unprecedented volumes of diverse types of knowledge and information from text. Results: A novel representational schema, PGschema, was developed that enables translation of phenotypic, genetic and their closely related information found in textual narratives to a well-defined data structure comprising phenotypic and genetic concepts from established ontologies along with modifiers and relationships. Evaluation for coverage of a selected set of entities showed that 90% of the information could be represented (95% confidence interval: 86-93%; n = 268). Moreover, PGschema can be expressed automatically in an XML format using natural language techniques to process the text. To our knowledge, we are providing the first evaluation of a translational schema for NLP that contains declarative knowledge about genes and their associated biomedical data (e.g. phenotypes).

UR - http://www.scopus.com/inward/record.url?scp=33750011894&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33750011894&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btl405

DO - 10.1093/bioinformatics/btl405

M3 - Article

C2 - 16870928

AN - SCOPUS:33750011894

VL - 22

SP - 2421

EP - 2429

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 19

ER -