Extraction of phenotypic traits from taxonomic descriptions for the tree of life using natural language processing

Lorena Endara, Hong Cui, J. Gordon Burleigh

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Premise of the Study: Phenotypic data sets are necessary to elucidate the genealogy of life, but assembling phenotypic data for taxa across the tree of life can be technically challenging and prohibitively time consuming. We describe a semi-automated protocol to facilitate and expedite the assembly of phenotypic character matrices of plants from formal taxonomic descriptions. This pipeline uses new natural language processing (NLP) techniques and a glossary of over 9000 botanical terms. Methods and Results: Our protocol includes the Explorer of Taxon Concepts (ETC), an online application that assembles taxon-by-character matrices from taxonomic descriptions, and MatrixConverter, a Java application that enables users to evaluate and discretize the characters extracted by ETC. We demonstrate this protocol using descriptions from Araucariaceae. Conclusions: The NLP pipeline unlocks the phenotypic data found in taxonomic descriptions and makes them usable for evolutionary analyses.

Original languageEnglish (US)
Article numbere1035
JournalApplications in Plant Sciences
Volume6
Issue number3
DOIs
StatePublished - Mar 1 2018

Fingerprint

Araucariaceae
genealogy
matrix
methodology
protocol
glossary
method

Keywords

  • morphological matrices
  • natural language processing
  • phenotypic traits
  • taxonomic descriptions

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Plant Science

Cite this

Extraction of phenotypic traits from taxonomic descriptions for the tree of life using natural language processing. / Endara, Lorena; Cui, Hong; Burleigh, J. Gordon.

In: Applications in Plant Sciences, Vol. 6, No. 3, e1035, 01.03.2018.

Research output: Contribution to journalArticle

@article{847749c43289430e8f2aa6fd7894b396,
title = "Extraction of phenotypic traits from taxonomic descriptions for the tree of life using natural language processing",
abstract = "Premise of the Study: Phenotypic data sets are necessary to elucidate the genealogy of life, but assembling phenotypic data for taxa across the tree of life can be technically challenging and prohibitively time consuming. We describe a semi-automated protocol to facilitate and expedite the assembly of phenotypic character matrices of plants from formal taxonomic descriptions. This pipeline uses new natural language processing (NLP) techniques and a glossary of over 9000 botanical terms. Methods and Results: Our protocol includes the Explorer of Taxon Concepts (ETC), an online application that assembles taxon-by-character matrices from taxonomic descriptions, and MatrixConverter, a Java application that enables users to evaluate and discretize the characters extracted by ETC. We demonstrate this protocol using descriptions from Araucariaceae. Conclusions: The NLP pipeline unlocks the phenotypic data found in taxonomic descriptions and makes them usable for evolutionary analyses.",
keywords = "morphological matrices, natural language processing, phenotypic traits, taxonomic descriptions",
author = "Lorena Endara and Hong Cui and Burleigh, {J. Gordon}",
year = "2018",
month = "3",
day = "1",
doi = "10.1002/aps3.1035",
language = "English (US)",
volume = "6",
journal = "Applications in Plant Sciences",
issn = "2168-0450",
publisher = "Botanical Society of America Inc.",
number = "3",

}

TY - JOUR

T1 - Extraction of phenotypic traits from taxonomic descriptions for the tree of life using natural language processing

AU - Endara, Lorena

AU - Cui, Hong

AU - Burleigh, J. Gordon

PY - 2018/3/1

Y1 - 2018/3/1

N2 - Premise of the Study: Phenotypic data sets are necessary to elucidate the genealogy of life, but assembling phenotypic data for taxa across the tree of life can be technically challenging and prohibitively time consuming. We describe a semi-automated protocol to facilitate and expedite the assembly of phenotypic character matrices of plants from formal taxonomic descriptions. This pipeline uses new natural language processing (NLP) techniques and a glossary of over 9000 botanical terms. Methods and Results: Our protocol includes the Explorer of Taxon Concepts (ETC), an online application that assembles taxon-by-character matrices from taxonomic descriptions, and MatrixConverter, a Java application that enables users to evaluate and discretize the characters extracted by ETC. We demonstrate this protocol using descriptions from Araucariaceae. Conclusions: The NLP pipeline unlocks the phenotypic data found in taxonomic descriptions and makes them usable for evolutionary analyses.

AB - Premise of the Study: Phenotypic data sets are necessary to elucidate the genealogy of life, but assembling phenotypic data for taxa across the tree of life can be technically challenging and prohibitively time consuming. We describe a semi-automated protocol to facilitate and expedite the assembly of phenotypic character matrices of plants from formal taxonomic descriptions. This pipeline uses new natural language processing (NLP) techniques and a glossary of over 9000 botanical terms. Methods and Results: Our protocol includes the Explorer of Taxon Concepts (ETC), an online application that assembles taxon-by-character matrices from taxonomic descriptions, and MatrixConverter, a Java application that enables users to evaluate and discretize the characters extracted by ETC. We demonstrate this protocol using descriptions from Araucariaceae. Conclusions: The NLP pipeline unlocks the phenotypic data found in taxonomic descriptions and makes them usable for evolutionary analyses.

KW - morphological matrices

KW - natural language processing

KW - phenotypic traits

KW - taxonomic descriptions

UR - http://www.scopus.com/inward/record.url?scp=85044740050&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85044740050&partnerID=8YFLogxK

U2 - 10.1002/aps3.1035

DO - 10.1002/aps3.1035

M3 - Article

VL - 6

JO - Applications in Plant Sciences

JF - Applications in Plant Sciences

SN - 2168-0450

IS - 3

M1 - e1035

ER -