CharaParser for fine-grained semantic annotation of organism morphological descriptions

Research output: Contribution to journalArticle

45 Scopus citations

Abstract

Biodiversity information organization is looking beyond the traditional document-level metadata approach and has started to look into factual content in textual documents to support more intelligent and semantic-based access. This article reports the development and evaluation of CharaParser, a software application for semantic annotation of morphological descriptions. CharaParser annotates semistructured morphological descriptions in such a detailed manner that all stated morphological characters of an organ are marked up in Extensible Markup Language format. Using an unsupervised machine learning algorithm and a general purpose syntactic parser as its key annotation tools, CharaParser requires minimal additional knowledge engineering work and seems to perform well across different description collections and/or taxon groups. The system has been formally evaluated on over 1,000 sentences randomly selected from Volume 19 of Flora of North American and Part H of Treatise on Invertebrate Paleontology. CharaParser reaches and exceeds 90% in sentence-wise recall and precision, exceeding other similar systems reported in the literature. It also significantly outperforms a heuristic rule-based system we developed earlier. Early evidence that enriching the lexicon of a syntactic parser with domain terms alone may be sufficient to adapt the parser for the biodiversity domain is also observed and may have significant implications.

Original languageEnglish (US)
Pages (from-to)738-754
Number of pages17
JournalJournal of the American Society for Information Science and Technology
Volume63
Issue number4
DOIs
Publication statusPublished - Apr 2012

    Fingerprint

Keywords

  • text mining

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence
  • Information Systems
  • Human-Computer Interaction
  • Computer Networks and Communications

Cite this