The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

To automatically convert legacy data of taxonomic descriptions into extensible markup language (XML) format, the authors designed a machine-learning-based approach. In this project three corpora of taxonomic descriptions were selected to prove the hypothesis that domain knowledge and conventions automatically induced from some semistructured corpora (i.e., base corpora) are useful to improve the markup performance of other less-structured, quite different corpora (i.e., evaluation corpora). The " structuredness" of the three corpora was carefully measured. Basing on the structuredness measures, two of the corpora were used as the base corpora and one as the evaluation corpus. Three series of experiments were carried out with the MARTT (markuper of taxonomic treatments) system the authors developed to evaluate the effectiveness of different methods of using the n-gram semantic class association rules, the element relative position probabilities, and a combination of the two types of knowledge mined from the automatically marked-up base corpora. The experimental results showed that the induced knowledge from the base corpora was more reliable than that learned from the training examples alone, and that the n-gram semantic class association rules were effective in improving the markup performance, especially on the elements with sparse training examples. The authors also identify a number of challenges for any automatic markup system using taxonomic descriptions.

Original languageEnglish (US)
Pages (from-to)133-149
Number of pages17
JournalJournal of the American Society for Information Science and Technology
Volume58
Issue number1
DOIs
StatePublished - Jan 2007
Externally publishedYes

Fingerprint

Association rules
Reusability
XML
Semantics
semantics
evaluation
performance
Learning systems
experiment
knowledge
learning
Experiments
Markup

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

@article{5d2886d69dee4dadbcca7b44a81c0027,
title = "The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions",
abstract = "To automatically convert legacy data of taxonomic descriptions into extensible markup language (XML) format, the authors designed a machine-learning-based approach. In this project three corpora of taxonomic descriptions were selected to prove the hypothesis that domain knowledge and conventions automatically induced from some semistructured corpora (i.e., base corpora) are useful to improve the markup performance of other less-structured, quite different corpora (i.e., evaluation corpora). The {"} structuredness{"} of the three corpora was carefully measured. Basing on the structuredness measures, two of the corpora were used as the base corpora and one as the evaluation corpus. Three series of experiments were carried out with the MARTT (markuper of taxonomic treatments) system the authors developed to evaluate the effectiveness of different methods of using the n-gram semantic class association rules, the element relative position probabilities, and a combination of the two types of knowledge mined from the automatically marked-up base corpora. The experimental results showed that the induced knowledge from the base corpora was more reliable than that learned from the training examples alone, and that the n-gram semantic class association rules were effective in improving the markup performance, especially on the elements with sparse training examples. The authors also identify a number of challenges for any automatic markup system using taxonomic descriptions.",
author = "Hong Cui and Heidorn, {Patrick B}",
year = "2007",
month = "1",
doi = "10.1002/asi.20463",
language = "English (US)",
volume = "58",
pages = "133--149",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley and Sons Ltd",
number = "1",

}

TY - JOUR

T1 - The reusability of induced knowledge for the automatic semantic markup of taxonomic descriptions

AU - Cui, Hong

AU - Heidorn, Patrick B

PY - 2007/1

Y1 - 2007/1

N2 - To automatically convert legacy data of taxonomic descriptions into extensible markup language (XML) format, the authors designed a machine-learning-based approach. In this project three corpora of taxonomic descriptions were selected to prove the hypothesis that domain knowledge and conventions automatically induced from some semistructured corpora (i.e., base corpora) are useful to improve the markup performance of other less-structured, quite different corpora (i.e., evaluation corpora). The " structuredness" of the three corpora was carefully measured. Basing on the structuredness measures, two of the corpora were used as the base corpora and one as the evaluation corpus. Three series of experiments were carried out with the MARTT (markuper of taxonomic treatments) system the authors developed to evaluate the effectiveness of different methods of using the n-gram semantic class association rules, the element relative position probabilities, and a combination of the two types of knowledge mined from the automatically marked-up base corpora. The experimental results showed that the induced knowledge from the base corpora was more reliable than that learned from the training examples alone, and that the n-gram semantic class association rules were effective in improving the markup performance, especially on the elements with sparse training examples. The authors also identify a number of challenges for any automatic markup system using taxonomic descriptions.

AB - To automatically convert legacy data of taxonomic descriptions into extensible markup language (XML) format, the authors designed a machine-learning-based approach. In this project three corpora of taxonomic descriptions were selected to prove the hypothesis that domain knowledge and conventions automatically induced from some semistructured corpora (i.e., base corpora) are useful to improve the markup performance of other less-structured, quite different corpora (i.e., evaluation corpora). The " structuredness" of the three corpora was carefully measured. Basing on the structuredness measures, two of the corpora were used as the base corpora and one as the evaluation corpus. Three series of experiments were carried out with the MARTT (markuper of taxonomic treatments) system the authors developed to evaluate the effectiveness of different methods of using the n-gram semantic class association rules, the element relative position probabilities, and a combination of the two types of knowledge mined from the automatically marked-up base corpora. The experimental results showed that the induced knowledge from the base corpora was more reliable than that learned from the training examples alone, and that the n-gram semantic class association rules were effective in improving the markup performance, especially on the elements with sparse training examples. The authors also identify a number of challenges for any automatic markup system using taxonomic descriptions.

UR - http://www.scopus.com/inward/record.url?scp=33846012520&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33846012520&partnerID=8YFLogxK

U2 - 10.1002/asi.20463

DO - 10.1002/asi.20463

M3 - Article

AN - SCOPUS:33846012520

VL - 58

SP - 133

EP - 149

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 1

ER -