Effects of information and machine learning algorithms on word sense disambiguation with small datasets

Gondy Leroy, Thomas C. Rindflesch

Research output: Contribution to journalArticle

35 Scopus citations

Abstract

Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several follow-up evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.

Original languageEnglish (US)
Pages (from-to)573-585
Number of pages13
JournalInternational Journal of Medical Informatics
Volume74
Issue number7-8
DOIs
StatePublished - Aug 2005

Keywords

  • Decision tree
  • Machine learning
  • Naïve Bayes
  • Neural network
  • UMLS
  • Word sense disambiguation

ASJC Scopus subject areas

  • Health Informatics

Fingerprint Dive into the research topics of 'Effects of information and machine learning algorithms on word sense disambiguation with small datasets'. Together they form a unique fingerprint.

  • Cite this