Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a Naïve Bayes classifier

Gondy Augusta Leroy, Thomas C. Rindflesch

Research output: Chapter in Book/Report/Conference proceedingChapter

4 Citations (Scopus)

Abstract

Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine-learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A naïve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.

Original languageEnglish (US)
Title of host publicationStudies in Health Technology and Informatics
Pages381-385
Number of pages5
Volume107
DOIs
StatePublished - 2004
Externally publishedYes

Fingerprint

Unified Medical Language System
Semantics
Classifiers
Learning systems
Knowledge Bases
Deterioration
Datasets
Machine Learning

Keywords

  • Artificial intelligence
  • machine learning
  • naïve Bayes
  • small datasets
  • symbolic knowledge
  • UMLS
  • Unified Medical Language System
  • word sense disambiguation

ASJC Scopus subject areas

  • Biomedical Engineering
  • Health Informatics
  • Health Information Management

Cite this

Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a Naïve Bayes classifier. / Leroy, Gondy Augusta; Rindflesch, Thomas C.

Studies in Health Technology and Informatics. Vol. 107 2004. p. 381-385.

Research output: Chapter in Book/Report/Conference proceedingChapter

Leroy, Gondy Augusta ; Rindflesch, Thomas C. / Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a Naïve Bayes classifier. Studies in Health Technology and Informatics. Vol. 107 2004. pp. 381-385
@inbook{2798b53c74cb4ebea15a80ad59bcbd28,
title = "Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a Na{\"i}ve Bayes classifier",
abstract = "Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine-learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A na{\"i}ve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10{\%} higher than the baseline; however, it varied from 8{\%} deterioration to 29{\%} improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.",
keywords = "Artificial intelligence, machine learning, na{\"i}ve Bayes, small datasets, symbolic knowledge, UMLS, Unified Medical Language System, word sense disambiguation",
author = "Leroy, {Gondy Augusta} and Rindflesch, {Thomas C.}",
year = "2004",
doi = "10.3233/978-1-60750-949-3-381",
language = "English (US)",
volume = "107",
pages = "381--385",
booktitle = "Studies in Health Technology and Informatics",

}

TY - CHAP

T1 - Using symbolic knowledge in the UMLS to disambiguate words in small datasets with a Naïve Bayes classifier

AU - Leroy, Gondy Augusta

AU - Rindflesch, Thomas C.

PY - 2004

Y1 - 2004

N2 - Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine-learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A naïve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.

AB - Current approaches to word sense disambiguation use and combine various machine-learning techniques. Most refer to characteristics of the ambiguous word and surrounding words and are based on hundreds of examples. Unfortunately, developing large training sets is time-consuming. We investigate the use of symbolic knowledge to augment machine-learning techniques for small datasets. UMLS semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. A naïve Bayes classifier was trained for 15 words with 100 examples for each. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in eight experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. In a follow-up evaluation, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators.

KW - Artificial intelligence

KW - machine learning

KW - naïve Bayes

KW - small datasets

KW - symbolic knowledge

KW - UMLS

KW - Unified Medical Language System

KW - word sense disambiguation

UR - http://www.scopus.com/inward/record.url?scp=84887084077&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84887084077&partnerID=8YFLogxK

U2 - 10.3233/978-1-60750-949-3-381

DO - 10.3233/978-1-60750-949-3-381

M3 - Chapter

C2 - 15360839

AN - SCOPUS:84887084077

VL - 107

SP - 381

EP - 385

BT - Studies in Health Technology and Informatics

ER -