Using Lexical Chains to Identify Text Difficulty: A Corpus Statistics and Classification Study

Partha Mukherjee, Gondy Augusta Leroy, David Kauchak

Research output: Contribution to journalArticle

1 Scopus citations

Abstract

Our goal is data-driven discovery of features for text simplification. In this work, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with 1) a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and 2) a classification task (11,000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, native Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ~90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.

Original languageEnglish (US)
JournalIEEE Journal of Biomedical and Health Informatics
DOIs
StateAccepted/In press - Jan 1 2018

Keywords

  • classification
  • decision trees
  • Health informatics
  • logistic regression
  • naive Bayes
  • natural language processing
  • random forest
  • readability
  • SVM
  • text difficulty
  • text simplification

ASJC Scopus subject areas

  • Biotechnology
  • Computer Science Applications
  • Electrical and Electronic Engineering
  • Health Information Management

Fingerprint Dive into the research topics of 'Using Lexical Chains to Identify Text Difficulty: A Corpus Statistics and Classification Study'. Together they form a unique fingerprint.

  • Cite this