Using Lexical Chains to Identify Text Difficulty: A Corpus Statistics and Classification Study

Partha Mukherjee, Gondy Augusta Leroy, David Kauchak

Research output: Contribution to journalArticle

Abstract

Our goal is data-driven discovery of features for text simplification. In this work, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with 1) a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and 2) a classification task (11,000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, native Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ~90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.

Original languageEnglish (US)
JournalIEEE Journal of Biomedical and Health Informatics
DOIs
StateAccepted/In press - Jan 1 2018

Fingerprint

Classifiers
Statistics
Chain length
Decision Trees
Semantics
Logistic Models
Decision trees
Logistics
Forests

Keywords

  • classification
  • decision trees
  • Health informatics
  • logistic regression
  • naive Bayes
  • natural language processing
  • random forest
  • readability
  • SVM
  • text difficulty
  • text simplification

ASJC Scopus subject areas

  • Biotechnology
  • Computer Science Applications
  • Electrical and Electronic Engineering
  • Health Information Management

Cite this

@article{839d62ca3b9a4fc7a5d8c65b4627bfa5,
title = "Using Lexical Chains to Identify Text Difficulty: A Corpus Statistics and Classification Study",
abstract = "Our goal is data-driven discovery of features for text simplification. In this work, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with 1) a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and 2) a classification task (11,000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, native Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ~90{\%} (compared to 78{\%} for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.",
keywords = "classification, decision trees, Health informatics, logistic regression, naive Bayes, natural language processing, random forest, readability, SVM, text difficulty, text simplification",
author = "Partha Mukherjee and Leroy, {Gondy Augusta} and David Kauchak",
year = "2018",
month = "1",
day = "1",
doi = "10.1109/JBHI.2018.2885465",
language = "English (US)",
journal = "IEEE Journal of Biomedical and Health Informatics",
issn = "2168-2194",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Using Lexical Chains to Identify Text Difficulty

T2 - A Corpus Statistics and Classification Study

AU - Mukherjee, Partha

AU - Leroy, Gondy Augusta

AU - Kauchak, David

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Our goal is data-driven discovery of features for text simplification. In this work, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with 1) a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and 2) a classification task (11,000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, native Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ~90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.

AB - Our goal is data-driven discovery of features for text simplification. In this work, we investigate three types of lexical chains: exact, synonymous, and semantic. A lexical chain links semantically related words in a document. We examine their potential with 1) a document-level corpus statistics study (914 texts) to estimate their overall capacity to differentiate between easy and difficult text and 2) a classification task (11,000 sentences) to determine usefulness of features at sentence-level for simplification. For the corpus statistics study we tested five document-level features for each chain type: total number of chains, average chain length, average chain span, number of crossing chains, and the number of chains longer than half the document length. We found significant differences between easy and difficult text for average chain length and the average number of cross chains. For the sentence classification study, we compared the lexical chain features to standard bag-of-words features on a range of classifiers: logistic regression, native Bayes, decision trees, linear and RBF kernel SVM, and random forest. The lexical chain features performed significantly better than the bag-of-words baseline across all classifiers with the best classifier achieving an accuracy of ~90% (compared to 78% for bag-of-words). Overall, we find several lexical chain features provide specific information useful for identifying difficult sentences of text, beyond what is available from standard lexical features.

KW - classification

KW - decision trees

KW - Health informatics

KW - logistic regression

KW - naive Bayes

KW - natural language processing

KW - random forest

KW - readability

KW - SVM

KW - text difficulty

KW - text simplification

UR - http://www.scopus.com/inward/record.url?scp=85058131390&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058131390&partnerID=8YFLogxK

U2 - 10.1109/JBHI.2018.2885465

DO - 10.1109/JBHI.2018.2885465

M3 - Article

C2 - 30530380

AN - SCOPUS:85058131390

JO - IEEE Journal of Biomedical and Health Informatics

JF - IEEE Journal of Biomedical and Health Informatics

SN - 2168-2194

ER -