Text simplification tools: Using machine learning to discover features that identify difficult text

David Kauchak, Obay Mouradi, Christopher Pentoney, Gondy Augusta Leroy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density; specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved.

Original languageEnglish (US)
Title of host publicationProceedings of the Annual Hawaii International Conference on System Sciences
PublisherIEEE Computer Society
Pages2616-2625
Number of pages10
ISBN (Print)9781479925049
DOIs
StatePublished - 2014
Event47th Hawaii International Conference on System Sciences, HICSS 2014 - Waikoloa, HI, United States
Duration: Jan 6 2014Jan 9 2014

Other

Other47th Hawaii International Conference on System Sciences, HICSS 2014
CountryUnited States
CityWaikoloa, HI
Period1/6/141/9/14

Fingerprint

Learning systems
Ablation
Learning algorithms
Health

ASJC Scopus subject areas

  • Engineering(all)

Cite this

Kauchak, D., Mouradi, O., Pentoney, C., & Leroy, G. A. (2014). Text simplification tools: Using machine learning to discover features that identify difficult text. In Proceedings of the Annual Hawaii International Conference on System Sciences (pp. 2616-2625). [6758930] IEEE Computer Society. https://doi.org/10.1109/HICSS.2014.330

Text simplification tools : Using machine learning to discover features that identify difficult text. / Kauchak, David; Mouradi, Obay; Pentoney, Christopher; Leroy, Gondy Augusta.

Proceedings of the Annual Hawaii International Conference on System Sciences. IEEE Computer Society, 2014. p. 2616-2625 6758930.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kauchak, D, Mouradi, O, Pentoney, C & Leroy, GA 2014, Text simplification tools: Using machine learning to discover features that identify difficult text. in Proceedings of the Annual Hawaii International Conference on System Sciences., 6758930, IEEE Computer Society, pp. 2616-2625, 47th Hawaii International Conference on System Sciences, HICSS 2014, Waikoloa, HI, United States, 1/6/14. https://doi.org/10.1109/HICSS.2014.330
Kauchak D, Mouradi O, Pentoney C, Leroy GA. Text simplification tools: Using machine learning to discover features that identify difficult text. In Proceedings of the Annual Hawaii International Conference on System Sciences. IEEE Computer Society. 2014. p. 2616-2625. 6758930 https://doi.org/10.1109/HICSS.2014.330
Kauchak, David ; Mouradi, Obay ; Pentoney, Christopher ; Leroy, Gondy Augusta. / Text simplification tools : Using machine learning to discover features that identify difficult text. Proceedings of the Annual Hawaii International Conference on System Sciences. IEEE Computer Society, 2014. pp. 2616-2625
@inproceedings{20d581f4dd9c4b728d26e1418919d573,
title = "Text simplification tools: Using machine learning to discover features that identify difficult text",
abstract = "Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density; specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84{\%} accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24{\%} combined impact on accuracy). Notably, a training size study showed that even with a 1{\%} sample (1,062 sentences) an accuracy of 80{\%} can be achieved.",
author = "David Kauchak and Obay Mouradi and Christopher Pentoney and Leroy, {Gondy Augusta}",
year = "2014",
doi = "10.1109/HICSS.2014.330",
language = "English (US)",
isbn = "9781479925049",
pages = "2616--2625",
booktitle = "Proceedings of the Annual Hawaii International Conference on System Sciences",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Text simplification tools

T2 - Using machine learning to discover features that identify difficult text

AU - Kauchak, David

AU - Mouradi, Obay

AU - Pentoney, Christopher

AU - Leroy, Gondy Augusta

PY - 2014

Y1 - 2014

N2 - Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density; specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved.

AB - Although providing understandable information is a critical component in healthcare, few tools exist to help clinicians identify difficult sections in text. We systematically examine sixteen features for predicting the difficulty of health texts using six different machine learning algorithms. Three represent new features not previously examined: medical concept density; specificity (calculated using word-level depth in MeSH); and ambiguity (calculated using the number of UMLS Metathesaurus concepts associated with a word). We examine these features for a binary prediction task on 118,000 simple and difficult sentences from a sentence-aligned corpus. Using all features, random forests is the most accurate with 84% accuracy. Model analysis of the six models and a complementary ablation study shows that the specificity and ambiguity features are the strongest predictors (24% combined impact on accuracy). Notably, a training size study showed that even with a 1% sample (1,062 sentences) an accuracy of 80% can be achieved.

UR - http://www.scopus.com/inward/record.url?scp=84902295430&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84902295430&partnerID=8YFLogxK

U2 - 10.1109/HICSS.2014.330

DO - 10.1109/HICSS.2014.330

M3 - Conference contribution

AN - SCOPUS:84902295430

SN - 9781479925049

SP - 2616

EP - 2625

BT - Proceedings of the Annual Hawaii International Conference on System Sciences

PB - IEEE Computer Society

ER -