Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty

Gondy Augusta Leroy, James E. Endicott

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

Measuring text difficulty is prevalent in health informatics since it is useful for information personalization and optimization. Unfortunately, it is uncertain how best to compute difficulty so that it relates to reader understanding. We aim to create computational, evidence-based metrics of perceived and actual text difficulty. We start with a corpus analysis to identify candidate metrics which are further tested in user studies. Our corpus contains blogs and journal articles (N=1,073) representing easy and difficult text. Using natural language processing, we calculated base grammatical and semantic metrics, constructed new composite metrics (noun phrase complexity and semantic familiarity), and measured the commonly used Flesch-Kincaid grade level. The metrics differed significantly between document types. Nouns were more prevalent but less familiar in difficult text; verbs and function words were more prevalent in easy text. Noun phrase complexity was lower, semantic familiarity was higher and grade levels were lower in easy text. Then, all metrics were tested for their relation to perceived and actual difficulty using follow-up analyses of two user studies conducted earlier. Base metrics and noun phrase complexity correlated significantly with perceived difficulty and could help explain actual difficulty.

Original languageEnglish (US)
Title of host publicationIHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Pages749-753
Number of pages5
DOIs
StatePublished - 2012
Externally publishedYes
Event2nd ACM SIGHIT International Health Informatics Symposium, IHI'12 - Miami, FL, United States
Duration: Jan 28 2012Jan 30 2012

Other

Other2nd ACM SIGHIT International Health Informatics Symposium, IHI'12
CountryUnited States
CityMiami, FL
Period1/28/121/30/12

Fingerprint

Semantics
Blogging
Natural Language Processing
Informatics
Health
Recognition (Psychology)

Keywords

  • Actual difficulty
  • Health informatics
  • Natural language processing
  • Perceived difficulty
  • Readability

ASJC Scopus subject areas

  • Health Informatics
  • Health Information Management

Cite this

Leroy, G. A., & Endicott, J. E. (2012). Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty. In IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium (pp. 749-753) https://doi.org/10.1145/2110363.2110452

Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty. / Leroy, Gondy Augusta; Endicott, James E.

IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. 2012. p. 749-753.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Leroy, GA & Endicott, JE 2012, Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty. in IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. pp. 749-753, 2nd ACM SIGHIT International Health Informatics Symposium, IHI'12, Miami, FL, United States, 1/28/12. https://doi.org/10.1145/2110363.2110452
Leroy GA, Endicott JE. Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty. In IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. 2012. p. 749-753 https://doi.org/10.1145/2110363.2110452
Leroy, Gondy Augusta ; Endicott, James E. / Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty. IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. 2012. pp. 749-753
@inproceedings{15231cbb81d8426398e2d7ba389d9a15,
title = "Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty",
abstract = "Measuring text difficulty is prevalent in health informatics since it is useful for information personalization and optimization. Unfortunately, it is uncertain how best to compute difficulty so that it relates to reader understanding. We aim to create computational, evidence-based metrics of perceived and actual text difficulty. We start with a corpus analysis to identify candidate metrics which are further tested in user studies. Our corpus contains blogs and journal articles (N=1,073) representing easy and difficult text. Using natural language processing, we calculated base grammatical and semantic metrics, constructed new composite metrics (noun phrase complexity and semantic familiarity), and measured the commonly used Flesch-Kincaid grade level. The metrics differed significantly between document types. Nouns were more prevalent but less familiar in difficult text; verbs and function words were more prevalent in easy text. Noun phrase complexity was lower, semantic familiarity was higher and grade levels were lower in easy text. Then, all metrics were tested for their relation to perceived and actual difficulty using follow-up analyses of two user studies conducted earlier. Base metrics and noun phrase complexity correlated significantly with perceived difficulty and could help explain actual difficulty.",
keywords = "Actual difficulty, Health informatics, Natural language processing, Perceived difficulty, Readability",
author = "Leroy, {Gondy Augusta} and Endicott, {James E.}",
year = "2012",
doi = "10.1145/2110363.2110452",
language = "English (US)",
isbn = "9781450307819",
pages = "749--753",
booktitle = "IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium",

}

TY - GEN

T1 - Combining NLP with evidence-based methods to find text metrics related to perceived and actual text difficulty

AU - Leroy, Gondy Augusta

AU - Endicott, James E.

PY - 2012

Y1 - 2012

N2 - Measuring text difficulty is prevalent in health informatics since it is useful for information personalization and optimization. Unfortunately, it is uncertain how best to compute difficulty so that it relates to reader understanding. We aim to create computational, evidence-based metrics of perceived and actual text difficulty. We start with a corpus analysis to identify candidate metrics which are further tested in user studies. Our corpus contains blogs and journal articles (N=1,073) representing easy and difficult text. Using natural language processing, we calculated base grammatical and semantic metrics, constructed new composite metrics (noun phrase complexity and semantic familiarity), and measured the commonly used Flesch-Kincaid grade level. The metrics differed significantly between document types. Nouns were more prevalent but less familiar in difficult text; verbs and function words were more prevalent in easy text. Noun phrase complexity was lower, semantic familiarity was higher and grade levels were lower in easy text. Then, all metrics were tested for their relation to perceived and actual difficulty using follow-up analyses of two user studies conducted earlier. Base metrics and noun phrase complexity correlated significantly with perceived difficulty and could help explain actual difficulty.

AB - Measuring text difficulty is prevalent in health informatics since it is useful for information personalization and optimization. Unfortunately, it is uncertain how best to compute difficulty so that it relates to reader understanding. We aim to create computational, evidence-based metrics of perceived and actual text difficulty. We start with a corpus analysis to identify candidate metrics which are further tested in user studies. Our corpus contains blogs and journal articles (N=1,073) representing easy and difficult text. Using natural language processing, we calculated base grammatical and semantic metrics, constructed new composite metrics (noun phrase complexity and semantic familiarity), and measured the commonly used Flesch-Kincaid grade level. The metrics differed significantly between document types. Nouns were more prevalent but less familiar in difficult text; verbs and function words were more prevalent in easy text. Noun phrase complexity was lower, semantic familiarity was higher and grade levels were lower in easy text. Then, all metrics were tested for their relation to perceived and actual difficulty using follow-up analyses of two user studies conducted earlier. Base metrics and noun phrase complexity correlated significantly with perceived difficulty and could help explain actual difficulty.

KW - Actual difficulty

KW - Health informatics

KW - Natural language processing

KW - Perceived difficulty

KW - Readability

UR - http://www.scopus.com/inward/record.url?scp=84857730515&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84857730515&partnerID=8YFLogxK

U2 - 10.1145/2110363.2110452

DO - 10.1145/2110363.2110452

M3 - Conference contribution

AN - SCOPUS:84857730515

SN - 9781450307819

SP - 749

EP - 753

BT - IHI'12 - Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium

ER -