Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD)

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.

Original languageEnglish (US)
Pages (from-to)508-517
Number of pages10
JournalAMIA ... Annual Symposium proceedings. AMIA Symposium
Volume2018
StatePublished - Jan 1 2018

Fingerprint

Electronic Health Records
Vocabulary
Semantics
PubMed
Autism Spectrum Disorder

ASJC Scopus subject areas

  • Medicine(all)

Cite this

@article{dc6766dd05844264b1cbced34b50d6c2,
title = "Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Disorder (ASD)",
abstract = "Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.",
author = "Yang Gu and Leroy, {Gondy Augusta} and Pettygrove, {Sydney D} and Galindo, {Maureen Kelly} and Margaret Kurzius-Spencer",
year = "2018",
month = "1",
day = "1",
language = "English (US)",
volume = "2018",
pages = "508--517",
journal = "AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium",
issn = "1559-4076",
publisher = "American Medical Informatics Association",

}

TY - JOUR

T1 - Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains

T2 - A Case Study in Autism Spectrum Disorder (ASD)

AU - Gu, Yang

AU - Leroy, Gondy Augusta

AU - Pettygrove, Sydney D

AU - Galindo, Maureen Kelly

AU - Kurzius-Spencer, Margaret

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.

AB - Automating the extraction of behavioral criteria indicative of Autism Spectrum Disorder (ASD) in electronic health records (EHRs) can contribute significantly to the effort to monitor the condition. Word embedding algorithms such as Word2Vec can encode semantic meanings of words in vectors and assist in automated vocabulary discovery from EHRs. However, text available for training word embeddings for ASD is miniscule compared to the billions of tokens typically used. We evaluate the importance of corpus specificity versus size and hypothesize that for specific domains small corpora can generate excellent word embeddings. We custom-built 6 ASD-themed corpora (N=4482), using ASD EHRs and abstracts from PubMed (N=39K) and PsychInfo (N=69K) and evaluated them. We were able to generate the most useful 200-dimension embeddings based on the small ASD EHR data. Due to diversity in its vocabulary, the abstract-based embeddings generated fewer related terms and saw minimal improvement when the size of the corpus increased.

UR - http://www.scopus.com/inward/record.url?scp=85062376782&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062376782&partnerID=8YFLogxK

M3 - Article

C2 - 30815091

AN - SCOPUS:85062376782

VL - 2018

SP - 508

EP - 517

JO - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium

JF - AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium

SN - 1559-4076

ER -