Mining e-cigarette adverse events in social media using Bi-LSTM recurrent neural network with word embedding representation

Jiaheng Xie, Xiao Liu, Dajun Zeng

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Objective: Recent years have seen increased worldwide popularity of e-cigarette use. However, the risks of e-cigarettes are underexamined. Most e-cigarette adverse event studies have achieved low detection rates due to limited subject sample sizes in the experiments and surveys. Social media provides a large data repository of consumers' e-cigarette feedback and experiences, which are useful for e-cigarette safety surveillance. However, it is difficult to automatically interpret the informal and nontechnical consumer vocabulary about e-cigarettes in social media. This issue hinders the use of social media content for e-cigarette safety surveillance. Recent developments in deep neural network methods have shown promise for named entity extraction from noisy text. Motivated by these observations, we aimed to design a deep neural network approach to extract e-cigarette safety information in social media. Methods: Our deep neural language model utilizes word embedding as the representation of text input and recognizes named entity types with the state-of-the-art Bidirectional Long Short-Term Memory (Bi-LSTM) Recurrent Neural Network. Results: Our Bi-LSTM model achieved the best performance compared to 3 baseline models, with a precision of 94.10%, a recall of 91.80%, and an F-measure of 92.94%. We identified 1591 unique adverse events and 9930 unique e-cigarette components (ie, chemicals, flavors, and devices) from our research testbed. Conclusion: Although the conditional random field baseline model had slightly better precision than our approach, our Bi-LSTM model achieved much higher recall, resulting in the best F-measure. Our method can be generalized to extract medical concepts from social media for other medical applications.

Original languageEnglish (US)
Article numberocx045
Pages (from-to)72-80
Number of pages9
JournalJournal of the American Medical Informatics Association
Volume25
Issue number1
DOIs
StatePublished - Jan 1 2018

Fingerprint

Social Media
Long-Term Memory
Short-Term Memory
Tobacco Products
Safety
Vocabulary
Sample Size
Language
Equipment and Supplies

Keywords

  • Bi-LSTM
  • Deep neural network
  • E-cigarette adverse event
  • Recurrent neural network
  • Word embedding

ASJC Scopus subject areas

  • Health Informatics

Cite this

Mining e-cigarette adverse events in social media using Bi-LSTM recurrent neural network with word embedding representation. / Xie, Jiaheng; Liu, Xiao; Zeng, Dajun.

In: Journal of the American Medical Informatics Association, Vol. 25, No. 1, ocx045, 01.01.2018, p. 72-80.

Research output: Contribution to journalArticle

@article{044360a8a6b24da5a38ac5ced2c1f402,
title = "Mining e-cigarette adverse events in social media using Bi-LSTM recurrent neural network with word embedding representation",
abstract = "Objective: Recent years have seen increased worldwide popularity of e-cigarette use. However, the risks of e-cigarettes are underexamined. Most e-cigarette adverse event studies have achieved low detection rates due to limited subject sample sizes in the experiments and surveys. Social media provides a large data repository of consumers' e-cigarette feedback and experiences, which are useful for e-cigarette safety surveillance. However, it is difficult to automatically interpret the informal and nontechnical consumer vocabulary about e-cigarettes in social media. This issue hinders the use of social media content for e-cigarette safety surveillance. Recent developments in deep neural network methods have shown promise for named entity extraction from noisy text. Motivated by these observations, we aimed to design a deep neural network approach to extract e-cigarette safety information in social media. Methods: Our deep neural language model utilizes word embedding as the representation of text input and recognizes named entity types with the state-of-the-art Bidirectional Long Short-Term Memory (Bi-LSTM) Recurrent Neural Network. Results: Our Bi-LSTM model achieved the best performance compared to 3 baseline models, with a precision of 94.10{\%}, a recall of 91.80{\%}, and an F-measure of 92.94{\%}. We identified 1591 unique adverse events and 9930 unique e-cigarette components (ie, chemicals, flavors, and devices) from our research testbed. Conclusion: Although the conditional random field baseline model had slightly better precision than our approach, our Bi-LSTM model achieved much higher recall, resulting in the best F-measure. Our method can be generalized to extract medical concepts from social media for other medical applications.",
keywords = "Bi-LSTM, Deep neural network, E-cigarette adverse event, Recurrent neural network, Word embedding",
author = "Jiaheng Xie and Xiao Liu and Dajun Zeng",
year = "2018",
month = "1",
day = "1",
doi = "10.1093/jamia/ocx045",
language = "English (US)",
volume = "25",
pages = "72--80",
journal = "Journal of the American Medical Informatics Association : JAMIA",
issn = "1067-5027",
publisher = "Oxford University Press",
number = "1",

}

TY - JOUR

T1 - Mining e-cigarette adverse events in social media using Bi-LSTM recurrent neural network with word embedding representation

AU - Xie, Jiaheng

AU - Liu, Xiao

AU - Zeng, Dajun

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Objective: Recent years have seen increased worldwide popularity of e-cigarette use. However, the risks of e-cigarettes are underexamined. Most e-cigarette adverse event studies have achieved low detection rates due to limited subject sample sizes in the experiments and surveys. Social media provides a large data repository of consumers' e-cigarette feedback and experiences, which are useful for e-cigarette safety surveillance. However, it is difficult to automatically interpret the informal and nontechnical consumer vocabulary about e-cigarettes in social media. This issue hinders the use of social media content for e-cigarette safety surveillance. Recent developments in deep neural network methods have shown promise for named entity extraction from noisy text. Motivated by these observations, we aimed to design a deep neural network approach to extract e-cigarette safety information in social media. Methods: Our deep neural language model utilizes word embedding as the representation of text input and recognizes named entity types with the state-of-the-art Bidirectional Long Short-Term Memory (Bi-LSTM) Recurrent Neural Network. Results: Our Bi-LSTM model achieved the best performance compared to 3 baseline models, with a precision of 94.10%, a recall of 91.80%, and an F-measure of 92.94%. We identified 1591 unique adverse events and 9930 unique e-cigarette components (ie, chemicals, flavors, and devices) from our research testbed. Conclusion: Although the conditional random field baseline model had slightly better precision than our approach, our Bi-LSTM model achieved much higher recall, resulting in the best F-measure. Our method can be generalized to extract medical concepts from social media for other medical applications.

AB - Objective: Recent years have seen increased worldwide popularity of e-cigarette use. However, the risks of e-cigarettes are underexamined. Most e-cigarette adverse event studies have achieved low detection rates due to limited subject sample sizes in the experiments and surveys. Social media provides a large data repository of consumers' e-cigarette feedback and experiences, which are useful for e-cigarette safety surveillance. However, it is difficult to automatically interpret the informal and nontechnical consumer vocabulary about e-cigarettes in social media. This issue hinders the use of social media content for e-cigarette safety surveillance. Recent developments in deep neural network methods have shown promise for named entity extraction from noisy text. Motivated by these observations, we aimed to design a deep neural network approach to extract e-cigarette safety information in social media. Methods: Our deep neural language model utilizes word embedding as the representation of text input and recognizes named entity types with the state-of-the-art Bidirectional Long Short-Term Memory (Bi-LSTM) Recurrent Neural Network. Results: Our Bi-LSTM model achieved the best performance compared to 3 baseline models, with a precision of 94.10%, a recall of 91.80%, and an F-measure of 92.94%. We identified 1591 unique adverse events and 9930 unique e-cigarette components (ie, chemicals, flavors, and devices) from our research testbed. Conclusion: Although the conditional random field baseline model had slightly better precision than our approach, our Bi-LSTM model achieved much higher recall, resulting in the best F-measure. Our method can be generalized to extract medical concepts from social media for other medical applications.

KW - Bi-LSTM

KW - Deep neural network

KW - E-cigarette adverse event

KW - Recurrent neural network

KW - Word embedding

UR - http://www.scopus.com/inward/record.url?scp=85040535426&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040535426&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocx045

DO - 10.1093/jamia/ocx045

M3 - Article

C2 - 28505280

AN - SCOPUS:85040535426

VL - 25

SP - 72

EP - 80

JO - Journal of the American Medical Informatics Association : JAMIA

JF - Journal of the American Medical Informatics Association : JAMIA

SN - 1067-5027

IS - 1

M1 - ocx045

ER -