Supervised Topic Modeling using Hierarchical Dirichlet Process-based Inverse Regression: Experiments on E-Commerce Applications

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

The proliferation of e-commerce calls for mining consumer preferences and opinions from user-generated texts. To this end, topic models have been widely adopted to discover the underlying semantic themes (i.e., topics). Supervised topic models have emerged to leverage discovered topics for predicting the response of interest (e.g., product quality and sales). However, supervised topic modeling remains a challenging problem because of the need to prespecify the number of topics, the lack of predictive information in topics, and limited scalability. In this paper, we propose a novel supervised topic model, \textit{Hierarchical Dirichlet Process-based Inverse Regression} (HDP-IR). HDP-IR characterizes the corpus with a flexible number of topics, which prove to retain as much predictive information as the original corpus. Moreover, we develop an efficient inference algorithm capable of examining large-scale corpora (millions of documents or more). Three experiments were conducted to evaluate the predictive performance over major e-commerce benchmark testbeds of online reviews. HDP-IR significantly outperformed existing supervised topic models. Particularly, retaining sufficient predictive information improved predictive R-squared by over 17.6 percent; having topic structure flexibility contributed to predictive R-squared by at least 4.1 percent. HDP-IR provides an important step for future study on user-generated texts from a topic perspective.

Original languageEnglish (US)
JournalIEEE Transactions on Knowledge and Data Engineering
DOIs
StateAccepted/In press - Dec 22 2017

Fingerprint

Electronic commerce
Experiments
Testbeds
Scalability
Sales
Semantics

Keywords

  • Approximation algorithms
  • Bayesian nonparametrics
  • hierarchical dirichlet process
  • Inference algorithms
  • Measurement
  • Prediction algorithms
  • Predictive models
  • Semantics
  • sufficient dimension reduction
  • topic modeling
  • variational inference

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

@article{4dba7e05126d4accad8157f3e186df76,
title = "Supervised Topic Modeling using Hierarchical Dirichlet Process-based Inverse Regression: Experiments on E-Commerce Applications",
abstract = "The proliferation of e-commerce calls for mining consumer preferences and opinions from user-generated texts. To this end, topic models have been widely adopted to discover the underlying semantic themes (i.e., topics). Supervised topic models have emerged to leverage discovered topics for predicting the response of interest (e.g., product quality and sales). However, supervised topic modeling remains a challenging problem because of the need to prespecify the number of topics, the lack of predictive information in topics, and limited scalability. In this paper, we propose a novel supervised topic model, \textit{Hierarchical Dirichlet Process-based Inverse Regression} (HDP-IR). HDP-IR characterizes the corpus with a flexible number of topics, which prove to retain as much predictive information as the original corpus. Moreover, we develop an efficient inference algorithm capable of examining large-scale corpora (millions of documents or more). Three experiments were conducted to evaluate the predictive performance over major e-commerce benchmark testbeds of online reviews. HDP-IR significantly outperformed existing supervised topic models. Particularly, retaining sufficient predictive information improved predictive R-squared by over 17.6 percent; having topic structure flexibility contributed to predictive R-squared by at least 4.1 percent. HDP-IR provides an important step for future study on user-generated texts from a topic perspective.",
keywords = "Approximation algorithms, Bayesian nonparametrics, hierarchical dirichlet process, Inference algorithms, Measurement, Prediction algorithms, Predictive models, Semantics, sufficient dimension reduction, topic modeling, variational inference",
author = "Weifeng Li and Junming Yin and Hsinchun Chen",
year = "2017",
month = "12",
day = "22",
doi = "10.1109/TKDE.2017.2786727",
language = "English (US)",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",

}

TY - JOUR

T1 - Supervised Topic Modeling using Hierarchical Dirichlet Process-based Inverse Regression

T2 - Experiments on E-Commerce Applications

AU - Li, Weifeng

AU - Yin, Junming

AU - Chen, Hsinchun

PY - 2017/12/22

Y1 - 2017/12/22

N2 - The proliferation of e-commerce calls for mining consumer preferences and opinions from user-generated texts. To this end, topic models have been widely adopted to discover the underlying semantic themes (i.e., topics). Supervised topic models have emerged to leverage discovered topics for predicting the response of interest (e.g., product quality and sales). However, supervised topic modeling remains a challenging problem because of the need to prespecify the number of topics, the lack of predictive information in topics, and limited scalability. In this paper, we propose a novel supervised topic model, \textit{Hierarchical Dirichlet Process-based Inverse Regression} (HDP-IR). HDP-IR characterizes the corpus with a flexible number of topics, which prove to retain as much predictive information as the original corpus. Moreover, we develop an efficient inference algorithm capable of examining large-scale corpora (millions of documents or more). Three experiments were conducted to evaluate the predictive performance over major e-commerce benchmark testbeds of online reviews. HDP-IR significantly outperformed existing supervised topic models. Particularly, retaining sufficient predictive information improved predictive R-squared by over 17.6 percent; having topic structure flexibility contributed to predictive R-squared by at least 4.1 percent. HDP-IR provides an important step for future study on user-generated texts from a topic perspective.

AB - The proliferation of e-commerce calls for mining consumer preferences and opinions from user-generated texts. To this end, topic models have been widely adopted to discover the underlying semantic themes (i.e., topics). Supervised topic models have emerged to leverage discovered topics for predicting the response of interest (e.g., product quality and sales). However, supervised topic modeling remains a challenging problem because of the need to prespecify the number of topics, the lack of predictive information in topics, and limited scalability. In this paper, we propose a novel supervised topic model, \textit{Hierarchical Dirichlet Process-based Inverse Regression} (HDP-IR). HDP-IR characterizes the corpus with a flexible number of topics, which prove to retain as much predictive information as the original corpus. Moreover, we develop an efficient inference algorithm capable of examining large-scale corpora (millions of documents or more). Three experiments were conducted to evaluate the predictive performance over major e-commerce benchmark testbeds of online reviews. HDP-IR significantly outperformed existing supervised topic models. Particularly, retaining sufficient predictive information improved predictive R-squared by over 17.6 percent; having topic structure flexibility contributed to predictive R-squared by at least 4.1 percent. HDP-IR provides an important step for future study on user-generated texts from a topic perspective.

KW - Approximation algorithms

KW - Bayesian nonparametrics

KW - hierarchical dirichlet process

KW - Inference algorithms

KW - Measurement

KW - Prediction algorithms

KW - Predictive models

KW - Semantics

KW - sufficient dimension reduction

KW - topic modeling

KW - variational inference

UR - http://www.scopus.com/inward/record.url?scp=85039797753&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039797753&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2017.2786727

DO - 10.1109/TKDE.2017.2786727

M3 - Article

AN - SCOPUS:85039797753

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

ER -