A non-parametric topic model for short texts incorporating word coherence knowledge

Yuhao Zhang, Wenji Mao, Dajun Zeng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short texts often restrict the performance of traditional topic models like LDA. Recently proposed Biterm Topic Model (BTM) which models word co-occurrence patterns directly, is revealed effective for topic detection in short texts. However, BTM has two main drawbacks. It needs to manually specify topic number, which is difficult to accurately determine when facing new corpora. Besides, BTM assumes that two words in same term should belong to the same topic, which is often too strong as it does not differentiate two types of words (i.e. general words and topical words). To tackle these problems, in this paper, we propose a non-parametric topic model npCTM with the above distinction. Our model incorporates the Chinese restaurant process (CRP) into the BTM model to determine topic number automatically. Our model also distinguishes general words from topical words by jointly considering the distribution of these two word types for each word as well as word coherence information as prior knowledge. We carry out experimental studies on real-world twitter dataset. The results demonstrate the effectiveness of our method to discover coherent topics compared with the baseline methods.

Original languageEnglish (US)
Title of host publicationCIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages2017-2020
Number of pages4
Volume24-28-October-2016
ISBN (Electronic)9781450340731
DOIs
StatePublished - Oct 24 2016
Externally publishedYes
Event25th ACM International Conference on Information and Knowledge Management, CIKM 2016 - Indianapolis, United States
Duration: Oct 24 2016Oct 28 2016

Other

Other25th ACM International Conference on Information and Knowledge Management, CIKM 2016
CountryUnited States
CityIndianapolis
Period10/24/1610/28/16

Fingerprint

Topic model
Twitter
Text analysis
Experimental study
Social media
Chinese restaurant
Media analysis
Prior knowledge

Keywords

  • Bayesian nonparametric model
  • Text mining
  • Topic model

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Zhang, Y., Mao, W., & Zeng, D. (2016). A non-parametric topic model for short texts incorporating word coherence knowledge. In CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management (Vol. 24-28-October-2016, pp. 2017-2020). Association for Computing Machinery. https://doi.org/10.1145/2983323.2983898

A non-parametric topic model for short texts incorporating word coherence knowledge. / Zhang, Yuhao; Mao, Wenji; Zeng, Dajun.

CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. Vol. 24-28-October-2016 Association for Computing Machinery, 2016. p. 2017-2020.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, Y, Mao, W & Zeng, D 2016, A non-parametric topic model for short texts incorporating word coherence knowledge. in CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. vol. 24-28-October-2016, Association for Computing Machinery, pp. 2017-2020, 25th ACM International Conference on Information and Knowledge Management, CIKM 2016, Indianapolis, United States, 10/24/16. https://doi.org/10.1145/2983323.2983898
Zhang Y, Mao W, Zeng D. A non-parametric topic model for short texts incorporating word coherence knowledge. In CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. Vol. 24-28-October-2016. Association for Computing Machinery. 2016. p. 2017-2020 https://doi.org/10.1145/2983323.2983898
Zhang, Yuhao ; Mao, Wenji ; Zeng, Dajun. / A non-parametric topic model for short texts incorporating word coherence knowledge. CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management. Vol. 24-28-October-2016 Association for Computing Machinery, 2016. pp. 2017-2020
@inproceedings{4241c7c1b1b24c8394b6a72ab905c73e,
title = "A non-parametric topic model for short texts incorporating word coherence knowledge",
abstract = "Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short texts often restrict the performance of traditional topic models like LDA. Recently proposed Biterm Topic Model (BTM) which models word co-occurrence patterns directly, is revealed effective for topic detection in short texts. However, BTM has two main drawbacks. It needs to manually specify topic number, which is difficult to accurately determine when facing new corpora. Besides, BTM assumes that two words in same term should belong to the same topic, which is often too strong as it does not differentiate two types of words (i.e. general words and topical words). To tackle these problems, in this paper, we propose a non-parametric topic model npCTM with the above distinction. Our model incorporates the Chinese restaurant process (CRP) into the BTM model to determine topic number automatically. Our model also distinguishes general words from topical words by jointly considering the distribution of these two word types for each word as well as word coherence information as prior knowledge. We carry out experimental studies on real-world twitter dataset. The results demonstrate the effectiveness of our method to discover coherent topics compared with the baseline methods.",
keywords = "Bayesian nonparametric model, Text mining, Topic model",
author = "Yuhao Zhang and Wenji Mao and Dajun Zeng",
year = "2016",
month = "10",
day = "24",
doi = "10.1145/2983323.2983898",
language = "English (US)",
volume = "24-28-October-2016",
pages = "2017--2020",
booktitle = "CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management",
publisher = "Association for Computing Machinery",

}

TY - GEN

T1 - A non-parametric topic model for short texts incorporating word coherence knowledge

AU - Zhang, Yuhao

AU - Mao, Wenji

AU - Zeng, Dajun

PY - 2016/10/24

Y1 - 2016/10/24

N2 - Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short texts often restrict the performance of traditional topic models like LDA. Recently proposed Biterm Topic Model (BTM) which models word co-occurrence patterns directly, is revealed effective for topic detection in short texts. However, BTM has two main drawbacks. It needs to manually specify topic number, which is difficult to accurately determine when facing new corpora. Besides, BTM assumes that two words in same term should belong to the same topic, which is often too strong as it does not differentiate two types of words (i.e. general words and topical words). To tackle these problems, in this paper, we propose a non-parametric topic model npCTM with the above distinction. Our model incorporates the Chinese restaurant process (CRP) into the BTM model to determine topic number automatically. Our model also distinguishes general words from topical words by jointly considering the distribution of these two word types for each word as well as word coherence information as prior knowledge. We carry out experimental studies on real-world twitter dataset. The results demonstrate the effectiveness of our method to discover coherent topics compared with the baseline methods.

AB - Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short texts often restrict the performance of traditional topic models like LDA. Recently proposed Biterm Topic Model (BTM) which models word co-occurrence patterns directly, is revealed effective for topic detection in short texts. However, BTM has two main drawbacks. It needs to manually specify topic number, which is difficult to accurately determine when facing new corpora. Besides, BTM assumes that two words in same term should belong to the same topic, which is often too strong as it does not differentiate two types of words (i.e. general words and topical words). To tackle these problems, in this paper, we propose a non-parametric topic model npCTM with the above distinction. Our model incorporates the Chinese restaurant process (CRP) into the BTM model to determine topic number automatically. Our model also distinguishes general words from topical words by jointly considering the distribution of these two word types for each word as well as word coherence information as prior knowledge. We carry out experimental studies on real-world twitter dataset. The results demonstrate the effectiveness of our method to discover coherent topics compared with the baseline methods.

KW - Bayesian nonparametric model

KW - Text mining

KW - Topic model

UR - http://www.scopus.com/inward/record.url?scp=84996598809&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84996598809&partnerID=8YFLogxK

U2 - 10.1145/2983323.2983898

DO - 10.1145/2983323.2983898

M3 - Conference contribution

AN - SCOPUS:84996598809

VL - 24-28-October-2016

SP - 2017

EP - 2020

BT - CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management

PB - Association for Computing Machinery

ER -