A non-parametric topic model for short texts incorporating word coherence knowledge

Yuhao Zhang, Wenji Mao, Dajun Zeng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short texts often restrict the performance of traditional topic models like LDA. Recently proposed Biterm Topic Model (BTM) which models word co-occurrence patterns directly, is revealed effective for topic detection in short texts. However, BTM has two main drawbacks. It needs to manually specify topic number, which is difficult to accurately determine when facing new corpora. Besides, BTM assumes that two words in same term should belong to the same topic, which is often too strong as it does not differentiate two types of words (i.e. general words and topical words). To tackle these problems, in this paper, we propose a non-parametric topic model npCTM with the above distinction. Our model incorporates the Chinese restaurant process (CRP) into the BTM model to determine topic number automatically. Our model also distinguishes general words from topical words by jointly considering the distribution of these two word types for each word as well as word coherence information as prior knowledge. We carry out experimental studies on real-world twitter dataset. The results demonstrate the effectiveness of our method to discover coherent topics compared with the baseline methods.

Original languageEnglish (US)
Title of host publicationCIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages2017-2020
Number of pages4
Volume24-28-October-2016
ISBN (Electronic)9781450340731
DOIs
StatePublished - Oct 24 2016
Externally publishedYes
Event25th ACM International Conference on Information and Knowledge Management, CIKM 2016 - Indianapolis, United States
Duration: Oct 24 2016Oct 28 2016

Other

Other25th ACM International Conference on Information and Knowledge Management, CIKM 2016
CountryUnited States
CityIndianapolis
Period10/24/1610/28/16

Keywords

  • Bayesian nonparametric model
  • Text mining
  • Topic model

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Fingerprint Dive into the research topics of 'A non-parametric topic model for short texts incorporating word coherence knowledge'. Together they form a unique fingerprint.

  • Cite this

    Zhang, Y., Mao, W., & Zeng, D. (2016). A non-parametric topic model for short texts incorporating word coherence knowledge. In CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management (Vol. 24-28-October-2016, pp. 2017-2020). Association for Computing Machinery. https://doi.org/10.1145/2983323.2983898