TY - GEN
T1 - A non-parametric topic model for short texts incorporating word coherence knowledge
AU - Zhang, Yuhao
AU - Mao, Wenji
AU - Zeng, Daniel
PY - 2016/10/24
Y1 - 2016/10/24
N2 - Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short texts often restrict the performance of traditional topic models like LDA. Recently proposed Biterm Topic Model (BTM) which models word co-occurrence patterns directly, is revealed effective for topic detection in short texts. However, BTM has two main drawbacks. It needs to manually specify topic number, which is difficult to accurately determine when facing new corpora. Besides, BTM assumes that two words in same term should belong to the same topic, which is often too strong as it does not differentiate two types of words (i.e. general words and topical words). To tackle these problems, in this paper, we propose a non-parametric topic model npCTM with the above distinction. Our model incorporates the Chinese restaurant process (CRP) into the BTM model to determine topic number automatically. Our model also distinguishes general words from topical words by jointly considering the distribution of these two word types for each word as well as word coherence information as prior knowledge. We carry out experimental studies on real-world twitter dataset. The results demonstrate the effectiveness of our method to discover coherent topics compared with the baseline methods.
AB - Mining topics in short texts (e.g. tweets, instant messages) can help people grasp essential information and understand key contents, and is widely used in many applications related to social media and text analysis. The sparsity and noise of short texts often restrict the performance of traditional topic models like LDA. Recently proposed Biterm Topic Model (BTM) which models word co-occurrence patterns directly, is revealed effective for topic detection in short texts. However, BTM has two main drawbacks. It needs to manually specify topic number, which is difficult to accurately determine when facing new corpora. Besides, BTM assumes that two words in same term should belong to the same topic, which is often too strong as it does not differentiate two types of words (i.e. general words and topical words). To tackle these problems, in this paper, we propose a non-parametric topic model npCTM with the above distinction. Our model incorporates the Chinese restaurant process (CRP) into the BTM model to determine topic number automatically. Our model also distinguishes general words from topical words by jointly considering the distribution of these two word types for each word as well as word coherence information as prior knowledge. We carry out experimental studies on real-world twitter dataset. The results demonstrate the effectiveness of our method to discover coherent topics compared with the baseline methods.
KW - Bayesian nonparametric model
KW - Text mining
KW - Topic model
UR - http://www.scopus.com/inward/record.url?scp=84996598809&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84996598809&partnerID=8YFLogxK
U2 - 10.1145/2983323.2983898
DO - 10.1145/2983323.2983898
M3 - Conference contribution
AN - SCOPUS:84996598809
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 2017
EP - 2020
BT - CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - 25th ACM International Conference on Information and Knowledge Management, CIKM 2016
Y2 - 24 October 2016 through 28 October 2016
ER -