A hybrid unsupervised approach for document clustering

Mihai Surdeanu, Jordi Turmo, Alicia Ageno

Research output: Contribution to conferencePaper

22 Scopus citations

Abstract

We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.

Original languageEnglish (US)
Pages685-690
Number of pages6
DOIs
StatePublished - Dec 1 2005
Externally publishedYes
EventKDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Chicago, IL, United States
Duration: Aug 21 2005Aug 24 2005

Other

OtherKDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
CountryUnited States
CityChicago, IL
Period8/21/058/24/05

Keywords

  • EM initialization
  • Unsupervised clustering

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint Dive into the research topics of 'A hybrid unsupervised approach for document clustering'. Together they form a unique fingerprint.

  • Cite this

    Surdeanu, M., Turmo, J., & Ageno, A. (2005). A hybrid unsupervised approach for document clustering. 685-690. Paper presented at KDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, United States. https://doi.org/10.1145/1081870.1081957