A hybrid unsupervised approach for document clustering

Mihai Surdeanu, Jordi Turmo, Alicia Ageno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

22 Citations (Scopus)

Abstract

We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.

Original languageEnglish (US)
Title of host publicationProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
EditorsR.L. Grossman, R. Bayardo, K. Bennett, J. Vaidya
Pages685-690
Number of pages6
DOIs
StatePublished - 2005
Externally publishedYes
EventKDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Chicago, IL, United States
Duration: Aug 21 2005Aug 24 2005

Other

OtherKDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
CountryUnited States
CityChicago, IL
Period8/21/058/24/05

Fingerprint

Set theory
Clustering algorithms

Keywords

  • EM initialization
  • Unsupervised clustering

ASJC Scopus subject areas

  • Information Systems

Cite this

Surdeanu, M., Turmo, J., & Ageno, A. (2005). A hybrid unsupervised approach for document clustering. In R. L. Grossman, R. Bayardo, K. Bennett, & J. Vaidya (Eds.), Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 685-690) https://doi.org/10.1145/1081870.1081957

A hybrid unsupervised approach for document clustering. / Surdeanu, Mihai; Turmo, Jordi; Ageno, Alicia.

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ed. / R.L. Grossman; R. Bayardo; K. Bennett; J. Vaidya. 2005. p. 685-690.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Surdeanu, M, Turmo, J & Ageno, A 2005, A hybrid unsupervised approach for document clustering. in RL Grossman, R Bayardo, K Bennett & J Vaidya (eds), Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 685-690, KDD-2005: 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, United States, 8/21/05. https://doi.org/10.1145/1081870.1081957
Surdeanu M, Turmo J, Ageno A. A hybrid unsupervised approach for document clustering. In Grossman RL, Bayardo R, Bennett K, Vaidya J, editors, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2005. p. 685-690 https://doi.org/10.1145/1081870.1081957
Surdeanu, Mihai ; Turmo, Jordi ; Ageno, Alicia. / A hybrid unsupervised approach for document clustering. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. editor / R.L. Grossman ; R. Bayardo ; K. Bennett ; J. Vaidya. 2005. pp. 685-690
@inproceedings{cb1492b1d87e4e3d983d0cecf7d6617b,
title = "A hybrid unsupervised approach for document clustering",
abstract = "We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.",
keywords = "EM initialization, Unsupervised clustering",
author = "Mihai Surdeanu and Jordi Turmo and Alicia Ageno",
year = "2005",
doi = "10.1145/1081870.1081957",
language = "English (US)",
pages = "685--690",
editor = "R.L. Grossman and R. Bayardo and K. Bennett and J. Vaidya",
booktitle = "Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining",

}

TY - GEN

T1 - A hybrid unsupervised approach for document clustering

AU - Surdeanu, Mihai

AU - Turmo, Jordi

AU - Ageno, Alicia

PY - 2005

Y1 - 2005

N2 - We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.

AB - We propose a hybrid, unsupervised document clustering approach that combines a hierarchical clustering algorithm with Expectation Maximization. We developed several heuristics to automatically select a subset of the clusters generated by the first algorithm as the initial points of the second one. Furthermore, our initialization algorithm generates not only an initial model for the iterative refinement algorithm but also an estimate of the model dimension, thus eliminating another important element of human supervision. We have evaluated the proposed system on five real-world document collections. The results show that our approach generates clustering solutions of higher quality than both its individual components.

KW - EM initialization

KW - Unsupervised clustering

UR - http://www.scopus.com/inward/record.url?scp=32344444908&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=32344444908&partnerID=8YFLogxK

U2 - 10.1145/1081870.1081957

DO - 10.1145/1081870.1081957

M3 - Conference contribution

AN - SCOPUS:32344444908

SP - 685

EP - 690

BT - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

A2 - Grossman, R.L.

A2 - Bayardo, R.

A2 - Bennett, K.

A2 - Vaidya, J.

ER -