Fully unsupervised word segmentation with BVE and MDL

Daniel Hewlett, Paul Cohen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Scopus citations

Abstract

Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

Original languageEnglish (US)
Title of host publicationACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics
Subtitle of host publicationHuman Language Technologies
Pages540-545
Number of pages6
StatePublished - Dec 1 2011
Event49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011 - Portland, OR, United States
Duration: Jun 19 2011Jun 24 2011

Publication series

NameACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Volume2

Other

Other49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011
CountryUnited States
CityPortland, OR
Period6/19/116/24/11

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint Dive into the research topics of 'Fully unsupervised word segmentation with BVE and MDL'. Together they form a unique fingerprint.

  • Cite this

    Hewlett, D., & Cohen, P. (2011). Fully unsupervised word segmentation with BVE and MDL. In ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 540-545). (ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; Vol. 2).