Fully unsupervised word segmentation with BVE and MDL

Daniel Hewlett, Paul R Cohen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

Original languageEnglish (US)
Title of host publicationACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Pages540-545
Number of pages6
Volume2
StatePublished - 2011
Event49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011 - Portland, OR, United States
Duration: Jun 19 2011Jun 24 2011

Other

Other49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011
CountryUnited States
CityPortland, OR
Period6/19/116/24/11

Fingerprint

candidacy
voting
segmentation
Word Segmentation
Length
expert
Segmentation
language
literature
Voting
Language Corpora
Natural Language

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Hewlett, D., & Cohen, P. R. (2011). Fully unsupervised word segmentation with BVE and MDL. In ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (Vol. 2, pp. 540-545)

Fully unsupervised word segmentation with BVE and MDL. / Hewlett, Daniel; Cohen, Paul R.

ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 2 2011. p. 540-545.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hewlett, D & Cohen, PR 2011, Fully unsupervised word segmentation with BVE and MDL. in ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. vol. 2, pp. 540-545, 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011, Portland, OR, United States, 6/19/11.
Hewlett D, Cohen PR. Fully unsupervised word segmentation with BVE and MDL. In ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 2. 2011. p. 540-545
Hewlett, Daniel ; Cohen, Paul R. / Fully unsupervised word segmentation with BVE and MDL. ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Vol. 2 2011. pp. 540-545
@inproceedings{2c68c9f3ae344d649b6fddb8750acecc,
title = "Fully unsupervised word segmentation with BVE and MDL",
abstract = "Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.",
author = "Daniel Hewlett and Cohen, {Paul R}",
year = "2011",
language = "English (US)",
isbn = "9781932432886",
volume = "2",
pages = "540--545",
booktitle = "ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies",

}

TY - GEN

T1 - Fully unsupervised word segmentation with BVE and MDL

AU - Hewlett, Daniel

AU - Cohen, Paul R

PY - 2011

Y1 - 2011

N2 - Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

AB - Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

UR - http://www.scopus.com/inward/record.url?scp=84859036614&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84859036614&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84859036614

SN - 9781932432886

VL - 2

SP - 540

EP - 545

BT - ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

ER -