Chinese word segmentation for terrorism-related contents

Daniel Zeng, Donghua Wei, Michael Chau, Feiyue Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall.

Original languageEnglish (US)
Title of host publicationIntelligence and Security Informatics - IEEE ISI 2008 International Workshops
Subtitle of host publicationPAISI, PACCF, and SOCO 2008, Proceedings
Pages1-13
Number of pages13
DOIs
StatePublished - 2008
EventIEEE International Conference on Intelligence and Security Informatics, ISI 2008 Workshops: PAISI, PACCF, and SOCO 2008 - Taipei, Taiwan, Province of China
Duration: Jun 17 2008Jun 17 2008

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5075 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

OtherIEEE International Conference on Intelligence and Security Informatics, ISI 2008 Workshops: PAISI, PACCF, and SOCO 2008
CountryTaiwan, Province of China
CityTaipei
Period6/17/086/17/08

Keywords

  • Heuristic rules
  • Lidstone flatness
  • Mutual information
  • N-gram
  • Suffix tree
  • Ukkonen algorithm

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Chinese word segmentation for terrorism-related contents'. Together they form a unique fingerprint.

  • Cite this

    Zeng, D., Wei, D., Chau, M., & Wang, F. (2008). Chinese word segmentation for terrorism-related contents. In Intelligence and Security Informatics - IEEE ISI 2008 International Workshops: PAISI, PACCF, and SOCO 2008, Proceedings (pp. 1-13). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5075 LNCS). https://doi.org/10.1007/978-3-540-69304-8_1