Chinese word segmentation for terrorism-related contents

Dajun Zeng, Donghua Wei, Michael Chau, Feiyue Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

In order to analyze security and terrorism related content in Chinese, it is important to perform word segmentation on Chinese documents. There are many previous studies on Chinese word segmentation. The two major approaches are statistic-based and dictionary-based approaches. The pure statistic methods have lower precision, while the pure dictionary-based method cannot deal with new words and are restricted to the coverage of the dictionary. In this paper, we propose a hybrid method that avoids the limitations of both approaches. Through the use of suffix tree and mutual information (MI) with the dictionary, our segmenter, called IASeg, achieves a high accuracy in word segmentation when domain training is available. It can identify new words through MI-based token merging and dictionary update. In addition, with the Improved Bigram method it can also process N-grams. To evaluate the performance of our segmenter, we compare it with the Hylanda segmenter and the ICTCLAS segmenter using a terrorism-related corpus. The experiment results show that IASeg performs better than the two benchmarks in both precision and recall.

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages1-13
Number of pages13
Volume5075 LNCS
DOIs
Publication statusPublished - 2008
EventIEEE International Conference on Intelligence and Security Informatics, ISI 2008 Workshops: PAISI, PACCF, and SOCO 2008 - Taipei, Taiwan, Province of China
Duration: Jun 17 2008Jun 17 2008

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5075 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

OtherIEEE International Conference on Intelligence and Security Informatics, ISI 2008 Workshops: PAISI, PACCF, and SOCO 2008
CountryTaiwan, Province of China
CityTaipei
Period6/17/086/17/08

    Fingerprint

Keywords

  • Heuristic rules
  • Lidstone flatness
  • Mutual information
  • N-gram
  • Suffix tree
  • Ukkonen algorithm

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Zeng, D., Wei, D., Chau, M., & Wang, F. (2008). Chinese word segmentation for terrorism-related contents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5075 LNCS, pp. 1-13). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5075 LNCS). https://doi.org/10.1007/978-3-540-69304-8_1