Combining probability model and Web mining model: A framework for proper name transliteration

Yilu Zhou, Feng Huang, Hsinchun Chen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The rapid growth of the Internet has created a tremendous number of multilingual resources. There are Web pages in almost every popular language. However, language boundaries prevent information sharing and discovery across countries. Proper names play an important role in search queries and knowledge discovery. However, often foreign names need to be translated and this phonetic translation is referred to as transliteration. Previous transliteration models can be categorized into three approaches: A rule-based approach, a machine learning approach, and a statistical approach.. In this research we proposed a generic proper name transliteration framework which incorporates an enhanced Hidden Markov Model (HMM) and a Web mining model. We improved the traditional statistical-based transliteration in three areas: 1) incorporated a simple phonetic transliteration knowledge base; 2) incorporated a bigram and a trigram HMM; 3) incorporated a Web mining model that uses word frequency of occurrence information from the Web. We evaluated the framework on two different language pairs, English-Arabic and English-Chinese. Both experiments showed that when using HMM alone, a combination of the bigram and trigram HMM approach performed the best for English-Arabic transliteration. While the bigram model alone achieved fairly good performance, the trigram model alone did not. The Web mining approach boosted the performance by 46%. For English-Chinese transliteration, we found the trigram model out-performed the bigram. The Web mining approach again improved the performance by 12%. Overall, our framework achieved a precision of 86%-93% when the 8 best transliterations were considered. Our results are encouraging and show promise for using transliteration techniques to improve multilingual Web retrieval.

Original languageEnglish (US)
Title of host publication15th Workshop on Information Technology and Systems, WITS 2005
PublisherUniversity of Arizona
Pages135-140
Number of pages6
StatePublished - 2005
Externally publishedYes
Event15th Workshop on Information Technology and Systems, WITS 2005 - Las Vegas, NV, United States
Duration: Dec 10 2005Dec 11 2005

Other

Other15th Workshop on Information Technology and Systems, WITS 2005
CountryUnited States
CityLas Vegas, NV
Period12/10/0512/11/05

Fingerprint

Hidden Markov models
Speech analysis
Data mining
Learning systems
Websites
Internet
Experiments

ASJC Scopus subject areas

  • Information Systems
  • Control and Systems Engineering

Cite this

Zhou, Y., Huang, F., & Chen, H. (2005). Combining probability model and Web mining model: A framework for proper name transliteration. In 15th Workshop on Information Technology and Systems, WITS 2005 (pp. 135-140). University of Arizona.

Combining probability model and Web mining model : A framework for proper name transliteration. / Zhou, Yilu; Huang, Feng; Chen, Hsinchun.

15th Workshop on Information Technology and Systems, WITS 2005. University of Arizona, 2005. p. 135-140.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhou, Y, Huang, F & Chen, H 2005, Combining probability model and Web mining model: A framework for proper name transliteration. in 15th Workshop on Information Technology and Systems, WITS 2005. University of Arizona, pp. 135-140, 15th Workshop on Information Technology and Systems, WITS 2005, Las Vegas, NV, United States, 12/10/05.
Zhou Y, Huang F, Chen H. Combining probability model and Web mining model: A framework for proper name transliteration. In 15th Workshop on Information Technology and Systems, WITS 2005. University of Arizona. 2005. p. 135-140
Zhou, Yilu ; Huang, Feng ; Chen, Hsinchun. / Combining probability model and Web mining model : A framework for proper name transliteration. 15th Workshop on Information Technology and Systems, WITS 2005. University of Arizona, 2005. pp. 135-140
@inproceedings{8fda327ea0c64f609a600cb0aeb9b8ad,
title = "Combining probability model and Web mining model: A framework for proper name transliteration",
abstract = "The rapid growth of the Internet has created a tremendous number of multilingual resources. There are Web pages in almost every popular language. However, language boundaries prevent information sharing and discovery across countries. Proper names play an important role in search queries and knowledge discovery. However, often foreign names need to be translated and this phonetic translation is referred to as transliteration. Previous transliteration models can be categorized into three approaches: A rule-based approach, a machine learning approach, and a statistical approach.. In this research we proposed a generic proper name transliteration framework which incorporates an enhanced Hidden Markov Model (HMM) and a Web mining model. We improved the traditional statistical-based transliteration in three areas: 1) incorporated a simple phonetic transliteration knowledge base; 2) incorporated a bigram and a trigram HMM; 3) incorporated a Web mining model that uses word frequency of occurrence information from the Web. We evaluated the framework on two different language pairs, English-Arabic and English-Chinese. Both experiments showed that when using HMM alone, a combination of the bigram and trigram HMM approach performed the best for English-Arabic transliteration. While the bigram model alone achieved fairly good performance, the trigram model alone did not. The Web mining approach boosted the performance by 46{\%}. For English-Chinese transliteration, we found the trigram model out-performed the bigram. The Web mining approach again improved the performance by 12{\%}. Overall, our framework achieved a precision of 86{\%}-93{\%} when the 8 best transliterations were considered. Our results are encouraging and show promise for using transliteration techniques to improve multilingual Web retrieval.",
author = "Yilu Zhou and Feng Huang and Hsinchun Chen",
year = "2005",
language = "English (US)",
pages = "135--140",
booktitle = "15th Workshop on Information Technology and Systems, WITS 2005",
publisher = "University of Arizona",

}

TY - GEN

T1 - Combining probability model and Web mining model

T2 - A framework for proper name transliteration

AU - Zhou, Yilu

AU - Huang, Feng

AU - Chen, Hsinchun

PY - 2005

Y1 - 2005

N2 - The rapid growth of the Internet has created a tremendous number of multilingual resources. There are Web pages in almost every popular language. However, language boundaries prevent information sharing and discovery across countries. Proper names play an important role in search queries and knowledge discovery. However, often foreign names need to be translated and this phonetic translation is referred to as transliteration. Previous transliteration models can be categorized into three approaches: A rule-based approach, a machine learning approach, and a statistical approach.. In this research we proposed a generic proper name transliteration framework which incorporates an enhanced Hidden Markov Model (HMM) and a Web mining model. We improved the traditional statistical-based transliteration in three areas: 1) incorporated a simple phonetic transliteration knowledge base; 2) incorporated a bigram and a trigram HMM; 3) incorporated a Web mining model that uses word frequency of occurrence information from the Web. We evaluated the framework on two different language pairs, English-Arabic and English-Chinese. Both experiments showed that when using HMM alone, a combination of the bigram and trigram HMM approach performed the best for English-Arabic transliteration. While the bigram model alone achieved fairly good performance, the trigram model alone did not. The Web mining approach boosted the performance by 46%. For English-Chinese transliteration, we found the trigram model out-performed the bigram. The Web mining approach again improved the performance by 12%. Overall, our framework achieved a precision of 86%-93% when the 8 best transliterations were considered. Our results are encouraging and show promise for using transliteration techniques to improve multilingual Web retrieval.

AB - The rapid growth of the Internet has created a tremendous number of multilingual resources. There are Web pages in almost every popular language. However, language boundaries prevent information sharing and discovery across countries. Proper names play an important role in search queries and knowledge discovery. However, often foreign names need to be translated and this phonetic translation is referred to as transliteration. Previous transliteration models can be categorized into three approaches: A rule-based approach, a machine learning approach, and a statistical approach.. In this research we proposed a generic proper name transliteration framework which incorporates an enhanced Hidden Markov Model (HMM) and a Web mining model. We improved the traditional statistical-based transliteration in three areas: 1) incorporated a simple phonetic transliteration knowledge base; 2) incorporated a bigram and a trigram HMM; 3) incorporated a Web mining model that uses word frequency of occurrence information from the Web. We evaluated the framework on two different language pairs, English-Arabic and English-Chinese. Both experiments showed that when using HMM alone, a combination of the bigram and trigram HMM approach performed the best for English-Arabic transliteration. While the bigram model alone achieved fairly good performance, the trigram model alone did not. The Web mining approach boosted the performance by 46%. For English-Chinese transliteration, we found the trigram model out-performed the bigram. The Web mining approach again improved the performance by 12%. Overall, our framework achieved a precision of 86%-93% when the 8 best transliterations were considered. Our results are encouraging and show promise for using transliteration techniques to improve multilingual Web retrieval.

UR - http://www.scopus.com/inward/record.url?scp=84905721068&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84905721068&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84905721068

SP - 135

EP - 140

BT - 15th Workshop on Information Technology and Systems, WITS 2005

PB - University of Arizona

ER -