How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese

Jerid Francom, Amy La Cross, Adam P Ussishkin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
PublisherEuropean Language Resources Association (ELRA)
Pages421-427
Number of pages7
ISBN (Electronic)2951740867, 9782951740860
StatePublished - Jan 1 2010
Event7th International Conference on Language Resources and Evaluation, LREC 2010 - Valletta, Malta
Duration: May 17 2010May 23 2010

Other

Other7th International Conference on Language Resources and Evaluation, LREC 2010
CountryMalta
CityValletta
Period5/17/105/23/10

Fingerprint

Maltese
psycholinguistics
language
evaluation
resources
statistical method
rating
Specialized Corpora
Representativeness
Evaluation
Language
linguistics
lack
Resources
Psycholinguistics

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Cite this

Francom, J., La Cross, A., & Ussishkin, A. P. (2010). How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. In Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 (pp. 421-427). European Language Resources Association (ELRA).

How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. / Francom, Jerid; La Cross, Amy; Ussishkin, Adam P.

Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (ELRA), 2010. p. 421-427.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Francom, J, La Cross, A & Ussishkin, AP 2010, How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. in Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (ELRA), pp. 421-427, 7th International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, 5/17/10.
Francom J, La Cross A, Ussishkin AP. How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. In Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (ELRA). 2010. p. 421-427
Francom, Jerid ; La Cross, Amy ; Ussishkin, Adam P. / How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese. Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010. European Language Resources Association (ELRA), 2010. pp. 421-427
@inproceedings{2fb7946769514fb8ba2dd2f03ee4d961,
title = "How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese",
abstract = "In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.",
author = "Jerid Francom and {La Cross}, Amy and Ussishkin, {Adam P}",
year = "2010",
month = "1",
day = "1",
language = "English (US)",
pages = "421--427",
booktitle = "Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010",
publisher = "European Language Resources Association (ELRA)",

}

TY - GEN

T1 - How specialized are specialized corpora? Behavioral evaluation of corpus representativeness for Maltese

AU - Francom, Jerid

AU - La Cross, Amy

AU - Ussishkin, Adam P

PY - 2010/1/1

Y1 - 2010/1/1

N2 - In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.

AB - In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then use three statistical methods to evaluate these comparisons. This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.

UR - http://www.scopus.com/inward/record.url?scp=84944679519&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84944679519&partnerID=8YFLogxK

M3 - Conference contribution

SP - 421

EP - 427

BT - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010

PB - European Language Resources Association (ELRA)

ER -