Locating and reconfiguring records in unstructured multiple-record web documents

David W. Embley, Li - Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

Record extraction from data-rich, unstructured, multiple-record Web documents works well [9], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [10]), many do not. When some values of textual records are factored out, are split unnaturally across boundaries, are joined unnaturally within boundaries, or are linked by ofi-page connectors, or when desired records are interspersed with records that are not of interest, it is dificult to automatically cull records and piece values together to form clean, delineated chunks of text that each represent a single record of interest. In this paper we address this problem and propose an algorithm to find and rearrange (if necessary) records in an HTML document. The essential idea is to attempt to maximize a record-recognition heuristic with respect to a given application ontology. Tests we conducted for two widely difiering applications show that this technique properly locates and reconfigures records.

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
PublisherSpringer Verlag
Pages256-274
Number of pages19
Volume1997
ISBN (Print)3540418261
Publication statusPublished - 2001
Externally publishedYes
Event3rd International Workshop on the Web and Databases, WebDB 2000 - Dallas, United States
Duration: May 18 2000May 19 2000

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume1997
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other3rd International Workshop on the Web and Databases, WebDB 2000
CountryUnited States
CityDallas
Period5/18/005/19/00

    Fingerprint

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Embley, D. W., & Xu, L. . (2001). Locating and reconfiguring records in unstructured multiple-record web documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 1997, pp. 256-274). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 1997). Springer Verlag.