Combining schema and instance information for integrating heterogeneous data sources

Huimin Zhao, Sudha Ram

Research output: Contribution to journalArticle

27 Citations (Scopus)

Abstract

Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.

Original languageEnglish (US)
Pages (from-to)281-303
Number of pages23
JournalData and Knowledge Engineering
Volume61
Issue number2
DOIs
StatePublished - May 2007

Fingerprint

Cluster analysis
Statistical methods
Data sources

Keywords

  • Data integration
  • Heterogeneous databases
  • Semantic correspondence

ASJC Scopus subject areas

  • Artificial Intelligence

Cite this

Combining schema and instance information for integrating heterogeneous data sources. / Zhao, Huimin; Ram, Sudha.

In: Data and Knowledge Engineering, Vol. 61, No. 2, 05.2007, p. 281-303.

Research output: Contribution to journalArticle

@article{dca52c939a434d5e84e63c38ef0bb2a6,
title = "Combining schema and instance information for integrating heterogeneous data sources",
abstract = "Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.",
keywords = "Data integration, Heterogeneous databases, Semantic correspondence",
author = "Huimin Zhao and Sudha Ram",
year = "2007",
month = "5",
doi = "10.1016/j.datak.2006.06.004",
language = "English (US)",
volume = "61",
pages = "281--303",
journal = "Data and Knowledge Engineering",
issn = "0169-023X",
publisher = "Elsevier",
number = "2",

}

TY - JOUR

T1 - Combining schema and instance information for integrating heterogeneous data sources

AU - Zhao, Huimin

AU - Ram, Sudha

PY - 2007/5

Y1 - 2007/5

N2 - Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.

AB - Determining the correspondences among heterogeneous data sources, which is critical to integration of the data sources, is a complex and resource-consuming task that demands automated support. We propose an iterative procedure for detecting both schema-level and instance-level correspondences from heterogeneous data sources. Cluster analysis techniques are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques are used to identify matching tuples. Statistical analysis techniques are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improvement in schema-level correspondences triggers another iteration of an iterative procedure. We have performed empirical evaluation using real-world heterogeneous data sources and report in this paper some promising results (i.e., incremental improvement in identified correspondences) that demonstrate the utility of the proposed iterative procedure.

KW - Data integration

KW - Heterogeneous databases

KW - Semantic correspondence

UR - http://www.scopus.com/inward/record.url?scp=33947161876&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33947161876&partnerID=8YFLogxK

U2 - 10.1016/j.datak.2006.06.004

DO - 10.1016/j.datak.2006.06.004

M3 - Article

VL - 61

SP - 281

EP - 303

JO - Data and Knowledge Engineering

JF - Data and Knowledge Engineering

SN - 0169-023X

IS - 2

ER -