Clustering schema elements for semantic integration of heterogeneous data sources

Huimin Zhao, Sudha Ram

Research output: Contribution to journalArticle

26 Citations (Scopus)

Abstract

Interschema relationship identification (IRI), that is, determining the relationships among schema elements in heterogeneous data sources, is an important step in integrating the data sources. This article proposes a cluster analysis based approach to semi-automating the IRI process, which is typically very time-consuming and requires extensive human interaction. The authors apply multiple clustering techniques, including K-means, hierarchical clustering, and self-organizing map (SOM) neural network, to identify similar schema elements from heterogeneous data sources, based on a combination of features such as naming similarity, document similarity, schema specification, data patterns, and usage patterns. An SOM prototype the authors have developed provides users with a visualization tool for display of clustering results as well as for incremental evaluation of candidate similar elements.

Original languageEnglish (US)
Pages (from-to)88-106
Number of pages19
JournalJournal of Database Management
Volume15
Issue number4
StatePublished - Oct 2004

Fingerprint

Self organizing maps
Semantics
Cluster analysis
Visualization
Display devices
Neural networks
Specifications
Data sources
Clustering
Self-organizing map

Keywords

  • Attribute correspondence
  • Cluster analysis
  • Heterogeneous database integration
  • Interschema relationship identification
  • Schema correspondence
  • Self-organizing map

ASJC Scopus subject areas

  • Computer Science(all)
  • Decision Sciences(all)

Cite this

Clustering schema elements for semantic integration of heterogeneous data sources. / Zhao, Huimin; Ram, Sudha.

In: Journal of Database Management, Vol. 15, No. 4, 10.2004, p. 88-106.

Research output: Contribution to journalArticle

@article{c440342bbeff40eda44773c8b322cf51,
title = "Clustering schema elements for semantic integration of heterogeneous data sources",
abstract = "Interschema relationship identification (IRI), that is, determining the relationships among schema elements in heterogeneous data sources, is an important step in integrating the data sources. This article proposes a cluster analysis based approach to semi-automating the IRI process, which is typically very time-consuming and requires extensive human interaction. The authors apply multiple clustering techniques, including K-means, hierarchical clustering, and self-organizing map (SOM) neural network, to identify similar schema elements from heterogeneous data sources, based on a combination of features such as naming similarity, document similarity, schema specification, data patterns, and usage patterns. An SOM prototype the authors have developed provides users with a visualization tool for display of clustering results as well as for incremental evaluation of candidate similar elements.",
keywords = "Attribute correspondence, Cluster analysis, Heterogeneous database integration, Interschema relationship identification, Schema correspondence, Self-organizing map",
author = "Huimin Zhao and Sudha Ram",
year = "2004",
month = "10",
language = "English (US)",
volume = "15",
pages = "88--106",
journal = "Journal of Database Management",
issn = "1063-8016",
publisher = "IGI Publishing",
number = "4",

}

TY - JOUR

T1 - Clustering schema elements for semantic integration of heterogeneous data sources

AU - Zhao, Huimin

AU - Ram, Sudha

PY - 2004/10

Y1 - 2004/10

N2 - Interschema relationship identification (IRI), that is, determining the relationships among schema elements in heterogeneous data sources, is an important step in integrating the data sources. This article proposes a cluster analysis based approach to semi-automating the IRI process, which is typically very time-consuming and requires extensive human interaction. The authors apply multiple clustering techniques, including K-means, hierarchical clustering, and self-organizing map (SOM) neural network, to identify similar schema elements from heterogeneous data sources, based on a combination of features such as naming similarity, document similarity, schema specification, data patterns, and usage patterns. An SOM prototype the authors have developed provides users with a visualization tool for display of clustering results as well as for incremental evaluation of candidate similar elements.

AB - Interschema relationship identification (IRI), that is, determining the relationships among schema elements in heterogeneous data sources, is an important step in integrating the data sources. This article proposes a cluster analysis based approach to semi-automating the IRI process, which is typically very time-consuming and requires extensive human interaction. The authors apply multiple clustering techniques, including K-means, hierarchical clustering, and self-organizing map (SOM) neural network, to identify similar schema elements from heterogeneous data sources, based on a combination of features such as naming similarity, document similarity, schema specification, data patterns, and usage patterns. An SOM prototype the authors have developed provides users with a visualization tool for display of clustering results as well as for incremental evaluation of candidate similar elements.

KW - Attribute correspondence

KW - Cluster analysis

KW - Heterogeneous database integration

KW - Interschema relationship identification

KW - Schema correspondence

KW - Self-organizing map

UR - http://www.scopus.com/inward/record.url?scp=4444314464&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4444314464&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:4444314464

VL - 15

SP - 88

EP - 106

JO - Journal of Database Management

JF - Journal of Database Management

SN - 1063-8016

IS - 4

ER -