A new dependency and correlation analysis for features

Guangzhi Qu, Salim A Hariri, Mazin Yousif

Research output: Contribution to journalArticle

97 Citations (Scopus)

Abstract

The quality of the data being analyzed is a critical factor that affects the accuracy of data mining algorithms. There are two important aspects of the data quality, one is relevance and the other is data redundancy. The inclusion of irrelevant and redundant features in the data mining model results in poor predictions and high computational overhead. This paper presents an efficient method concerning both the relevance of the features and the pairwise features correlation in order to improve the prediction and accuracy of our data mining algorithm. We introduce a new feature correlation metric QY(X i, Xj) and feature subset merit measure e(S) to quantify the relevance and the correlation among features with respect to a desired data mining task (e.g., detection of an abnormal behavior in a network service due to network attacks). Our approach takes into consideration not only the dependency among the features, but also their dependency with respect to a given data mining task. Our analysis shows that the correlation relationship among features depends on the decision task and, thus, they display different behaviors as we change the decision task. We applied our data mining approach to network security and validated it using the DARPA KDD99 benchmark data set. Our results show that, using the new decision dependent correlation metric, we can efficiently detect rare network attacks such as User to Root (U2R) and Remote to Local (R2L) attacks. The best reported detection rates for U2R and R2L on the KDD99 data sets were 13.2 percent and 8.4 percent with 0.5 percent false alarm, respectively. For U2R attacks, our approach can achieve a 92.5 percent detection rate with a false alarm of 0.7587 percent For R2L attacks, our approach can achieve a 92.47 percent detection rate with a false alarm of 8.35 percent.

Original languageEnglish (US)
Pages (from-to)1199-1206
Number of pages8
JournalIEEE Transactions on Knowledge and Data Engineering
Volume17
Issue number9
DOIs
StatePublished - Sep 2005

Fingerprint

Data mining
Network security
Redundancy

Keywords

  • Correlation measure
  • Feature extraction

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Electrical and Electronic Engineering
  • Artificial Intelligence
  • Information Systems

Cite this

A new dependency and correlation analysis for features. / Qu, Guangzhi; Hariri, Salim A; Yousif, Mazin.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, 09.2005, p. 1199-1206.

Research output: Contribution to journalArticle

@article{34b2130599e141c2b5b8ff132edf089c,
title = "A new dependency and correlation analysis for features",
abstract = "The quality of the data being analyzed is a critical factor that affects the accuracy of data mining algorithms. There are two important aspects of the data quality, one is relevance and the other is data redundancy. The inclusion of irrelevant and redundant features in the data mining model results in poor predictions and high computational overhead. This paper presents an efficient method concerning both the relevance of the features and the pairwise features correlation in order to improve the prediction and accuracy of our data mining algorithm. We introduce a new feature correlation metric QY(X i, Xj) and feature subset merit measure e(S) to quantify the relevance and the correlation among features with respect to a desired data mining task (e.g., detection of an abnormal behavior in a network service due to network attacks). Our approach takes into consideration not only the dependency among the features, but also their dependency with respect to a given data mining task. Our analysis shows that the correlation relationship among features depends on the decision task and, thus, they display different behaviors as we change the decision task. We applied our data mining approach to network security and validated it using the DARPA KDD99 benchmark data set. Our results show that, using the new decision dependent correlation metric, we can efficiently detect rare network attacks such as User to Root (U2R) and Remote to Local (R2L) attacks. The best reported detection rates for U2R and R2L on the KDD99 data sets were 13.2 percent and 8.4 percent with 0.5 percent false alarm, respectively. For U2R attacks, our approach can achieve a 92.5 percent detection rate with a false alarm of 0.7587 percent For R2L attacks, our approach can achieve a 92.47 percent detection rate with a false alarm of 8.35 percent.",
keywords = "Correlation measure, Feature extraction",
author = "Guangzhi Qu and Hariri, {Salim A} and Mazin Yousif",
year = "2005",
month = "9",
doi = "10.1109/TKDE.2005.136",
language = "English (US)",
volume = "17",
pages = "1199--1206",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE Computer Society",
number = "9",

}

TY - JOUR

T1 - A new dependency and correlation analysis for features

AU - Qu, Guangzhi

AU - Hariri, Salim A

AU - Yousif, Mazin

PY - 2005/9

Y1 - 2005/9

N2 - The quality of the data being analyzed is a critical factor that affects the accuracy of data mining algorithms. There are two important aspects of the data quality, one is relevance and the other is data redundancy. The inclusion of irrelevant and redundant features in the data mining model results in poor predictions and high computational overhead. This paper presents an efficient method concerning both the relevance of the features and the pairwise features correlation in order to improve the prediction and accuracy of our data mining algorithm. We introduce a new feature correlation metric QY(X i, Xj) and feature subset merit measure e(S) to quantify the relevance and the correlation among features with respect to a desired data mining task (e.g., detection of an abnormal behavior in a network service due to network attacks). Our approach takes into consideration not only the dependency among the features, but also their dependency with respect to a given data mining task. Our analysis shows that the correlation relationship among features depends on the decision task and, thus, they display different behaviors as we change the decision task. We applied our data mining approach to network security and validated it using the DARPA KDD99 benchmark data set. Our results show that, using the new decision dependent correlation metric, we can efficiently detect rare network attacks such as User to Root (U2R) and Remote to Local (R2L) attacks. The best reported detection rates for U2R and R2L on the KDD99 data sets were 13.2 percent and 8.4 percent with 0.5 percent false alarm, respectively. For U2R attacks, our approach can achieve a 92.5 percent detection rate with a false alarm of 0.7587 percent For R2L attacks, our approach can achieve a 92.47 percent detection rate with a false alarm of 8.35 percent.

AB - The quality of the data being analyzed is a critical factor that affects the accuracy of data mining algorithms. There are two important aspects of the data quality, one is relevance and the other is data redundancy. The inclusion of irrelevant and redundant features in the data mining model results in poor predictions and high computational overhead. This paper presents an efficient method concerning both the relevance of the features and the pairwise features correlation in order to improve the prediction and accuracy of our data mining algorithm. We introduce a new feature correlation metric QY(X i, Xj) and feature subset merit measure e(S) to quantify the relevance and the correlation among features with respect to a desired data mining task (e.g., detection of an abnormal behavior in a network service due to network attacks). Our approach takes into consideration not only the dependency among the features, but also their dependency with respect to a given data mining task. Our analysis shows that the correlation relationship among features depends on the decision task and, thus, they display different behaviors as we change the decision task. We applied our data mining approach to network security and validated it using the DARPA KDD99 benchmark data set. Our results show that, using the new decision dependent correlation metric, we can efficiently detect rare network attacks such as User to Root (U2R) and Remote to Local (R2L) attacks. The best reported detection rates for U2R and R2L on the KDD99 data sets were 13.2 percent and 8.4 percent with 0.5 percent false alarm, respectively. For U2R attacks, our approach can achieve a 92.5 percent detection rate with a false alarm of 0.7587 percent For R2L attacks, our approach can achieve a 92.47 percent detection rate with a false alarm of 8.35 percent.

KW - Correlation measure

KW - Feature extraction

UR - http://www.scopus.com/inward/record.url?scp=27644496932&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=27644496932&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2005.136

DO - 10.1109/TKDE.2005.136

M3 - Article

AN - SCOPUS:27644496932

VL - 17

SP - 1199

EP - 1206

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 9

ER -