Scaling a neyman-pearson subset selection approach via heuristics for mining massive data

Gregory Ditzler, Matthew Austen, Gail Rosen, Robi Polikar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Feature subset selection is an important step towards producing a classifier that relies only on relevant features, while keeping the computational complexity of the classifier low. Feature selection is also used in making inferences on the importance of attributes, even when classification is not the ultimate goal. For example, in bioinformatics and genomics feature subset selection is used to make inferences between the variables that best describe multiple populations. Unfortunately, many feature selection algorithms require the subset size to be specified a priori, but knowing how many variables to select is typically a nontrivial task. Other approaches rely on a specific variable subset selection framework to be used. In this work, we examine an approach to feature subset selection works with a generic variable selection algorithm, and our approach provides statistical inference on the number of features that are relevant, which may be unknown to the generic variable selection algorithm. This work extends our previous implementation of a Neyman-Pearson feature selection (NPFS) hypothesis test, which acts as a meta-subset selection algorithm. Specifically, we examine the conservativeness of the NPFS approach by biasing the hypothesis test, and examine other heuristics for NPFS. We include results from carefully designed synthetic datasets. Furthermore, we demonstrate the NPFS's ability to perform on data of a massive scale.

Original languageEnglish (US)
Title of host publicationIEEE SSCI 2014 - 2014 IEEE Symposium Series on Computational Intelligence - CIDM 2014
Subtitle of host publication2014 IEEE Symposium on Computational Intelligence and Data Mining, Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages439-445
Number of pages7
ISBN (Electronic)9781479945191
DOIs
StatePublished - Jan 13 2015
Externally publishedYes
Event5th IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2014 - Orlando, United States
Duration: Dec 9 2014Dec 12 2014

Publication series

NameIEEE SSCI 2014 - 2014 IEEE Symposium Series on Computational Intelligence - CIDM 2014: 2014 IEEE Symposium on Computational Intelligence and Data Mining, Proceedings

Conference

Conference5th IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2014
CountryUnited States
CityOrlando
Period12/9/1412/12/14

Keywords

  • Neyman-Pearson
  • feature subset selection

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Signal Processing
  • Software

Fingerprint Dive into the research topics of 'Scaling a neyman-pearson subset selection approach via heuristics for mining massive data'. Together they form a unique fingerprint.

  • Cite this

    Ditzler, G., Austen, M., Rosen, G., & Polikar, R. (2015). Scaling a neyman-pearson subset selection approach via heuristics for mining massive data. In IEEE SSCI 2014 - 2014 IEEE Symposium Series on Computational Intelligence - CIDM 2014: 2014 IEEE Symposium on Computational Intelligence and Data Mining, Proceedings (pp. 439-445). [7008701] (IEEE SSCI 2014 - 2014 IEEE Symposium Series on Computational Intelligence - CIDM 2014: 2014 IEEE Symposium on Computational Intelligence and Data Mining, Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/CIDM.2014.7008701