Gene selection using support vector machines with non-convex penalty

Hao Zhang, Jeongyoun Ahn, Xiaodong Lin, Cheolwoo Park

Research output: Contribution to journalArticle

167 Citations (Scopus)

Abstract

Motivation: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of 'high-dimensional low sample size'. Therefore, robust and accurate gene selection methods are required to identify differentially expressed group of genes across different samples, e.g. between cancerous and normal cells. Successful gene selection will help to classify different cancer types, lead to a better understanding of genetic signatures in cancers and improve treatment strategies. Although gene selection and cancer classification are two closely related problems, most existing approaches handle them separately by selecting genes prior to classification. We provide a unified procedure for simultaneous gene selection and cancer classification, achieving high accuracy in both aspects. Results: In this paper we develop a novel type of regularization in support vector machines (SVMs) to identify important genes for cancer classification. A special nonconvex penalty, called the smoothly clipped absolute deviation penalty, is imposed on the hinge loss function in the SVM. By systematically thresholding small estimates to zeros, the new procedure eliminates redundant genes automatically and yields a compact and accurate classifier. A successive quadratic algorithm is proposed to convert the non-differentiable and non-convex optimization problem into easily solved linear equation systems. The method is applied to two real datasets and has produced very promising results.

Original languageEnglish (US)
Pages (from-to)88-95
Number of pages8
JournalBioinformatics
Volume22
Issue number1
DOIs
StatePublished - Jan 2006
Externally publishedYes

Fingerprint

Gene Selection
Support vector machines
Penalty
Support Vector Machine
Cancer Classification
Genes
Gene
Neoplasm Genes
Cancer
Nondifferentiable Optimization
DNA Microarray
Nonconvex Optimization
Microarrays
Nonconvex Problems
Thresholding
Loss Function
Microarray Data
Convert
Oligonucleotide Array Sequence Analysis
Linear equation

ASJC Scopus subject areas

  • Clinical Biochemistry
  • Computer Science Applications
  • Computational Theory and Mathematics

Cite this

Gene selection using support vector machines with non-convex penalty. / Zhang, Hao; Ahn, Jeongyoun; Lin, Xiaodong; Park, Cheolwoo.

In: Bioinformatics, Vol. 22, No. 1, 01.2006, p. 88-95.

Research output: Contribution to journalArticle

Zhang, Hao ; Ahn, Jeongyoun ; Lin, Xiaodong ; Park, Cheolwoo. / Gene selection using support vector machines with non-convex penalty. In: Bioinformatics. 2006 ; Vol. 22, No. 1. pp. 88-95.
@article{7a7ce431a6ea4e7ca16d7266cf84c40d,
title = "Gene selection using support vector machines with non-convex penalty",
abstract = "Motivation: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of 'high-dimensional low sample size'. Therefore, robust and accurate gene selection methods are required to identify differentially expressed group of genes across different samples, e.g. between cancerous and normal cells. Successful gene selection will help to classify different cancer types, lead to a better understanding of genetic signatures in cancers and improve treatment strategies. Although gene selection and cancer classification are two closely related problems, most existing approaches handle them separately by selecting genes prior to classification. We provide a unified procedure for simultaneous gene selection and cancer classification, achieving high accuracy in both aspects. Results: In this paper we develop a novel type of regularization in support vector machines (SVMs) to identify important genes for cancer classification. A special nonconvex penalty, called the smoothly clipped absolute deviation penalty, is imposed on the hinge loss function in the SVM. By systematically thresholding small estimates to zeros, the new procedure eliminates redundant genes automatically and yields a compact and accurate classifier. A successive quadratic algorithm is proposed to convert the non-differentiable and non-convex optimization problem into easily solved linear equation systems. The method is applied to two real datasets and has produced very promising results.",
author = "Hao Zhang and Jeongyoun Ahn and Xiaodong Lin and Cheolwoo Park",
year = "2006",
month = "1",
doi = "10.1093/bioinformatics/bti736",
language = "English (US)",
volume = "22",
pages = "88--95",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "1",

}

TY - JOUR

T1 - Gene selection using support vector machines with non-convex penalty

AU - Zhang, Hao

AU - Ahn, Jeongyoun

AU - Lin, Xiaodong

AU - Park, Cheolwoo

PY - 2006/1

Y1 - 2006/1

N2 - Motivation: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of 'high-dimensional low sample size'. Therefore, robust and accurate gene selection methods are required to identify differentially expressed group of genes across different samples, e.g. between cancerous and normal cells. Successful gene selection will help to classify different cancer types, lead to a better understanding of genetic signatures in cancers and improve treatment strategies. Although gene selection and cancer classification are two closely related problems, most existing approaches handle them separately by selecting genes prior to classification. We provide a unified procedure for simultaneous gene selection and cancer classification, achieving high accuracy in both aspects. Results: In this paper we develop a novel type of regularization in support vector machines (SVMs) to identify important genes for cancer classification. A special nonconvex penalty, called the smoothly clipped absolute deviation penalty, is imposed on the hinge loss function in the SVM. By systematically thresholding small estimates to zeros, the new procedure eliminates redundant genes automatically and yields a compact and accurate classifier. A successive quadratic algorithm is proposed to convert the non-differentiable and non-convex optimization problem into easily solved linear equation systems. The method is applied to two real datasets and has produced very promising results.

AB - Motivation: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of 'high-dimensional low sample size'. Therefore, robust and accurate gene selection methods are required to identify differentially expressed group of genes across different samples, e.g. between cancerous and normal cells. Successful gene selection will help to classify different cancer types, lead to a better understanding of genetic signatures in cancers and improve treatment strategies. Although gene selection and cancer classification are two closely related problems, most existing approaches handle them separately by selecting genes prior to classification. We provide a unified procedure for simultaneous gene selection and cancer classification, achieving high accuracy in both aspects. Results: In this paper we develop a novel type of regularization in support vector machines (SVMs) to identify important genes for cancer classification. A special nonconvex penalty, called the smoothly clipped absolute deviation penalty, is imposed on the hinge loss function in the SVM. By systematically thresholding small estimates to zeros, the new procedure eliminates redundant genes automatically and yields a compact and accurate classifier. A successive quadratic algorithm is proposed to convert the non-differentiable and non-convex optimization problem into easily solved linear equation systems. The method is applied to two real datasets and has produced very promising results.

UR - http://www.scopus.com/inward/record.url?scp=30344438839&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=30344438839&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/bti736

DO - 10.1093/bioinformatics/bti736

M3 - Article

C2 - 16249260

AN - SCOPUS:30344438839

VL - 22

SP - 88

EP - 95

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 1

ER -