{"title": "Kernel Feature Selection via Conditional Covariance Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 6946, "page_last": 6955, "abstract": "We propose a method for feature selection that employs kernel-based measures of independence to find a subset of covariates that is maximally predictive of the response. Building on past work in kernel dimension reduction, we show how to perform feature selection via a constrained optimization problem involving the trace of the conditional covariance operator. We prove various consistency results for this procedure, and also demonstrate that our method compares favorably with other state-of-the-art algorithms on a variety of synthetic and real data sets.", "full_text": "Kernel Feature Selection via\n\nConditional Covariance Minimization\n\nJianbo Chen\u21e4\n\nUniversity of California, Berkeley\n\njianbochen@berkeley.edu\n\nMartin J. Wainwright\n\nUniversity of California, Berkeley\n\nwainwrig@berkeley.edu\n\nMitchell Stern\u21e4\n\nUniversity of California, Berkeley\n\nmitchell@berkeley.edu\n\nMichael I. Jordan\n\nUniversity of California, Berkeley\n\njordan@berkeley.edu\n\nAbstract\n\nWe propose a method for feature selection that employs kernel-based measures\nof independence to \ufb01nd a subset of covariates that is maximally predictive of the\nresponse. Building on past work in kernel dimension reduction, we show how to\nperform feature selection via a constrained optimization problem involving the\ntrace of the conditional covariance operator. We prove various consistency results\nfor this procedure, and also demonstrate that our method compares favorably with\nother state-of-the-art algorithms on a variety of synthetic and real data sets.\n\n1\n\nIntroduction\n\nFeature selection is an important issue in statistical machine learning, leading to both computational\nbene\ufb01ts (lower storage and faster computation) and statistical bene\ufb01ts, including increased model\ninterpretability. With large data sets becoming ever more prevalent, feature selection has seen\nwidespread usage across a variety of real-world tasks in recent years, including text classi\ufb01cation,\ngene selection from microarray data, and face recognition [3, 14, 17]. In this work, we consider\nthe supervised variant of feature selection, which entails \ufb01nding a subset of the input features that\nexplains the output well. This practice can reduce the computational expense of downstream learning\nby removing features that are redundant or noisy, while simultaneously providing insight into the\ndata through the features that remain.\nFeature selection algorithms can generally be divided into two groups: those which are agnostic\nto the choice of learning algorithm, and those which attempt to \ufb01nd features that optimize the\nperformance of a speci\ufb01c learning algorithm.1 Kernel methods have been successfully applied under\neach of these paradigms in recent work; for instance, see the papers [1, 8, 16, 19, 23, 25, 26, 29].\nKernel feature selection methods have the advantage of capturing nonlinear relationships between the\nfeatures and the labels. Many previous approaches are \ufb01lter methods based on the Hilbert-Schmidt\nIndependence Criterion (HSIC), as proposed by Gretton et al. [13] as a measure of dependence.\nFor instance, Song et al. [24, 25] proposed to optimize HSIC with greedy algorithms on features.\nMasaeli et al. [19] proposed Hilbert-Schmidt Feature Selection (HSFS), which optimizes HSIC with\na continuous relaxation. In later work, Yamada et al. [29] proposed the HSIC-LASSO, in which the\ndual augmented Lagrangian can be used to \ufb01nd a global optimum. There are also wrapper methods\n\n\u21e4Equal contribution.\n1Feature selection algorithms that operate independently of the choice of predictor are referred to as \ufb01lter\nmethods. Algorithms tailored to speci\ufb01c predictors can be further divided into wrapper methods, which use\nlearning algorithms to evaluate features based on their predictive power, and embedded methods, which combine\nfeature selection and learning into a single problem [14].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fand embedded methods using kernels. Most of the methods add weights to features and optimize the\noriginal kernelized loss function together with a penalty on the weights [1, 5, 8, 11, 12, 27, 28]. For\nexample, Cao et al. [5] proposed margin-based algorithms for SVMs to select features in the kernel\nspace. Lastly, Allen [1] proposed an embedded method suitable for kernel SVMs and kernel ridge\nregression.\nIn this paper, we propose to use the trace of the conditional covariance operator as a criterion for\nfeature selection. We offer theoretical motivation for this choice and show that our method can be\ninterpreted both as a \ufb01lter method and as a wrapper method for a certain class of learning algorithms.\nWe also show that the empirical estimate of the criterion is consistent as the sample size increases.\nFinally, we conclude with an empirical demonstration that our algorithm is comparable to or better\nthan several other popular feature selection algorithms on both synthetic and real-world tasks.\n\n2 Formulating feature selection\nLet X\u21e2 Rd be the domain of covariates X, and let Y be the domain of responses Y . Given n\nindependent and identically distributed (i.i.d.) samples {(xi, yi), i = 1, 2, . . . , n} generated from an\nunknown joint distribution PX,Y together with an integer m \uf8ff d, our goal is to select m of the d total\ndenote\nfeatures X1, X2, . . . , Xd which best predict Y . Let S be the full set of features, and let T\u2713S\na subset of features. For ease of notation, we identify S = {X1, . . . , Xd} with [d] = {1, . . . , d},\nand also identify XT with T . We formulate the problem of supervised feature selection from two\nperspectives below. The \ufb01rst perspective motivates our algorithm as a \ufb01lter method. The second\nperspective offers an interpretation as a wrapper method.\n\n2.1 From a dependence perspective\nViewing the problem from the perspective of dependence, we would ideally like to identify a subset\nof features T of size m such that the remaining features S \\ T are conditionally independent of the\nresponses given T . However, this may not be achievable when the number of allowable features\nm is small. We therefore quantify the extent of the remaining conditional dependence using some\nmetric Q, and aim to minimize Q over all subsets T of the appropriate size. More formally, let\nQ : 2[d] ! [0,1) be a function mapping subsets of [d] to the non-negative reals that satis\ufb01es the\nfollowing properties:\n\n\u2022 For a subset of features T , we have Q(T ) = 0 if and only if XS\\T and Y are conditionally\n\u2022 The function Q is non-increasing, meaning that Q(T ) Q (S) whenever T\u2713S . Hence, the\n\nindependent given XT .\nfunction Q achieves its minimum for the full feature set T = [d].\n\nGiven a \ufb01xed integer m, the problem of supervised feature selection can then be posed as\n\nThis formulation can be taken as a \ufb01lter method for feature selection.\n\nmin\n\nT :|T |=mQ(T ).\n\n(1)\n\n2.2 From a prediction perspective\nAn alternative perspective aims at characterizing how well XT can predict Y directly within the\ncontext of a speci\ufb01c learning problem. Formally, we de\ufb01ne the error of prediction as\n(2)\nwhere F is a class of functions from X to Y, and L is a loss function speci\ufb01ed by the user. For\nexample, in a univariate regression problem, the function class F might be the set of all linear\nfunctions, and the loss function might be the squared error L(Y, f (X)) = (Y f (X))2.\nWe then hope to solve the following problem:\n\nEF (X) = inf\nf2F\n\nEX,Y L(Y, f (X)),\n\nmin\n\nT :|T |\uf8ffmEF (XT ) = min\nT :|T |\uf8ffm\n\nwhere Fm is a class of functions supported on Rm. That is, we aim to \ufb01nd the subset of m features\nthat minimizes the prediction error. This formulation thus falls within the scope of wrapper methods\nfor feature selection.\n\ninf\nf2Fm\n\nEX,Y L(Y, f (XT )),\n\n2\n\n\f3 Conditional covariance operator\n\nThe conditional covariance operator provides a measure of conditional dependence for random\nvariables. It was \ufb01rst proposed by Baker [2], and was further studied and used for suf\ufb01cient dimension\nreduction by Fukumizu et al. [9, 10]. We provide a brief overview of this operator and some of its\nkey properties here.\nLet (HX , kX ) and (HY , kY ) denote reproducing kernel Hilbert spaces (RKHSs) of functions on\nspaces X and Y, respectively. Also let (X, Y ) be a random vector on X\u21e5Y with joint distribution\nPX,Y . Assume the kernels kX and kY are bounded in expectation:\n(3)\nThe cross-covariance operator associated with the pair (X, Y ) is the mapping \u2303Y X : HX ! HY\nde\ufb01ned by the relations\nfor all f 2 HX and g 2 HY .\nhg, \u2303Y XfiHY\n(4)\n\n= EX,Y [(f (X) EX[f (X)])(g(Y ) EY [g(Y )])]\n\nEX[kX (X, X)] < 1 and EY [kY (Y, Y )] < 1.\n\nBaker [2] showed there exists a unique bounded operator VY X such that\n\n\u2303Y X =\u2303 1/2\n\nY Y VY X\u23031/2\nXX.\n\n(5a)\n\nThe conditional covariance operator is then de\ufb01ned as\n\u2303Y Y |X =\u2303 Y Y \u23031/2\n\nY Y VY XVXY \u23031/2\nY Y .\n\n(5b)\nAmong other results, Fukumizu et al. [9, 10] showed that the conditional covariance operator captures\nthe conditional variance of Y given X. More precisely, if the sum HX + R is dense in L2(PX),\nwhere L2(PX) is the space of all square-integrable functions on X , then we have\nfor any g 2 HY.\n\n(6)\nFrom Proposition 2 in the paper [10], we also know the residual error of g(Y ) with g 2 HY can be\ncharacterized by the conditional covariance operator. More formally, for any g 2 HY, we have\n(7)\n\n= EX[varY |X[g(Y )|X]]\n\nEX,Y ((g(Y ) EY [g(Y )]) (f (X) EX[f (X)]))2.\n\nhg, \u2303Y Y |XgiHY\n\nhg, \u2303Y Y |XgiHY\n\n= inf\nf2HX\n\n4 Proposed method\n\nIn this section, we describe our method for feature selection, which we call conditional covariance\nminimization (CCM).\nLet (H1, k1) denote an RKHS supported on X\u21e2 Rd. Let T\u2713 [d] be a subset of features with\ncardinality m \uf8ff d, and for all x 2 Rd, take xT 2 Rd to be the vector with components xTi = xi if\ni 2T or 0 otherwise. We de\ufb01ne the kernel kT1 by kT1 (x, \u02dcx) = k1(xT , \u02dcxT ) for all x, \u02dcx 2X . Suppose\nfurther that the kernel k1 is permutation-invariant. That is, for any x, \u02dcx 2X and permutation \u21e1,\ndenoting (x\u21e1(1), . . . , x\u21e1(d)) as x\u21e1, we have k1(x, \u02dcx) = k1(x\u21e1, \u02dcx\u21e1). (Note that this property holds\nfor many common kernels, including the linear, polynomial, Gaussian, and Laplacian kernels.)\nThen for every T of cardinality m, kT1 generates the same RKHS supported on Rm. We call this\nRKHS ( \u02dcH1,ek1). We will show the trace of the conditional covariance operator trace(\u2303Y Y |X) can\nbe interpreted as a dependence measure, as long as the RKHS H1 is large enough.\nWe say that an RKHS (H, k) is characteristic if the map P ! EP [k(X,\u00b7)] 2 H is one-to-one. If k\nis bounded, this is equivalent to saying that H + R is dense in L2(P ) for any probability measure\nP [10]. We have the following lemma, whose proof is given in the appendix:\nLemma 1. If k1 is bounded and characteristic, thenek1 is also characteristic.\nLet (H2, k2) denote an RKHS supported on Y. Based on the above lemma, we have the following\ntheorem, which is a parallel version of Theorem 4 in [10]:\nTheorem 2. If (H1, k1) and (H2, k2) are characteristic, we have \u2303Y Y |X \u2303Y Y |XT\nwith equality\nholding if and only if Y ?? X|XT .\n\n3\n\n\fThe proof is postponed to the appendix.\nWith this generic result in place, we now narrow our focus to problems with univariate responses,\nincluding univariate regression, binary classi\ufb01cation and multi-class classi\ufb01cation. In the case of\nregression, we assume H2 is supported on R, and we take k2 to be the linear kernel:\n\n(8)\nfor all y, \u02dcy 2 R. For binary or multi-class classi\ufb01cation, we take k2 to be the Kronecker delta\nfunction:\n\nk2(y, \u02dcy) = y\u02dcy\n\nk2(y, \u02dcy) = (y, \u02dcy) =\u21e21\n\n0\n\nif y = \u02dcy,\notherwise.\n\n(9)\n\nThis can be equivalently interpreted as a linear kernel k(y, \u02dcy) = hy, \u02dcyi assuming a one-hot encoding\nof Y , namely that Y = {y 2{ 0, 1}k :Pi yi = 1}\u21e2 Rk, where k is the number of classes.\nWhen Y is R or {y 2{ 0, 1}k :Pi yi = 1}\u21e2 Rk, we obtain the following corollary of Theorem 2:\nCorollary 3. If (H1, k1) is characteristic, Y is R or {y 2{ 0, 1}k :Pi yi = 1}\u21e2 Rk, and (H2, k2)\nincludes the identity function on Y, then we have Tr(\u2303Y Y |X) \uf8ff Tr(\u2303Y Y |XT\n) for any subset T of\nfeatures. Moreover, the equality Tr(\u2303Y Y |X) = Tr(\u2303 Y Y |XT\nHence, in the univariate case, the problem of supervised feature selection reduces to minimizing the\ntrace of the conditional covariance operator over subsets of features with controlled cardinality:\n\n) holds if and only if Y ?? X|XT .\n\nmin\n\nT :|T |=mQ(T ) := Tr(\u2303 Y Y |XT\n\n).\n\n(10)\n\nIn the regression setting, Equation (7) implies the residual error of regression can also be characterized\nby the trace of the conditional covariance operator when using the linear kernel on Y. More formally,\nwe have the following observation:\ndenote the conditional covariance operator of (XT , Y ) in ( \u02dcH1,ek1).\nCorollary 4. Let \u2303Y Y |XT\nDe\ufb01ne the space of functions Fm from Rm to Y as\nFm = \u02dcH1 + R := {f + c : f 2 \u02dcH1, c 2 R}.\n\n(11)\n\nThen we have\n\nTr(\u2303Y Y |XT\n\n) = EFm(XT ) = inf\nf2Fm\n\nEX,Y (Y f (XT ))2.\n\n(12)\n\nGiven the fact that the trace of the conditional covariance operator can characterize the dependence\nand the prediction error in regression, we will use the empirical estimate of it as our objective. Given\nn samples {(x1, y1), . . . , (xn, yn)}, the empirical estimate is given by [10]:\n\ntrace( \u02c6\u2303(n)\n\nY Y |XT\n\n) := trace[ \u02c6\u2303(n)\nY Y \u02c6\u2303(n)\nY XT\n= \"n trace[GY (GXT\n\n( \u02c6\u2303(n)\n\nXT XT\n\n+ \"nI)1 \u02c6\u2303(n)\n\nXT Y ]\n\n+ n\"nIn)1],\n\n1\nn\n\nT )\n\n1\nn\n\nGXT\n\nY X , \u02c6\u2303T (n)\n\nwhere \u02c6\u2303T (n)\nY Y are the covariance operators de\ufb01ned with respect to the empirical\ndistribution and GXT and GY are the centralized kernel matrices, respectively. Concretely, we de\ufb01ne\n\nXT X and \u02c6\u2303(n)\n\n(In \n\nT )KXT\n\n: = (In \n\nT )KY (In \n\n1\nn\nThe (i, j)th entry of the kernel matrix KXT is \u02dck1(xi\ndenoting the ith sample with\nonly features in T . As the kernel k2 on the space of responses is linear, we have KY = YYT , where\nY is the n \u21e5 k matrix with each row being a sample response. Without loss of generality, we assume\neach column of Y is zero-mean, so that GY = KY = YYT . Our objective then becomes:\n+ n\"nIn)1Y].\ntrace[GY (GXT\n\nand GY : = (In \nT ), with xi\nT\n\n+ n\"nIn)1] = trace[YYT (GXT\n\n+ n\"nIn)1] = trace[YT (GXT\n\n(13)\nFor simplicity, we only consider univariate regression and binary classi\ufb01cation where k = 1, but our\ndiscussion carries over to the multi-class setting with minimal modi\ufb01cation. The objective becomes\n(14)\n\n+ n\"nIn)1y,\n\nT , xj\n\nT ).\n\n1\nn\n\nmin\n|T |=m\n\n\u02c6Q(n)(T ) := yT (GXT\n\nwhere y = (y1, . . . , yn)T is an n-dimensional vector. We show the global optimal of the problem (14)\nis consistent. More formally, we have the following theorem:\n\n4\n\n\fTheorem 5 (Feature Selection Consistency). Let the set A = argmin\noptimal solutions to (12) and \u02c6T (n) 2 argmin\nand \"nn ! 1 as n ! 1, we have\n\n|T |\uf8ffmQ(T ) be the set of all the\n\u02c6Q(n)(T ) be a global optimal of (14). If \"n ! 0\n\n|T|\uf8ffm\n\nP ( \u02c6T (n) 2 A) ! 1.\n\n(15)\n\nOur proof is provided in the appendix. A comparable result is given in Fukumizu et al. [10] for the\nconsistency of their dimension reduction estimator, but as our minimization takes place over a \ufb01nite\nset, our proof is considerably simpler.\n\n5 Optimization\n\nFinding a global optimum for (14) is NP-hard for generic kernels [28], and exhaustive search\nis computationally intractable if the number of features is large. We therefore approximate the\nproblem of interest via continuous relaxation, as has previously been done in past work on feature\nselection [4, 27, 28].\n\nInitial relaxation\n\n5.1\nWe begin by introducing a binary vector w 2{ 0, 1}d to indicate which features are active. This\nallows us to rephrase the optimization problem from (14) as\n\nw\n\nyT (GwX + n\"nIn)1y\nmin\nsubject to wi 2{ 0, 1}, i = 1, . . . , d,\n\nT w = m,\n\n(16)\n\nwhere denotes the Hadamard product between two vectors and GwX is the centralized version of\nthe kernel matrix KwX with (KwX)ij = k1(w xi, w xj).\nWe then approximate the problem (16) by relaxing the domain of w to the unit hypercube [0, 1]d and\nreplacing the equality constraint with an inequality constraint:\n\nyT (GwX + n\"nIn)1y\nmin\nw\nsubject to 0 \uf8ff wi \uf8ff 1, i = 1, . . . , d,\n\nT w \uf8ff m.\n\n(17)\n\nThis objective can be optimized using projected gradient descent, and represents our \ufb01rst tractable\napproximation. A solution to the relaxed problem is converted back into a solution for the original\nproblem by setting the m largest values of w to 1 and remaining values to 0. We initialize w to the\nuniform vector (m/d)[1, 1, . . . , 1]T in order to avoid the corners of the constraint set during the early\nstages of optimization.\n\n5.2 Computational issues\n\nThe optimization problem can be approximated and manipulated in a number of ways so as to reduce\ncomputational complexity. We discuss a few such options below.\n\nRemoving the inequality constrant. The hard constraint T w \uf8ff m requires a nontrivial projec-\ntion step, such as the one detailed in Duchi et al. [6]. We can instead replace it with a soft constraint\nand move it to the objective. Letting 1 0 be a hyperparameter, this gives rise to the modi\ufb01ed\nproblem\n\nyT (GwX + n\"nIn)1y + 1( T w m)\n\nmin\nw\nsubject to 0 \uf8ff wi \uf8ff 1, i = 1, . . . , d.\n\n(18)\n\n5\n\n\fRemoving the matrix inverse. The matrix inverse in the objective function is an expensive op-\neration. In light of this, we \ufb01rst de\ufb01ne an auxiliary variable \u21b5 2 Rn, add the equality constraint\n\u21b5 = (GwX +n\"nIn)1y, and rewrite the objective as \u21b5T y. We then note that we may multiply both\nsides of the constraint by the centered kernel matrix to obtain the relation (GwX + n\"nIn)\u21b5 = y.\nLetting 2 0 be a hyperparameter, we \ufb01nally replace this relation by a soft `2 constraint to obtain\n\n\u21b5T y + 2k(GwX + n\"nIn)\u21b5 yk2\n\nmin\nw,\u21b5\nsubject to 0 \uf8ff wi \uf8ff 1, i = 1, . . . , d,\n\n2\n\nT w \uf8ff m.\n\n(19)\n\nUsing a kernel approximation. Rahimi and Recht [22] propose a method for approximating kernel\nevaluations by inner products of random feature vectors, so that k(x, \u02dcx) \u21e1 z(x)T z(\u02dcx) for a random\nmap z depending on the choice of kernel k. Let Kw \u21e1 UwU T\nw be such a decomposition, where\nUw 2 Rn\u21e5D for some D < n. Then, de\ufb01ning Vw = (I T /n)Uw, we similarly have that the\ncentered kernel matrix can be written as Gw \u21e1 VwV T\nw . By the Woodbury matrix identity, we may\nwrite\n1\nnn2 Vw(ID +\n\"2\n\nw Vw)1V T\nV T\nw\n\n1\n\"nn\n\n(20)\n\n(GwX + n\"nIn)1 \u21e1\n=\n\n1\n\"nn\n1\n\"nn\n\nI \n(I Vw(V T\n\nw Vw + \"nnID)1V T\n\nw ).\n\nSubstituting this into our objective function, scaling by \u270fnn, and removing the constant term yT y\nresulting from the identity matrix gives a new approximate optimization problem. This modi\ufb01cation\nreduces the complexity of each optimization step from O(n2d + n3) to O(n2D + D3 + nDd).\nChoice of formulation. We remark that each of the three approximations beyond the initial re-\nlaxation may be independently used or omitted, allowing for a number of possible objectives and\nconstraint sets. We explore some of these con\ufb01gurations in the experimental section below.\n\n6 Experiments\n\nIn this section, we evaluate our approach (CCM) on both synthetic and real-world data sets. We\ncompare with several strong existing algorithms, including recursive feature elimination (RFE) [15],\nMinimum Redundancy Maximum Relevance (mRMR) [21], BAHSIC [24, 25], and \ufb01lter methods\nusing mutual information (MI) and Pearson\u2019s correlation (PC). We use the author\u2019s implementation\nfor BAHSIC2 and use Scikit-learn [20] and Scikit-feature [17] packages for the rest of the algorithms.\nThe code for our approach is publicly available at https://github.com/Jianbo-Lab/CCM.\n\n6.1 Synthetic data\n\nWe begin with experiments on the following synthetic data sets:\n\nj=1 X 2\n\n\u2022 Binary classi\ufb01cation (Friedman et al. [7]). Given Y = 1, (X1, . . . , X10) \u21e0 N (0, I10).\nGiven Y = 1, X1 through X4 are standard normal conditioned on 9 \uf8ffP4\nj \uf8ff 16, and\n(X5, . . . , X10) \u21e0 N (0, I6).\n\u2022 3-dimensional XOR as 4-way classi\ufb01cation. Consider the 8 corners of the 3-dimensional\nhypercube (v1, v2, v3) 2 {1, 1}3, and group them by the tuples (v1v3, v2v3), leaving 4 sets\nof vectors paired with their negations {v(i),v(i)}. Given a class i, a point is generated\nfrom the mixture distribution (1/2)N (v(i), 0.5I3) + (1/2)N (v(i), 0.5I3). Each example\nadditionally has 7 standard normal noise features for a total of 10 dimensions.\n\u2022 Additive nonlinear regression: Y = 2 sin(2X1)+max(X2, 0)+X3 +exp(X4)+\", where\n(X1, . . . , X10) \u21e0 N (0, I10) and \" \u21e0 N (0, 1).\n2http://www.cc.gatech.edu/~lsong/code.html\n\n6\n\n\fFigure 1: The above plots show the median rank (y-axis) of the true features as a function of sample\nsize (x-axis) for the simulated data sets. Lower median ranks are better. The dotted line indicates the\noptimal median rank.\n\nThe \ufb01rst data set represents a standard nonlinear binary classi\ufb01cation task. The second data set is a\nmulti-class classi\ufb01cation task where each feature is independent of Y by itself but a combination of\nthree features has a joint effect on Y . The third data set arises from an additive model for nonlinear\nregression.\nEach data set has d = 10 dimensions in total, but only m = 3 or 4 true features. Since the identity\nof these features is known, we can evaluate the performance of a feature selection algorithm by\ncomputing the median rank it assigns to the real features, with lower median ranks indicating better\nperformance. Given enough samples, we would expect this value to come close to the optimal lower\nbound of (m + 1)/2.\nOur experimental setup is as follows. We generate 10 independent copies of each data set with\nsample sizes ranging from 10 to 100, and record the median ranks assigned to the true features by\neach algorithm. This process is repeated a total of 100 times, and the results are averaged across\ntrials. For kernel-based methods, we use a Gaussian kernel k(x, \u02dcx) = exp(kx \u02dcxk2/(22)) on\nX and a linear kernel k(y, \u02dcy) = yT \u02dcy on Y . We take to be the median pairwise distance between\nsamples scaled by 1/p2. Since the number of true features is known, we provide this as an input to\nalgorithms that require it.\nOur initial experiments use the basic version of our algorithm from Section 5.1. When the number\nof desired features m is \ufb01xed, only the regularization parameter \" needs to be chosen. We use\n\" = 0.001 for the classi\ufb01cation tasks and \" = 0.1 for the regression task, selecting these values from\n{0.001, 0.01, 0.1} using cross-validation. Our results are shown in Figure 1.\nOn the binary and 4-way classi\ufb01cation tasks, our method outperforms all other algorithms, succeeding\nin identifying the true features using fewer than 50 samples where others require close to 100 or even\nfail to converge. On the additive nonlinear model, several algorithms perform well, and our method is\non par with the best of them across all sample sizes.\nThese experiments show that our algorithm is comparable to or better than several widely-used\nfeature selection techniques on a selection of synthetic tasks, and is adept at capturing several kinds\nof nonlinear relationships between the covariates and the responses. When compared in particular\nto its closest relative BAHSIC, a backward-elimination algorithm based on the Hilbert\u2013Schmidt\nindependence criterion, we see that our algorithm often produces higher quality results with fewer\nsamples, and even succeeds in the non-additive problem where BAHSIC fails to converge.\nWe also rerun these experiments separately for each of the \ufb01rst two approximations described\nin Section 5.2 above, selecting 1 from {0.001, 0.01, 0.1} and 2 from {1, 10, 100} using cross-\nvalidation. We \ufb01nd that comparable results can be attained with either approximate objective, but\nnote that the algorithm is more robust to changes in 1 than 2.\n\n6.2 Real-world data\n\nIn the previous section, we found that our method for feature selection excelled in identifying\nnonlinear relationships on a variety of synthetic data sets. We now turn our attention to a collection\n\n7\n\n\fSamples\nFeatures\nClasses\n\n214\n10\n6\n\n178\n13\n3\n\nALLAML CLL-SUB-111 glass ORL orlraws10P pixraw10P TOX-171 vowel warpAR10P warpPIE10P wine Yale\n165\n1,024\n15\n\n400\n1,024\n40\n\n210\n2,420\n10\n\n100\n10,000\n\n10\n\n100\n10,304\n\n10\n\n130\n2,400\n10\n\n72\n7,129\n\n2\n\n111\n11,340\n\n3\n\n171\n5,784\n\n4\n\n990\n10\n11\n\nTable 1: Summary of the benchmark data sets we use for our empirical evaluation.\n\nFigure 2: The above plots show classi\ufb01cation accuracy (y-axis) versus number of selected features\n(x-axis) for our real-world benchmark data sets. Higher accuracies are better.\n\nof real-word tasks, studying the performance of our method and other nonlinear approaches when\nused in conjunction with a kernel SVM for downstream classi\ufb01cation.\nWe carry out experiments on 12 standard benchmark tasks from the ASU feature selection website [17]\nand the UCI repository [18]. A summary of our data sets is provided in Table 1. The data sets are\ndrawn from several domains including gene data, image data, and voice data, and span both the\nlow-dimensional and high-dimensional regimes.\nFor every task, we run each algorithm being evaluated to obtain ranks for all features. Performance is\nthen measured by training a kernel SVM on the top m features and computing the resulting accuracy\nas measured by 5-fold cross-validation. This is done for m 2{ 5, 10, . . . , 100} if the total number of\nfeatures d is larger than 100, or m 2{ 1, 2, . . . , d} otherwise. In all cases we \ufb01x the regularization\nconstant of the SVM to C = 1 and use a Gaussian kernel with set as in the previous section over\nthe selected features. For our own algorithm, we \ufb01x \" = 0.001 across all experiments and set the\nnumber of desired features to m = 100 if d > 100 or dd/5e otherwise. Our results are shown in\nFigure 2.\nCompared with three other popular methods for nonlinear feature selection, i.e. mRMR, BAHSIC,\nand MI, we \ufb01nd that our method is the strongest performer in the large majority of cases, sometimes\nby a substantial margin as in the case of TOX-171. While our method is occasionally outperformed\nin the beginning when the number of selected features is small, it either ties or overtakes the leading\nmethod by the end in all but one instance. We remark that our method consistently improves upon\nthe performance of the related BAHSIC method, suggesting that our objective based on conditional\ncovariance may be more powerful than one based on the Hilbert-Schmidt independence criterion.\n\n8\n\n\f7 Conclusion\n\nIn this paper, we proposed an approach to feature selection based on minimizing the trace of the\nconditional covariance operator. The idea is to select the features that maximally account for the\ndependence of the response on the covariates. We do so by relaxing from an intractable discrete\nformulation of the problem to a continuous approximation suitable for gradient-based optimization.\nWe demonstrate the effectiveness of our approach on multiple synthetic and real-world experiments,\n\ufb01nding that it often outperforms other state-of-the-art approaches, including another competitive\nkernel feature selection method based on the Hilbert-Schmidt independence criterion.\n\nReferences\n[1] Genevera I Allen. Automatic feature selection via weighted kernels and regularization. Journal\n\nof Computational and Graphical Statistics, 22(2):284\u2013299, 2013.\n\n[2] Charles R Baker. Joint measures and cross-covariance operators. Transactions of the American\n\nMathematical Society, 186:273\u2013289, 1973.\n\n[3] Ver\u00f3nica Bol\u00f3n-Canedo, Noelia S\u00e1nchez-Maro\u00f1o, and Amparo Alonso-Betanzos. Recent\nadvances and emerging challenges of feature selection in the context of big data. Knowledge-\nBased Systems, 86:33\u201345, 2015.\n\n[4] Paul S Bradley and Olvi L Mangasarian. Feature selection via concave minimization and\nsupport vector machines. In Machine Learning Proceedings of the Fifteenth International\nConference(ICML \u201998, pages 82\u201390. Morgan Kaufmann, 1998.\n\n[5] Bin Cao, Dou Shen, Jian-Tao Sun, Qiang Yang, and Zheng Chen. Feature selection in a kernel\nspace. In Proceedings of the 24th international conference on Machine learning, pages 121\u2013128.\nACM, 2007.\n\n[6] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections onto\nthe l1-ball for learning in high dimensions. In Proceedings of the 25th International Conference\non Machine Learning, ICML \u201908, pages 272\u2013279, New York, NY, USA, 2008. ACM. ISBN\n978-1-60558-205-4. doi: 10.1145/1390156.1390191. URL http://doi.acm.org/10.1145/\n1390156.1390191.\n\n[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics Springer, Berlin, 2001.\n\n[8] Kenji Fukumizu and Chenlei Leng. Gradient-based kernel method for feature extraction and\nvariable selection. In Advances in Neural Information Processing Systems, pages 2114\u20132122,\n2012.\n\n[9] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised\nlearning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):\n73\u201399, 2004.\n\n[10] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Kernel dimension reduction in\n\nregression. The Annals of Statistics, pages 1871\u20131905, 2009.\n\n[11] Ran Gilad-Bachrach, Amir Navot, and Naftali Tishby. Margin based feature selection-theory\nand algorithms. In Proceedings of the twenty-\ufb01rst international conference on Machine learning,\npage 43. ACM, 2004.\n\n[12] Yves Grandvalet and St\u00e9phane Canu. Adaptive scaling for feature selection in svms.\n\nIn\nProceedings of the 15th International Conference on Neural Information Processing Systems,\nNIPS\u201902, pages 569\u2013576, Cambridge, MA, USA, 2002. MIT Press. URL http://dl.acm.\norg/citation.cfm?id=2968618.2968689.\n\n[13] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch\u00f6lkopf. Measuring statistical\ndependence with hilbert-schmidt norms. In International conference on algorithmic learning\ntheory, pages 63\u201377. Springer, 2005.\n\n9\n\n\f[14] Isabelle Guyon and Andr\u00e9 Elisseeff. An introduction to variable and feature selection. Journal\n\nof machine learning research, 3(Mar):1157\u20131182, 2003.\n\n[15] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for\ncancer classi\ufb01cation using support vector machines. Machine learning, 46(1):389\u2013422, 2002.\n[16] P Jaganathan, N Rajkumar, and R Nagalakshmi. A kernel based feature selection method used in\nthe diagnosis of wisconsin breast cancer dataset. Advances in Computing and Communications,\npages 683\u2013690, 2011.\n\n[17] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Trevino Robert, Jiliang Tang, and\n\nHuan Liu. Feature selection: A data perspective. arXiv:1601.07996, 2016.\n\n[18] Moshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.\n\nedu/ml.\n\n[19] Mahdokht Masaeli, Jennifer G Dy, and Glenn M Fung. From transformation-based dimension-\nality reduction to feature selection. In Proceedings of the 27th International Conference on\nMachine Learning (ICML-10), pages 751\u2013758, 2010.\n\n[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[21] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information\ncriteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern\nanalysis and machine intelligence, 27(8):1226\u20131238, 2005.\n\n[22] Ali Rahimi and Ben Recht. Random features for large-scale kernel machines. In In Neural\n\nInfomration Processing Systems, 2007.\n\n[23] Shaogang Ren, Shuai Huang, John Onofrey, Xenios Papademetris, and Xiaoning Qian. A\nScalable Algorithm for Structured Kernel Feature Selection. In Guy Lebanon and S. V. N.\nVishwanathan, editors, Proceedings of the Eighteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages\n781\u2013789, San Diego, California, USA, 09\u201312 May 2015. PMLR.\n\n[24] Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo. Supervised\nfeature selection via dependence estimation. In Proceedings of the 24th international conference\non Machine learning, pages 823\u2013830. ACM, 2007.\n\n[25] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection\nvia dependence maximization. Journal of Machine Learning Research, 13(May):1393\u20131434,\n2012.\n\n[26] Shiquan Sun, Qinke Peng, and Adnan Shakoor. A kernel-based multivariate feature selection\n\nmethod for microarray data classi\ufb01cation. PloS one, 9(7):e102541, 2014.\n\n[27] Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio,\nand Vladimir Vapnik. Feature selection for svms. In Proceedings of the 13th International\nConference on Neural Information Processing Systems, pages 647\u2013653. MIT Press, 2000.\n\n[28] Jason Weston, Andr\u00e9 Elisseeff, Bernhard Sch\u00f6lkopf, and Mike Tipping. Use of the zero-\nnorm with linear models and kernel methods. Journal of machine learning research, 3(Mar):\n1439\u20131461, 2003.\n\n[29] Makoto Yamada, Wittawat Jitkrittum, Leonid Sigal, Eric P Xing, and Masashi Sugiyama. High-\ndimensional feature selection by feature-wise kernelized lasso. Neural computation, 26(1):\n185\u2013207, 2014.\n\n10\n\n\f", "award": [], "sourceid": 3482, "authors": [{"given_name": "Jianbo", "family_name": "Chen", "institution": "University of California, Berkeley"}, {"given_name": "Mitchell", "family_name": "Stern", "institution": "UC Berkeley"}, {"given_name": "Martin", "family_name": "Wainwright", "institution": "UC Berkeley"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}