Max-Planck-Institut für biologische Kybernetik Arbeitsgruppe Bülthoff - PDF

Max-Planck-Institut für biologische Kybernetik Arbeitsgruppe Bülthoff Spemannstraße Tübingen Germany Technical Report No. 44 December 996 Nonlinear Component Analysis as a Kernel Eigenvalue Problem

Please download to get full document.

View again

of 18
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

Kids & Toys

Publish on:

Views: 158 | Pages: 18

Extension: PDF | Download: 0

Max-Planck-Institut für biologische Kybernetik Arbeitsgruppe Bülthoff Spemannstraße Tübingen Germany Technical Report No. 44 December 996 Nonlinear Component Analysis as a Kernel Eigenvalue Problem Bernhard Scholkopf, Alexander Smola, and Klaus{Robert Muller Abstract We describe a new method for performing a nonlinear form of Principal Component Analysis. By the use of integral operator kernel functions, we can eciently compute principal components in high{dimensional feature spaces, related to input space by some nonlinear map for instance the space of all possible 5{pixel products in 66 images. We give the derivation of the method, along with a discussion of other techniques which can be made nonlinear with the kernel approach and present rst experimental results on nonlinear feature extraction for pattern recognition. AS and KRM are with GMD First (Forschungszentrum Informationstechnik), Rudower Chaussee 5, 2489 Berlin. AS and BS were supported by grants from the Studienstiftung des deutschen Volkes. BS thanks the GMD First for hospitality during two visits. AS and BS thank V. Vapnik for introducing them to kernel representations of dot products during joint work on Support Vector machines. This work proted from discussions with V. Blanz, L. Bottou, C. Burges, H. Bultho, K. Gegenfurtner, P. Haner, N. Murata, P. Simard, S. Solla, V. Vapnik, and T. Vetter. We are grateful to V. Blanz, C. Burges, and S. Solla for reading a preliminary version of the manuscript. This document is available as /pub/mpi-memos/ via anonymous ftp from or from the World Wide Web, Introduction Principal Component Analysis (PCA) is a powerful technique for extracting structure from possibly high{dimensional data sets. It is readily performed by solving an Eigenvalue problem, or by using iterative algorithms which estimate principal components for reviews of the existing literature, see Jollie (986) and Diamataras & Kung (996). PCA is an orthogonal transformation of the coordinate system in which we describe our data. The new coordinate values by whichwe represent out data are called principal components. It is often the case that a small number of principal components is sucient to account for most of the structure in the data. These are sometimes called the factors or latent variables of the data. The present work generalizes PCA to the case where we are not interested in principal components in input space, but rather in principal components of variables, or features, which are nonlinearly related to the input variables. Among these are for instance variables obtained by taking higher{order correlations between input variables. In the case of image analysis, this would amount to nding principal components in the space of products of input pixels. To this end, we are using the method of expressing dot products in feature space in terms of kernel functions in input space. Given any algorithm which can be expressed solely in terms of dot products, i.e. without explicit usage of the variables themselves, this kernel method enables us to construct dierent nonlinear versions of it (Aizerman, Braverman, & Rozonoer, 964 Boser, Guyon,& Vapnik, 992). Even though this general fact was known (Burges, 996), the machine learning community has made little use of it, the exception being Support Vector machines (Vapnik, 995). In this paper, we give some examples of nonlinear methods constructed by this approach. For one example, the case of a nonlinear form of principal component analysis, we shall give details and experimental results (Sections 2 { 6) for some other cases, we shall briey sketch the algorithms (Sec. 7). In the next section, we will rst review the standard PCA algorithm. In order to be able to generalize it to the nonlinear case, we shall then formulate it in a way which uses exclusively dot products. In Sec. 3, we shall discuss the kernel method for computing dot products in feature spaces. Together, these two sections form the basis for Sec. 4, which presents the proposed kernel{based algorithm for nonlinear PCA. Following that, Sec. 5 will discuss some dierences between kernel{based PCA and other generalizations of PCA. In Sec. 6, we shall give some rst experimental results on kernel{based feature extraction for pattern recognition. After a discussion of other applications of the kernel method (Sec. 7), we conclude with a discussion (Sec. 8). Finally, some technical material which is not essential for the main thread of the argument has been relegated into the appendix. 2 PCA in Feature Spaces Given a set of M centered P observations x k, k = ::: M, x k 2 R N M, x k= k =, PCA diagonalizes the covariance matrix C = M j= x j x j : () To do this, one has to solve the Eigenvalue equation v = Cv (2) for Eigenvalues andv 2 R N nfg. AsCv = M P M j= (x j v)x j, all solutions v must lie in the span of x :::x M, hence (2) is equivalent to (x k v) =(x k Cv) for all k = ::: M: (3) The remainder of this section is devoted to a straightforward translation to a nonlinear scenario, in order to prepare the ground for the method proposed in the present paper. We shall now describe this computation in another dot product space F, which is related to the input space by a possibly nonlinear map :R N! F x 7! X: (4) Note that F, which we will refer to as the feature space, could have an arbitrarily large, possibly in- nite, dimensionality. Here and in the following, upper case characters are used for elements of F, while lower case characters denote elements of R N. Again, we make the assumption P that we are M dealing with centered data, i.e. (x k= k)= we shall return to this point later. Using the covariance matrix in F, C = M j= (x j )(x j ) (5) More precisely, thecovariance matrix is dened as the expectation of xx for convenience, we shall use the same term to refer to the maximum likelihood estimate () of the covariance matrix from a nite sample. 2 (if F is innite{dimensional, we think of (x j )(x j ) as the linear operator which maps X 2 F to (x j )((x j ) X)) we nowhave tond Eigenvalues and Eigenvectors V 2 F nfg satisfying V = CV (6) By the same argument as above, the solutions V lie in the span of (x ) ::: (x M ). For us, this has two useful consequences: rst, we can consider the equivalent equation ((x k ) V) = ((x k ) CV) forallk = ::: M (7) and second, there exist coecients i (i = ::: M) such that V = i= Combining (7) and (8), we get M i= i= i ((x k ) i (x i ): (8) i ((x k ) (x i )) = j= Dening an M M matrix K by this reads (x j ))((x j ) (x i )) for all k = ::: M: (9) K ij := ((x i ) (x j )) () MK = K 2 () where denotes the column vector with entries ::: M. As K is symmetric, it has a set of Eigenvectors which spans the whole space, thus M = K (2) gives us all solutions of Eq. (). Note that K is positive semidenite, which can be seen by noticing that it equals ((x ) ::: (x M )) ((x ) ::: (x M )) (3) which implies that for all X 2 F, (X KX) =k((x ) ::: (x M ))Xk 2 : (4) Consequently, K's Eigenvalues will be nonnegative, and will exactly give the solutions M of Eq. (). We therefore only need to diagonalize K. Let 2 ::: M denote the Eigenvalues, and ::: M the corresponding complete set of Eigenvectors, with p being the rst nonzero Eigenvalue. 2 We normalize p ::: M by requiring that the corresponding vectors in F be normalized, i.e. (V k V k ) = for all k = p ::: M: (5) By virtue of (8) and (2), this translates into a normalization condition for p ::: M : = = i j= i j= k i k j ((x i ) (x j )) k i k j K ij = ( k K k ) = k ( k k ) (6) For the purpose of principal component extraction, we need to compute projections on the Eigenvectors V k in F (k = p ::: M). Let x beatest point, with an image (x) inf,then (V k (x)) = i= k i ((x i ) (x)) (7) may be called its nonlinear principal components corresponding to. In summary, the following steps were necessary to compute the principal components: rst, compute the dot product matrix K dened by () 3 second, compute its Eigenvectors and normalize them in F third, compute projections of a test point onto the Eigenvectors by (7). For the sake of simplicity,we have above made the assumption that the observations are centered. This is easy to achieve in input space, but more dicult in F,as we cannot explicitly compute the mean of the mapped observations in F. There is, however,away to do it, and this leads to slightly modied equations for kernel{based PCA (see Appendix A). Before we proceed to the next section, which more closely investigates the role of the map, the following observation is essential. The mapping used in the matrix computation can be an arbitrary nonlinear map into the possibly high{ dimensional space F. e.g. the space of all nth order monomials in the entries of an input vector. 2 If we require that should not map all observations to zero, then such ap will always exist. 3 Note that in our derivation we could have used the known result (e.g. Kirby & Sirovich, 99) that PCA can be carried out on the dot product matrix (xi xj)ij instead of (), however, for the sake ofclarity and extendability (in Appendix A, we shall consider the case where the data must be centered in F ), we gave a detailed derivation. 3 In that case, we need to compute dot products of input vectors mapped by, with a possibly prohibitive computational cost. The solution to this problem, which will be described in the following section, builds on the fact that we exclusively need to compute dot products between mapped patterns (in () and (7)) we never need the mapped patterns explicitly. 3 Computing Dot Products in Feature Space In order to compute dot products of the form ((x) (y)), we use kernel representations of the form k(x y) =((x) (y)) (8) which allow us to compute the value of the dot product in F without having to carry out the map. This method was used by Boser, Guyon, & Vapnik (992) to extend the \Generalized Portrait hyperplane classier of Vapnik & Chervonenkis (974) to nonlinear Support Vector machines. To this end, they substitute a priori chosen kernel functions for all occurances of dot products. This way, the powerful results of Vapnik & Chervonenkis (974) for the Generalized Portrait carry over to the nonlinear case. Aizerman, Braverman & Rozonoer (964) call F the \linearization space , and use it in the context of the potential function classication method to express the dot product between elements of F in terms of elements of the input space. If F is high{dimensional, we would like to be able to nd a closed form expression for k which can be eciently computed. Aizerman et al. (964) consider the possibilityof choosing k a priori, without being directly concerned with the corresponding mapping into F. A specic choice of k might then correspond to a dot product between patterns mapped with a suitable. A particularly useful example, which isa direct generalization of a result proved by Poggio (975, Lemma 2.) in the context of polynomial approximation, is (x y) d =(C d (x) C d (y)) (9) where C d maps x to the vector C d (x) whose entries are all possible n-th degree ordered products of the entries of x. For instance (Vapnik, 995), if x =(x x 2 ), then C 2 (x) =(x 2 x 2 2 x x 2 x 2 x ), or, yielding the same value of the dot product, c 2 (x) =(x 2 x 2 2 p 2x x 2 ): (2) For this example, it is easy to verify that ; (x x 2 )(y y 2 ) 2 = (x 2 x 2 2 p 2x x 2 )(y 2 y 2 2 p 2y y 2 ) = c 2 (x)c 2 (y) : (2) In general, the function k(x y) =(x y) d (22) corresponds to a dot product in the space of d-th order monomials of the input coordinates. If x represents an image with the entries being pixel values, we can thus easily work in the space spanned by products of any d pixels provided that we are able to do our work solely in terms of dot products, without any explicit usage of a mapped pattern c d (x). The latter lives in a possibly very high{dimensional space: even though we will identify terms like x x 2 and x 2 x into one coordinate of F as in (2), the dimensionalityof F, the image of R N under c d, still is (N+p;)! and p!(n;)! thus grows like N p.for instance, 66 input images and a polynomial degree d = 5 yield a dimensionality of. Thus, using kernels of the form (22) is our only way totakeinto account higher{ order statistics without a combinatorial explosion of time complexity. The general question which function k corresponds to a dot product in some space F has been discussed by Boser, Guyon, & Vapnik (992) and Vapnik (995): Mercer's theorem of functional analysis states that if k is a continuous kernel of a positive integral operator, we can construct a mapping into a space where k acts as a dot product (for details, see Appendix B. The application of (8) to our problem is straightforward: we simply substitute an a priori chosen kernel function k(x y) for all occurances of ((x) (y)). This was the reason why we had to formulate the problem in Sec. 2 in a way which only makes use of the values of dot products in F. The choice of k then implicitly determines the mapping and the feature space F. In Appendix B, we give some examples of kernels other than (22) which may be used. 4 Kernel PCA 4. The Algorithm To perform kernel{based PCA (Fig. ), from now on referred to as kernel PCA, the following steps have to be carried out: rst, we compute the dot product matrix (cf. Eq. ()) K ij =(k(x i x j )) ij : (23) Next, we solve (2) by diagonalizing K, and normalize the Eigenvector expansion coecients n by requiring Eq. (6), = n ( n n ): 4 linear PCA kernel PCA k R 2 R 2 F=R 2 Φ Figure : The basic idea of kernel PCA. In some high{ dimensional feature space F (bottom right), weare performing linear PCA, just as a PCA in input space (top). Since F is nonlinearly related to input space (via ), the contour lines of constant projections onto the principal Eigenvector (drawn as an arrow) become nonlinear in input space. Note that we cannot draw a pre{image of the Eigenvector in input space, as it may not even exist. Crucial to kernel PCA is the fact that we do not actually perform the map into F,but instead perform all necessary computations by the use ofakernel function k in input space (here: R 2 ). To extract the principal components (corresponding to the kernel k) of a test point x, we then compute projections onto the Eigenvectors by (cf. Eq. (7)), (kpc) n (x) =(V n (x)) = i= n i k(x i x): (24) If we use a kernel as described in Sec. 3, we know that this procedure exactly corresponds to standard PCA in some high{dimensional feature space, except that we do not need to perform expensive computations in that space. 4.2 Properties of (Kernel{) PCA If we use a kernel which satises the conditions given in Sec. 3, we know that we are in fact doing a standard PCA in F. Consequently, all mathematical and statistical properties of PCA (see for instance Jollie, 986) carry over to kernel{ based PCA, with the modications that they become statements about a set of points (x i ) i = ::: M,inF rather than in R N. In F,we can thus assert that PCA is the orthogonal basis transformation with the following properties (assuming that the Eigenvectors are sorted in ascending order of the Eigenvalue size): the rst q (q 2f ::: Mg) principal components, i.e. projections on Eigenvectors, carry more variance than any other q orthogonal directions the mean{squared approximation error in representing the observations by the rst q principal components is minimal the principal components are uncorrelated the representation entropy is minimized the rst q principal components have maximal mutual information with respect to the inputs For more details, see Diamantaras & Kung (996). To translate these properties of PCA in F into statements about the data in input space, they need to be investigated for specic choices of a kernels. We shall not go into detail on that matter, but rather proceed in our discussion of kernel PCA. 4.3 Dimensionality Reduction and Feature Extraction Unlike linear PCA, the proposed method allows the extraction of a number of principal components which can exceed the input dimensionality. Suppose that the number of observations M exceeds the input dimensionality N. Linear PCA, even when it is based on the M M dot product matrix, can nd at most N nonzero Eigenvalues they are identical to the nonzero Eigenvalues of the N N covariance matrix. In contrast, kernel PCA can nd up to M nonzero Eigenvalues 4 a fact that illustrates that it is impossible to perform kernel PCA based on an N N covariance matrix. 4.4 Computational Complexity As mentioned in Sec. 3, a fth order polynomial kernel on a 256{dimensional input space yields a {dimensional space. It would seem that looking for principal components in his space should pose intractable computational problems. However, as we have explained above, this is not the case. First, as pointed out in Sect. 2 we donot need to look for Eigenvectors in the full space F, but just in the subspace spanned by the images of our observations x k in F. Second, we donot need to compute dot products explicitly between vectors in F,aswe know that in our case this can be done directly in the input space, using kernel 4 If we use one kernel of course, we could extract features with several kernels, to get even more. 5 functions. The proposed kernel principal component analysis thus is computationally comparable to a linear PCA on ` observations with an `` dot product matrix. The overall computational complexity isnotchanged by the following additional cost: we need to evaluate kernel functions rather than just dot products. If k is easy to compute, as for polynomial kernels, e.g., this is negligible. Furthermore, in the case where we need to use a large number ` of observations, we maywant to work with an algorithm for computing only the largest Eigenvalues, as for instance the power method with deation (for a discussion, see Diamantaras & Kung, 996). In addition, we cancon- sider using an estimate of the dot product matrix computed from a subset of M `examples, while still extracting principal components from all ` examples (this approach was chosen in some of our experiments described below). The situation is dierent for principal component extraction. There, we have toevaluate the kernel function ` times for each extracted principal component (24), rather than just evaluating one dot product as for a linear PCA. In some cases, e.g. if we were to extract principal components as a preprocessing step for classication, this is a disadvantage. However, in that case, we can speed up the extraction by a method analoguous to a technique proposed by Burges (996) in the context of Support Vector machines. Specically, we can try to approximate each Eigenvector V = P` i= i(x i ) (Eq. (8)) by another vector ~V = mx j= j (z j ) (25) where m `is chosen a priori according to the desired speed{up, and z j 2 R N j = m. This is done by minimizing the squared dierence = kv ; ~ Vk 2 : (26) The crucial point is that this also can be done without explicitly dealing with the possibly high{ dimensional space F.As = kvk + k ~ Vk;2 `X mx i= j= i j k(x i z j ) (27) the gradient of with respect to the j and the z j is readily expressed in terms of the kernel function, thus can be minimized by standard gradient methods. In the case k(x y) =(x y) 2 it is possible to get an exact expansion with at most N vectors z j (N being the dimension of the input space) by solving an Eigenvalue problem (Burges, 996). For the task of handwritten character recognition, this technique led to a speed{up by a factor of 5 at almost no loss in accuracy, yielding a state of the art classier (Burges & Scholkopf, 996). Finally,we add that although the kernel principal component extraction is computationally more expensive than its linear counterpart, this additional investment can pay back afterwards. In experiments on classication based on the extracted principal components, we found that in the nonlinear case, it was
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks