5 Reasons You Should Never Use PCA For Feature Selection

Fundamental limitations you need to be aware of before using Principal Components Analysis for feature selection.

· 7 min read
5 Reasons You Should Never Use PCA For Feature Selection
Photo by Mitchell Luo / Unsplash

Principal Component Analysis, or PCA, is one of the most consequential dimensionality reduction algorithms ever invented.

Unfortunately, like all popular tools, PCA is often used for unintended purposes, sometimes abusively. One such purpose is feature selection.

In this article, we give you 5 key reasons never to use PCA for feature selection. But first, let's briefly review the inner-workings of PCA.

What Is PCA?

The Problem

Let us assume we have a vector of inputs \(x := (x_1, \dots, x_d) \in \mathbb{R}^d\), which we assume has mean 0 to simplify the argument (i.e. E(x)=0).

We are interested in reducing the size \(d\) of our vector, without losing too much information. Here, a proxy for the information content of \(x\) is its energy defined as \[\mathcal{E}(x) := E(||x||^2).\]

The challenge is that the information content of \(x\) is usually unevenly spread out across its coordinates.

In particular, coordinates can be positively or negatively correlated, which makes it hard to gauge the effect of removing a coordinate on the overall energy.

Let's take a concrete example. In the simplest bivariate case (\(d=2\)),

\(\mathcal{E}(x) = \text{Var}(x_1) + \text{Var}(x_2) + 2\rho(x_1, x_2) \sqrt{\text{Var}(x_1)\text{Var}(x_2)}\),

where \(\rho(x_1, x_2)\) is the correlation between the two coordinates, and \(\text{Var}(x_i)\) is the variance of \(x_i\).

Let's assume that \(x_1\) has a higher variance than \(x_2\). Clearly, the effect of removing \(x_2\) on the energy, namely \[\mathcal{E}(x)-\text{Var}(x_1) =  \text{Var}(x_2) + 2\rho(x_1, x_2) \sqrt{\text{Var}(x_1)\text{Var}(x_2)},\] does not just depend on \(x_2\); it also depends on the correlation between \(x_1\) and \(x_2\), and on the variance/energy of \(x_1\)! When \(d > 2\), things get even more complicated. The energy now reads\[\mathcal{E}(x) =\sum_{i=1}^{d}\sum_{j=1}^{d} \rho(x_i, x_j) \sqrt{\text{Var}(x_i)\text{Var}(x_j)},\]and analyzing the effect on the energy of removing any coordinate becomes a lot more complicated.

The aim of PCA is to find a feature vector \(z := (z_1, \dots, z_d) \in \mathbb{R}^d\) obtained from \(x\) by a linear transformation, namely \(z = Wx,\) satisfying the following conditions:

  1. \(z\) has the same energy as \(x\): \(E(||x^2||) = E(||z^2||)\).
  2. \(z\) has decorralated coordinates: \(\forall i \neq j, ~ \rho(z_i, z_j) = 0\).
  3. Coordinates of \(z\) have decreasing variances: \(\text{Var}(z_1) \geq \text{Var}(z_2) \geq \dots \geq \text{Var}(z_d)\).

When the 3 conditions above are met, we have \[\mathcal{E}(x) = \mathcal{E}(z)  =\sum_{i=1}^{d} \text{Var}(z_i).\]Thus, dimensionality reduction can be achieved by using features \(z ^{p} := (z_1, \dots, z_p)\) instead of the original features \(x := (x_1, \dots, x_d)\), where \(p < d\) is chosen so that the energy loss, namely \[\mathcal{E}(z)-\mathcal{E}(z^{p}) = \sum_{i=p+1}^{d}\text{Var}(z_i),\]is only a small fraction of the total energy \(\mathcal{E}(z)\):\[\frac{\sum_{i=p+1}^{d}\text{Var}(z_i)}{\sum_{i=1}^{d}\text{Var}(z_i)} \ll 1.\]

The Solution

The three conditions above induce a unique solution.

The conservation of energy equation implies: \[E(||z^2||) = E\left( x^{T} W^{T}Wx\right) = E\left( x^{T} x\right)=E(||x^2||).\] A sufficient condition for this to hold is that \(W\) be an orthogonal matrix: \[W^{T}W = WW^{T} = I.\] In other words, columns (resp. rows) form an orthonormal basis of \(\mathbb{R}^{d}\).

As for the second condition, it implies that the autocovariance matrix\[\text{Cov}(z) = WE(xx^T)W^T = W\text{Cov}(x)W^T \]should be diagonal.

Let us  write \(\text{Cov}(x) = UDU^T\) the Singular Value Decomposition of \(\text{Cov}(x)\), where columns of the orthogonal matrix \(U\) are orthonormal eigenvectors of the (positive semidefinite) matrix \(\text{Cov}(x)\), sorted in decreasing order of eigenvalues.

Plugging \(\text{Cov}(x) = UDU^T\) in the equation \(\text{Cov}(z) = W\text{Cov}(x)W^T\), we see that, to satisfy the second condition, it is sufficient that \(WU=I=U^{T}W^{T}\), which is equivalent to \(W=U^{-1} =U^{T}\).

Note that, because \(U\) is orthogonal, the choice \(W = U^{T}\) also satisfies the first condition.

Finally, given that columns of \(U\) are sorted in decreasing order of eigenvalues, their variances \(\text{Var}(z_i) = \text{Cov}(z)[i, i] = D[i, i]\) also form a decreasing sequence, which satisfies the third condition.

Interestingly, it can be shown that any loading matrix \(W\) of a linear transformation that satisfies the three conditions above ought to be of the form \(W=U^T\) where columns of \(U\) are orthonormal eigenvectors of \(\text{Cov}(x)\) sorted in decreasing order of their eigenvalues.

Coordinates of \(z\) are called principal components, and the transformation \(x \to U^{T}x\) is the Principal Component Analysis.

5 Reasons Not To Use PCA For Feature Selection

Now that we are on the same page about what PCA is, let me give you 5 reasons why it is not suitable for feature selection.

When used for feature selection, data scientists typically regard \(z^{p} := (z_1, \dots, z_p)\) as a feature vector that contains fewer and richer representations than the original input \(x\) for predicting a target \(y\).

Reason 1: Conservation of energy does not guarantee conservation of signal

The essence of PCA is that the extent to which dimensionality reduction is lossy is driven by the information content (energy in this case) that is lost in the process. However, for feature selection, what we really want is to make sure that reducing dimensionality will not reduce performance!

Unfortunately, maximizing the information content or energy of features \(z^p := (z_1, \dots, z_p)\) does not necessarily maximize their predictive power!

Think of the predictive power of \(z^p\) as the signal part of its overall energy or, equivalently, the fraction of its overall energy that is useful for predicting the target \(y\).

We may decompose an energy into signal and noise as \[ \mathcal{S}(z^p) + \mathcal{N}(z^p) = \mathcal{E}(z^p) \leq \mathcal{E}(x) =  \mathcal{S}(x) + \mathcal{N}(x) ,\]where \(\mathcal{N}(z^p) := E\left(||z^p||^2 \vert y\right)\) is the noise component, and \(\mathcal{S}(z^p) := E\left(||z^p||^2 \right)-E\left(||z^p||^2 \vert y\right)\) is the signal.

Clearly, while PCA ensures that \(\mathcal{E}(x) \approx \mathcal{E}(z^p)\), we may easily find ourself in a situation where PCA has wiped out all the signal that was originally in \(x\) (i.e. \(\mathcal{S}(z^p) \approx 0\))! The lower the Signal-to-Noise Ratio (SNR) \(\frac{\mathcal{S}(x)}{\mathcal{N}(x)}\), the more likely this is to happen.

Fundamentally, for feature selection, what we want is conservation of signal  \(\mathcal{S}(x) \approx \mathcal{S}(z^p)\) not conservation of energy.

Note that, if instead of using the energy as the measure of information content we used the entropy, the noise would have been the conditional entropy \(h\left(z^p \vert y\right)\), and the signal would have been the mutual information \(I(y; z^p)\).

Reason 2: Conservation of energy is antithetical to feature selection

Fundamentally, preserving the energy of the original feature vector conflicts with the objectives of feature selection.

Feature selection is most needed when the original feature vector \(x\) contains coordinates that are uninformative about the target \(y\), whether they are used by themselves, or in conjunction with other coordinates.

In such a case, removing the useless feature(s) is bound to reduce the energy of the feature vector. The more useless features there are, the more energy we will lose, and that's OK!

Let's take a concrete example in the bivariate case \(x=(x_1, x_2)\) to illustrate this. Let's assume \(x_2\) is uninformative about \(y\) and \(x_1\) is almost perfectly correlated to \(y\).

Saying that \(x_2\) is uninformative about the target \(y\) means that it ought to be independent from \(y\) both unconditionally (i.e. \(I(y; x_2)=0\)) and conditionally on \(x_1\) (i.e. \(I(y; x_2 \vert x_1) = 0\)).

This can occur for instance when \(x_2\) is completely random (i.e. independent from both \(y\) and \(x_1)\). In such a case, we absolutely need to remove \(x_2\), but doing so would inevitably reduce the energy by \(E(||x_2||^2)\).

Note that, when both \(x_1\) and \(x_2\) have been standardized, as is often the case before applying PCA, removing \(x_2\), which is the optimal thing to do from a feature selection standpoint, would result in 50% energy loss!

Even worse, in this example, \(x_1\) and \(x_2\) happen to be principal components (i.e. \(U=I\)) associated to the exact same eigenvalue. Thus, PCA is unable to decide which one to keep, even though \(x_2\) is clearly useless and \(x_1\) almost perfectly correlated to the target!

Reason 3: Decorrelation of features does not imply maximum complementarity

It is easy to think that because two features are decorrelated each must bring something new to the table. That is certainly true, but that 'new thing' which decorrelated features bring is energy or information content, not necessarily signal!

Much of that new energy can be pure noise. In fact, features that are completely random are decorrelated with useful features, yet they cannot possibly complement them for predicting the target \(y\); they are useless.

Reason 4: Learning patterns from principal components could be harder than from original features

When PCA is used for feature selection, new features are constructed.

In general, the primary goal of feature construction is to simplify the relationship between inputs and the target into one that models our toolbox can reliably learn.

By linearly combining previously constructed features, PCA creates new features that can be harder to interpret, and in a more complex relationship with the target.

The questions you should be asking yourself before applying PCA are:

  • Does linearly combining my features make any sense?
  • Can I think of an explanation for why the linearly combined features could have as simple a relationship to the target as the original features?

If the answer to either question is no, then PCA features would likely be less useful than original features.

As an illustration, imagine we want to predict a person's income using, among other features, GPS coordinates of her primary residence, age, number of children, number of hours worked per week.

While it is easy to see how a tree-based learner could exploit these features, linearly combining them would result in features that make little sense and are much harder to learn anything meaningful from using tree-based methods.

Reason 5: Feature selection ought to be model-specific

Feature selection serves one primary goal: removing useless features from a set of candidates. As explained in this article, feature usefulness is a model-specific notion. A feature can very well be useful for a model, but not so much for another.

PCA, however, is model-agnostic. In fact, it does not even utilize any information about the target.

Conclusion

PCA is a great tool with many high-impact applications. Feature selection is just not one of them.