How To Fix PCA To Make It Work For Feature Selection

Introducing Principal Feature Selection (With Code)

· 9 min read
How To Fix PCA To Make It Work For Feature Selection
Photo by Fernando Lavin / Unsplash

In case you haven't read our previous blog post, you should never use PCA for feature selection.

In this blog post, we propose a drop-in replacement, which we refer to as Principal Feature Selection (PFS), that addresses the most critical limitation of using PCA for feature selection, namely the fact that conservation of energy does not guarantee conservation of signal.

In nutshell, while PCA optimizes for conservation of energy (or \(L^2(\mathbb{P})\) norm), PFS optimizes for conservation of signal, which is what is really needed for feature selection.

But first, here's a gentle reminder of why PCA isn't suitable for feature selection.

PCA Is Not For Feature Selection

We recall that the aim of PCA is to find a feature vector \(z := (z_1, \dots, z_d) \in \mathbb{R}^d\) obtained from an original mean-zero input vector \(x\) by a linear transformation, namely \(z = Wx,\) satisfying the following conditions:

  1. \(z\) has the same energy as \(x\): \(E(||x^2||) = E(||z^2||)\).
  2. \(z\) has decorralated coordinates: \(\forall i \neq j, ~ \rho(z_i, z_j) = 0\).
  3. Coordinates of \(z\) have decreasing variances: \(\text{Var}(z_1) \geq \text{Var}(z_2) \geq \dots \geq \text{Var}(z_d)\).

When used for feature selection, data scientists typically select \(p < d\) such that the loss of energy (or loss of information content) \(\sum_{i=p+1}^{d} \text{Var}(z_i)\) is negligible, and they regard \(z^{p} := (z_1, \dots, z_p)\) as a feature vector that contains fewer and richer representations than the original input \(x\) for predicting a target \(y\).

In so doing, data scientists utilize PCA as a tool for both feature construction and feature selection. Unfortunately, using PCA for either purpose is problematic.

Feature Construction: What is often considered the appeal of PCA for feature construction is the fact that it generates features that are decorrelated.

While linearly dependent features are clearly an issue both theoretically and in practice, decorrelated features are not necessarily desirable for predictive modeling. What we want is to construct features that complement each other the most for predicting the target \(y\).

Feature Selection: When reducing the number of features from \(d\) to \(p\) using PCA, the emphasis is solely placed on ensuring that the \(p\) principal components capture much of the information content or energy of the original inputs vector \(x\).

However, capturing much of the energy of \(x\) does not guarantee capturing much of the signal that is in \(x\) to help us predict \(y\), which is what feature selection ought to optimize for!

Principal Feature Selection

Like PCA, Principal Feature Selection (PFS) learns a linear transformation of the original input vector \(x \in \mathbb{R}^d\), namely \(z = Fx\), where \(F \in \mathbb{R}^{p \times d}\) is a matrix that has \(p \leq d\) normal rows.

However, PFS trades the feature decorrelation requirement of PCA for feature maximum complementarity, and trades the energy conservation requirement of PCA for signal conservation.

PFS is best understood by taking a close look at what PCA does sequentially.

First Principal Feature

In PCA, the first principal component is obtained by looking for the direction in which the projection of the original input vector \(x\) has the highest energy.

Explanation

Indeed, if we denote \(w\) a normal vector (i.e. \(||w||=1\)), then the projection of \(x\) on \(w\) reads \[P_{w}(x) = wx^Tw\] and its energy reads \[\begin{align}\mathcal{E}(P_{w}(x)) :&= E(||P_{w}(x)||^{2}) \\ &= E\left( w^{T} xx^{T}w \right) \\&=w^{T} \text{Cov}(x) w,\end{align}\]and we recover the fact any direction \(w\) maximizing \(P_{w}(x)\) ought to be an eigenvector associated to the largest eigenvalue of \(\text{Cov}(x)\) (see this to understand why), which happens to yield what we already know to be the first principal component!

Instead, in PFS, the first feature, which we called the first principal feature, is obtained by looking for the direction in which the projection of the original input vector \(x\) contains the largest share of the signal that is in \(x\) (in relation to predicting \(y\)).

Formally, the first row \(f_1\) of \(F\) is obtained by maximizing the mutual information between the target \(y\) and \(x^T f\): \[f_1 = \underset{f}{\text{argmax}} ~I(y; x^Tf) ~~ \text{s.t.} ~~ ||f|| = 1,\]and the corresponding first principal feature is \[ z_1 = x^Tf_1.\]

The use of mutual information as a measure of signal or feature potential stems from the fact that the highest performance achievable when using a feature vector \(z\) to predict a target \(y\) is an increasing function of the mutual information \(I(y; z)\) for a variety of commonly used performance metrics. See this paper for more details.

Subsequent Principal Features

More generally, in PCA, the (\(i+1\))-th feature is obtained by looking for the direction that is orthogonal to all previous i principal directions, and in which the projection of the original input has the highest energy.

Explanation

Indeed, if \(w\) is orthogonal to all directions \((w_1, \dots, w_i)\) (i.e. \(w \perp \text{span}(w_1, \dots, w_i)\)), then the projection of \(x\) on \(w\) is the same as the projection on \(w\) of what is left of \(x\) after projecting it on \(\text{span}(w_1, \dots, w_i)\), namely \[\bar{x}_i := x - W_i^T W_ix, \] where \(W_i\) is the \(i \times d\) matrix whose \(i\) rows are the orthonormal eigenvectors of \(\text{Cov}(x)\) corresponding to the \(i\) largest eigenvalues.

We know from the previous explanation that the optimal direction ought to be an eigenvector of \[\text{Cov}(\bar{x}_i) = [I- W_i^T W_i] \text{Cov}(x) [I- W_i^T W_i]\]associated to the largest eigenvalue.

Clearly, all eigenvectors of \(\text{Cov}(x)\) are also eigenvectors of \(\text{Cov}(\bar{x}_i)\). The top-\(i\) eigenvectors of \(\text{Cov}(x)\) have eigenvalue \(0\) relative to \(\text{Cov}(\bar{x}_i)\), and all other eigenvectors of \(\text{Cov}(x)\) have the same eigenvalue relative to \(\text{Cov}(\bar{x}_i)\).

Thus, \(w_{i+1}\), the eigenvector of \(\text{Cov}(x)\) corresponding to the (\(i+1\))-th largest eigenvalue, is the eigenvector of \(\text{Cov}(\bar{x}_i)\) corresponding to the largest eigenvalue.

We recover what we already knew to be the (\(i+1\))-th principal direction.

PCA requires the (\(i+1\))-th principal direction to be orthogonal to all \(i\) previous principal directions in order to ensure that all principal features are decorrelated. However, decorrelated features are not necessarily complementary as far as predicting \(y\) goes.

PFS addresses this limitation by learning the (\(i+1\))-th principal feature by looking for the direction in which the projection of the original input vector \(x\) complements all \(i\) previously learned principal features the most.

Formally, the (\(i+1\))-th row \(f_{i+1}\) of \(F\) is obtained by maximizing the conditional mutual information between the target \(y\) and \(x^T f\) given all previously learned principal features: \[f_{i+1} = \underset{f}{\text{argmax}} ~I\left(y; x^Tf \vert x^{T}f_1, \dots, x^{T} f_i\right) ~~ \text{s.t.} ~~ ||f|| = 1,\]and the corresponding principal feature is \[z_{i+1} = x^Tf_{i+1}.\]

Recalling that \[I(y;x |z) := I(y; x, z)-I(y; z),\]it follows that maximizing \(I\left(y; x^Tf \vert x^{T}f_1, \dots, x^{T} f_i\right)\) (as a function of \(f\)) is equivalent to maximizing \(I\left(y; x^Tf, x^{T}f_1, \dots, x^{T} f_i\right)\) (as a function of \(f\)).

Stopping Criteria

PFS naturally stops when the last conditional mutual information term evaluated is negligible \[I\left(y; x^Tf_{i+1} \vert x^{T}f_1, \dots, x^{T} f_i\right) \approx 0,\]in which case the corresponding principal feature \(z_{i+1} = x^{T}f_{i+1}\) is discarded.

Properties

Maximum Complementarity

Note that, by construction, Principal Feature Selection cannot generate linearly dependent features.

Indeed, if the (\(i+1\))-th principal feature \(z_{i+1} = x^T f_{i+1}\) is linearly related to the first \(i\) principal features, then \(I\left(y; z_{i+1} \vert z_1, \dots, z_i\right)=0\), in which case PFS would stop (by construction), and \(z_{i+1}\) would be discarded.

More generally, by construction, the (\(i+1\))-th principal feature is always the (linear) feature that, when used in conjunction with existing principal features to predict the target \(y\), would yield the highest achievable performance.

Said differently, whether \(f_{i+1}\) is orthogonal to previous principal directions \((f_1, \dots, f_i)\) or not, we are guaranteed that no other feature of the form \(g(x^Tf)\) for any function \(g\) may complement existing principal features \((z_{1}, \dots, z_{i})\) more than \(z_{i+1}\) when it comes to predicting \(y\)!

Dimensionality Reduction

A direct consequence of maximum complementarity is that PFS is bound to learn at most \(d\) features, as we cannot find more than \(d\) vectors in \(\mathbb{R}^d\) that are not linearly related.

In practice, because each principal feature squeezes as much of the signal leftover as it possibly can, in most cases, PFS should be expected to considerably reduce the number of features needed.

Signal Conservation

If PFS ever happens to stop at \(p=d\), in other words, if \(I(y; z_1) > 0\) and no conditional mutual information term \(I\left(y; z_{i+1} \vert z_i, \dots, z_1\right)\) is negligible for \(i\) ranging from \(1\) to \((d-1)\), then \((z_1, \dots, z_d)\) are guaranteed to contain all the signal that is in \(x\) to predict \(y\): \[I(y; z_1, \dots, z_d) = I(y; Fx) = I(y; x),\]which is the least we should expect.

Explanation

Indeed, as previously discussed, the matrix of principal directions \(F\) is full rank. 

Thus, the feature transformation \(x \to z:=Fx\) is a 1-to-1 map, and mutual information is invariant by 1-to-1 maps.

When \(p < d\) or, equivalently, \(I(y; x^Tf \vert z_1, \dots, z_p) \approx 0\) for any \(f \in \mathbb{R}^{d}\), we may safely conclude that no feature of the form \(g(x^Tf)\), for any function \(g\), contains insights that are not already in \((z_1, \dots, z_p)\).

In particular, for each \(i\), \(I(y; x_i \vert z_1, \dots, z_p) \approx 0\), which implies that each original feature is always redundant relative to the \(p\) principal features learned by PFS.

While this is a strong indication that the \(p\) learned principal features \(z\) contain much of the signal that is in the \(d > p\) original features \(x\) to predict \(y\), we can't however conclude that it contains all the signal.

Explanation

Indeed, we may very well have \(I(y; x^Tu \vert z_1, \dots, z_p) = 0\) for all vectors \(u \in \mathbb{R}^{d}\), but \(I(y; x^Tu, x^Tv \vert z_1, \dots, z_p) > 0\) for some vectors \(u, v \in \mathbb{R}^d\). 

All it would take is to have either \(I(y; x^Tu \vert z_1, \dots, z_p, x^Tv) > 0\) or \(I(y; x^Tv \vert z_1, \dots, z_p, x^Tu) > 0\), which is not incompatible with \(I(y; x^Tu \vert z_1, \dots, z_p) = 0\).

What we can conclude, however, is that if we want to complement principal features \(z\) with a feature that is a function of a projection of original features \(x\), then we need to project \(x\) onto a 2 or higher dimensional space, not just a line.

In other words, all features of the form \(g(x^Tu)\) are redundant given principal features \(z\), but some features of the form \(g(Gx)\) where \(G\) is a matrix with more than \(1\) rows, might not be redundant given \(z\).

If no such feature would be beneficial, i.e. \[\underset{G \in \mathbb{R}^{(d-p) \times d}}{\sup} I\left(y; Gx \vert z_1, \dots, z_p\right) = 0,\]then we can safely conclude that the \(p\) learned principal features \(z\) contain all the signal in \(x\) to predict \(y\), no matter how small \(p\) is!

Proof

Add a \((d-p) \times d \) matrix block \(G\) to \(F\) to turn it into a \(d \times d\) full rank matrix \(H\). Note that \(H\) being full rank, \(I(y; Hx) = I(y; x)\).

Also note that we have \[I(y; Hx)=I\left(y; [Fx, Gx]\right) = I(y; Gx \vert Fx) + I(y; Fx).\]

Given that \(I(y; Gx \vert Fx)=0\), it follows that \(I(y; x) = I(y; Hx) = I(y; Fx)\).

When Not To Use PFS?

Like PCA, Principal Feature Selection constructs new features by linearly combining existing features \(x\).

Linearly combining existing features might result in features that make little sense (e.g. think about a linear combination of GPS coordinates, age, and one's number of children to predict one's income), and that will be harder for models to learn from.

Let me stress that, while PFS maximizes the signal or 'juice' in principal features (for predicting \(y\)), it does not necessarily make the said signal or 'juice' easier to extract using models in your toolbox!

A scenario in which you may confidently use PFS is when all original features (i.e. coordinates of \(x\)) are of the same nature, and linearly combining them would produce another feature that is of the same nature.

This is for instance the case in finance when \(x\) is a vector of returns. Here, \(x^Tf\) can be regarded as the return of a (possibly long-short) portfolio containing the same assets of which coordinates of \(x\) are returns.

In general, if linearly combining your original features does not produce features that make sense, then you should probably not consider using PFS, and certainly not PCA.

Python Implementation

We have implemented Principal Feature Selection in the kxy Python package; simply run pip install kxy -U.

As an illustration, we consider running PFS on a regression problem where the target \(y\) is generated as \[y = x^Tw + 2(x^{T}w)^2 + 0.5(x^{T}w)^3,\]where \(x \in \mathbb{R}^{100}\) is a vector of independent standard normal, and \(w = (1/100, \dots, 1/100)\).

import numpy as np
from kxy.pfs import PFS

# Generate the data
d = 100
w = np.ones(d)/d
x = np.random.randn(10000, d)
xTw = np.dot(x, w)
y = xTw + 2.*xTw**2 + 0.5*xTw**3

# Run PFS
selector = PFS()
selector.fit(x, y)

# Learned principal directions
F = selector.feature_directions

# Learned features
z = np.dot(x, F.T)

# Accuracy
true_f_1 = w/np.linalg.norm(w)
learned_f_1 = F[0, :]
e = np.linalg.norm(true_f_1-learned_f_1)
print('PFS Error Norm: %.4f' % e)

The code outputs PFS Error Norm: 0.0552, meaning that our implementation only made 5% error in finding the first principal feature.

Let's compare this to the first principal component of PCA.

from kxy.pfs import PCA
# Run PCA
pca_selector = PCA(energy_loss_frac=0.05)
pca_selector.fit(x, None)

# Learned principal directions
W = pca_selector.feature_directions

# Learned principal components
z = np.dot(x, W.T)

# Accuracy
learned_w_1 = W[0, :]
e = np.linalg.norm(true_f_1-learned_w_1)
print('PCA Error Norm: %.4f' % e)

The PCA code outputs PCA Error Norm: 1.4134!