TL'DR

In this article, we explain why effective feature selection matters, why popular approaches such as RFE, Boruta, and SHAP values aren't good enough, and we propose an information-theoretical solution with code.

One of the first lessons a machine learning engineer learns at the beginning of his career is that, when it comes to features, sometimes less is more.

The Limits of Using Too Many Features

While the temptation is high to generate as many features as possible and to let the model do the rest, training a model with redundant or uninformative features can be both ineffective and inefficient.

Non-informative features, simply put, are features that cannot contribute to improving the performance of a predictive model. As an illustration, whether a customer is left-handed or right-handed is likely non-informative about whether or/not he will churn a banking product.

Formally, non-informative features are features that are statistically independent from the target (unconditionally), and also statistically independent from the target conditional on any other feature(s) in the set of candidate features.

Redundant features on the other hand are features that, while they might be informative about the target, cannot contribute to improving the performance of a predictive model when used in conjunction with some other features in the set of candidates.

For instance, in problems where geolocation matters, the postcode can potentially be informative about the target. However, when GPS coordinates are included to the set of features, the postcode becomes redundant as it can be inferred from GPS coordinates.

Formally, redundant features are features for which there exists a subset of other features conditional on/given which they are stastically independent from the target.

Using non-informative or redundant features can cause several problems in the training phase of a machine learning project, as well as after a predictive model is deployed to production.

Before Production

Non-informative and redundant features make the training phase less effective (i.e. poorer performance) and less efficient (i.e. longer and costlier).

Increased risk of overfitting: By training a model with non-informative features, we run the risk that it gives a lot of importance to those non-informative features based on spurious patterns detected in the training dataset, which is a form of overfitting. By definition, a model shouldn't rely on non-informative features. Redundant features can also cause degeneracy while training certain models (e.g. linear regression), which could result in overfitting.

Poorer performance: Even when overfitting does not occur, an optimizer might still give small but non-negligible importance to non-informative features. This would result in poorer performance compared to not including said non-informative features.

Costlier to train: When training a model, runtime, memory usage, and CPU usage grow with the number of features. Because they are dispensable, redundant features and non-informative features make model training unnecessarily longer and costlier.

Harder to explain: The presence of redundant features makes it harder to explain a model, as it will be harder to properly attribute feature importance to redundant features.

In Production

Using non-informative or redundant features in production increases maintenance costs.

Increased outage surface: Different features could be served in production by different pipelines (e.g. the raw inputs might come from different databases or tables, processing could be done by different processes, etc.). In such a case, the more features a deployed model uses, the more feature delivery system outages the model will be exposed to, which will eventually increase downtime as any feature outage will likely take down prediction capabilities.

Faster data drift: Data distributions drift over time, causing production models to be retrained. The higher the number of features, the quicker the drift will arise, and the more often production models will need to be retrained.

Review of Existing Feature Selection Approaches & Their Limitations

Pairwise Mutual Information

A popular feature selection approach consists of quantifying how informative each feature is about the target and retaining features that have a high degree of statistical dependency on the target. This is often done using measures of association such as Pearson correlation, Spearman rank correlation, or (pairwise) mutual information.

There are two major problems with this approach. First, a feature need not be statistically dependent on the target to be useful! Because this is a common misconception, let's take a concrete illustrative example to clear this out.

Imagine we want to pin down the location of a person or an object (e.g. by predicting GPS coordinates) using elements of an address (i.e. street name, house number, city, zip code, and country). Intuitively, we would expect house numbers to be statistically independent from GPS coordinates, as most house numbers are found on every street, in every city, and in every country. Simply knowing a house number can hardly tell us anything about the associated GPS coordinates (on average).

Clearly, without house numbers, we will not be able to predict GPS coordinates with an error smaller than a few hundred meters (typical street lengths). That said, even though house numbers are expected to be statistically independent from GPS coordinates, using house numbers, we will be able to reduce the error from a few hundred meters to a few tens of meters (typical house dimensions); this is very significant!

Without the street name, the house number is indeed useless. The house number becomes more useful when the street name is known. But even then, because some street names are very common (e.g. First Street, Second Street, Third Street, etc. in the US), we need either the city or the zip code to remove any ambiguity.  Said differently, house numbers can be greatly useful, but only if we also know the street name and city or the zip code. Pairwise feature selection approaches are unable to select such features as house numbers, as they never consider interactions between features.

The second problem with pairwise approaches to feature selection is that they are inherently unable to detect and remove redundant features. To detect redundant features, an approach needs to consider more than one feature at a time.

Recursive Feature Elimination

Another popular feature selection algorithm is the Recursive Feature Elimination (RFE) algorithm. It proceeds as follows:

1. Train a classification or regression model.
2. Quantify the importance of each feature used to the trained model.
3. Remove the least important feature.
4. Repeat 1-3. until we are left with the desired number of features.

RFE does a good job at removing features that are not useful to a model. However, RFE is unable to detect redundant features, and it can be very slow.

Intuitively, if we duplicate the most important feature and retrain our model, both said feature and its duplicate will likely still be deemed very important by the newly trained model. Thus, the duplicated feature will be among the last ones to be considered for removal by RFE. More generally, RFE cannot remove feature a feature on the basis that it is redundant, which effective feature selection algorithms should be able to.

As for runtime and resources usage of RFE, they grow with the number of features to remove, not the number of features to keep, which is counter-intuitive. This is made worse by the fact that each elimination round requires fully re-training our model.

To take a concrete example, if we want to select 10 features out of thousands of candidates, we need to fully train our model thousands of times. It is bad enough to have to retrain our model thousands of times, but each time we also need to calculate feature importance scores.

While feature importance scores are a byproduct of model training for some classes of models (e.g. linear regression and decision trees), for many more classes of models they are very costly to evaluate, to the point of being impractical.

Boruta

Another fundamental limitation of RFE is that the choice of the number of features to keep is left to the user. The Boruta feature selection algorithm addresses this limitation by providing a statistical framework for testing whether a feature is non-informative.

The intuitive idea underpinning Boruta is that, by randomly shuffling rows of a feature column, we effectively destroy any potential association between the feature, the target, and other features, thereby rendering it useless.

By randomly shuffling all features, we generate an additional set of useless features referred to as 'shadow features', whose importance scores we may use as benchmarks to form a threshold above which an importance score can be deemed high enough (and below which it can be deemed typical of useless features). In the case of Boruta, the threshold is the highest importance score out of all shadow features.

In Boruta, a model is trained using a combination of real features and shadow features, and feature importance scores are calculated for real and shadow features. Any real feature whose importance score is higher than the highest importance score of shadow features is said to have triggered a 'hit'. Our expectation is that useful features will trigger hits, while useless won't.

Boruta models our lack of knowledge about whether a feature is useful or not using the (null) hypothesis $H_0$ that there is a 1/2 probability that the feature will trigger a hit. If we repeat the experiment $k$ times then, the number of hits $h_k$ of any real feature for which $H_0$ holds (i.e. when we don't know whether the feature is useful or not) ought to follow the binomial distribution $B(k, 1/2)$, of which we can evaluate confidence bounds $[hm_k, hM_k]$ at a level of our choosing (say 95%) analytically.

We conclude about the usefulness of a feature as follows:

• $h_k > hM_k$: we may reject the hypothesis $H_0$ in favor of the alternative $H_1$ that the feature is useful.
• $h_k < hm_k$: we may reject the hypothesis $H_0$ in favor of the alternative $H_2$ that the feature is useless.
• $h_k \in [hm_k, hM_k]$: we still do not know whether the feature is useful. We need to increase $k$ to conclude.

Because models need to be trained with real and shadow features combined, an iteration of Boruta can be much more computationally demanding than an iteration of RFE. When the number of features is large, however, Boruta will typically need far fewer iterations to detect all useless features than RFE.

That said, like RFE, Boruta is unable to detect and eliminate redundant features, and for the same reasons.

RFE-SHAP and Boruta-SHAP

SHAP values offer a principled game-theoretic approach to feature importance scoring and, as such, can be combined with both RFE and Boruta.

In a nutshell, the SHAP value of a feature corresponding to a specific model prediction represents the contribution to the deviation of the specific decision made from the baseline decision, which can be attributed to the specific value of the said feature.

By taking the absolute value and averaging across all decisions made, we obtain a score that quantifies the contribution of each feature in driving model decisions away from the baseline decision (i.e. the best decision we can make without using any feature): this the SHAP feature importance score.

The appeal of SHAP values stems in large part from the fact that Shapley values, upon which they are based, satisfy a desirable uniqueness property. However, because SHAP values of identical features are the same, using RFE or Boruta with SHAP feature importance cannot detect and remove redundant features.

Feature Importance Based Approaches Are Fundamentally Flawed

More generally, in addition to the fact that they cannot eliminate redundant features, feature importance-based approaches to feature selection are fundamentally flawed.

A feature importance score (SHAP value included) quantifies the contribution or 'influence' of a feature towards generating model decisions or, said differently, the extent to which model decisions are due to the feature. However, what we really need, which a feature importance score does not and cannot do, is to quantify the contribution of a feature to the overall performance of a model.

Features with high importance scores are features that drive model decisions the most. Useful features on the other hand are features that account for model performance the most! For the two notions to coincide, the trained model has to be very accurate, an assumption rarely met but which RFE and Boruta take for granted.

Whenever a model is overfitted, chances are that the features with the highest feature importance scores are to blame, and as such should be removed. However, these are precisely the features that a feature importance-based feature selection algorithm such as RFE and Boruta will keep. Let's take a concrete example.

Consider the following toy regression problem.

In black is a fitted linear model relating input $x$ to $y$, and in blue is a fitted polynomial model, which can be regarded as a linear model relating features $x, x^2, x^3, x^4, ...$ to $y$.

From the plot above, we can tell that the nonlinear model, which is obviously overfitted, relies heavily on at least one monomial term whose degree is higher than $1$. Thus, we expect the corresponding feature to have a high importance score, no matter the scoring method used. Hence, by applying a feature importance-based selection algorithm to the nonlinear model, we will never be able to remove all nonlinear monomials, which would be the right thing to do in this example.

Had we estimated the contribution of each feature to model performance instead, we would have been able to remove all nonlinear monomials on the basis that they are responsible for overfitting.

Effective Information-Theoretic Feature Selection With KXY

To be effective, a feature selection algorithm should do two things right: 1) discard redundant features, and 2) keep features that contribute the most to model performance.

Algorithm

Building on our  previous blog post, we propose the following algorithm:

• Select as the first feature the feature $f_1$ that, when used by itself to predict the target $y$, has the highest achievable performance $a_1$, as described in this blog.
• For $i$ ranging from 1 to the total number $k$ of candidate features, out of all $(k-i)$ features not yet selected, select as $(i+1)$-th feature $f_{i+1}$ the feature that, when used in conjunction with all $i$ previously selected features, yields the highest increase in the achievable performance $(a_{i+1}-a_{i})$. Said differently, $f_{i+1}$ is the feature that complements previously selected features $(f_1, \dots, f_i )$ the most to predict the target $y$.

Note that this algorithm is model-free. It only considers the potential effect a feature can have on model performance.

At each step $i$, any feature that is redundant relative to the $i$ features already selected is the least likely to be selected as, by definition of it being redundant, it cannot increase the highest performance achievable.

Similarly, any feature that is non-informative cannot increase the highest performance achievable and, as such, is bound to be among the last features selected.

Formally, if we denote $j$ the smallest index such that $a_j=a_k$ then, if $j < k$, all features $\{a_{j+1}, \dots, a_k\}$ are either redundant or useless and, as such, can be safely discarded.

More generally, the sequence $\{ (f_1, a_1), \dots, (f_j, a_j)\}$ allows us to strike the right balance between the number of features used and the potential performance they can bring about. Even if a feature isn't completely useless or redundant, the performance boost it may bring about can be so small that using the feature wouldn't be worth it.

Code Snippet

We use the UCI Bank Marketing dataset as an example.

But first, we generate some features. The code snippet below generates 103 features out of the 20 raw explanatory variables, and 1-hot encodes the target.

# 0. As a one-off, run 'pip install kxy', then 'kxy configure'
import kxy
# pip install kxy_datasets
from kxy_datasets.classifications import BankMarketing
dataset = BankMarketing()
target_column = dataset.y_column
df = dataset.df
# 2. Generate candidate features
features_df = df.kxy.generate_features(entity=None, max_lag=None,\
entity_name='*', exclude=[target_column])
del features_df['y_no']
target_column = 'y_yes'


Here is how simple it is to generate the sequence $\{ (f_1, a_1), \dots, (f_k, a_k)\}$ in Python.

# 3. Run model-free variable selection, simple!
features_df.kxy.variable_selection(target_column, \
problem_type='classification')

Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.0000 0.89
1 duration 0.0472 0.90
2 euribor3m 0.0743 0.91
3 age 0.0764 0.91
4 campaign 0.0764 0.91
5 job_services 0.24 0.94
6 housing_yes 0.24 0.94
8 pdays 0.25 0.95
9 previous 0.26 0.95
10 emp.var.rate 0.26 0.95
11 cons.price.idx 0.27 0.95
12 cons.conf.idx 0.27 0.95
13 nr.employed 0.27 0.95
14 age.ABS(* - MEAN(*)) 0.27 0.95
15 age.ABS(* - MEDIAN(*)) 0.27 0.95
16 age.ABS(* - Q25(*)) 0.27 0.95
17 age.ABS(* - Q75(*)) 0.27 0.95
18 duration.ABS(* - MEAN(*)) 0.27 0.95
19 duration.ABS(* - MEDIAN(*)) 0.28 0.95
20 duration.ABS(* - Q25(*)) 0.28 0.95
21 duration.ABS(* - Q75(*)) 0.28 0.95
22 campaign.ABS(* - MEAN(*)) 0.28 0.95
23 campaign.ABS(* - MEDIAN(*)) 0.28 0.95
24 campaign.ABS(* - Q25(*)) 0.28 0.95
25 campaign.ABS(* - Q75(*)) 0.28 0.95
26 pdays.ABS(* - MEAN(*)) 0.28 0.95
27 pdays.ABS(* - MEDIAN(*)) 0.28 0.95
28 pdays.ABS(* - Q25(*)) 0.28 0.95
29 pdays.ABS(* - Q75(*)) 0.28 0.95
30 previous.ABS(* - MEAN(*)) 0.28 0.95
31 previous.ABS(* - MEDIAN(*)) 0.28 0.95
32 previous.ABS(* - Q25(*)) 0.28 0.95
33 previous.ABS(* - Q75(*)) 0.28 0.95
34 emp.var.rate.ABS(* - MEAN(*)) 0.28 0.95
35 emp.var.rate.ABS(* - MEDIAN(*)) 0.28 0.95
36 emp.var.rate.ABS(* - Q25(*)) 0.28 0.95
37 emp.var.rate.ABS(* - Q75(*)) 0.28 0.95
38 cons.price.idx.ABS(* - MEAN(*)) 0.28 0.95
39 cons.price.idx.ABS(* - MEDIAN(*)) 0.28 0.95
40 cons.price.idx.ABS(* - Q25(*)) 0.28 0.95
41 cons.price.idx.ABS(* - Q75(*)) 0.28 0.95
42 cons.conf.idx.ABS(* - MEAN(*)) 0.28 0.95
43 cons.conf.idx.ABS(* - MEDIAN(*)) 0.28 0.95
44 cons.conf.idx.ABS(* - Q25(*)) 0.29 0.95
45 cons.conf.idx.ABS(* - Q75(*)) 0.29 0.95
46 euribor3m.ABS(* - MEAN(*)) 0.29 0.95
47 euribor3m.ABS(* - MEDIAN(*)) 0.29 0.95
48 euribor3m.ABS(* - Q25(*)) 0.29 0.95
49 euribor3m.ABS(* - Q75(*)) 0.29 0.96
50 nr.employed.ABS(* - MEAN(*)) 0.29 0.96

As we can see, the top-8 explanatory variables collectively contribute to an increase of the achievable classification accuracy above the baseline by 6%, while the remaining 90+ features can only generate a 1% accuracy boost.