5 Reasons Why You Should Never Use Recursive Feature Elimination

Fundamental limitations you need to be aware of before using Recursive Feature Elimination (RFE) or any other feature selection algorithm based on feature importance.

· 4 min read
5 Reasons Why You Should Never Use Recursive Feature Elimination
Photo by Dim Hou / Unsplash

When building a predictive model, an explanatory variable or feature is deemed unnecessary when it is either uninformative or redundant.

Feature selection, the act of detecting and removing unnecessary explanatory variables, is key to building an effective predictive model.

In effect, failure to remove unnecessary features while training a predictive model increases the likelihood of overfitting, and makes the model harder to explain and costlier to maintain.

An approach often used for feature selection is the Recursive Feature Elimination algorithm (RFE).

RFE is applicable to models for which we may compute a feature importance score.

Simply put, a feature importance score is a score that quantifies how important a feature is for generating model decisions.

Example feature importance scores include:

Starting with all \(d\) candidate features, RFE recursively trains a model using all eligible features, deletes the feature with the smallest feature importance score to obtain the new set of eligible features, and repeats the previous two steps until the set of eligible features is of a given size \(q < d\), specified by the operator.

While it may seem intuitive at first, RFE should be avoided at all costs.

Here are 5 reasons why.

Reason 1: Because a feature is important does not make it useful!

That's right. Feature importance scores quantify the extent to which a model relies on a feature to make predictions. They do not (necessarily) quantify the contribution of a feature to the overall accuracy of a model (i.e. the feature's usefulness). This is a subtle but fundamental difference.

Take Mean Absolute SHAP value for instance. Its calculation does not utilize true labels at all. As such, there is no guarantee that a high Mean Absolute SHAP value will correspond to a high contribution to the overall accuracy of a model.

The same goes for Gini's importance. Because splitting on a feature yielded a high mean (cumulative) impurity decrease while building a Random Forest, does not guarantee said feature will have a high contribution towards the overall accuracy of a model out-of-sample. The trained Random Forest could very well be overfitted. The same argument can be applied to split-based and gain-based importance scores, as well as LOCO and permutation importance when scores are calculated using the training set.

Even if LOCO and permutation importance scores are calculated using validation/held-out data, it still would not guarantee that important features are also useful.

In general, unless the trained model is a good model, there is no guarantee that features with high LOCO or permutation importance scores will be useful. However, feature selection is part of constructing a good model.

Feature selection should guard against us training bad models, not rely on us having trained good models.

In short, if you are not sure that the feature importance score you are using in RFE guarantees that an important feature is also useful, which most off-the-shelf scores don't guarantee, you should not be using RFE.

Reason 2: RFE exacerbates overfitting.

The first pass of RFE uses all candidate features to train the model. The presence of unnecessary features in this set of candidates, without which feature selection would not be needed, increases the likelihood that the first trained model will be overfitted.

When the first model is overfitted, the most important features will tend to be the most useless. After all, they are the ones that drive (inaccurate) model decisions the most!

Thus, in the first pass, RFE will tend to favor important  but useless features over unimportant but useful ones. When features responsible for overfitting are kept in a pass, the model trained in the subsequent pass will also likely overfit, and features driving this overfitting (i.e. features with high importance scores) will most likely be kept as well.

By keeping important features instead of useful ones, RFE basically exacerbates overfitting.

Reason 3: RFE cannot properly detect, let alone eliminate, redundant features.

Among features that are the most important for a model, there could very well be features that are redundant.

As an illustration consider duplicating the feature with the highest importance score and retraining your model. Intuitively, the original feature and its duplicate should have the same score, and the two should be among features with the highest importance scores for the second model. As such, the duplicate feature is unlikely to be removed by RFE, even though it is clearly unnecessary.

In short, if you are not sure that the feature importance score you are using in RFE guarantees that high importance features cannot be mutually redundant, which most off-the-shelf scores don't guarantee, you should not be using RFE to remove redundant features.

Reason 4: RFE does not tell us how many features we can afford to remove.

For a given \(q\), RFE will help you choose which \((d-q)\) features to remove in your set of \(d\) candidate features. But what value of \(q\) should you use?

Too large a \(q\) and you might still be exposed to the downsides of using unnecessary features (e.g. higher likelihood of overfitting, model harder to explain and costlier to maintain, etc.). Too small a \(q\) and your model performance might suffer.

Picking the right \(q\) is as important as choosing which \((d-q)\) features to remove.

Reason 5: RFE does not work for all models out-of-the-box.

As previously discussed, RFE requires a feature importance score. While model-agnostic feature importance scores exist in theory (Mean Absolute SHAP values, LOCO and Permutation Importance to name but a few), they are usually computationally intractable beyond a few dozen features, except in the case of specific models.

This is the case for SHAP values, which are only practical to compute when model-based solutions or approximations are available (e.g. linear regression and tree-based models).

As for LOCO, because the number of models to train at each RFE pass scales linearly with the number of eligible features, it is impractical to use beyond a few dozens of features to remove. As a simple illustration, eliminating 40 features out of 50 would require us training the model 1000 times!

Finally, in the case of both feature permutation and LOCO, as discussed in Reason 1 above, to be effective they require the trained model to be a good model.

To see how bad RFE really is in practice, check out this benchmark we made, where we compared RFE to LeanML feature selection and Boruta on 38 real-word classification and regression problems.

RFE underperformed both Boruta and LeanML by at least 0.40 \(R^2\) and AUC score on average, and was 10x slower than LeanML!