Boruta(SHAP) Does Not Work For The Reason You Think It Does!

Everything you wish you knew about Boruta, and more.

· 8 min read
Boruta(SHAP) Does Not Work For The Reason You Think It Does!
Photo by Elliot Cullen / Unsplash

Any feature selection method that relies on feature importance scores is inherently flawed. Nonetheless, in practice, Boruta seems to perform decently most of the times, unlike Recursive Feature Elimination (RFE) for instance.

The aim of this post is to elucidate this apparent mystery and shed some light on the surprising reason why Boruta is not quite as bad as RFE.

But first, let me stress why using feature importance scores for feature selection isn't such a great idea, and then explain how Boruta works (in case you didn't already know).

Feature Importance Scores Should Not Be Used For Feature Selection

Feature selection methods that rely on feature importance scores are inherently flawed!

The main reason why is that a feature importance score quantifies the extent to which model decisions are affected by values of the feature; it does not quantify the extent to which model performance is affected by values of the feature.

High feature importance score does not imply that the feature is useful

Take a random model of your choosing that uses a given set of features (e.g. a randomly constructed CART or neural net, a linear model with randomly selected coefficients etc.).

Every feature will have a score, and some features might actually have much larger scores than others.

However, no matter the type of feature importance metric used (SHAP, gain-based, split-based, permutation etc.), a high feature importance score simply would not imply that the feature is insightful or informative about the target. All we could conclude is that the feature is important to the model!

If the model has poor performance, then chances are that features with the highest importance scores are to blame, and should be removed.

The idea of removing a feature because it has a low feature importance score makes the fundamental assumption that the trained model performs great and, in particular is not overfitted.

It is only then that a feature with high importance score will also have a high contribution towards model performance (i.e. a high feature usefulness).

In general, feature selection algorithms should remove features with low usefulness, not features with low feature importance scores.

Feature importance scores cannot detect redundant features

In addition to removing features that are not insightful, a good feature selection algorithm should also remove features that are redundant relative to others.

Most families of predictive models, when trained, would treat identical features the same, as if they were interchangeable.

As a result, identical features will typically have the same feature importance score, even though one is a duplicate of the other.

Thus, a feature selection algorithm based on feature importance scores would never be able to select one feature and remove another which is identical to the first and therefore redundant.

The same reasoning holds when features are very similar but not necessarily identical. More generally, there is nothing in feature importance scores that would inherently ensure that redundant features would not have similar feature importance scores.

The Boruta Exception

What Is Boruta(SHAP)

Boruta is a feature selection algorithm whose core idea is to remove features whose feature importance scores are so low that they ought to be useless.

Boruta gives a statistical answer to the following question: what is the threshold below which feature importance scores should be deemed so low that associated features ought to be useless?

To answer this question, Boruta creates new features called 'shadow features', which we know cannot possibly be useful to solve the problem at hand.

Typically, shadow features are constructed by shuffling each existing feature, thereby destroying any association between feature and target.

Formally, thanks to random shuffling, shadow features have the same marginal distributions as original features, but they are independent of the target both unconditionally and conditional on original features.

The notion of a hit

Shadow features are then added to original features, a model is trained using the combined set, and feature importance scores are calculated for each original and shadow feature.

Because shadow features are useless, Boruta considers that no useful feature should have a feature importance score lower than that of a shadow feature.

An original feature is considered to have received a hit when its feature importance score is higher than the scores of all shadow features.

The Boruta statistical test

Given that shadow features are randomly generated, Boruta constructs a statistical hypothesis test to robustify decision making.

Specifically, the procedure described above is ran \(k\) independent times, and we count the number of hits \(h_i\) of each feature \(x_i\).

The Null Hypothesis \(H_0\)

We represent the fact that we do not know whether feature \(x_i\) is useful or not before running the test by saying that, in each of the \(k\) runs, there is a 50% chance that feature \(x_i\) will receive a hit (i.e. a probability \(p=0.5\)).

Thus, in the absence of evidence pointing to whether \(x_i\) is useful or not, \(h_i\) should follow a Binomial distribution with parameters \(k\) and \(p\). This forms the null hypothesis \(H_o\) of the Boruta test.

The CDF of a Binomial distribution is well known and allows us to easily compute the two-sided symmetric confidence intervals \([m_q(k), M_q(k)]\) in which \(h_i\) should lie with probably \(q\) under \(H_0\).

The Alternate Hypothesis \(H_1\) That Feature \(x_i\) is Useful

When the number of hits \(h_i\) observed after \(k\) runs exceeds \(M_q(k)\) we reject the hypothesis \(H_0\) that we do not know whether feature \(x_i\) is useful or not, in favor of the alternative \(H_1\) that feature \(x_i\) is more likely to be useful than not.

The Alternate Hypothesis \(H_2\) That Feature \(x_i\) is Useless

When the number of hits \(h_i\) observed after \(k\) runs is lower than \(m_q(k)\) we reject the hypothesis \(H_0\) that we do not know whether feature \(x_i\) is useful or not, in favor of the alternative \(H_2\) that feature \(x_i\) is more likely to be useless than not.

Fig. 1: Visualization of the Boruta statistical test for k=20 and at q=95% confidence. A feature is deemed useful when it has at least 14 hits after 20 runs, and it is deemed useless when it has 5 hits or fewer after 20 runs.

BorutaSHAP is basically Boruta when the feature importance metric used is SHAP values.

Python Implementation

Boruta is implemented in the kxy package. Whenever the method you are calling has a feature_selection_method argument, simply specify 'boruta' as value. 

For the source code see here.

The Counterintuitive Reason Why Boruta (Often) Works

The fundamental flaw of Boruta is to assume that a feature is useful just because it has a relatively high feature importance score, and that a feature is useless just because it has a relatively low feature importance score.

This is clearly not the case. Our model could be overfitted to the point of giving useless original features a lot of importance and useful original features very little importance.

What saves Boruta is the fact that the feature importance threshold above which features are deemed useful is a function of the trained model.

In fact, if this threshold was deterministic then, no matter how high we set it, the resulting feature selection algorithm would very easily break down. All it would take is for our trained model to overfit!

Because Boruta defines its threshold as the highest score of shadow features, for Boruta to break down, our trained models should consistently give the wrong importance to an original feature (i.e. low importance if it is a useful feature or high importance if it is a useless feature), while at the same time consistently giving the right (i.e little) importance to all shadow features. Fortunately, the larger \(k\), the less likely this is to happen.

To see why, let us consider two scenarios that would break a feature selection algorithm using a deterministic feature importance threshold.

Scenario I: Completely Random Models

Let's consider the scenario where our model training procedure is completely wrong/random. In this case, there is no reason why any original feature would have a score that is consistently higher or consistently lower than the highest shadow feature importance score.

As such, the number of hits of most original features would always be in \([m_q(k), M_q(k)]\), we would not be able to reject \(H_0\), and Boruta would always be inconclusive, not wrong.

Scenario II: Overfitted Models

Let's consider the scenario where our model training procedure always overfits. For Boruta to wrongly conclude that certain features are useful, our models need to consistently unduly overweight the said features, and give little importance to all shadow features. While this is easily conceivable for \(k=1\) or \(k=2\), the higher the number of runs \(k\), the less likely it is to happen.

On the other hand, if our model overfits in each run but does not necessarily overweight the same features, then, once more, most features will have a number of hits in \([m_q(k), M_q(k)]\), and Boruta will be inconclusive, not wrong.

A Simple Adversarial Attack

Fig. 2: Illustration of BorutaSHAP (k=50, q=0.95) on a regression problem using as candidate features monomial features up to order 7: \((x, x^2, \dots, x^7)\). The only selected feature is \(x^3\), even though a simple linear regression model provides a better fit.

As previously discussed, all it takes for Boruta to break down is for our model training procedure to consistently overfit, overweighting the same original feature(s), and never overweighting shadow features.

The illustration above, which was produced by the code below, provides such an example.

Even though a simple linear regression model is better than any polynomial regression model in this case, Boruta is inconclusive as to whether \(x\) is useful or not, it deems \(x^3\) useful, and it deems all other features \(x^2, x^4, \dots, x^7\) useless.


import pylab as plt
import numpy as np
import pandas as pd

data = [
	[-5., -12.5], [-4., -17.], [-3., -14.],
	[-2., -9.], [-1., -6.], [0., -1.], [1., -1.],
	[2., 3.], [3., 5.], [4., 10.], [5., 9.]
data = np.array(data)
x = data[:, 0][:, None]
y = data[:, 1][:, None]
z = np.concatenate([x, x**2, x**3, x**4, x**5, x**6, \
	x**7], axis=1)
x_df = pd.DataFrame(z, columns= \
	['x%d' % i for i in range(1, 8)])
y_df = pd.DataFrame(y, columns=['y'])

from kxy.learning import get_sklearn_learner
from kxy.misc import Boruta

learner_func = get_sklearn_learner(\
selector = Boruta(learner_func)
lr_model =, y_df, n_evaluations=50, \
	pval=0.95, max_duration=None)
selected_features = selector.selected_variables
ambiguous_features = selector.ambiguous_variables

print('Selected Features: %s' % \
print('Ambiguous Features: %s' % \

simple_lr_model = learner_func(), y)

mono_lr_model = learner_func(), y)

x_test = np.arange(-5., 5.1, 0.1)[:, None]
z_test = np.concatenate([x_test, x_test**2, \
	x_test**3, x_test**4, x_test**5, x_test**6, \
    x_test**7], axis=1)
x_test_df = pd.DataFrame(z_test, columns= \
	['x%d' % i for i in range(1, 8)])
y_test_pred = lr_model.predict(\
simple_y_test_pred = simple_lr_model.predict(x_test)
mono_y_test_pred = mono_lr_model.predict(z_test)

fig, ax = plt.subplots(figsize=(10, 10))
plt.plot(x, y, '.k', label='Training Data')
plt.plot(x_test, y_test_pred, '-b', \
	label='Linear Regression (Monomial Features '\
    'Selected by Boruta)')
plt.plot(x_test, simple_y_test_pred, '-.r', \
	label='Simple Linear Regression')
plt.plot(x_test, mono_y_test_pred, '-c', \
	label='Linear Regression (All Monomial Features)')
plt.savefig('overfitting.jpg', dpi=500, \


For Recursive Feature Elimination (RFE) to break down, all it takes is for trained models to overfit. Boruta(SHAP) requires a little more to break down. Trained models need to overfit, overweighting the same original features, while never overweighting shadow features.

When trained models overfit but do not always overweight the same (original) features, Boruta(SHAP) becomes inconclusive about whether or not a feature is useful.

To use a classification analogy, the increased resilience of Boruta(SHAP) over RFE lies in the fact that, unlike RFE, it is able to increase its precision by lowering its recall, while RFE always has a recall of 100%.

This is desirable in practice because, when it comes to feature selection, we care more about the quality of selected features (i.e. precision) than selecting a large number of features (i.e. recall).

Make no mistake. Boruta(SHAP) does break down; just less often than RFE does. LeanML is a better alternative to both Boruta(SHAP) and RFE. For a benchmark, see this post.