Any feature selection method that relies on feature importance scores is inherently flawed. Nonetheless, in practice, Boruta seems to perform decently most of the times, unlike Recursive Feature Elimination (RFE) for instance.
The aim of this post is to elucidate this apparent mystery and shed some light on the surprising reason why Boruta is not quite as bad as RFE.
But first, let me stress why using feature importance scores for feature selection isn't such a great idea, and then explain how Boruta works (in case you didn't already know).
Feature Importance Scores Should Not Be Used For Feature Selection
Feature selection methods that rely on feature importance scores are inherently flawed!
The main reason why is that a feature importance score quantifies the extent to which model decisions are affected by values of the feature; it does not quantify the extent to which model performance is affected by values of the feature.
High feature importance score does not imply that the feature is useful
Take a random model of your choosing that uses a given set of features (e.g. a randomly constructed CART or neural net, a linear model with randomly selected coefficients etc.).
Every feature will have a score, and some features might actually have much larger scores than others.
However, no matter the type of feature importance metric used (SHAP, gain-based, split-based, permutation etc.), a high feature importance score simply would not imply that the feature is insightful or informative about the target. All we could conclude is that the feature is important to the model!
If the model has poor performance, then chances are that features with the highest importance scores are to blame, and should be removed.
The idea of removing a feature because it has a low feature importance score makes the fundamental assumption that the trained model performs great and, in particular is not overfitted.
It is only then that a feature with high importance score will also have a high contribution towards model performance (i.e. a high feature usefulness).
In general, feature selection algorithms should remove features with low usefulness, not features with low feature importance scores.
Feature importance scores cannot detect redundant features
In addition to removing features that are not insightful, a good feature selection algorithm should also remove features that are redundant relative to others.
Most families of predictive models, when trained, would treat identical features the same, as if they were interchangeable.
As a result, identical features will typically have the same feature importance score, even though one is a duplicate of the other.
Thus, a feature selection algorithm based on feature importance scores would never be able to select one feature and remove another which is identical to the first and therefore redundant.
The same reasoning holds when features are very similar but not necessarily identical. More generally, there is nothing in feature importance scores that would inherently ensure that redundant features would not have similar feature importance scores.
The Boruta Exception
What Is Boruta(SHAP)
Boruta is a feature selection algorithm whose core idea is to remove features whose feature importance scores are so low that they ought to be useless.
Boruta gives a statistical answer to the following question: what is the threshold below which feature importance scores should be deemed so low that associated features ought to be useless?
To answer this question, Boruta creates new features called 'shadow features', which we know cannot possibly be useful to solve the problem at hand.
Typically, shadow features are constructed by shuffling each existing feature, thereby destroying any association between feature and target.
Formally, thanks to random shuffling, shadow features have the same marginal distributions as original features, but they are independent of the target both unconditionally and conditional on original features.
The notion of a hit
Shadow features are then added to original features, a model is trained using the combined set, and feature importance scores are calculated for each original and shadow feature.
Because shadow features are useless, Boruta considers that no useful feature should have a feature importance score lower than that of a shadow feature.
An original feature is considered to have received a hit when its feature importance score is higher than the scores of all shadow features.
The Boruta statistical test
Given that shadow features are randomly generated, Boruta constructs a statistical hypothesis test to robustify decision making.
Specifically, the procedure described above is ran \(k\) independent times, and we count the number of hits \(h_i\) of each feature \(x_i\).
The Null Hypothesis \(H_0\)
We represent the fact that we do not know whether feature \(x_i\) is useful or not before running the test by saying that, in each of the \(k\) runs, there is a 50% chance that feature \(x_i\) will receive a hit (i.e. a probability \(p=0.5\)).
Thus, in the absence of evidence pointing to whether \(x_i\) is useful or not, \(h_i\) should follow a Binomial distribution with parameters \(k\) and \(p\). This forms the null hypothesis \(H_o\) of the Boruta test.
The CDF of a Binomial distribution is well known and allows us to easily compute the two-sided symmetric confidence intervals \([m_q(k), M_q(k)]\) in which \(h_i\) should lie with probably \(q\) under \(H_0\).
The Alternate Hypothesis \(H_1\) That Feature \(x_i\) is Useful
When the number of hits \(h_i\) observed after \(k\) runs exceeds \(M_q(k)\) we reject the hypothesis \(H_0\) that we do not know whether feature \(x_i\) is useful or not, in favor of the alternative \(H_1\) that feature \(x_i\) is more likely to be useful than not.
The Alternate Hypothesis \(H_2\) That Feature \(x_i\) is Useless
When the number of hits \(h_i\) observed after \(k\) runs is lower than \(m_q(k)\) we reject the hypothesis \(H_0\) that we do not know whether feature \(x_i\) is useful or not, in favor of the alternative \(H_2\) that feature \(x_i\) is more likely to be useless than not.
BorutaSHAP is basically Boruta when the feature importance metric used is SHAP values.
Boruta is implemented in the
kxy package. Whenever the method you are calling has a
feature_selection_method argument, simply specify
'boruta' as value.
For the source code see here.
The Counterintuitive Reason Why Boruta (Often) Works
The fundamental flaw of Boruta is to assume that a feature is useful just because it has a relatively high feature importance score, and that a feature is useless just because it has a relatively low feature importance score.
This is clearly not the case. Our model could be overfitted to the point of giving useless original features a lot of importance and useful original features very little importance.
What saves Boruta is the fact that the feature importance threshold above which features are deemed useful is a function of the trained model.
In fact, if this threshold was deterministic then, no matter how high we set it, the resulting feature selection algorithm would very easily break down. All it would take is for our trained model to overfit!
Because Boruta defines its threshold as the highest score of shadow features, for Boruta to break down, our trained models should consistently give the wrong importance to an original feature (i.e. low importance if it is a useful feature or high importance if it is a useless feature), while at the same time consistently giving the right (i.e little) importance to all shadow features. Fortunately, the larger \(k\), the less likely this is to happen.
To see why, let us consider two scenarios that would break a feature selection algorithm using a deterministic feature importance threshold.
Scenario I: Completely Random Models
Let's consider the scenario where our model training procedure is completely wrong/random. In this case, there is no reason why any original feature would have a score that is consistently higher or consistently lower than the highest shadow feature importance score.
As such, the number of hits of most original features would always be in \([m_q(k), M_q(k)]\), we would not be able to reject \(H_0\), and Boruta would always be inconclusive, not wrong.
Scenario II: Overfitted Models
Let's consider the scenario where our model training procedure always overfits. For Boruta to wrongly conclude that certain features are useful, our models need to consistently unduly overweight the said features, and give little importance to all shadow features. While this is easily conceivable for \(k=1\) or \(k=2\), the higher the number of runs \(k\), the less likely it is to happen.
On the other hand, if our model overfits in each run but does not necessarily overweight the same features, then, once more, most features will have a number of hits in \([m_q(k), M_q(k)]\), and Boruta will be inconclusive, not wrong.
A Simple Adversarial Attack
As previously discussed, all it takes for Boruta to break down is for our model training procedure to consistently overfit, overweighting the same original feature(s), and never overweighting shadow features.
The illustration above, which was produced by the code below, provides such an example.
Even though a simple linear regression model is better than any polynomial regression model in this case, Boruta is inconclusive as to whether \(x\) is useful or not, it deems \(x^3\) useful, and it deems all other features \(x^2, x^4, \dots, x^7\) useless.
import pylab as plt import numpy as np import pandas as pd # GENERATE THE DATA data = [ [-5., -12.5], [-4., -17.], [-3., -14.], [-2., -9.], [-1., -6.], [0., -1.], [1., -1.], [2., 3.], [3., 5.], [4., 10.], [5., 9.] ] data = np.array(data) x = data[:, 0][:, None] y = data[:, 1][:, None] z = np.concatenate([x, x**2, x**3, x**4, x**5, x**6, \ x**7], axis=1) x_df = pd.DataFrame(z, columns= \ ['x%d' % i for i in range(1, 8)]) y_df = pd.DataFrame(y, columns=['y']) # RUN BORUTA from kxy.learning import get_sklearn_learner from kxy.misc import Boruta learner_func = get_sklearn_learner(\ 'sklearn.linear_model.LinearRegression') selector = Boruta(learner_func) lr_model = selector.fit(x_df, y_df, n_evaluations=50, \ pval=0.95, max_duration=None) selected_features = selector.selected_variables ambiguous_features = selector.ambiguous_variables print('Selected Features: %s' % \ str(selected_features)) print('Ambiguous Features: %s' % \ str(ambiguous_features)) # TRAIN A SIMPLE LINEAR MODEL simple_lr_model = learner_func() simple_lr_model.fit(x, y) # TRAIN A LINEAR MODEL ON ALL MONOMIAL FEATURES mono_lr_model = learner_func() mono_lr_model.fit(z, y) # PLOT DATA x_test = np.arange(-5., 5.1, 0.1)[:, None] z_test = np.concatenate([x_test, x_test**2, \ x_test**3, x_test**4, x_test**5, x_test**6, \ x_test**7], axis=1) x_test_df = pd.DataFrame(z_test, columns= \ ['x%d' % i for i in range(1, 8)]) y_test_pred = lr_model.predict(\ x_test_df[selected_features].values) simple_y_test_pred = simple_lr_model.predict(x_test) mono_y_test_pred = mono_lr_model.predict(z_test) fig, ax = plt.subplots(figsize=(10, 10)) plt.plot(x, y, '.k', label='Training Data') plt.plot(x_test, y_test_pred, '-b', \ label='Linear Regression (Monomial Features '\ 'Selected by Boruta)') plt.plot(x_test, simple_y_test_pred, '-.r', \ label='Simple Linear Regression') plt.plot(x_test, mono_y_test_pred, '-c', \ label='Linear Regression (All Monomial Features)') plt.legend() ax.set_ylabel(r'''$y$''') ax.set_xlabel(r'''$x$''') plt.savefig('overfitting.jpg', dpi=500, \ bbox_inches='tight') plt.show()
For Recursive Feature Elimination (RFE) to break down, all it takes is for trained models to overfit. Boruta(SHAP) requires a little more to break down. Trained models need to overfit, overweighting the same original features, while never overweighting shadow features.
When trained models overfit but do not always overweight the same (original) features, Boruta(SHAP) becomes inconclusive about whether or not a feature is useful.
To use a classification analogy, the increased resilience of Boruta(SHAP) over RFE lies in the fact that, unlike RFE, it is able to increase its precision by lowering its recall, while RFE always has a recall of 100%.
This is desirable in practice because, when it comes to feature selection, we care more about the quality of selected features (i.e. precision) than selecting a large number of features (i.e. recall).
Make no mistake. Boruta(SHAP) does break down; just less often than RFE does. LeanML is a better alternative to both Boruta(SHAP) and RFE. For a benchmark, see this post.