How To Seamlessly Compress Any Tabular Model in Python

Train your favorite predictive models in Python (e.g. LightGBM, XGBoost, Scikit-Learn, Tensorflow, and PyTorch models) using at least 80% fewer features, at no performance cost.

· 18 min read
How To Seamlessly Compress Any Tabular Model in Python
Photo by Kira auf der Heide / Unsplash

TL'DR

In this article, we show you how to train your favorite predictive models in Python using more than 80% fewer features, at no performance cost.

Supported models include but are not limited to LightGBM, XGBoost, all tabular models in scikit-learn , deep neural classifiers and regressors implemented using TensorFlow or PyTorch,  AWS' AutoGluon tabular models, etc.

We provide a detailed comparative analysis between Recursive Feature Elimination (RFE), Boruta, and LeanML feature selection, an approach we developed. Our benchmark is based on 38 representative real-world regression and classification problems.

We find that RFE should always be avoided. Boruta is 5 times slower than LeanML, achieves a lower compression rate than LeanML, and underperforms both LeanML and the full model. Finally, LeanML performs similarly to the full model, despite the drastic reduction in the number of features used.

Table of Contents


Motivation

Effective predictive modeling requires effective feature construction and effective feature selection.

Feature Construction is concerned with turning raw inputs (i.e. business and customer data as persisted in databases) into representations or features whose relationships to the target variable are (simple enough and) consistent with patterns that Machine Learning models in our toolbox can reliably learn.

While we do not know beforehand what transformations will yield such patterns, we do on the other hand know what feature transformations would make what types of patterns learnable using models in our Machine Learning toolbox.

For instance, extracting seconds, minutes, hour of the day, day of the month, day of the week, and month as separate features from an epoch timestamp makes it easy for most Machine Learning models to learn seasonal patterns.

As an alternative, it would be very hard for any model in a modern toolbox (e.g. LightGBM, XGBoost, Random Forest, GP, and other Bayesian models etc.) to reliably learn patterns such as weather or intra-week seasonality directly from epoch timestamps.

Because we might not know a priori what types of patterns our problem exhibits  (e.g. whether or not seasonal effects play a key role), Feature Construction should focus on making as many potential types of patterns as possible learnable by models in our toolbox.

As a result, many more features will be generated by Feature Construction than needed. This is where Feature Selection comes in.

For more information about Feature Construction, check out this blog post.

Feature Selection is concerned with removing unnecessary features from a list of candidates.

A feature is deemed unnecessary when it is either uninformative or redundant.

Uninformative Features are features that are not useful to predict the target of interest, whether they are used by themselves, or in conjunction with other candidate features.

Redundant Features are features that may be useful to predict the target, but that add no incremental value when used in conjunction with some other candidate features.

For instance, for most models, distances in feet (resp. temperatures in Fahrenheit) are redundant when the same distances (resp. temperatures) in meters (resp. in Celsius) are also available as features.

Fundamentally, training a predictive model with unnecessary features increases the chance of overfitting, and makes the model harder to explain and costlier to maintain.

In a previous post, we addressed how to remove unnecessary features from a list of candidates. The algorithm we proposed goes as follows:

  • Select as the first feature the feature \(f_1\) that, when used by itself to predict the target \(y\), has the highest achievable performance \(a_1\), as described in this blog.
  • For \(i\) ranging from 1 to the total number \(k\) of candidate features, out of all \((k-i)\) features not yet selected, select as \((i+1)\)-th feature \(f_{i+1}\) the feature that, when used in conjunction with all \(i\) previously selected features, yields the highest increase in the achievable performance \(|a_{i+1}-a_{i}|\). Said differently, \(f_{i+1}\) is the feature that may complement previously selected features \((f_1, \dots, f_i )\) the most to predict the target \(y\).

Formally, if we denote \(j\) the smallest index such that \(a_j=a_k\) then, if \(j < k\), all features \(\{a_{j+1}, \dots, a_k\}\) ought to be unnecessary, and can be safely removed.

More generally, for \(l>i, ~\epsilon_{il} := |a_l-a_i|\) represents the largest boost in performance that features \(\{f_{i+1}, \dots, f_l\}\) can bring about over only using features \(\{f_1, \dots, f_i\}\).

Even when features \(\{f_{i+1}, \dots, f_k\}\)  are not strictly unnecessary (i.e. when \(\epsilon_{ik} \neq 0\)), we still might not want to use features \(\{f_{i+1}, \dots, f_k\}\) if \(\epsilon_{ik} \) is too small.

Indeed, the downside in using \((k-i)\) more features (e.g. increased risk of overfitting, models costlier to train and maintain, models harder to explain, etc.) might very well outweigh the only benefit, namely a potential increase in performance by up to \(\epsilon_{ik}\).

In fact, while the potential performance boost \(\epsilon_{ik}\) is always non-negative, in practice, when \(\epsilon_{ik}\) is too small, training a model with all features might actually result in poorer performance than training the model with features \(\{f_1, \dots, f_i\}\).

To sum up, we need to use the first few features \(f_i\), up to a certain fraction of the total achievable performance \(a_k\). However, subsequent features should only be added if they actually boost the performance of a specific model out-of-sample by a high enough margin.

We refer to this approach as the LeanML feature selection algorithm. We will discuss where to draw the line and provide more implementation details in the explanation below.

For now, let's illustrate how to wrap this feature selection approach around any predictive model in Python using a concrete example.


Tutorial

For illustration purposes, we will use the UCI Bank Marketing classification dataset, and wrap LeanML feature selection around a LightGBM classifier.

Code

# 0. As a one-off, run 'pip install kxy', then 'kxy configure'
# This import is necessary to get all df.kxy.* methods
import kxy

# 1. Load your data
# pip install kxy_datasets
from kxy_datasets.classifications import BankMarketing
dataset = BankMarketing()
target_column = dataset.y_column
df = dataset.df

# 2. Generate candidate features
features_df = df.kxy.generate_features(entity=None, max_lag=None,\
    entity_name='*', exclude=[target_column])
features_df = features_df.drop('y_yes', axis=1)
target_column = 'y_no'

# 3. Training/Testing split
# pip install scikit-learn
from sklearn.model_selection import train_test_split
train_features_df, test_features_df = train_test_split(features_df, \
	test_size=0.2, random_state=0)
test_labels_df = test_features_df.loc[:, [target_column]]
test_features_df = test_features_df.drop(target_column, axis=1)

# 4. Create a LightGBM learner function.

# A learner function is a function that expects up to two optional 
# variables: n_vars and path. When called it returns an instance of 
# 'predictive model' expecting n_vars features. The path parameter, 
# when provided, allows the learner function to load a saved model 
# from disk. 

# A 'predictive model' here is any class with a fit(self, x, y) method 
# and predict(self, x) method. To use the path argument of the learner 
# function, the class should also define a save(self, path) method to 
# save a model to disk, and a load(cls, path) class method to load a 
# saved model from disk. 

# See kxy.learning.base_learners for helper functions that allow you 
# create learner functions that return instances of popular predictive 
# models (e.g. lightgbm, xgboost, sklearn, tensorflow, pytorch models 
# etc.).

from kxy.learning import get_lightgbm_learner_learning_api
params = {
	'objective': 'binary',
	'metric': ['auc', 'binary_logloss'],
}
lightgbm_learner_func = get_lightgbm_learner_learning_api(params, \
	num_boost_round=10000, early_stopping_rounds=50, verbose_eval=50)
    
# 5. Fit a LightGBM classifier wrapped around LeanML feature selection
results = train_features_df.kxy.fit(target_column, \
	lightgbm_learner_func, problem_type='classification', \
	feature_selection_method='leanml')
predictor = results['predictor']

# 6. Make predictions from a dataframe of test features
test_predictions_df = predictor.predict(test_features_df)

# 7. Compute out-of-sample accuracy and AUC
from sklearn.metrics import accuracy_score, roc_auc_score
accuracy = accuracy_score(
    test_labels_df[target_column].values, \
    test_predictions_df[target_column].values, \
)
auc = roc_auc_score( \
    test_labels_df[target_column].values, \
    test_predictions_df[target_column].values, \
    multi_class='ovr'
)

print('LeanML -- Testing Accuracy: %.2f, AUC: %.2f' % (accuracy, auc))
selected_features = predictor.selected_variables
print('LeanML -- Selected Variables:')
import pprint as pp
pp.pprint(selected_features)

# 8. (Optional) Save the trained model.
path = './lightgbm_uci_bank_marketing.sav'
predictor.save(path)

# 9. (Optional) Load the saved model.
from kxy.learning.leanml_predictor import LeanMLPredictor
loaded_predictor = LeanMLPredictor.load(path, lightgbm_learner_func)

Output

LeanML -- Testing Accuracy: 0.92, Testing AUC: 0.77
LeanML -- Selected Variables:
['duration',
 'euribor3m.ABS(* - MEAN(*))',
 'age',
 'campaign',
 'job_blue-collar',
 'housing_yes',
 'pdays',
 'poutcome_nonexistent',
 'nr.employed.ABS(* - MEDIAN(*))',
 'duration.ABS(* - Q75(*))',
 'loan_no',
 'nr.employed.ABS(* - Q25(*))']

Explanation

The key step in the code above is step 5. The method df.kxy.fit is available on any pandas DataFrame object df, so long as you import the package kxy.

df.kxy.fit trains the model of your choosing using df as training data while performing LeanML feature selection.

To be specific, it has the following signature:

kxy.fit(self, target_column, learner_func, problem_type=None, \
    snr='auto', train_frac=0.8, random_state=0, max_n_features=None, \
    min_n_features=None, start_n_features=None, anonymize=False, \
    benchmark_feature=None, missing_value_imputation=False, \
    score='auto', n_down_perf_before_stop=3, \
    regression_baseline='mean', additive_learning=False, \
    regression_error_type='additive', return_scores=False, \
    start_n_features_perf_frac=0.9, feature_selection_method='leanml',\
    rfe_n_features=None, boruta_pval=0.5, boruta_n_evaluations=20, \
    max_duration=None, val_performance_buffer=0.0)

Core Parameters

target_column should be the name of the column that contains target values. All other columns are considered candidate features/explanatory variables.

learner_func can be any function that returns an instance of the predictive model of your choosing. The predictive model can be any class that defines methods fit(self, X, y) and predict(self, X), where X and y are NumPy arrays. Most existing libraries can be leveraged to construct learner functions.

In fact, in the submodule kxy.learner.base_learners we provide a range of functions that allow you to construct valid learner_func functions that return models from popular libraries (e.g. sklearn, lightgbm, xgboost, tensorflow, pytorch etc.) with frozen hyper-parameters.

One such function is the function get_lightgbm_learner_learning_api we used above, which returns a learner_func instantiating LightGBM predictive models using the LightGBM training API.

The parameter problem_type should be either 'classification' or 'regression' depending on the nature of the predictive problem. When it is None (the default), it is inferred from the data.

The LeanML Feature Selection Algorithm

When df.kxy.fit is called with feature_selection_method 'leanml' (the default), a fraction train_frac of rows are used for training, while the remaining rows are used for validation.

We run the model-free variable selection algorithm of this blog post, which we recall is available through the method df.kxy.variable_selection. This gives us the aforementioned sequence of top features and highest performances achievable using these features \(\{(f_1, a_1), \dots, (f_k, a_k)\}\), among which highest \(R^2\)  achievable.

For a discussion on how the \(R^2\) is defined for classification problems, check out this paper.

Next, we find the smallest index \(j\) such that the associated highest \(R^2\) achievable, namely \(a_j\),  is greater than the fractionstart_n_features_perf_frac of the highest \(R^2\) achievable using all features, namely \(a_k\).

We then train the model of your choosing (as per learner_func) on the training set with the initial set of features \((f_1, \dots, f_j)\), we record its performance (\(R^2\) for regression problems and classification accuracy for classification problems) on the validation set as the benchmark validation performance, and we consider adding remaining features one at a time, in increasing order of index, starting with \(f_{j+1}\).

Specifically, for \(i\) ranging from \(j+1\) to \(k\) we train a new model with features \((f_1, \dots, f_i)\), and compute its validation performance. If  the computed validation performance exceeds the benchmark validation performance (by val_performance_buffer or more), then we accept features \((f_1, \dots, f_i)\) as the new set of features, and we set the associated validation performance as the new benchmark validation performance to beat.

If the validation performance does not exceed the benchmark (by a margin of at least val_performance_buffer) n_down_perf_before_stop consecutive times (3 by default), then we stop and we keep the last accepted features set and its associated trained model.

We also stop adding features if a valid hard cap on the number of features is provided through the parameter max_n_features, and when i>max_n_features.

For more implementation details, check out the source code.


Empirical Evaluation

The kxy package also allows you to seamlessly wrap Recursive Feature Elimination (RFE) and Boruta around any predictive model in Python.

This is done by simply changing the feature_selection_method argument from 'leanml'to 'boruta', or 'rfe'. You may also set feature_selection_method to 'none' or None to train a model without feature selection in a consistent fashion.

To use RFE as the feature selection method, the number of features to keep should be provided through the parameter rfe_n_features.

When using Boruta, boruta_n_evaluations is the parameter that controls the number of random trials/rounds to run.

boruta_pval is the confidence level for the number of hits above which we will reject the null hypothesis \(H_0\) that we don't know whether a feature is important or not, in favor of the alternative \(H_1\) that the feature is more likely to be important than not.

In other words, features that are selected by the Boruta method are features whose number of hits h is such that \(\mathbb{P}(n>h) \leq \) (1-boruta_pval) where n is a Binomial random variable with boruta_n_evaluations trials, each with probability 1/2.

Here's an illustration of how to wrap Boruta and RFE around LightGBM using our previous Bank Marketing example.

Code

# 10.a Fit a LightGBM classifier wrapped around RFE feature selection
n_leanml_features = len(selected_features)
rfe_results = train_features_df.kxy.fit(target_column, \
	lightgbm_learner_func, problem_type='classification', \
	feature_selection_method='rfe', rfe_n_features=n_leanml_features)
rfe_predictor = rfe_results['predictor']

# 10.b Fit a LightGBM classifier wrapped around Boruta feature 
# selection.
boruta_results = train_features_df.kxy.fit(target_column, \
	lightgbm_learner_func, problem_type='classification', \
	feature_selection_method='boruta', boruta_n_evaluations= 20, \
    boruta_pval=0.95)
boruta_predictor = boruta_results['predictor']

# 10.c Fit a LightGBM classifier wrapped around Boruta feature 
# selection.
none_results = train_features_df.kxy.fit(target_column, \
	lightgbm_learner_func, problem_type='classification', \
	feature_selection_method=None)
none_predictor = none_results['predictor']

# 11. Make predictions from a dataframe of test features
rfe_test_predictions_df = rfe_predictor.predict(test_features_df)
boruta_test_predictions_df = boruta_predictor.predict(test_features_df)
none_test_predictions_df = none_predictor.predict(test_features_df)

# 12. Compute out-of-sample accuracy and AUC
rfe_accuracy = accuracy_score(
    test_labels_df[target_column].values, \
    rfe_test_predictions_df[target_column].values, \
)
rfe_auc = roc_auc_score( \
    test_labels_df[target_column].values, \
    rfe_test_predictions_df[target_column].values, \
    multi_class='ovr'
)

boruta_accuracy = accuracy_score(
    test_labels_df[target_column].values, \
    boruta_test_predictions_df[target_column].values, \
)
boruta_auc = roc_auc_score( \
    test_labels_df[target_column].values, \
    boruta_test_predictions_df[target_column].values, \
    multi_class='ovr'
)

none_accuracy = accuracy_score(
    test_labels_df[target_column].values, \
    none_test_predictions_df[target_column].values, \
)
none_auc = roc_auc_score( \
    test_labels_df[target_column].values, \
    none_test_predictions_df[target_column].values, \
    multi_class='ovr'
)

print('RFE -- Accuracy: %.2f, AUC: %.2f' % (rfe_accuracy, rfe_auc))
rfe_selected_features = rfe_predictor.selected_variables
print('RFE -- Selected Variables:')
pp.pprint(rfe_selected_features)
print()

print('Boruta -- Accuracy: %.2f, AUC: %.2f' % (boruta_accuracy, \
	boruta_auc))
boruta_selected_features = boruta_predictor.selected_variables
print('Boruta -- Selected Variables:')
pp.pprint(boruta_selected_features)
print()

print('No Feature Selection -- Accuracy: %.2f, AUC: %.2f' % (none_accuracy, \
	none_auc))
all_features = none_predictor.selected_variables
print('No Feature Selection -- Selected Variables:')
pp.pprint(all_features)

Output

RFE -- Accuracy: 0.77, AUC: 0.51
RFE -- Selected Variables:
['euribor3m',
 'duration',
 'cons.price.idx.ABS(* - MEDIAN(*))',
 'euribor3m.ABS(* - MEAN(*))',
 'duration.ABS(* - MEAN(*))',
 'age.ABS(* - MEAN(*))',
 'age',
 'pdays',
 'euribor3m.ABS(* - Q25(*))',
 'duration.ABS(* - Q75(*))',
 'campaign',
 'duration.ABS(* - MEDIAN(*))']

Boruta -- Accuracy: 0.92, AUC: 0.77
Boruta -- Selected Variables:
['euribor3m.ABS(* - MEAN(*))',
 'euribor3m',
 'duration.ABS(* - MEDIAN(*))',
 'duration.ABS(* - MEAN(*))',
 'duration',
 'pdays',
 'duration.ABS(* - Q75(*))',
 'euribor3m.ABS(* - Q25(*))',
 'emp.var.rate',
 'cons.price.idx.ABS(* - MEDIAN(*))',
 'age.ABS(* - MEAN(*))']
 
No Feature Selection -- Accuracy: 0.92, AUC: 0.77
No Feature Selection -- Selected Variables:
['age',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed',
 'age.ABS(* - MEAN(*))',
 'age.ABS(* - MEDIAN(*))',
 'age.ABS(* - Q25(*))',
 'age.ABS(* - Q75(*))',
 'duration.ABS(* - MEAN(*))',
 'duration.ABS(* - MEDIAN(*))',
 'duration.ABS(* - Q25(*))',
 'duration.ABS(* - Q75(*))',
 'campaign.ABS(* - MEAN(*))',
 'campaign.ABS(* - MEDIAN(*))',
 'campaign.ABS(* - Q25(*))',
 'campaign.ABS(* - Q75(*))',
 'pdays.ABS(* - MEAN(*))',
 'pdays.ABS(* - MEDIAN(*))',
 'pdays.ABS(* - Q25(*))',
 'pdays.ABS(* - Q75(*))',
 'previous.ABS(* - MEAN(*))',
 'previous.ABS(* - MEDIAN(*))',
 'previous.ABS(* - Q25(*))',
 'previous.ABS(* - Q75(*))',
 'emp.var.rate.ABS(* - MEAN(*))',
 'emp.var.rate.ABS(* - MEDIAN(*))',
 'emp.var.rate.ABS(* - Q25(*))',
 'emp.var.rate.ABS(* - Q75(*))',
 'cons.price.idx.ABS(* - MEAN(*))',
 'cons.price.idx.ABS(* - MEDIAN(*))',
 'cons.price.idx.ABS(* - Q25(*))',
 'cons.price.idx.ABS(* - Q75(*))',
 'cons.conf.idx.ABS(* - MEAN(*))',
 'cons.conf.idx.ABS(* - MEDIAN(*))',
 'cons.conf.idx.ABS(* - Q25(*))',
 'cons.conf.idx.ABS(* - Q75(*))',
 'euribor3m.ABS(* - MEAN(*))',
 'euribor3m.ABS(* - MEDIAN(*))',
 'euribor3m.ABS(* - Q25(*))',
 'euribor3m.ABS(* - Q75(*))',
 'nr.employed.ABS(* - MEAN(*))',
 'nr.employed.ABS(* - MEDIAN(*))',
 'nr.employed.ABS(* - Q25(*))',
 'nr.employed.ABS(* - Q75(*))',
 'job_admin.',
 'job_blue-collar',
 'job_entrepreneur',
 'job_housemaid',
 'job_management',
 'job_retired',
 'job_self-employed',
 'job_services',
 'job_student',
 'job_technician',
 'job_unemployed',
 'job_unknown',
 'marital_divorced',
 'marital_married',
 'marital_single',
 'marital_unknown',
 'education_basic.4y',
 'education_basic.6y',
 'education_basic.9y',
 'education_high.school',
 'education_illiterate',
 'education_professional.course',
 'education_university.degree',
 'education_unknown',
 'default_no',
 'default_unknown',
 'default_yes',
 'housing_no',
 'housing_unknown',
 'housing_yes',
 'loan_no',
 'loan_unknown',
 'loan_yes',
 'contact_cellular',
 'contact_telephone',
 'month_apr',
 'month_aug',
 'month_dec',
 'month_jul',
 'month_jun',
 'month_mar',
 'month_may',
 'month_nov',
 'month_oct',
 'month_sep',
 'day_of_week_fri',
 'day_of_week_mon',
 'day_of_week_thu',
 'day_of_week_tue',
 'day_of_week_wed',
 'poutcome_failure',
 'poutcome_nonexistent',
 'poutcome_success']

Benchmark Setup

We compared wrapping LeanML, Boruta and RFE around LightGBM to solve 38 real-world classification and regression problems from the UCI ML datasets repository and Kaggle.

We used the kxy package for feature construction. Overall, the number of candidate features ranges from 15 to 1925, and the number of rows ranges from 303 to 583,250.

The full list of datasets is provided below.

Dataset Number of Features Number of Rows Problem Type Source
0 SkinSegmentation 15 245057 classification UCI
1 BankNote 20 1372 classification UCI
2 PowerPlant 20 9568 regression UCI
3 AirFoil 25 1503 regression UCI
4 YachtHydrodynamics 30 308 regression UCI
5 RealEstate 30 414 regression UCI
6 Abalone 38 4177 regression UCI
7 Concrete 40 1030 regression UCI
8 EnergyEfficiency 45 768 regression UCI
9 WaterQuality 45 3276 classification Kaggle
10 Shuttle 45 58000 classification UCI
11 MagicGamma 50 19020 classification UCI
12 Avila 50 20867 classification UCI
13 WhiteWineQuality 55 4898 regression UCI
14 HeartAttack 65 303 classification Kaggle
15 HeartDisease 65 303 classification Kaggle
16 AirQuality 70 8991 regression UCI
17 EEGEyeState 70 14980 classification UCI
18 LetterRecognition 80 20000 classification UCI
19 NavalPropulsion 85 11934 regression UCI
20 BikeSharing 90 17379 regression UCI
21 DiabeticRetinopathy 95 1151 classification UCI
22 BankMarketing 103 41188 classification UCI
23 Parkinson 105 5875 regression UCI
24 CardDefault 115 30000 classification UCI
25 Landsat 180 6435 classification UCI
26 Adult 202 48843 classification UCI
27 SensorLessDrive 240 58509 classification UCI
28 FacebookComments 265 209074 regression UCI
29 OnlineNews 290 39644 regression UCI
30 SocialMediaBuzz 385 583250 regression UCI
31 Superconductivity 405 21263 regression UCI
32 HousePricesAdvanced 432 1460 regression Kaggle
33 YearPredictionMSD 450 515345 regression UCI
34 APSFailure 850 76000 classification UCI
35 BlogFeedback 1400 60021 regression UCI
36 Titanic 1754 891 classification Kaggle
37 CTSlices 1925 53500 regression UCI

Each dataset is split in two: 80% is used for training and feature selection, and the remaining 20% is used for testing.

Benchmark Results

We compare feature selection methods from the perspective of model size,  performance, and training duration.  A good feature selection method should select as few features as possible, with little to no performance reduction, and without requiring too much compute resources.

Model Size

We compare the number of features selected by each feature selection method. We also compare compression rates, defined as the fraction of candidate features a feature selection method did not select. For RFE, we use the same number of features as LeanML.

Results are illustrated in Figures 1, 2, 3, and 4 below.

LeanML is able to reduce the number of candidate features by 82% on average, compared to 78% for Boruta. Additionally, the number of features selected by LeanML rarely exceeds 20 and never exceeds 45.

Boruta on the other hand selected as many as 140 features. More importantly, it can be seen in Fig. 4 that Boruta almost always selected more features than LeanML, and selected up to 100 more features at times.

Fig. 1: Boxplots of compression rates of the LeanML and Boruta feature selection methods wrapped around LightGBM on 38 classification and regression datasets.
Fig. 2: Histograms of compression rates of the LeanML and Boruta feature selection methods wrapped around LightGBM on 38 real-world classification and regression datasets.
Fig. 3: Histograms of the number of features selected by the LeanML and Boruta feature selection methods wrapped around LightGBM on 38 real-world classification and regression datasets.
Fig. 4: Histogram of the difference between the number of features selected by Boruta and the number of features selected by LeanML wrapped around LightGBM on 38 real-world classification and regression datasets. Positive means Boruta selected more features than LeanML.

Performance

As the performance metric, we use the \(R^2\) for regression problems and the AUC for classification problems. These are evaluated on the testing set. See Figures 5 and 6 for results.

It can be seen that LeanML feature selection outperforms both Boruta and RFE out-of-sample, despite using far fewer features. We also see that LeanML is able to achieve the same performance as without feature selection on average with 82% fewer features, and often improves performance!

Fig. 5: Boxplots of performances of the LeanML, Boruta, and RFE feature selection methods wrapped around LightGBM on 38 classification and regression datasets. For classification problems, the performance is the testing AUC, and for regression problems, the performance is the testing \(R^2\).
Fig. 6: Histograms of performances of the LeanML, Boruta, and RFE feature selection methods wrapped around LightGBM on 38 classification and regression datasets. For classification problems, the performance is the testing AUC, and for regression problems, the performance is the testing \(R^2\).

Training Duration

We compare the training duration (in seconds) of each method divided by the number of candidate features.

Results are illustrated in Figures 7 and 8 where it can be seen that, on average, training with LeanML feature selection took about 0.15 second per candidate feature, compared to 0.75 second for Boruta, and 1.39 seconds for RFE.

Fig. 7: Boxplots of training durations of the LeanML, Boruta, and RFE feature selection methods wrapped around LightGBM on 38 classification and regression datasets. Durations are normalized by the number of candidate features.
Fig. 8: Histograms of training durations of the LeanML, Boruta, and RFE feature selection methods wrapped around LightGBM on 38 classification and regression datasets. Durations are normalized by the number of candidate features.

Beyond LightGBM

We ran the same experiments using XGBoost and Random Forest. Results are illustrated in the tables and figures below.

It can be seen that, out of the three feature selection methods, RFE is by far the worst! RFE is consistently the least accurate, and it is 100x slower than LeanML.

Boruta tends to be as accurate as LeanML on average, but it is 5 to 12 times slower and has a much lower compression rate, especially when wrapped around Random Forest. When wrapped around Random Forest, Boruta never selected fewer features than LeanML did. We found instance where Boruta selected up to 700 more features than LeanML.

As for LeanML, on average it consistently performs as well as without including feature selection, with 82% fewer features! When wrapped around Random Forest, LeanML is on average twice as fast.

Table 1: Compression rate of LeanML, Boruta and RFE (mean ± std. error), on 38 classification and regression datasets using LightGBM, XGBoost and Random Forest. The compression rate is defined as the fraction of features removed by the feature selection method.
Model No Feature Selection LeanML Boruta RFE
LightGBM 0.00 ± 0.00 0.82 ± 0.03 0.78 ± 0.04 -
Random Forest 0.00 ± 0.00 0.82 ± 0.03 0.51 ± 0.06 -
XGBoost 0.00 ± 0.00 0.82 ± 0.03 0.76 ± 0.04 -

Table 2: Testing performance (mean ± std. error) obtained by wrapping LeanML, Boruta and RFE around LightGBM, XGBoost and Random Forest on 38 real-world classification and regression problems. The performance metric used is the AUC for classification problems and the \(R^2\) for regression problems.
Model No Feature Selection LeanML Boruta RFE
LightGBM 0.82 ± 0.04 0.81 ± 0.04 0.78 ± 0.04 0.36 ± 0.05
Random Forest 0.79 ± 0.04 0.79 ± 0.04 0.79 ± 0.04 0.35 ± 0.05
XGBoost 0.83 ± 0.04 0.82 ± 0.04 0.81 ± 0.04 0.41 ± 0.06

Table 3: Training duration in seconds divided per the number of candidate features of LightGBM, XGBoost and Random Forest, around which we wrapped LeanML, Boruta, and RFE feature selection, on 38 real-world classification and regression problems.
Model No Feature Selection LeanML Boruta RFE
LightGBM 0.03 ± 0.01 0.15 ± 0.03 0.75 ± 0.22 1.39 ± 0.35
Random Forest 0.53 ± 0.29 0.26 ± 0.06 3.27 ± 0.82 2.81 ± 0.63
XGBoost 0.07 ± 0.03 0.18 ± 0.05 1.52 ± 0.39 2.12 ± 0.51

Table 4: Number of additional features selected by Boruta and RFE over LeanML feature selection when wrapped around LightGBM, XGBoost and Random Forest, on 38 real-world classification and regression problems. The 'No Feature Selection' column corresponds to the number of features deleted by LeanML.
Model No Feature Selection LeanML Boruta RFE
LightGBM 255 ± 72 0 ± 0 6 ± 4 -
Random Forest 255 ± 73 0 ± 0 53 ± 20 -
XGBoost 254 ± 72 0 ± 0 8 ± 5 -
Fig. 9: Illustration of the tradeoffs between average compression rate and average performance on the one hand and between average compression rate and training duration on the other, of LightGBM on 38 real-world classification and regression problems, without feature selection and with LeanML, Boruta, or RFE feature selection. The closer a point is to the top-right (resp. top-left) corner on the plot on the left (resp. right), the better. RFE was implemented using the same number of features as selected by LeanML.
Fig. 10: Illustration of the tradeoffs between average compression rate and average performance on the one hand and between average compression rate and training duration on the other, of XGBoost on 38 real-world classification and regression problems, without feature selection and with LeanML, Boruta, or RFE feature selection. The closer a point is to the top-right (resp. top-left) corner on the plot on the left (resp. right), the better. RFE was implemented using the same number of features as selected by LeanML.
Fig. 11: Illustration of the tradeoffs between average compression rate and average performance on the one hand and between average compression rate and training duration on the other, of Random Forest on 38 real-world classification and regression problems, without feature selection and with LeanML, Boruta, or RFE feature selection. The closer a point is to the top-right (resp. top-left) corner on the plot on the left (resp. right), the better. RFE was implemented using the same number of features as selected by LeanML.

Conclusion

We show how to seamlessly wrap LeanML, Boruta, and RFE feature selection around popular predictive models in Python, including but not limited to lightgbm, xgboost, tensorflow, and sklearn models.

Using 38 real-world classification and regression problems, we showed that LeanML feature selection cuts the number of features used by 82% on average, at no performance cost.

As an alternative, RFE can be up to 100 times slower on average than LeanML and performs significantly worse! RFE's notable poor performance is perhaps not surprising as RFE tends to exacerbate overfitting. Indeed, when a model is overfitted, it is often because of [...] the most important features! Thus, removing the least important features and keeping the most important ones can only amplify the problem.

As for Boruta, while it performs similarly to LeanML on average, it can be up to 12 times slower than LeanML on average, and almost always selects more features than LeanML, often hundreds.

Note: The source code to reproduce all experiments can be found here.