The Productive Machine Learning Engineer

How I Got In The Top 1% of A Kaggle Competition With kxy And No Hyper-Parameter Tuning

Yves-Laurent Kom Samo, PhD — Mon, 09 May 2022 17:11:06 GMT

The G-Research Crypto Forecasting Kaggle competition was my first Kaggle competition using the kxy package, and I managed to finish 26th out of 1946 teams, with the kxy package, LightGBM, no hyper-parameter tuning, and only 2 submissions (one test and one real)!

In this post I share my solution and explain why the kxy package was key.

The Competition

The aim of the competition was to predict the price move (net of market trends) of some cryptocurrencies over the next 15 minutes, using historical trade data aggregated in minute bars up to now.

The Data

Specifically, each row of the training data CSV file summarizes trades in a cryptocurrency that took place in a given minute.

Here's the full list of attributes of each row:

timestamp - A timestamp for the minute covered by the row.
Asset_ID - An ID code for the cryptocurrency.
Count - The number of trades that took place this minute.
Open - The USD price at the beginning of the minute.
High - The highest USD price during the minute.
Low - The lowest USD price during the minute.
Close - The USD price at the end of the minute.
Volume - The number of cryptocurrency units traded during the minute.
VWAP - The volume weighted average price for the minute.
Target - The residualized log-return over the 15 minutes following the current minute. For details of how the target is calculated see here.

Additional information about cryptocurrencies include:

Asset_ID - An ID code for the cryptocurrency.
Name - The real name of the cryptocurrency associated to Asset_ID.
Weight - The weight that the cryptocurrency associated to Asset_ID receives in the evaluation metric.

The Evaluation Criteria

Teams were ranked based on the weighted average (across cryptocurrencies) of the Pearson correlation between their predictions of Target and the ground truth during the live testing phase. Pearson correlations were weighted using the Weight column above.

My Solution

My solution relied on three essential building blocks: feature construction, training data resampling, and effective feature selection.

Feature Construction

The goal of feature construction is to turn raw data (i.e. Open, High, Low, Close, Volume, VWAP at a given time as well as their past values) into representations that can potentially have much simpler relationships to the target; for instance, features that we reasonably believe can have a monotonous relationship to the target, its magnitude (i.e. absolute value), or its sign.

Atemporal Features

I began by constructing features using information about the last trading bar that are more likely to reveal the direction and/or the magnitude of future market moves than raw attributes such as Open, High, Low, Close and VWAP.

UPS = High-max(Close, Open) - Upper shadow; i.e. how high did the price get in the minute bar relative to where it started and closed. High values of this feature could indicate short-term reversions, low values could indicate short-term momentum/trend.
LOS = min(Close, Open)-Low) - Lower shadow; i.e. how low did the price get in the minute bar relative to where it started and closed. High values of this feature could indicate short-term reversions, low values could indicate short-term momentum/trend.
RNG = (High-Low)/VWAP - High-low range; i.e. how did the price fluctuate in the minute bar. This feature could be perceived as a volatility indicator. High (resp. low) values could indicate that the target might be high (resp. low) in absolute value.
MOV = (Close-Open)/VWAP - Price move in the minute bar. This feature could be perceived as a momentum/trend indicator. High (resp. low) values could indicate that the target might be high (resp. low).
LOGRETCO = log(Close/Open) - Log-return Open-to-Close. This feature could be perceived as a momentum/trend indicator. High (resp. low) values could indicate that the target might be high (resp. low).
RANKLOGRETCO = LOGRETCO.groupby('timestamp').rank('dense') - How the Open-to-Close log-return of the cryptocurrency the row pertains to, ranks relative to the Open-to-Close log-returns of other cryptocurrencies in the same minute bar.
CLS = (Close-VWAP)/VWAP - The last trade price in the bar relative to the average trade price. This could be perceived as a momentum/trend indicator. High (resp. low) values could indicate that prices are trending up (resp. down).
LOGVOL = log(1+Volume) - Logarithm of the trade volume. Same as Volume but more robust to outliers. High volume could be an indication that returns will persist in the near future. Low volume and high returns could indicate a future market correction.
LOGCNT = log(1+Count) - Logarithm of the number of trades in the bar. Same as Count but more robust to outliers. High count could be an indication that returns will persist in the near future. Low count and high returns could indicate a future market correction.
VOL2CNT = Volume/(1+Count) - Average trade size in the bar. Can reveal whether there is a big player trading a large block, which would move the price further.

Temporal Features

Next, for each of the features above, I constructed the following 4 features to summarize values of the feature over the past 15 minutes. The idea was to capture short-term trends and extrema, and to gauge the short-term stability of the feature.

While atemporal features can be indicative of future market moves, these temporal aggregation can add to their forecasting power by providing useful temporal context.

*.GROUPBY(Asset_ID).LAST(15).MEAN() - Average of the feature over the past 15 minutes for a specific cryptocurrency.
*.GROUPBY(Asset_ID).LAST(15).MAX() - Maximum of the feature over the past 15 minutes for a specific cryptocurrency.
*.GROUPBY(Asset_ID).LAST(15).MIN() - Minimum of the feature over the past 15 minutes for a specific cryptocurrency.
*.GROUPBY(Asset_ID).LAST(15).MAX-MIN() - Difference between the maximum and the minimum of the feature over the past 15 minutes for a specific cryptocurrency.

Seasonality Features

I used the following features to help detect possible seasonal patterns.

timestamp.HOUR() - Hour in the day.
timestamp.DAYOFWEEK() - The day of the week with Monday=0, Sunday=6.
timestamp.DAY() - Day of the month.
timestamp.MONTH() - Month in the year.

Asset-Specific Features

I chose to build a single model to trade all cryptocurrencies as I expect the same market players to be active in all cryptocurrencies and, as a result, inefficiencies in all cryptocurrencies should be of similar types. However, I added Asset_ID as a feature to give the model the ability to learn idiosyncratic patterns. This brings the total number of candidate features to 56.

Feature Selection and Model Training

Financial markets, both traditional and decentralized, tend to undergo frequent and aggressive data distribution drifts due to, among other factors, geopolitical and other macroeconomic events.

As such, during training, data that are a few months old should not be given the same importance as recently observed data.

Additionally, care should be taken to avoid using too many features or, equivalently, to incorporate effective feature selection in the modeling pipeline. The more features are used in a model, the more the model is exposed to data distribution drifts.

Training Data

To construct our training data, I first constructed features using the full raw training data. I then proceeded a quarter at a time, in chronological order, starting from 2018Q1. The training data was initialized to 2018Q1 feature data. For new quarters ranging from 2018Q2 to 2021Q2, I updated the training data by sampling 80% of the new quarter data and 20% of the old training data.

This autoregressive style sampling scheme ensures that we overweight recent observations, while exponentially decaying the importance of past observations.

Testing Data

To test the method on a rolling basis, the new quarter can be used as testing data, and the training data before resampling can be used as training data. Ultimately, I only constructed the training data at the end of 2021Q2 for training and validation and used 2021Q3 as testing data.

Model Choice

I used LightGBM with default hyper-parameters.

from kxy.learning import get_lightgbm_learner_learning_api
params = {'objective': 'rmse'}
learner_func = get_lightgbm_learner_learning_api(params)

Feature Selection

I used the kxy package to wrap LeanML feature selection around LightGBM training.

import kxy

# Jointly train the model and select features using 
# 	LeanML feature selection.
target_column = 'Target'
results = training_features_df.kxy.fit(target_column, \
	learner_func, problem_type='regression')
    
# Retrieve the trained predictor
predictor = results['predictor']

# Make predictions
predictions = predictor.predict(testing_features_df)

# Selected features
selected_features = predictor.selected_variables

Out of the original 56 candidate features, here is the list of the 28 features selected by LeanML, in the order they were selected:

RANKLOGRETCO.GROUPBY(Asset_ID).LAST(15).MEAN()
LOGRETCO.GROUPBY(Asset_ID).LAST(15).MAX()
RNG.GROUPBY(Asset_ID).LAST(15).MIN()
LOGRETCO.GROUPBY(Asset_ID).LAST(15).MEAN()
Asset_ID
timestamp.HOUR()
timestamp.DAY()
VOL2CNT.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
VOL2CNT
LOGVOL
LOGCNT.GROUPBY(Asset_ID).LAST(15).MIN()
VOL2CNT.GROUPBY(Asset_ID).LAST(15).MIN()
LOGVOL.GROUPBY(Asset_ID).LAST(15).MIN()
LOGVOL.GROUPBY(Asset_ID).LAST(15).MEAN()
LOGCNT.GROUPBY(Asset_ID).LAST(15).MAX()
LOS
UPS.GROUPBY(Asset_ID).LAST(15).MAX()
UPS.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
LOS.GROUPBY(Asset_ID).LAST(15).MAX()
LOGCNT.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
UPS.GROUPBY(Asset_ID).LAST(15).MIN()
VOL2CNT.GROUPBY(Asset_ID).LAST(15).MEAN()
LOGCNT.GROUPBY(Asset_ID).LAST(15).MEAN()
LOS.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
CLS.GROUPBY(Asset_ID).LAST(15).MAX()
RNG.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
CLS.GROUPBY(Asset_ID).LAST(15).MEAN()
MOV

Benefits of the kxy Package

While feature construction and the autoregressive style resampling scheme contributed to this solution ranking in the top 1% with just one submission, what truly made a difference is the LeanML feature selection of the kxy package.

Throughout the live testing phase of the competition, my submission oscillated between the 92nd and the 163rd place on the public leaderboard, but I finished in 26th place, an indication that dozens of top submissions were penalized by data distribution drifts.

This is a testament to the ability of kxy's LeanML feature selection to mitigate data distribution drifts by drastically shrinking the number of features used, while preserving forecasting power.

See this blog post for an extensive comparison between LeanML and other feature selection methods.

Boruta(SHAP) Does Not Work For The Reason You Think It Does!

Yves-Laurent Kom Samo, PhD — Tue, 03 May 2022 22:45:03 GMT

Any feature selection method that relies on feature importance scores is inherently flawed. Nonetheless, in practice, Boruta seems to perform decently most of the times, unlike Recursive Feature Elimination (RFE) for instance.

The aim of this post is to elucidate this apparent mystery and shed some light on the surprising reason why Boruta is not quite as bad as RFE.

But first, let me stress why using feature importance scores for feature selection isn't such a great idea, and then explain how Boruta works (in case you didn't already know).

Feature Importance Scores Should Not Be Used For Feature Selection

Feature selection methods that rely on feature importance scores are inherently flawed!

The main reason why is that a feature importance score quantifies the extent to which model decisions are affected by values of the feature; it does not quantify the extent to which model performance is affected by values of the feature.

High feature importance score does not imply that the feature is useful

Take a random model of your choosing that uses a given set of features (e.g. a randomly constructed CART or neural net, a linear model with randomly selected coefficients etc.).

Every feature will have a score, and some features might actually have much larger scores than others.

However, no matter the type of feature importance metric used (SHAP, gain-based, split-based, permutation etc.), a high feature importance score simply would not imply that the feature is insightful or informative about the target. All we could conclude is that the feature is important to the model!

If the model has poor performance, then chances are that features with the highest importance scores are to blame, and should be removed.

The idea of removing a feature because it has a low feature importance score makes the fundamental assumption that the trained model performs great and, in particular is not overfitted.

It is only then that a feature with high importance score will also have a high contribution towards model performance (i.e. a high feature usefulness).

In general, feature selection algorithms should remove features with low usefulness, not features with low feature importance scores.

Feature importance scores cannot detect redundant features

In addition to removing features that are not insightful, a good feature selection algorithm should also remove features that are redundant relative to others.

Most families of predictive models, when trained, would treat identical features the same, as if they were interchangeable.

As a result, identical features will typically have the same feature importance score, even though one is a duplicate of the other.

Thus, a feature selection algorithm based on feature importance scores would never be able to select one feature and remove another which is identical to the first and therefore redundant.

The same reasoning holds when features are very similar but not necessarily identical. More generally, there is nothing in feature importance scores that would inherently ensure that redundant features would not have similar feature importance scores.

The Boruta Exception

What Is Boruta(SHAP)

Boruta is a feature selection algorithm whose core idea is to remove features whose feature importance scores are so low that they ought to be useless.

Boruta gives a statistical answer to the following question: what is the threshold below which feature importance scores should be deemed so low that associated features ought to be useless?

To answer this question, Boruta creates new features called 'shadow features', which we know cannot possibly be useful to solve the problem at hand.

Typically, shadow features are constructed by shuffling each existing feature, thereby destroying any association between feature and target.

Formally, thanks to random shuffling, shadow features have the same marginal distributions as original features, but they are independent of the target both unconditionally and conditional on original features.

The notion of a hit

Shadow features are then added to original features, a model is trained using the combined set, and feature importance scores are calculated for each original and shadow feature.

Because shadow features are useless, Boruta considers that no useful feature should have a feature importance score lower than that of a shadow feature.

An original feature is considered to have received a hit when its feature importance score is higher than the scores of all shadow features.

The Boruta statistical test

Given that shadow features are randomly generated, Boruta constructs a statistical hypothesis test to robustify decision making.

Specifically, the procedure described above is ran \(k\) independent times, and we count the number of hits \(h_i\) of each feature \(x_i\).

The Null Hypothesis \(H_0\)

We represent the fact that we do not know whether feature \(x_i\) is useful or not before running the test by saying that, in each of the \(k\) runs, there is a 50% chance that feature \(x_i\) will receive a hit (i.e. a probability \(p=0.5\)).

Thus, in the absence of evidence pointing to whether \(x_i\) is useful or not, \(h_i\) should follow a Binomial distribution with parameters \(k\) and \(p\). This forms the null hypothesis \(H_o\) of the Boruta test.

The CDF of a Binomial distribution is well known and allows us to easily compute the two-sided symmetric confidence intervals \([m_q(k), M_q(k)]\) in which \(h_i\) should lie with probably \(q\) under \(H_0\).

The Alternate Hypothesis \(H_1\) That Feature \(x_i\) is Useful

When the number of hits \(h_i\) observed after \(k\) runs exceeds \(M_q(k)\) we reject the hypothesis \(H_0\) that we do not know whether feature \(x_i\) is useful or not, in favor of the alternative \(H_1\) that feature \(x_i\) is more likely to be useful than not.

The Alternate Hypothesis \(H_2\) That Feature \(x_i\) is Useless

When the number of hits \(h_i\) observed after \(k\) runs is lower than \(m_q(k)\) we reject the hypothesis \(H_0\) that we do not know whether feature \(x_i\) is useful or not, in favor of the alternative \(H_2\) that feature \(x_i\) is more likely to be useless than not.

Fig. 1: Visualization of the Boruta statistical test for k=20 and at q=95% confidence. A feature is deemed useful when it has at least 14 hits after 20 runs, and it is deemed useless when it has 5 hits or fewer after 20 runs.

BorutaSHAP

BorutaSHAP is basically Boruta when the feature importance metric used is SHAP values.

Python Implementation

Boruta is implemented in the kxy package. Whenever the method you are calling has a feature_selection_method argument, simply specify 'boruta' as value.

For the source code see here.

The Counterintuitive Reason Why Boruta (Often) Works

The fundamental flaw of Boruta is to assume that a feature is useful just because it has a relatively high feature importance score, and that a feature is useless just because it has a relatively low feature importance score.

This is clearly not the case. Our model could be overfitted to the point of giving useless original features a lot of importance and useful original features very little importance.

What saves Boruta is the fact that the feature importance threshold above which features are deemed useful is a function of the trained model.

In fact, if this threshold was deterministic then, no matter how high we set it, the resulting feature selection algorithm would very easily break down. All it would take is for our trained model to overfit!

Because Boruta defines its threshold as the highest score of shadow features, for Boruta to break down, our trained models should consistently give the wrong importance to an original feature (i.e. low importance if it is a useful feature or high importance if it is a useless feature), while at the same time consistently giving the right (i.e little) importance to all shadow features. Fortunately, the larger \(k\), the less likely this is to happen.

To see why, let us consider two scenarios that would break a feature selection algorithm using a deterministic feature importance threshold.

Scenario I: Completely Random Models

Let's consider the scenario where our model training procedure is completely wrong/random. In this case, there is no reason why any original feature would have a score that is consistently higher or consistently lower than the highest shadow feature importance score.

As such, the number of hits of most original features would always be in \([m_q(k), M_q(k)]\), we would not be able to reject \(H_0\), and Boruta would always be inconclusive, not wrong.

Scenario II: Overfitted Models

Let's consider the scenario where our model training procedure always overfits. For Boruta to wrongly conclude that certain features are useful, our models need to consistently unduly overweight the said features, and give little importance to all shadow features. While this is easily conceivable for \(k=1\) or \(k=2\), the higher the number of runs \(k\), the less likely it is to happen.

On the other hand, if our model overfits in each run but does not necessarily overweight the same features, then, once more, most features will have a number of hits in \([m_q(k), M_q(k)]\), and Boruta will be inconclusive, not wrong.

A Simple Adversarial Attack

Fig. 2: Illustration of BorutaSHAP (k=50, q=0.95) on a regression problem using as candidate features monomial features up to order 7: \((x, x^2, \dots, x^7)\). The only selected feature is \(x^3\), even though a simple linear regression model provides a better fit.

As previously discussed, all it takes for Boruta to break down is for our model training procedure to consistently overfit, overweighting the same original feature(s), and never overweighting shadow features.

The illustration above, which was produced by the code below, provides such an example.

Even though a simple linear regression model is better than any polynomial regression model in this case, Boruta is inconclusive as to whether \(x\) is useful or not, it deems \(x^3\) useful, and it deems all other features \(x^2, x^4, \dots, x^7\) useless.

Code

import pylab as plt
import numpy as np
import pandas as pd

# GENERATE THE DATA
data = [
	[-5., -12.5], [-4., -17.], [-3., -14.],
	[-2., -9.], [-1., -6.], [0., -1.], [1., -1.],
	[2., 3.], [3., 5.], [4., 10.], [5., 9.]
]
data = np.array(data)
x = data[:, 0][:, None]
y = data[:, 1][:, None]
z = np.concatenate([x, x**2, x**3, x**4, x**5, x**6, \
	x**7], axis=1)
x_df = pd.DataFrame(z, columns= \
	['x%d' % i for i in range(1, 8)])
y_df = pd.DataFrame(y, columns=['y'])

# RUN BORUTA
from kxy.learning import get_sklearn_learner
from kxy.misc import Boruta

learner_func = get_sklearn_learner(\
	'sklearn.linear_model.LinearRegression')
selector = Boruta(learner_func)
lr_model = selector.fit(x_df, y_df, n_evaluations=50, \
	pval=0.95, max_duration=None)
selected_features = selector.selected_variables
ambiguous_features = selector.ambiguous_variables

print('Selected Features: %s' % \
	str(selected_features))
print('Ambiguous Features: %s' % \
	str(ambiguous_features))


# TRAIN A SIMPLE LINEAR MODEL
simple_lr_model = learner_func()
simple_lr_model.fit(x, y)


# TRAIN A LINEAR MODEL ON ALL MONOMIAL FEATURES
mono_lr_model = learner_func()
mono_lr_model.fit(z, y)


# PLOT DATA
x_test = np.arange(-5., 5.1, 0.1)[:, None]
z_test = np.concatenate([x_test, x_test**2, \
	x_test**3, x_test**4, x_test**5, x_test**6, \
    x_test**7], axis=1)
x_test_df = pd.DataFrame(z_test, columns= \
	['x%d' % i for i in range(1, 8)])
y_test_pred = lr_model.predict(\
	x_test_df[selected_features].values)
simple_y_test_pred = simple_lr_model.predict(x_test)
mono_y_test_pred = mono_lr_model.predict(z_test)

fig, ax = plt.subplots(figsize=(10, 10))
plt.plot(x, y, '.k', label='Training Data')
plt.plot(x_test, y_test_pred, '-b', \
	label='Linear Regression (Monomial Features '\
    'Selected by Boruta)')
plt.plot(x_test, simple_y_test_pred, '-.r', \
	label='Simple Linear Regression')
plt.plot(x_test, mono_y_test_pred, '-c', \
	label='Linear Regression (All Monomial Features)')
plt.legend()
ax.set_ylabel(r'''$y$''')
ax.set_xlabel(r'''$x$''')
plt.savefig('overfitting.jpg', dpi=500, \
	bbox_inches='tight')
plt.show()

Conclusion

For Recursive Feature Elimination (RFE) to break down, all it takes is for trained models to overfit. Boruta(SHAP) requires a little more to break down. Trained models need to overfit, overweighting the same original features, while never overweighting shadow features.

When trained models overfit but do not always overweight the same (original) features, Boruta(SHAP) becomes inconclusive about whether or not a feature is useful.

To use a classification analogy, the increased resilience of Boruta(SHAP) over RFE lies in the fact that, unlike RFE, it is able to increase its precision by lowering its recall, while RFE always has a recall of 100%.

This is desirable in practice because, when it comes to feature selection, we care more about the quality of selected features (i.e. precision) than selecting a large number of features (i.e. recall).

Make no mistake. Boruta(SHAP) does break down; just less often than RFE does. LeanML is a better alternative to both Boruta(SHAP) and RFE. For a benchmark, see this post.

Autoencoders: What Are They, and Why You Should Never Use Them For Pre-Processing

Yves-Laurent Kom Samo, PhD — Tue, 19 Apr 2022 23:50:00 GMT

Autoencoders are essential tools for compression and generative modeling in any modern Machine Learning toolbox.

Unfortunately, their success as unsupervised pre-training step in vision problems has led to some Machine Learning engineers wrongly considering autoencoders great tools for feature engineering or dimensionality reduction.

In this article, we explain why using autoencoders (AEs) for feature engineering on tabular data can be problematic.

First, let's briefly recall what AEs are.

What Are AEs?

An autoencoder is a module consisting of an encoder and a decoder typically used for compression or generative modeling.

Conceptually, the encoder takes an input vector \(x \in \mathcal{X}\) and generates a lower-dimensional feature or 'code' vector \(c \in \mathcal{C}\): \(x \to c\). The decoder operates on codes. It uses the code \(c \in \mathcal{C}\) to generate a reconstructed version \(x^\prime \in \mathcal{X}\) of the original input vector \(x\): \(c \to x^\prime \).

The quality of an autoencoder is gauged by a reconstruction error \(\mathcal{E}\) that quantifies how close \(x\) is to its reconstruction \(x^{\prime}\).

Achieving a low reconstruction error despite the code space \(\mathcal{C}\) being lower-dimensional than the input space \(\mathcal{X}\) is a sign that the encoder generates richer representations of the corresponding original inputs.

There are two main types of autoencoders: Traditional AEs and Variational AEs.

Traditional Autoencoders

Traditional AEs (TAEs) define the code and reconstructed input as \(c = f_\theta(x)\) and \(x^{\prime} = g_\theta(c)\), for some functions \(f_\theta, g_\theta\).

They use as reconstruction error \[\begin{align}\mathcal{E} :&= E\left( ||x-x^\prime||^2 \right) \\ &= E\left( ||x-g_\theta \circ f_\theta(x)||^2 \right),\end{align}\]or a regularized flavor thereof \[\mathcal{E} = E\left( ||x-g_\theta \circ f_\theta(x)||^2 \right) + \lambda \mathcal{R}(\theta),\]for \(\lambda \geq 0\).

Variational Autoencoders

While the mathematical artillery required to construct Variational Autoencoders (VAEs) is a lot more complex, they are very similar to TAEs in spirit.

AIM

Like TAEs, VAEs aim at learning a mapping \[x \to c \to x^\prime ,\] where \(x \in \mathcal{X}\) is the original input vector, \(c \in \mathcal{C}\) is the code, and \(x^\prime \in \mathcal{X}\) is a reconstruction of the original input.

However, TAEs do not guarantee that the encoder will map the input space to the entire code space. Specifically, not every element of \(\mathcal{C}\) is a valid code of an element of \(\mathcal{X}\).

As a result, if we simply apply a TAE's decoder to any element of \(\mathcal{C}\), we may not necessarily generate an output that resembles the type of inputs that the TAE's encoder is meant to encode. For that we need to apply the decoder only to those elements of \(\mathcal{C}\) that were obtained by encoding an element of \(\mathcal{X}\).

However, because it is very hard to characterize this subset of \(\mathcal{C}\), it is virtually impossible to use a TAE's decoder to generate a draw from the distribution that its encoder encodes.

VAEs address this limitation. Specifically, VAEs require that the encoder always maps the distribution of original inputs \(\mathbb{P}_{x}\) on \(\mathcal{X}\) to the same code distribution \(\mathbb{P}_{c}\) on \(\mathcal{C}\). \(\mathbb{P}_{c}\) is typically chosen to be a multivariate standard normal\[\mathbb{P}_c = \mathcal{N}(0, I).\]This ensures that, not only is every element of \(\mathcal{C}\) a valid code (i.e. there exists an element of \(\mathcal{X}\) of which it is the code), but also that codes are cryptic in the sense that we cannot say anything about what distribution \(\mathbb{P}_{x}\) has been encoded simply from observing \(\mathbb{P}_c\).

FEASIBILITY (COMPRESSION-FREE)

At this point, you might be wondering whether, for any \(\mathbb{P}_x\), we may actually always find a transformation that maps \(\mathbb{P}_x\) to a multivariate standard normal \(\mathbb{P}_c = \mathcal{N}(0, I)\).

The answer is yes, and the transformation is available analytically! The ideal encoder maps any \(x \in \mathcal{X}\) to code \[c= F_c^{-1} \circ F_x(x),\] and the ideal decoder maps any \(c \in \mathcal{C}\) to original input as \[x^\prime = F_x^{-1}\circ F_c(c) = x,\] where \(F_c\) (resp. \(F_x\)) is the vector-valued function whose i-th coordinate function is the CDF of the standard normal (resp. the CDF of the i-th coordinate distribution of \(\mathbb{P}_x\)).

Note that, under this ideal transformation, the conditional distributions \(\mathbb{P}_{x \vert c}=\mathbb{P}_{x^\prime \vert c}\) and \(\mathbb{P}_{c \vert x} = \mathbb{P}_{c \vert x^\prime}\) are point-masses. In other words, if we know the original input and the ideal encoder, then we know its code, and if we know a code and the ideal decoder, then we know what input it is the code of.

TRACTABILITY

In practice, however, it is extremely hard to reliably learn \(F_x\) directly. Imagine having to learn the CDFs of pixels of images of dogs!

VAEs deal with this difficulty by acknowledging that we may never learn the ideal encoder/decoder, and leverage Bayesian statistics to cope with this uncertainty.

In an ideal world, given a code \(c\) that our encoder generated, there is only one choice for what original input \(x\) it is the code of. But because our encoder may not be ideal, VAEs assume that the conditional distribution of \(x\) given \(c\) is not a point mass but has a variance \(\sigma^2\) that accounts for the encoder's imperfection.

Specifically, VAEs assume that the code generated by the encoder indeed has distribution \[\mathbb{E}_{c} = \mathbb{P}_{c}=\mathcal{N}(0, I),\]but that the conditional distribution of \(x\) given \(c\) implied by the encoder reads \[\mathbb{E}_{x \vert c} = \mathcal{N}\left(f_\theta(c), \sigma I \right).\]

The ideal encoder corresponds to the case \(f_\theta(c) = F_x^{-1}\circ F_c(c)\) and \(\sigma = 0\). In general, however, \(\sigma > 0\) to reflect the fact that the encoder is not ideal. In such a case, the encoder is fuzzy in the sense that the code associated to input \(x\) is sampled from the distribution \(\mathbb{E}_{c \vert x}\) and is not unique.

Similarly, because our decoder might not be ideal, the conditional distribution of \(c\) given \(x\) implied by the decoder is not a point mass. Instead, VAEs assume that the conditional distribution of \(c\) given \(x^\prime\) implied by the decoder reads \[\mathbb{D}_{c \vert x^\prime} = \mathcal{N}\left(\mu_\theta(x^\prime), \sigma_\theta(x^\prime) \right),\]and that the distribution of reconstructed inputs is the same as that of original inputs\[\mathbb{D}_{x^\prime} = \mathbb{P}_{x}.\]

The ideal decoder corresponds to the case \(\mu_\theta(x^\prime) = F_c^{-1}\circ F_x(x^\prime)\) and \(\sigma_\theta(x^\prime) = 0\). In general, however, \(\sigma_\theta(x^\prime) > 0\) to reflect the fact that the decoder is not ideal. In such a case, the decoder is fuzzy in the sense that the reconstructed input associated to code \(c\) is sampled from the distribution \(\mathbb{D}_{x^\prime \vert c}\) and is not unique.

RECONSTRUCTION ERROR

Using Bayes' theorem, we may get the joint distributions governing the encoder and the decoder as \[\mathbb{D}_{(x^\prime, c)} = \mathbb{D}_{x^\prime} \mathbb{D}_{(c \vert x^\prime)} = \mathbb{P}_{x} \mathbb{D}_{(c \vert x^\prime)},\]and\[\mathbb{E}_{(x, c)} = \mathbb{E}_{c} \mathbb{E}_{(x \vert c)} = \mathbb{P}_{c} \mathbb{E}_{(x \vert c)}.\]

Note that the encoder and the decoder are both ideal if and only if \(\mathbb{D}_{(x^\prime, c)}=\mathbb{E}_{(x, c)}\).

In effect, when both the encoder and the decoder are ideal the two distributions are the same, as previously discussed.

Conversely, when the two distributions are the same, the encoder maps \(\mathbb{E}_x\), which is equal to \(\mathbb{P}_x\) (because to two joint distributions are the same), to \(\mathbb{E}_c:=\mathbb{P}_c\), and therefore is ideal.

Similarly, when the two distributions are the same, the decoder maps \(\mathbb{D}_c\), which is equal to \(\mathbb{P}_c=\mathcal{N}(0, I)\) (because to two joint distributions are the same), to \(\mathbb{D}_{x^\prime}:=\mathbb{P}_x\), and therefore is ideal.

Thus, a natural reconstruction error is the KL-divergence between \(\mathbb{D}_{(x^\prime, c)}\) and \(\mathbb{E}_{(x, c)}\) which quantifies how different the decoder and encoder joint distributions are:\[\mathcal{E} := KL \left( \mathbb{D}_{(x^\prime, c)} || \mathbb{E}_{(x, c)}\right).\]

To arrive at the form of the objective function commonly used in the VAE litterature, we may use Bayes' theorem and get \[\begin{align}\mathcal{E} =& E_{\mathbb{P}_x}\left( \log \mathbb{P}_x \right) - E_{\mathbb{P}_x}\left( \log \mathbb{E}_x \right) \\ & +E_{\mathbb{P}_x}\left[KL \left( \mathbb{D}_{c \vert x^\prime} || \mathbb{E}_{c \vert x}\right)\right] \end{align}\]where we have used the fact that \(\mathbb{D}_{x^\prime} := \mathbb{P}_x\).

Additionally, we may drop the term \(E_{\mathbb{P}_x}\left( \log \mathbb{P}_x \right)\) because it does not depend on any model parameter, and arrive at the usual formulation of VAEs' optimization problem \[ \begin{align}\min_{f_\theta, \sigma, \mu_\theta, \sigma_\theta} ~&E_{\mathbb{P}_x}\left[ KL \left( \mathbb{D}_{c \vert x^\prime} || \mathbb{E}_{c \vert x}\right)\right] \\ &-E_{\mathbb{P}_x}\left( \log \mathbb{E}_x \right).\end{align}\]

Let's verify that the unique solution to the foregoing optimization problem corresponds to the ideal encoder and decoder.

The negative-likelihood term \(-E_{\mathbb{P}_x}\left( \log \mathbb{E}_x \right)\) reflects the extent to which the encoder model \(\mathbb{E}_{(x, c)}\) is consistent with the true distribution of original inputs \(\mathbb{E}_{x}\). It is a well-known result that this term is minimized when \(\mathbb{E}_x = \mathbb{P}_x\), which indeed corresponds to the ideal encoder.

The KL-divergence term \(E_{\mathbb{P}_x}\left[ KL \left( \mathbb{D}_{c \vert x^\prime} || \mathbb{E}_{c \vert x}\right)\right]\) measures how consistent the real encoder is on average with the encoder implied by the decoder's joint distribution.

It is minimized when for every \(x \in \mathcal{X}, ~~ \mathbb{D}_{c \vert x^\prime} = \mathbb{E}_{c \vert x}\), which also implies \[\begin{align}\mathbb{D}_c :&= \int \mathbb{D}_{c \vert x^\prime} \mathbb{D}_{x^\prime} dc \\ &= \int \mathbb{D}_{c \vert x^\prime} \mathbb{P}_x dc \\&= \int \mathbb{E}_{c \vert x} \mathbb{E}_x dc \\&= \mathbb{E}_c \\&= \mathbb{P}_c,\end{align}\]in which case the decoder is also ideal.

EXTENSION TO LOWER-DIMENSIONAL CODES

‌So far we have focused on the case where it is always possible to map the input distribution to a multivariate standard normal, no matter the input distribution \(\mathbb{P}_x\). In this case, the ideal encoder maps \(x \in \mathcal{X}\) to code \(c= F_c^{-1} \circ F_x(x)\), which ought to have the same dimension as \(x\), as this mapping is reversible.

In practice, we are often interested in lower-dimensional codes, which VAEs equally assume follow a multivariate standard normal distribution, but a lower-dimensional one.

The challenge now is to find a joint distribution \(\mathbb{Q}_{(x, c)}\) on \(\mathcal{X} \times \mathcal{C}\) such that its \(x\)-marginal is the true input distribution \(\mathbb{P}_x\), and its \(c\)-marginal is the lower-dimensional standard normal code distribution \(\mathbb{P}_c\).

This is clearly not always possible, mathematically speaking. The intuition giving us hope that we may find such a joint distribution in practice is that some physical applications (e.g. computer vision) are structured enough that the true input distribution \(\mathbb{P}_x\) is supported on a lower-dimensional manifold \(\mathcal{M} \subset \mathcal{X}\); they might not be supported on the whole of \(\mathcal{X}\).

For instance, because a real-world image typically contains objects which themselves have edges, textures, patterns, and obey the laws of physics, not every matrix of pixel value can be regarded as a real-world image.

If we can find such a distribution \(\mathbb{Q}_{(x, c)}\), then we can find the ideal (compressing) encoder that maps \(\mathbb{P}_x\) to \(\mathbb{P}_c\) by sampling from the conditional distribution \(\mathbb{Q}_{c \vert x}\). Similarly, we can find the ideal (expanding) decoder that maps \(\mathbb{P}_c\) to \(\mathbb{P}_x\) by sampling from the conditional distribution \(\mathbb{Q}_{x \vert c}\).

Note that both ideal encoder and ideal decoder are now fuzzy.

Using joint distributions for the encoder and decoder of the forms introduced earlier, if we can find parameters \(f_\theta, \sigma, \mu_\theta, \sigma_\theta\) such that \(KL \left( \mathbb{D}_{(x^\prime, c)} || \mathbb{E}_{(x, c)}\right)=0,\) then we are guaranteed that \(\mathbb{Q}_{(x, c)}\) exists, AND that \(\mathbb{D}_{(x^\prime, c)}\) and \(\mathbb{E}_{(x, c)}\) are the ideal decoder and encoder respectively!

Thus, whether the code is lower-dimensional or not, solving the optimization problem \[ \begin{align}\min_{f_\theta, \sigma, \mu_\theta, \sigma_\theta} ~&E_{\mathbb{P}_x}\left[ KL \left( \mathbb{D}_{c \vert x^\prime} || \mathbb{E}_{c \vert x}\right)\right] \\ &-E_{\mathbb{P}_x}\left( \log \mathbb{E}_x \right)\end{align}\]is the key to learning the ideal encoder and decoder, and this is what VAEs do.

Feature Engineering With AEs

‌Now that we are on the same page about what AEs are, let me tell you why you should be careful before using them for feature engineering.

When AEs are used for feature engineering, the idea is that, because the code vector \(c\) provides a richer and lower-dimensional representation of the original input vector \(x\), it may be used as a substitute feature vector to predict a target of interest \(y\). The hope is that we would get a better performance by predicting \(y\) using \(c\) instead of \(x\)

Reason 1: Low reconstruction error does not guarantee low signal loss

Whether you use TAEs or VAEs, a low reconstruction error simply does not guarantee a low signal loss. After all, AEs simply do not know anything about \(y\)!

In TAEs, much of the reconstruction error \(\mathcal{E} := E\left( ||x-x^\prime||^2 \right)\) could very well be the signal that was in \(x\) to predict \(y\), irreversibly lost when the input was encoded.

Let us take a concrete example to illustrate this. Imagine we are dealing with a tabular classification problem using categorical and continuous features \(x\), and that we know can be perfectly solved using a single classification tree (CT).

A good proxy for the signal lost by a TAE is the error made by our CT using reconstructed inputs \(x^\prime\) to predict \(y\).

Imagine that our CT first splits on a binary feature \(b\) that accounts for only 10% of the total variance of \(x\). Suppose that our TAE reconstructs all coordinates of \(x\) perfectly except for \(b\) that it gets wrong 10% of the time. This means that our TAE achieves a reconstruction error of only about 1% of the total variance of \(x\).

However, because our CT first splits on values of features \(b\), on average, 10% of reconstructed inputs \(x^\prime\) will fall in the wrong leaves of our CT. This could potentially result in a 10% classification error, despite a 1% reconstruction error! If the TAE got 20% of \(b\) wrong, then this would still lead to a very small 2% reconstruction error, but potentially a whopping 20% classification error, solely due to loss of signal by the AE!

One could think that making 20% error on a single feature might be too high, but we could arrive at the same conclusion if the TAE got a reconstruction error of 5% on 4 binary features, each accounting for 10% of the total variance. In such a case, the overall reconstruction error of the TAE would still be 2%, and the classification error could still be as high as 20%!

In VAEs, the very fact that the ideal encoder is fuzzy (i.e. \(\mathbb{Q}_{x \vert c}\) has a non-zero variance), which is always the case when the code is lower-dimensional, is sufficient to conclude that the encoder is lossy or, equivalently, a code \(c\) does not contain all the information, let alone all the signal, that was in \(x\).

In a sense, you can think of the conditional entropy \(h(x \vert c)\) as the amount of information about \(x\) irreversibly lost when encoding. Unfortunately, there is no guarantee that this loss information is pure noise; much of it could very well be signal that was in \(x\) to predict \(y\).

In fact, as illustrated in the CT example above, a 2% information loss about \(x\) can result in a reduction of classification accuracy by 20%.

Reason 2: Learning patterns from the code could be harder than from original features

When AEs are used for feature selection, new features are constructed.

In general, the primary goal of feature construction is to simplify the relationship between inputs and the target into one that models our toolbox can reliably learn.

The questions you should be asking yourself before using AE's codes as features in a tabular predictive model are:

Do the codes make any sense?
Can I think of an explanation for why the codes could have as simple a relationship to the target as the original inputs?

If the answer to either question is no, then codes would likely be less useful than original features.

As an illustration, imagine we want to predict a person's income using, among other features, GPS coordinates of her primary residence, age, number of children, number of hours worked per week.

While it is easy to see how a tree-based learner could exploit these features, combining them into a code vector would likely result in features that make little sense and are much harder to learn anything meaningful from using tree-based methods.

‌

The Computer Vision Exception

Interestingly, unsupervised pre-training in computer vision does exactly what I am advising you against, and with great success!

So why are AEs effective as pre-training step for computer vision, but not for tabular data?

Reason 1: Low reconstruction error in computer vision usually guarantees low signal loss

Fig 1: Example 28x28 MNIST images (bottom) and the same images corrupted with Gaussian noise with standard deviation 0.5 (pixel values are rescaled between 0 and 1). Source: keras.

In computer vision, corrupting an image or a video does not make it much harder to solve the problem at hand.

As illustrated in the image above, adding a Gaussian noise with standard deviation more than 50% of the pixel standard deviation does not make it much harder for a human being to tell what digit is in the corrupted image.

In other words, even if a TAE achieves a reconstruction error as high as 50% of the total variance of the image distribution, the corrupted reconstruction will still likely have all we need to recognize the digit in the image!

This implies that the code too contains all we need to recognize the digit in the image. Indeed, by the data processing inequality, \[I(y; x) \geq I(y; c) \geq I(y; x^\prime),\]meaning that the performance achievable using code \(c\) to predict image class \(y\) is at least as high as the performance achievable using corrupted reconstruction \(x^\prime\) to predict image class \(y\).

Fig 2: Example 28x28 MNIST images (top) and their 28x28 reconstruction (bottom) using a TAE with a 32-dimensional code. Source: keras.

In practice, State-Of-The-Art TAEs on MNIST and other computer vision problems achieve far smaller reconstruction errors as can be evidenced from the Fig. 2 above, and virtually no signal loss!

Reason 2: Features learned through convolutional layers make it easier to learn structures in images from

Several studies have shown that convolutional layers are capable of learning higher-level abstractions in images such as edges, texture, parts, and objects.

When used as features, these abstractions make it easier to solve computer vision problems from than raw images.

For a great resource on what types of features computer vision models learn, click here.

‌

Conclusion

Autoencoders are great tools for data compression and generative modeling.

However, in order to use autoencoders as pre-processing step in a predictive modeling problem, one needs to ensure that:

A low reconstruction error in the problem at hand implies that the problem is as easy to solve using the reconstructed inputs as using the original inputs.
The code represents features/abstractions that make it easier to predict the target than using the original inputs.

How To Fix PCA To Make It Work For Feature Selection

Yves-Laurent Kom Samo, PhD — Fri, 08 Apr 2022 23:00:00 GMT

In case you haven't read our previous blog post, you should never use PCA for feature selection.

In this blog post, we propose a drop-in replacement, which we refer to as Principal Feature Selection (PFS), that addresses the most critical limitation of using PCA for feature selection, namely the fact that conservation of energy does not guarantee conservation of signal.

In nutshell, while PCA optimizes for conservation of energy (or \(L^2(\mathbb{P})\) norm), PFS optimizes for conservation of signal, which is what is really needed for feature selection.

But first, here's a gentle reminder of why PCA isn't suitable for feature selection.

PCA Is Not For Feature Selection

We recall that the aim of PCA is to find a feature vector \(z := (z_1, \dots, z_d) \in \mathbb{R}^d\) obtained from an original mean-zero input vector \(x\) by a linear transformation, namely \(z = Wx,\) satisfying the following conditions:

\(z\) has the same energy as \(x\): \(E(||x^2||) = E(||z^2||)\).
\(z\) has decorralated coordinates: \(\forall i \neq j, ~ \rho(z_i, z_j) = 0\).
Coordinates of \(z\) have decreasing variances: \(\text{Var}(z_1) \geq \text{Var}(z_2) \geq \dots \geq \text{Var}(z_d)\).

When used for feature selection, data scientists typically select \(p < d\) such that the loss of energy (or loss of information content) \(\sum_{i=p+1}^{d} \text{Var}(z_i)\) is negligible, and they regard \(z^{p} := (z_1, \dots, z_p)\) as a feature vector that contains fewer and richer representations than the original input \(x\) for predicting a target \(y\).

In so doing, data scientists utilize PCA as a tool for both feature construction and feature selection. Unfortunately, using PCA for either purpose is problematic.

Feature Construction: What is often considered the appeal of PCA for feature construction is the fact that it generates features that are decorrelated.

While linearly dependent features are clearly an issue both theoretically and in practice, decorrelated features are not necessarily desirable for predictive modeling. What we want is to construct features that complement each other the most for predicting the target \(y\).

Feature Selection: When reducing the number of features from \(d\) to \(p\) using PCA, the emphasis is solely placed on ensuring that the \(p\) principal components capture much of the information content or energy of the original inputs vector \(x\).

However, capturing much of the energy of \(x\) does not guarantee capturing much of the signal that is in \(x\) to help us predict \(y\), which is what feature selection ought to optimize for!

Principal Feature Selection

Like PCA, Principal Feature Selection (PFS) learns a linear transformation of the original input vector \(x \in \mathbb{R}^d\), namely \(z = Fx\), where \(F \in \mathbb{R}^{p \times d}\) is a matrix that has \(p \leq d\) normal rows.

However, PFS trades the feature decorrelation requirement of PCA for feature maximum complementarity, and trades the energy conservation requirement of PCA for signal conservation.

PFS is best understood by taking a close look at what PCA does sequentially.

First Principal Feature

In PCA, the first principal component is obtained by looking for the direction in which the projection of the original input vector \(x\) has the highest energy.

Explanation

Indeed, if we denote \(w\) a normal vector (i.e. \(||w||=1\)), then the projection of \(x\) on \(w\) reads \[P_{w}(x) = wx^Tw\] and its energy reads \[\begin{align}\mathcal{E}(P_{w}(x)) :&= E(||P_{w}(x)||^{2}) \\ &= E\left( w^{T} xx^{T}w \right) \\&=w^{T} \text{Cov}(x) w,\end{align}\]and we recover the fact any direction \(w\) maximizing \(P_{w}(x)\) ought to be an eigenvector associated to the largest eigenvalue of \(\text{Cov}(x)\) (see this to understand why), which happens to yield what we already know to be the first principal component!

Instead, in PFS, the first feature, which we called the first principal feature, is obtained by looking for the direction in which the projection of the original input vector \(x\) contains the largest share of the signal that is in \(x\) (in relation to predicting \(y\)).

Formally, the first row \(f_1\) of \(F\) is obtained by maximizing the mutual information between the target \(y\) and \(x^T f\): \[f_1 = \underset{f}{\text{argmax}} ~I(y; x^Tf) ~~ \text{s.t.} ~~ ||f|| = 1,\]and the corresponding first principal feature is \[ z_1 = x^Tf_1.\]

The use of mutual information as a measure of signal or feature potential stems from the fact that the highest performance achievable when using a feature vector \(z\) to predict a target \(y\) is an increasing function of the mutual information \(I(y; z)\) for a variety of commonly used performance metrics. See this paper for more details.

Subsequent Principal Features

More generally, in PCA, the (\(i+1\))-th feature is obtained by looking for the direction that is orthogonal to all previous i principal directions, and in which the projection of the original input has the highest energy.

Explanation

Indeed, if \(w\) is orthogonal to all directions \((w_1, \dots, w_i)\) (i.e. \(w \perp \text{span}(w_1, \dots, w_i)\)), then the projection of \(x\) on \(w\) is the same as the projection on \(w\) of what is left of \(x\) after projecting it on \(\text{span}(w_1, \dots, w_i)\), namely \[\bar{x}_i := x - W_i^T W_ix, \] where \(W_i\) is the \(i \times d\) matrix whose \(i\) rows are the orthonormal eigenvectors of \(\text{Cov}(x)\) corresponding to the \(i\) largest eigenvalues.

We know from the previous explanation that the optimal direction ought to be an eigenvector of \[\text{Cov}(\bar{x}_i) = [I- W_i^T W_i] \text{Cov}(x) [I- W_i^T W_i]\]associated to the largest eigenvalue.

Clearly, all eigenvectors of \(\text{Cov}(x)\) are also eigenvectors of \(\text{Cov}(\bar{x}_i)\). The top-\(i\) eigenvectors of \(\text{Cov}(x)\) have eigenvalue \(0\) relative to \(\text{Cov}(\bar{x}_i)\), and all other eigenvectors of \(\text{Cov}(x)\) have the same eigenvalue relative to \(\text{Cov}(\bar{x}_i)\).

Thus, \(w_{i+1}\), the eigenvector of \(\text{Cov}(x)\) corresponding to the (\(i+1\))-th largest eigenvalue, is the eigenvector of \(\text{Cov}(\bar{x}_i)\) corresponding to the largest eigenvalue.

We recover what we already knew to be the (\(i+1\))-th principal direction.

PCA requires the (\(i+1\))-th principal direction to be orthogonal to all \(i\) previous principal directions in order to ensure that all principal features are decorrelated. However, decorrelated features are not necessarily complementary as far as predicting \(y\) goes.

PFS addresses this limitation by learning the (\(i+1\))-th principal feature by looking for the direction in which the projection of the original input vector \(x\) complements all \(i\) previously learned principal features the most.

Formally, the (\(i+1\))-th row \(f_{i+1}\) of \(F\) is obtained by maximizing the conditional mutual information between the target \(y\) and \(x^T f\) given all previously learned principal features: \[f_{i+1} = \underset{f}{\text{argmax}} ~I\left(y; x^Tf \vert x^{T}f_1, \dots, x^{T} f_i\right) ~~ \text{s.t.} ~~ ||f|| = 1,\]and the corresponding principal feature is \[z_{i+1} = x^Tf_{i+1}.\]

Recalling that \[I(y;x |z) := I(y; x, z)-I(y; z),\]it follows that maximizing \(I\left(y; x^Tf \vert x^{T}f_1, \dots, x^{T} f_i\right)\) (as a function of \(f\)) is equivalent to maximizing \(I\left(y; x^Tf, x^{T}f_1, \dots, x^{T} f_i\right)\) (as a function of \(f\)).

Stopping Criteria

PFS naturally stops when the last conditional mutual information term evaluated is negligible \[I\left(y; x^Tf_{i+1} \vert x^{T}f_1, \dots, x^{T} f_i\right) \approx 0,\]in which case the corresponding principal feature \(z_{i+1} = x^{T}f_{i+1}\) is discarded.

Properties

Maximum Complementarity

Note that, by construction, Principal Feature Selection cannot generate linearly dependent features.

Indeed, if the (\(i+1\))-th principal feature \(z_{i+1} = x^T f_{i+1}\) is linearly related to the first \(i\) principal features, then \(I\left(y; z_{i+1} \vert z_1, \dots, z_i\right)=0\), in which case PFS would stop (by construction), and \(z_{i+1}\) would be discarded.

More generally, by construction, the (\(i+1\))-th principal feature is always the (linear) feature that, when used in conjunction with existing principal features to predict the target \(y\), would yield the highest achievable performance.

Said differently, whether \(f_{i+1}\) is orthogonal to previous principal directions \((f_1, \dots, f_i)\) or not, we are guaranteed that no other feature of the form \(g(x^Tf)\) for any function \(g\) may complement existing principal features \((z_{1}, \dots, z_{i})\) more than \(z_{i+1}\) when it comes to predicting \(y\)!

Dimensionality Reduction

A direct consequence of maximum complementarity is that PFS is bound to learn at most \(d\) features, as we cannot find more than \(d\) vectors in \(\mathbb{R}^d\) that are not linearly related.

In practice, because each principal feature squeezes as much of the signal leftover as it possibly can, in most cases, PFS should be expected to considerably reduce the number of features needed.

Signal Conservation

If PFS ever happens to stop at \(p=d\), in other words, if \(I(y; z_1) > 0\) and no conditional mutual information term \(I\left(y; z_{i+1} \vert z_i, \dots, z_1\right)\) is negligible for \(i\) ranging from \(1\) to \((d-1)\), then \((z_1, \dots, z_d)\) are guaranteed to contain all the signal that is in \(x\) to predict \(y\): \[I(y; z_1, \dots, z_d) = I(y; Fx) = I(y; x),\]which is the least we should expect.

Explanation

Indeed, as previously discussed, the matrix of principal directions \(F\) is full rank.

Thus, the feature transformation \(x \to z:=Fx\) is a 1-to-1 map, and mutual information is invariant by 1-to-1 maps.

When \(p < d\) or, equivalently, \(I(y; x^Tf \vert z_1, \dots, z_p) \approx 0\) for any \(f \in \mathbb{R}^{d}\), we may safely conclude that no feature of the form \(g(x^Tf)\), for any function \(g\), contains insights that are not already in \((z_1, \dots, z_p)\).

In particular, for each \(i\), \(I(y; x_i \vert z_1, \dots, z_p) \approx 0\), which implies that each original feature is always redundant relative to the \(p\) principal features learned by PFS.

While this is a strong indication that the \(p\) learned principal features \(z\) contain much of the signal that is in the \(d > p\) original features \(x\) to predict \(y\), we can't however conclude that it contains all the signal.

Explanation

Indeed, we may very well have \(I(y; x^Tu \vert z_1, \dots, z_p) = 0\) for all vectors \(u \in \mathbb{R}^{d}\), but \(I(y; x^Tu, x^Tv \vert z_1, \dots, z_p) > 0\) for some vectors \(u, v \in \mathbb{R}^d\).

All it would take is to have either \(I(y; x^Tu \vert z_1, \dots, z_p, x^Tv) > 0\) or \(I(y; x^Tv \vert z_1, \dots, z_p, x^Tu) > 0\), which is not incompatible with \(I(y; x^Tu \vert z_1, \dots, z_p) = 0\).

What we can conclude, however, is that if we want to complement principal features \(z\) with a feature that is a function of a projection of original features \(x\), then we need to project \(x\) onto a 2 or higher dimensional space, not just a line.

In other words, all features of the form \(g(x^Tu)\) are redundant given principal features \(z\), but some features of the form \(g(Gx)\) where \(G\) is a matrix with more than \(1\) rows, might not be redundant given \(z\).

If no such feature would be beneficial, i.e. \[\underset{G \in \mathbb{R}^{(d-p) \times d}}{\sup} I\left(y; Gx \vert z_1, \dots, z_p\right) = 0,\]then we can safely conclude that the \(p\) learned principal features \(z\) contain all the signal in \(x\) to predict \(y\), no matter how small \(p\) is!

Proof

Add a \((d-p) \times d \) matrix block \(G\) to \(F\) to turn it into a \(d \times d\) full rank matrix \(H\). Note that \(H\) being full rank, \(I(y; Hx) = I(y; x)\).

Also note that we have \[I(y; Hx)=I\left(y; [Fx, Gx]\right) = I(y; Gx \vert Fx) + I(y; Fx).\]

Given that \(I(y; Gx \vert Fx)=0\), it follows that \(I(y; x) = I(y; Hx) = I(y; Fx)\).

When Not To Use PFS?

Like PCA, Principal Feature Selection constructs new features by linearly combining existing features \(x\).

Linearly combining existing features might result in features that make little sense (e.g. think about a linear combination of GPS coordinates, age, and one's number of children to predict one's income), and that will be harder for models to learn from.

Let me stress that, while PFS maximizes the signal or 'juice' in principal features (for predicting \(y\)), it does not necessarily make the said signal or 'juice' easier to extract using models in your toolbox!

A scenario in which you may confidently use PFS is when all original features (i.e. coordinates of \(x\)) are of the same nature, and linearly combining them would produce another feature that is of the same nature.

This is for instance the case in finance when \(x\) is a vector of returns. Here, \(x^Tf\) can be regarded as the return of a (possibly long-short) portfolio containing the same assets of which coordinates of \(x\) are returns.

In general, if linearly combining your original features does not produce features that make sense, then you should probably not consider using PFS, and certainly not PCA.

Python Implementation

We have implemented Principal Feature Selection in the kxy Python package; simply run pip install kxy -U.

As an illustration, we consider running PFS on a regression problem where the target \(y\) is generated as \[y = x^Tw + 2(x^{T}w)^2 + 0.5(x^{T}w)^3,\]where \(x \in \mathbb{R}^{100}\) is a vector of independent standard normal, and \(w = (1/100, \dots, 1/100)\).

import numpy as np
from kxy.pfs import PFS

# Generate the data
d = 100
w = np.ones(d)/d
x = np.random.randn(10000, d)
xTw = np.dot(x, w)
y = xTw + 2.*xTw**2 + 0.5*xTw**3

# Run PFS
selector = PFS()
selector.fit(x, y)

# Learned principal directions
F = selector.feature_directions

# Learned features
z = np.dot(x, F.T)

# Accuracy
true_f_1 = w/np.linalg.norm(w)
learned_f_1 = F[0, :]
e = np.linalg.norm(true_f_1-learned_f_1)
print('PFS Error Norm: %.4f' % e)

The code outputs PFS Error Norm: 0.0552, meaning that our implementation only made 5% error in finding the first principal feature.

Let's compare this to the first principal component of PCA.

from kxy.pfs import PCA
# Run PCA
pca_selector = PCA(energy_loss_frac=0.05)
pca_selector.fit(x, None)

# Learned principal directions
W = pca_selector.feature_directions

# Learned principal components
z = np.dot(x, W.T)

# Accuracy
learned_w_1 = W[0, :]
e = np.linalg.norm(true_f_1-learned_w_1)
print('PCA Error Norm: %.4f' % e)

The PCA code outputs PCA Error Norm: 1.4134!

5 Reasons You Should Never Use PCA For Feature Selection

Yves-Laurent Kom Samo, PhD — Thu, 31 Mar 2022 16:00:00 GMT

Principal Component Analysis, or PCA, is one of the most consequential dimensionality reduction algorithms ever invented.

Unfortunately, like all popular tools, PCA is often used for unintended purposes, sometimes abusively. One such purpose is feature selection.

In this article, we give you 5 key reasons never to use PCA for feature selection. But first, let's briefly review the inner-workings of PCA.

What Is PCA?

The Problem

Let us assume we have a vector of inputs \(x := (x_1, \dots, x_d) \in \mathbb{R}^d\), which we assume has mean 0 to simplify the argument (i.e. E(x)=0).

We are interested in reducing the size \(d\) of our vector, without losing too much information. Here, a proxy for the information content of \(x\) is its energy defined as \[\mathcal{E}(x) := E(||x||^2).\]

The challenge is that the information content of \(x\) is usually unevenly spread out across its coordinates.

In particular, coordinates can be positively or negatively correlated, which makes it hard to gauge the effect of removing a coordinate on the overall energy.

Let's take a concrete example. In the simplest bivariate case (\(d=2\)),

\(\mathcal{E}(x) = \text{Var}(x_1) + \text{Var}(x_2) + 2\rho(x_1, x_2) \sqrt{\text{Var}(x_1)\text{Var}(x_2)}\),

where \(\rho(x_1, x_2)\) is the correlation between the two coordinates, and \(\text{Var}(x_i)\) is the variance of \(x_i\).

Let's assume that \(x_1\) has a higher variance than \(x_2\). Clearly, the effect of removing \(x_2\) on the energy, namely \[\mathcal{E}(x)-\text{Var}(x_1) = \text{Var}(x_2) + 2\rho(x_1, x_2) \sqrt{\text{Var}(x_1)\text{Var}(x_2)},\] does not just depend on \(x_2\); it also depends on the correlation between \(x_1\) and \(x_2\), and on the variance/energy of \(x_1\)! When \(d > 2\), things get even more complicated. The energy now reads\[\mathcal{E}(x) =\sum_{i=1}^{d}\sum_{j=1}^{d} \rho(x_i, x_j) \sqrt{\text{Var}(x_i)\text{Var}(x_j)},\]and analyzing the effect on the energy of removing any coordinate becomes a lot more complicated.

The aim of PCA is to find a feature vector \(z := (z_1, \dots, z_d) \in \mathbb{R}^d\) obtained from \(x\) by a linear transformation, namely \(z = Wx,\) satisfying the following conditions:

\(z\) has the same energy as \(x\): \(E(||x^2||) = E(||z^2||)\).
\(z\) has decorralated coordinates: \(\forall i \neq j, ~ \rho(z_i, z_j) = 0\).
Coordinates of \(z\) have decreasing variances: \(\text{Var}(z_1) \geq \text{Var}(z_2) \geq \dots \geq \text{Var}(z_d)\).

When the 3 conditions above are met, we have \[\mathcal{E}(x) = \mathcal{E}(z) =\sum_{i=1}^{d} \text{Var}(z_i).\]Thus, dimensionality reduction can be achieved by using features \(z ^{p} := (z_1, \dots, z_p)\) instead of the original features \(x := (x_1, \dots, x_d)\), where \(p < d\) is chosen so that the energy loss, namely \[\mathcal{E}(z)-\mathcal{E}(z^{p}) = \sum_{i=p+1}^{d}\text{Var}(z_i),\]is only a small fraction of the total energy \(\mathcal{E}(z)\):\[\frac{\sum_{i=p+1}^{d}\text{Var}(z_i)}{\sum_{i=1}^{d}\text{Var}(z_i)} \ll 1.\]

The Solution

The three conditions above induce a unique solution.

The conservation of energy equation implies: \[E(||z^2||) = E\left( x^{T} W^{T}Wx\right) = E\left( x^{T} x\right)=E(||x^2||).\] A sufficient condition for this to hold is that \(W\) be an orthogonal matrix: \[W^{T}W = WW^{T} = I.\] In other words, columns (resp. rows) form an orthonormal basis of \(\mathbb{R}^{d}\).

As for the second condition, it implies that the autocovariance matrix\[\text{Cov}(z) = WE(xx^T)W^T = W\text{Cov}(x)W^T \]should be diagonal.

Let us write \(\text{Cov}(x) = UDU^T\) the Singular Value Decomposition of \(\text{Cov}(x)\), where columns of the orthogonal matrix \(U\) are orthonormal eigenvectors of the (positive semidefinite) matrix \(\text{Cov}(x)\), sorted in decreasing order of eigenvalues.

Plugging \(\text{Cov}(x) = UDU^T\) in the equation \(\text{Cov}(z) = W\text{Cov}(x)W^T\), we see that, to satisfy the second condition, it is sufficient that \(WU=I=U^{T}W^{T}\), which is equivalent to \(W=U^{-1} =U^{T}\).

Note that, because \(U\) is orthogonal, the choice \(W = U^{T}\) also satisfies the first condition.

Finally, given that columns of \(U\) are sorted in decreasing order of eigenvalues, their variances \(\text{Var}(z_i) = \text{Cov}(z)[i, i] = D[i, i]\) also form a decreasing sequence, which satisfies the third condition.

Interestingly, it can be shown that any loading matrix \(W\) of a linear transformation that satisfies the three conditions above ought to be of the form \(W=U^T\) where columns of \(U\) are orthonormal eigenvectors of \(\text{Cov}(x)\) sorted in decreasing order of their eigenvalues.

Coordinates of \(z\) are called principal components, and the transformation \(x \to U^{T}x\) is the Principal Component Analysis.

5 Reasons Not To Use PCA For Feature Selection

Now that we are on the same page about what PCA is, let me give you 5 reasons why it is not suitable for feature selection.

When used for feature selection, data scientists typically regard \(z^{p} := (z_1, \dots, z_p)\) as a feature vector that contains fewer and richer representations than the original input \(x\) for predicting a target \(y\).

Reason 1: Conservation of energy does not guarantee conservation of signal

The essence of PCA is that the extent to which dimensionality reduction is lossy is driven by the information content (energy in this case) that is lost in the process. However, for feature selection, what we really want is to make sure that reducing dimensionality will not reduce performance!

Unfortunately, maximizing the information content or energy of features \(z^p := (z_1, \dots, z_p)\) does not necessarily maximize their predictive power!

Think of the predictive power of \(z^p\) as the signal part of its overall energy or, equivalently, the fraction of its overall energy that is useful for predicting the target \(y\).

We may decompose an energy into signal and noise as \[ \mathcal{S}(z^p) + \mathcal{N}(z^p) = \mathcal{E}(z^p) \leq \mathcal{E}(x) = \mathcal{S}(x) + \mathcal{N}(x) ,\]where \(\mathcal{N}(z^p) := E\left(||z^p||^2 \vert y\right)\) is the noise component, and \(\mathcal{S}(z^p) := E\left(||z^p||^2 \right)-E\left(||z^p||^2 \vert y\right)\) is the signal.

Clearly, while PCA ensures that \(\mathcal{E}(x) \approx \mathcal{E}(z^p)\), we may easily find ourself in a situation where PCA has wiped out all the signal that was originally in \(x\) (i.e. \(\mathcal{S}(z^p) \approx 0\))! The lower the Signal-to-Noise Ratio (SNR) \(\frac{\mathcal{S}(x)}{\mathcal{N}(x)}\), the more likely this is to happen.

Fundamentally, for feature selection, what we want is conservation of signal \(\mathcal{S}(x) \approx \mathcal{S}(z^p)\) not conservation of energy.

Note that, if instead of using the energy as the measure of information content we used the entropy, the noise would have been the conditional entropy \(h\left(z^p \vert y\right)\), and the signal would have been the mutual information \(I(y; z^p)\).

Reason 2: Conservation of energy is antithetical to feature selection

Fundamentally, preserving the energy of the original feature vector conflicts with the objectives of feature selection.

Feature selection is most needed when the original feature vector \(x\) contains coordinates that are uninformative about the target \(y\), whether they are used by themselves, or in conjunction with other coordinates.

In such a case, removing the useless feature(s) is bound to reduce the energy of the feature vector. The more useless features there are, the more energy we will lose, and that's OK!

Let's take a concrete example in the bivariate case \(x=(x_1, x_2)\) to illustrate this. Let's assume \(x_2\) is uninformative about \(y\) and \(x_1\) is almost perfectly correlated to \(y\).

Saying that \(x_2\) is uninformative about the target \(y\) means that it ought to be independent from \(y\) both unconditionally (i.e. \(I(y; x_2)=0\)) and conditionally on \(x_1\) (i.e. \(I(y; x_2 \vert x_1) = 0\)).

This can occur for instance when \(x_2\) is completely random (i.e. independent from both \(y\) and \(x_1)\). In such a case, we absolutely need to remove \(x_2\), but doing so would inevitably reduce the energy by \(E(||x_2||^2)\).

Note that, when both \(x_1\) and \(x_2\) have been standardized, as is often the case before applying PCA, removing \(x_2\), which is the optimal thing to do from a feature selection standpoint, would result in 50% energy loss!

Even worse, in this example, \(x_1\) and \(x_2\) happen to be principal components (i.e. \(U=I\)) associated to the exact same eigenvalue. Thus, PCA is unable to decide which one to keep, even though \(x_2\) is clearly useless and \(x_1\) almost perfectly correlated to the target!

Reason 3: Decorrelation of features does not imply maximum complementarity

It is easy to think that because two features are decorrelated each must bring something new to the table. That is certainly true, but that 'new thing' which decorrelated features bring is energy or information content, not necessarily signal!

Much of that new energy can be pure noise. In fact, features that are completely random are decorrelated with useful features, yet they cannot possibly complement them for predicting the target \(y\); they are useless.

Reason 4: Learning patterns from principal components could be harder than from original features

When PCA is used for feature selection, new features are constructed.

In general, the primary goal of feature construction is to simplify the relationship between inputs and the target into one that models our toolbox can reliably learn.

By linearly combining previously constructed features, PCA creates new features that can be harder to interpret, and in a more complex relationship with the target.

The questions you should be asking yourself before applying PCA are:

Does linearly combining my features make any sense?
Can I think of an explanation for why the linearly combined features could have as simple a relationship to the target as the original features?

If the answer to either question is no, then PCA features would likely be less useful than original features.

As an illustration, imagine we want to predict a person's income using, among other features, GPS coordinates of her primary residence, age, number of children, number of hours worked per week.

While it is easy to see how a tree-based learner could exploit these features, linearly combining them would result in features that make little sense and are much harder to learn anything meaningful from using tree-based methods.

Reason 5: Feature selection ought to be model-specific

Feature selection serves one primary goal: removing useless features from a set of candidates. As explained in this article, feature usefulness is a model-specific notion. A feature can very well be useful for a model, but not so much for another.

PCA, however, is model-agnostic. In fact, it does not even utilize any information about the target.

Conclusion

PCA is a great tool with many high-impact applications. Feature selection is just not one of them.

Feature Engineering With Game Theory: Beyond SHAP values

Yves-Laurent Kom Samo, PhD — Wed, 23 Mar 2022 16:00:00 GMT

There is no great predictive (a.k.a tabular) model without great features. In our quest for great features, we might find ourselves with hundreds of candidate features.

Such a large number of candidate features comes with several challenges, not the least of which are the need to limit the number of features used to train our model (i.e. feature selection) and the need to understand how each selected feature affects model decisions (i.e. model explanation) and model performance.

Unfortunately, the ease with which machine learning engineers can generate feature importance scores nowadays has resulted in feature importance scores being misunderstood, and abusively used for feature selection.

In this post, we discuss three fundamental properties of a feature every machine learning engineer needs to be aware of, namely its importance, its usefulness, and its potential, and we clarify when each one may be used.

We focus on game-theoretic implementations using Shapley values to stress the similarities and differences between the three properties.

Shapley Values

Shapley values provide a framework for fairly attributing payoffs to individual players that collaborate in a game that may be played individually or in coalitions.

As is customary, let us denote \(N\) the set of all players, \(n\) the total number of players, and \(v: 2^n \to \mathbb{R}\) the function such that \(v(S)\) is the payoff received by the coalition of players \(S\).

As per Shapley value, the fair amount that player \(i\) should get is defined as \[\phi_i(v) := \frac{1}{n}\sum_{S \subset N \backslash \{i\}} {n-1 \choose |S|}^{-1} \Delta v_i(S) , \]where \(\Delta v_i(S) := v\left( S \cup \{i\} \right) - v\left( S\right)\).

One way to understand \(\phi_i(v)\) is that, if player \(i\) joins a coalition \(S\), it is only fair that she asks as attribution for her contribution \(v\left( S \cup \{i\} \right) - v\left( S\right)\).

To avoid giving players an advantage or disadvantage for entering a coalition early or late, we take the average over all possible permutations in which the coalition can be formed.

Shapley value is the only payoff attribution rule that satisfies the following four properties:

Efficiency: The sum of individual attributions is the payoff of the coalition made of all players minus the payoff of the empty coalition: \(\sum_{i=1}^n \phi_i(v) = v(N)-v(\emptyset) \).
Symmetry: If two players \(i\) and \(j\) always have the same impact in any coalition they are added to and that contains neither, then they should be attributed the same payoff. That is, if \(v(S \cup \{i\}) = v(S \cup \{j\})\) for every \(S\) such that \(i \notin S\) and \(j \notin S\), then \(\phi_i(v)=\phi_j(v)\).
Linearity: Running two attributions one after the other would produce the same result as running one attribution with the combined payoff: \(\forall v, w, i, ~~~~ \phi_i(v+w) = \phi_i(v) + \phi_i(w).\)
Null player: A player that does not contribute when added to any coalition should not be attributed any payoff: if \(\forall S, v(S \cup \{i\}) = v(S)\) then \(\phi_i(v)=0\).

Shapley Values For Predictive Modeling

Shapley values provide a flexible framework for understanding the marginal contribution of a feature when building a predictive model.

Features are essentially players that collaborate in a game related to predictive modeling. Using multiple features in a model is tantamount to players forming a coalition to play the game. The specific rules of the game and the associated payoff function \(v\) will vary depending on what it is that we want to attribute back to a specific feature.

To avoid any ambiguity between features and samples, going forward we will use \(D\) instead of \(N\) to refer to the set of all candidate features and \(d\) instead of \(n\) to refer to its size.

Properties of A Feature

I. Feature Importance

A feature importance score is any score that quantifies the extent to which a specific feature drives decisions of a specific model.

A popular example is given by Mean Absolute SHAP (SHapley Additive exPlanations) values.

For a model using all features \(x\) to make a decision \(f(x)\), SHAP values quantify the contribution of each feature in generating the specific model decision.

For regression problems, \(f(x)\) is the model's prediction, whereas for classification problems it is typically the estimated predictive probability of an outcome.

SHAP attributes to the features coalition \(S\) represented by the feature vector \(x_S\) the payoff \[v_{SHAP}(S) := E(f(x) \vert x_S ) = \int f(x) d\mathbb{P}_{x_{-S}},\] where \(x_{-S}\) represents the vector of features not in the coalition.

In plain English, SHAP's payoff is the average model decision made when features in the coalition take specific values and other features vary as per the data generating distribution.

Because \(v_{SHAP}(D) = f(x)\) and \(v_{SHAP}(\emptyset) = E(f(x))\), the efficiency property above yields the following decomposition of model decisions \[f(x) = E(f(x)) + \sum_{i=1}^d \phi_i(v_{SHAP}).\]Thus, SHAP allows us to quantify the (possibly negative) contribution of a specific feature towards the deviation of the model decision from its average value \(E(f(x))\).

The absolute term \(\vert \phi_i(v_{SHAP}) \vert \) reflects the sensitivity of the specific model decision made to the value of the i-th feature in \(x\). Thus, its average, namely the Mean Absolute SHAP value \( \int \vert \phi_i(v_{SHAP}) \vert d\mathbb{P}_{x}\), quantifies the extent to which feature \(i\) drives or affects model decisions overall.

At this point, it is important to stress that, so far, we have not mentioned or considered true labels, let alone the performance of the model!

Feature importance scores, SHAP values or otherwise, simply cannot inform us whether a model performs well or what features drive model performance. Feature importance scores are great for model explanation, but should not be used beyond that.

II. Feature Usefulness

A feature usefulness score is any score that quantifies the extent to which a specific feature drives the performance of a specific model.

One such score can be obtained by using Shapley value \(\phi_i(v_{USEFUL})\) with payoff function the negative risk, namely \[\begin{align}v_{USEFUL}(S) := -E\left[L(v_{SHAP}(S), y)\right]\end{align},\]where \(L\) is a loss function reflecting model performance and \(y\) is the true label associated to \(x\). Alternatively, we may use the (cross-validated) negative empirical risk as a proxy for the negative risk.

You can think of \(v_{USEFUL}(S)\) as the performance (negative loss) that we would obtain when only using features in the coalition \(S\).

Note that \(v_{USEFUL}(\emptyset) := P_\emptyset\) is the performance of the baseline strategy consisting of always making the same decision (i.e. \(E(f(x))\) ) regardless of feature values. Similarly, \(v_{USEFUL}(D) := P_D\) is the performance of the model when using all features.

The efficiency property of Shapley values yields the following performance decomposition: \[P_D = P_\emptyset + \sum_{i=1}^d \phi_i(v_{USEFUL}).\]This equation can be very useful for feature selection.

Features with small Shapley usefulness scores \(\phi_i(v_{USEFUL})\) are good candidates to be removed. In particular, it is worth stressing that \(\phi_i(v_{USEFUL})\) may very well be negative!

This would occur if a feature decreases model performance when added to a coalition, which could indicate that said feature causes the model to overfit.

Feature usefulness is a model-specific notion; a feature can be useful to a model, but not to another. A direct consequence of this is that using one model to select features that will then be used in another model can be perilous.

III. Feature Potential

The working assumption when training a predictive model is that there is a sufficient amount of predictive power or 'juice' in our set of candidate features to predict the target.

Often times this is true and yet some features only contribute marginally to the total amount of juice there is in the set of candidate features. In such a case, it is important to be able to detect and remove those features with low 'potential'.

Formally, a feature potential or feature potential usefulness score is any score that quantifies the contribution of a specific feature to the highest performance achievable using a set of candidate features.

One such score can be obtained by using Shapley value \(\phi_i(v_{POTENTIAL})\) with payoff function the supremum of the negative risk across all possible models, namely \[v_{POTENTIAL}(S) := \sup_{f \in \mathcal{F}} -E\left[L(v_{SHAP}(S), y)\right],\]where \(\mathcal{F}\) is a flexible enough function space and \(v_{SHAP}(S) = E(f(x)\vert x_s)\).

\(\sup_{f \in \mathcal{F}} -E\left[L(v_{SHAP}(S), y)\right]\) represents the highest performance achievable using features in the coalition \(S\) to predict the target \(y\).

It was shown in this paper that, for most performance metrics, the highest performance achievable using \(x_S\) to predict \(y\) is an increasing function of the mutual information \(I(y; x_S)\). As a result, we may also use as a proxy the payoff function \[v_{MI}(S) := I(y; x_S).\]

Note that \(v_{MI}(\emptyset) := 0\) and \(v_{MI}(D) := I(y; x)\). This implies that the efficiency property of Shapley values yields the following mutual information decomposition: \[I(y; x) = \sum_{i=1}^d \phi_i(v_{MI}).\]

When a feature has a high potential, we can conclude that there exists a model to which the feature can be useful for predicting the target. On the other hand, when the feature's potential is null, we can safely conclude that there does not exist any model to which the feature can be useful for predicting the target, either by itself or in conjunction with other features in the set of candidates.

Feature potential is set-specific. A feature might very well have no potential relative to a set but have some potential relative to another set.

In effect, by noting that \(I(y; x_S, x_i)-I(y; x_S) = I(y; x_i \vert x_S),\) we may rewrite \(\phi_i(v_{MI})\) as \[\phi_i(v_{MI}) = \frac{1}{d}\sum_{S \subset D \backslash \{i\}} {d-1 \choose |S|}^{-1} I(y; x_i \vert x_S).\]

Thus, for a feature to have potential, the feature and the target need to be statiscally dependent, either unconditionally (when \(S=\emptyset\)) or conditional on some other features in the set of candidate features.

If a feature \(i\) has no potential in a set of candidate features \(D\), we can only conclude that is it statistically independent from the target, both unconditionally and conditional on any subset of features in \(D\). We cannot, however, rule out the possibility that there could exist another set of candidate features \(C\) containing features conditional on which our feature \(i\) and the target are statistically dependent!

Conclusion

Feature importance scores are good at one thing and one thing only: explaining decisions made by a specific model. For feature selection, feature usefulness and feature potential are more relevant properties to consider.

Of feature importance, feature usefulness, and feature potential, feature potential is the only property of a feature that allows us to draw conclusions that are not model-specific. In particular, because a feature is important or useful to a Random Forest or a Gradient Boosted Tree does not mean it will be important or useful to a linear model, a kernel-based method, or any other model!

Feature usefulness and feature potential are the only two properties that allow us to draw conclusions related to model performance. In particular, feature importance scores should not be used for feature selection.

Features should not be selected for their importance but for their (potential) usefulness.

5 Reasons Why You Should Never Use Recursive Feature Elimination

Yves-Laurent Kom Samo, PhD — Tue, 15 Mar 2022 14:16:27 GMT

When building a predictive model, an explanatory variable or feature is deemed unnecessary when it is either uninformative or redundant.

Feature selection, the act of detecting and removing unnecessary explanatory variables, is key to building an effective predictive model.

In effect, failure to remove unnecessary features while training a predictive model increases the likelihood of overfitting, and makes the model harder to explain and costlier to maintain.

An approach often used for feature selection is the Recursive Feature Elimination algorithm (RFE).

RFE is applicable to models for which we may compute a feature importance score.

Simply put, a feature importance score is a score that quantifies how important a feature is for generating model decisions.

Example feature importance scores include:

Model-Agnostic Scores: Mean Absolute SHAP values, Leave-One-Covariate-Out (LOCO), Permutation Importance.
Random Forest Score: Gini importance.
Gradient Boosted Trees Scores: Split-based and gain-based Importances.

Starting with all \(d\) candidate features, RFE recursively trains a model using all eligible features, deletes the feature with the smallest feature importance score to obtain the new set of eligible features, and repeats the previous two steps until the set of eligible features is of a given size \(q < d\), specified by the operator.

While it may seem intuitive at first, RFE should be avoided at all costs.

Here are 5 reasons why.

Reason 1: Because a feature is important does not make it useful!

That's right. Feature importance scores quantify the extent to which a model relies on a feature to make predictions. They do not (necessarily) quantify the contribution of a feature to the overall accuracy of a model (i.e. the feature's usefulness). This is a subtle but fundamental difference.

Take Mean Absolute SHAP value for instance. Its calculation does not utilize true labels at all. As such, there is no guarantee that a high Mean Absolute SHAP value will correspond to a high contribution to the overall accuracy of a model.

The same goes for Gini's importance. Because splitting on a feature yielded a high mean (cumulative) impurity decrease while building a Random Forest, does not guarantee said feature will have a high contribution towards the overall accuracy of a model out-of-sample. The trained Random Forest could very well be overfitted. The same argument can be applied to split-based and gain-based importance scores, as well as LOCO and permutation importance when scores are calculated using the training set.

Even if LOCO and permutation importance scores are calculated using validation/held-out data, it still would not guarantee that important features are also useful.

In general, unless the trained model is a good model, there is no guarantee that features with high LOCO or permutation importance scores will be useful. However, feature selection is part of constructing a good model.

Feature selection should guard against us training bad models, not rely on us having trained good models.

In short, if you are not sure that the feature importance score you are using in RFE guarantees that an important feature is also useful, which most off-the-shelf scores don't guarantee, you should not be using RFE.

Reason 2: RFE exacerbates overfitting.

The first pass of RFE uses all candidate features to train the model. The presence of unnecessary features in this set of candidates, without which feature selection would not be needed, increases the likelihood that the first trained model will be overfitted.

When the first model is overfitted, the most important features will tend to be the most useless. After all, they are the ones that drive (inaccurate) model decisions the most!

Thus, in the first pass, RFE will tend to favor important but useless features over unimportant but useful ones. When features responsible for overfitting are kept in a pass, the model trained in the subsequent pass will also likely overfit, and features driving this overfitting (i.e. features with high importance scores) will most likely be kept as well.

By keeping important features instead of useful ones, RFE basically exacerbates overfitting.

Reason 3: RFE cannot properly detect, let alone eliminate, redundant features.

Among features that are the most important for a model, there could very well be features that are redundant.

As an illustration consider duplicating the feature with the highest importance score and retraining your model. Intuitively, the original feature and its duplicate should have the same score, and the two should be among features with the highest importance scores for the second model. As such, the duplicate feature is unlikely to be removed by RFE, even though it is clearly unnecessary.

In short, if you are not sure that the feature importance score you are using in RFE guarantees that high importance features cannot be mutually redundant, which most off-the-shelf scores don't guarantee, you should not be using RFE to remove redundant features.

Reason 4: RFE does not tell us how many features we can afford to remove.

For a given \(q\), RFE will help you choose which \((d-q)\) features to remove in your set of \(d\) candidate features. But what value of \(q\) should you use?

Too large a \(q\) and you might still be exposed to the downsides of using unnecessary features (e.g. higher likelihood of overfitting, model harder to explain and costlier to maintain, etc.). Too small a \(q\) and your model performance might suffer.

Picking the right \(q\) is as important as choosing which \((d-q)\) features to remove.

Reason 5: RFE does not work for all models out-of-the-box.

As previously discussed, RFE requires a feature importance score. While model-agnostic feature importance scores exist in theory (Mean Absolute SHAP values, LOCO and Permutation Importance to name but a few), they are usually computationally intractable beyond a few dozen features, except in the case of specific models.

This is the case for SHAP values, which are only practical to compute when model-based solutions or approximations are available (e.g. linear regression and tree-based models).

As for LOCO, because the number of models to train at each RFE pass scales linearly with the number of eligible features, it is impractical to use beyond a few dozens of features to remove. As a simple illustration, eliminating 40 features out of 50 would require us training the model 1000 times!

Finally, in the case of both feature permutation and LOCO, as discussed in Reason 1 above, to be effective they require the trained model to be a good model.

To see how bad RFE really is in practice, check out this benchmark we made, where we compared RFE to LeanML feature selection and Boruta on 38 real-word classification and regression problems.

RFE underperformed both Boruta and LeanML by at least 0.40 \(R^2\) and AUC score on average, and was 10x slower than LeanML!

AutoML: How To Reduce Your Model Size by 95% While Improving Model Performance

Yves-Laurent Kom Samo, PhD — Tue, 08 Mar 2022 17:38:00 GMT

Motivation

AutoML, Large Feature Sets, and Overfitting

Automating algorithm selection and hyper-parameter tuning using an AutoML library such as AWS' AutoGluon can save machine learning engineers tremendously in development costs. However, with great modeling power comes an increased risk of overfitting.

To illustrate this, let us first consider a single algorithm trained with one set of hyper-parameters using k-fold cross-validation. While cross-validation might considerably reduce the probability that our trained model will perform poorly on unseen data, it can never reduce it down to 0. There will always be a small chance that our model was overfitted and will perform very poorly on unseen data. Let us denote \(p_o\) this probability of overfitting.

Now, let us consider independently training not just \(1\) but multiple configurations algorithm + hyper-parameters, and let us assume that \(m\) of these configurations yielded satisfying held-out performances. The probability that there will be at least one of the \(m\) successful configurations that will not generalize well to unseen data despite cross-validation is \(p_{o_m} = 1-(1-p_o)^m \approx mp_o\). This is a huge jump from \(p_o\).

As an illustration, let us assume that there is a 1% chance that a model that did well after k-fold cross-validation will perform poorly on unseen data (i.e. \(p_o=0.01\)). If we found \(m=10\) satisfying algorithm + hyper-parameters configurations, then there is a \(p_{o_m}=0.096\) chance that at least one of them will not generalize to new data after cross-validation. This increases to \(p_{o_m}=0.634\) when \(m=100\) and, for \(m=1000\), it is almost certain that at least one configuration will not generalize to new data after cross-validation!

The foregoing analysis was made for one problem/dataset. When we have \(q\) problems to solve, the issue gets much worse. If for each problem we have the same number \(m_q\) of satisfying configurations after cross-validation, then the probability that at least one satisfying configuration for at least one problem will not generalize to new data after cross-validation becomes \(p_{o_{mq}} := 1-(1-p_o)^{qm_q} \approx qm_qp_o \).

In concrete terms, if your k-fold cross-validation is so good that it only has a 1% chance of letting an overfitted model slip through the cracks, and you have 10 predictive models to build, and you use an AutoML suite such as AWS' AutoGluon that finds 10 satisfying configurations algorithm + hyper-parameters you can rely on for each problem, then there is a whopping 63% chance that at least one configuration will perform poorly on unseen data, even though you (kind of) did everything right!

So, what can you do about it? Let's look at each variable affecting the problem and consider possible solutions.

\(\bf{q}\) — The larger the number of problems you have to solve, the more likely it is that at least one configuration will overfit. However, this variable reflects the needs of the business and can hardly be controlled.

\(\bf{m_q}\) — The more configurations your AutoML library tries out the more likely it is that at least one overfitted configuration will trick your k-fold cross-validation. However, by not trying enough algorithms or hyper-parameters, you are running the risk that your selected model or ensemble will be suboptimal. A good tradeoff is to not consider configurations that are too similar. But then again, if you use an existing AutoML library, you might not have enough control over this.

\(\bf{p_o}\) — This variable essentially reflects the quality of your single-configuration k-fold cross-validation. The most obvious way to increase it is to simply increase the number of folds \(k\). However, this can vastly increase your runtime and development costs more generally.

Another factor driving the probability of overfitting is the number of features a model uses. The more features are used during training, the more opportunity there is for a (flexible) model to discover spurious patterns. Reducing the number of features used in algorithm and hyper-parameter search will not only reduce the likelihood of overfitting but will also reduce runtime and overall development costs.

The challenge here is that, if you choose the wrong subset of features, or you simply don't use enough features, then you will decrease model performance and hurt the bottom-line.

To sum up, because it runs a large number of experiments, an AutoML algorithm reduces the statistical power of k-fold cross-validation, and increases the likelihood that a model will perform poorly when deployed to production, despite performing superbly on held-out data during cross-validation. A cost-effective approach to addressing this issue is to reduce the number of features the AutoML algorithm learns from, while ensuring insightful features are not left out.

Large Feature Sets and Maintenance Cost

Beyond overfitting and, more generally, higher development costs, another peculiarity of machine learning models as pieces of software is that they are costlier to maintain than traditional pieces of software.

The cost of maintaining predictive machine learning models in production is exacerbated by several factors, among which are data pipeline outages and model performance decay resulting from data drift.

When a data pipeline goes down and predictive models stop receiving some of their inputs, those predictive models (usually) stop working. Such an outage, whose likelihood increases with the number of features that predictive models rely on, can severely handicap a product, and present a big opportunity cost. Intuitively, the fewer the number of features a predictive model uses, the less likely it is to go down.

As time goes by, predictive models often become less effective, to the point of needing to be retrained. The root cause of this problem is known as data drift.

The way we humans behave tends to change or 'drift' over time. It is, therefore, no surprise that distributions of data generated by human activities also change over time. In particular, the relationship between features a production model uses and the target it predicts will also change over time, thereby gradually rendering obsolete the specific relationship learned by the production model at the time of training, and upon which it relies to make predictions. The more features the production model uses, the more rapidly data will drift, and the more often the production model will need to be retrained.

While one should aim to keep the number of features a production model uses to a bare minimum, accidentally leaving out the wrong features can drastically reduce model performance, which would likely affect the bottom-line. Not to mention that the 'bare minimum' number of features one has to keep without affecting model performance is usually unknown to machine learning engineers, and varies from one problem to another.

In short, if your production model uses too many features, you will increase your maintenance cost (among other downsides). But if you choose the wrong subset of features, or you simply don't use enough features, then you will decrease model performance, and the bottom-line with that.

This blog post shows you how to drastically reduce the number of features used by AWS' AutoGluon in Python while improving model performance.

What To Expect

Using the kxy Python package you don't have to choose between high maintenance cost and low bottom-line, or between overfitting because you passed too many features to your AutoML algorithm (AWS' AutoGluon in this case), and poor performance because you left out insightful features.

The kxy package allows you to drastically reduce the number of features used by AutoGluon, while improving model performance.

Indeed, in an experiment on 38 real-world classification and regression problems from the UCI Machine Learning Repository and Kaggle, using the kxy package, we were able to reduce the number of features used by 95% while improving performance.

The datasets used had between 15 and 1925 automatically generated candidate features, and between 303 and 583250 rows. We did a random 80/20 training/testing data split, and used as the evaluation metric the testing \(R^2\) for regression problems, and the testing AUC for classification problems.

Details and results for each problem are summarized in the table below.

Dataset	Rows	Candidate Features	Features Selected	Performance (Full Model)	Performance (Compressed Model)	Problem Type	Source
SkinSegmentation	245057	15	4	1	1	classification	UCI
BankNote	1372	20	6	0.99	1	classification	UCI
PowerPlant	9568	20	6	0.97	0.97	regression	UCI
AirFoil	1503	25	14	0.95	0.94	regression	UCI
YachtHydrodynamics	308	30	1	1	0.99	regression	UCI
RealEstate	414	30	9	0.75	0.76	regression	UCI
Abalone	4177	38	5	0.58	0.58	regression	UCI
Concrete	1030	40	11	0.93	0.92	regression	UCI
EnergyEfficiency	768	45	7	1	1	regression	UCI
WaterQuality	3276	45	30	0.59	0.6	classification	Kaggle
Shuttle	58000	45	4	1	1	classification	UCI
MagicGamma	19020	50	13	0.86	0.86	classification	UCI
Avila	20867	50	31	1	1	classification	UCI
WhiteWineQuality	4898	55	27	0.48	0.44	regression	UCI
HeartAttack	303	65	8	0.83	0.81	classification	Kaggle
HeartDisease	303	65	9	0.83	0.83	classification	Kaggle
AirQuality	8991	70	2	1	1	regression	UCI
EEGEyeState	14980	70	17	0.97	0.97	classification	UCI
LetterRecognition	20000	80	22	0.99	0.99	classification	UCI
NavalPropulsion	11934	85	6	1	1	regression	UCI
BikeSharing	17379	90	3	1	1	regression	UCI
DiabeticRetinopathy	1151	95	32	0.7	0.71	classification	UCI
BankMarketing	41188	103	17	0.76	0.77	classification	UCI
Parkinson	5875	105	2	1	1	regression	UCI
CardDefault	30000	115	24	0.66	0.66	classification	UCI
Landsat	6435	180	6	0.99	0.98	classification	UCI
Adult	48843	202	8	0.79	0.78	classification	UCI
SensorLessDrive	58509	240	19	1	1	classification	UCI
FacebookComments	209074	265	13	-13.36	0.61	regression	UCI
OnlineNews	39644	290	26	-0.71	0.04	regression	UCI
SocialMediaBuzz	583250	385	6	0.94	0.93	regression	UCI
Superconductivity	21263	405	19	0.92	0.91	regression	UCI
HousePricesAdvanced	1460	432	9	0.88	0.87	regression	Kaggle
YearPredictionMSD	515345	450	35	0.41	0.36	regression	UCI
APSFailure	76000	850	13	0.86	0.69	classification	UCI
BlogFeedback	60021	1400	17	0.6	0.59	regression	UCI
Titanic	891	1754	28	0.82	0.79	classification	Kaggle
CTSlices	53500	1925	31	1	1	regression	UCI

Cumulatively, there were 10229 candidate features to select from across the 38 datasets, and the kxy package only selected 540 of them in total, which corresponds to a 95% reduction in the number of features used overall.

Crucially, the average performance (testing \(R^2\) for regression problems and testing AUC for classification problems) of the compressed model was 0.82, compared to only 0.45 for the full model; a drastic performance increase despite a 95% reduction in the number of features used!

Looking closely at the results in the table above, we see that AutoGluon yielded negative testing performances on FacebookComments and OnlineNews when using all features, but its compressed version did not! While these two datasets explain the big average performance difference between full AutoGluon and compressed AutoGluon, when they are excluded, full and compressed AutoGluon have the same average performance, despite compressed AutoGluon using only 5% of the features used by full AutoGluon!

Code

A Jupyter notebook to reproduce the experiments above is available here. In this post, we will focus on showing you what it will take to compress your own AutoGluon model in Python.

Setup

First, you will need to install the kxy Python package using your method of choice:

From PyPi: pip install -U kxy
From GitHub: git clone https://github.com/kxytechnologies/kxy-python.git & cd ./kxy-python & pip install .
From DockerHub: docker pull kxytechnologies/kxy. The image is shipped with kxy and all its dependencies pre-installed.

Next, simply import the kxy package in your code. The kxy package is well integrated with pandas, so while you are at it you might also want to import pandas.

import kxy
import pandas as pd

From this point on, any instance of a pandas DataFrame, say df, that you have in your code is automatically enriched with a set of kxy methods accessible as df.kxy..

Training

Training a compressed AutoGluon model can be done in a single line of code.

results = training_df.kxy.fit(target_column, learner_func, \
    problem_type=problem_type, feature_selection_method='leanml')

training_df is the pandas DataFrame containing training data. target_column is a variable containing the name of the target column. All other columns are considered candidate features/explanatory variables.

problem_type reflects the nature of the predictive problem to solve and should be either 'regression' or 'classification'.

feature_selection_method should be set to 'leanml'. If you want to know all possible values and why you should use 'leanml', read this blog post.

In general, learner_func is the function we will call to create new trainable instances of your model. It takes three optional parameters:

n_vars: The number of features the model should expect, in case it is required to instantiate the model (e.g. for neural networks).
path: Where to save or load the model from, if needed.
safe: A boolean that controls what to do when the model can't be loaded from path. The convention is that, if path is not None, learner_func should try to load the model from disk. If this fails then learner_func should create a new instance of your model if safe is set to True, and raise an exception if safe is False.

learner_func should return a model following the Scikit-Learn API. That is, at the very least, returned models should have fit(self, X, y) and predict(self, X) methods, where X and y are NumPy arrays. If you intend to save/load your compressed models, models returned by learner_func should also have a save(self, path) method to save a specific instance to disk.

For AutoGluon models specifically, we provide a utility function (get_autogluon_learner) to generate a learner_func that creates instances of autogluon.tabular.TabularPredictor with set hyper-parameters.

Here is an illustration in the case of a regression problem.

from kxy.learning import get_autogluon_learner
kwargs = {}
fit_kwargs = {}
learner_func = get_autogluon_learner(problem_type='regression', \
	eval_metric=None, verbosity=2, sample_weight=None, \
	weight_evaluation=False, groups=None, fit_kwargs={}, **kwargs)

problem_type, eval_metric, verbosity, sample_weight, weight_evaluation, groups, and kwargs are all parameters you would pass to the constructor of autogluon.tabular.TabularPredictor. It is worth noting that problem_type here is not the same as problem_type you would pass to df.kxy.fit. fit_kwargs is the dictionary of named arguments you would pass to the fit method of an instance of autogluon.tabular.TabularPredictor.

Prediction

Once you have fitted a model, you get a predictor back in the results dictionary.

predictor = results['predictor']

You can inspect selected variables from the predictor like so:

selected_variables = predictor.selected_variables

The following line shows you how to make predictions corresponding to a DataFrame of testing features testing_df.

predictions_df = predictor.predict(testing_df)

All that is required is for testing_df to have all columns contained in selected_variables. predictions_df is a pandas DataFrame with a single column whose name is the same as the target column in the training DataFrame training_df.

To access the low-level TabularPredictor model, run

autogluon_tabular_predictor = predictor.models[0]._model

If you choose to use the TabularPredictor directly, remember that testing inputs data_test should be generated like so:

X_test = testing_df[selected_variables].values
X_columns = predictor.models[0].x_columns
data_test = pd.DataFrame(X_test, columns=X_columns)

Saving/Loading

You can directly save the predictor to disk using

predictor.save(path)

To load a predictor from disk, run

from kxy.learning.leanml_predictor import LeanMLPredictor
predictor = LeanMLPredictor.load(path, learner_func)

Pricing

The kxy package is open-source. However, some of the heavy-duty optimization tasks (involved in LeanML feature selection) are run by our backend. For that, we charge a small per task fee.

That said, kxy is completely free for academic use. Simply sign up here with your university email address, and get your API key here.

Once you have your API key, simply run kxy configure in the terminal as a one-off, or set your API key as the value of the environment variable KXY_API_KEY, for instance by adding the two lines below to your Python code before importing the kxy package:

import os
os.environ['KXY_API_KEY'] = ''

Finally, you don't need to sign up to try out kxy! Your first few dozen tasks are on us; just install the kxy package and give it a go. If you love it, sign up and spread the word.

Random Forest: How To Reduce Your Production Model Size by 95%

Yves-Laurent Kom Samo, PhD — Tue, 08 Mar 2022 12:00:00 GMT

Motivation

Machine learning models are peculiar pieces of software in that they are costlier to maintain than traditional pieces of software.

The cost of maintaining predictive machine learning models in production is exacerbated by several factors, among which are data pipeline outages and model performance decay resulting from data drift.

As time goes by, predictive models often become less effective, to the point of needing to be retrained. The root cause of this problem is known as data drift.

Using too large a production model can have many more downsides, among which high latency and poor explainability, to name but a couple.

While one should aim to keep the number of features a production model uses to a bare minimum, accidentally leaving out the wrong features can drastically reduce model performance, which would likely affect the bottom-line. Not to mention that the 'bare minimum' number of features one has to keep without affecting model performance is usually unknown to the machine learning engineers, and varies from one problem to another.

In short, if your production model uses too many features, you will increase your maintenance cost (among other downsides). But if you choose the wrong subset of features, or you simply don't use enough features, then you will decrease model performance, and the bottom-line with that.

This blog post provides a solution to this dilemma in Python when you use Random Forest as your production model.

What To Expect

Using the kxy Python package you don't have to choose between high maintenance cost and low bottom line; you can drastically reduce the number of features your production model uses without losing model performance.

Details and results for each problem are summarized in the table below.

Dataset	Rows	Candidate Features	Selected Features	Performance (Full Model)	Performance (Compressed Model)	Problem Type
SkinSegmentation	245057	15	4	1	1	classification
BankNote	1372	20	5	0.99	0.98	classification
PowerPlant	9568	20	8	0.97	0.97	regression
AirFoil	1503	25	12	0.92	0.93	regression
YachtHydrodynamics	308	30	2	1	0.99	regression
RealEstate	414	30	11	0.74	0.77	regression
Abalone	4177	38	7	0.57	0.54	regression
Concrete	1030	40	11	0.93	0.92	regression
EnergyEfficiency	768	45	12	1	1	regression
WaterQuality	3276	45	28	0.59	0.58	classification
Shuttle	58000	45	3	1	1	classification
MagicGamma	19020	50	20	0.86	0.85	classification
Avila	20867	50	30	1	1	classification
WhiteWineQuality	4898	55	25	0.43	0.38	regression
HeartAttack	303	65	9	0.82	0.8	classification
HeartDisease	303	65	9	0.82	0.85	classification
AirQuality	8991	70	6	1	1	regression
EEGEyeState	14980	70	18	0.92	0.93	classification
LetterRecognition	20000	80	19	0.98	0.98	classification
NavalPropulsion	11934	85	7	1	0.99	regression
BikeSharing	17379	90	3	1	1	regression
DiabeticRetinopathy	1151	95	34	0.65	0.72	classification
BankMarketing	41188	103	13	0.77	0.77	classification
Parkinson	5875	105	6	1	1	regression
CardDefault	30000	115	24	0.66	0.66	classification
Landsat	6435	180	6	0.98	0.98	classification
Adult	48843	202	4	0.79	0.76	classification
SensorLessDrive	58509	240	19	1	1	classification
FacebookComments	209074	265	11	0.71	0.59	regression
OnlineNews	39644	290	26	0.01	0.01	regression
SocialMediaBuzz	583250	385	6	0.93	0.91	regression
Superconductivity	21263	405	18	0.92	0.91	regression
HousePricesAdvanced	1460	432	8	0.83	0.81	regression
YearPredictionMSD	515345	450	41	0.34	0.33	regression
APSFailure	76000	850	3	0.87	0.58	classification
BlogFeedback	60021	1400	20	0.59	0.58	regression
Titanic	891	1754	26	0.84	0.82	classification
CTSlices	53500	1925	33	0.99	0.99	regression

Cumulatively, there were 10229 candidate features to select from across the 38 datasets, and the kxy package only selected 516 of them in total, which corresponds to a 95% reduction in the number of features used overall.

Crucially, the average performance (testing \(R^2\) for regression problems and testing AUC for classification problems) of the compressed model was 0.79, the same as the full model; no performance reduction for a 95% reduction in the number of features used!

Code

A Jupyter notebook to reproduce the experiment above is available here. In this post, we will focus on showing you what it will take to compress your own Random Forest model in Python.

Setup

First, you will need to install the kxy Python package using your method of choice:

From PyPi: pip install -U kxy
From GitHub: git clone https://github.com/kxytechnologies/kxy-python.git & cd ./kxy-python & pip install .
From DockerHub: docker pull kxytechnologies/kxy. The image is shipped with kxy and all its dependencies pre-installed.

Next, simply import the kxy package in your code. The kxy package is well integrated with pandas, so while you are at it you might also want to import pandas.

import kxy
import pandas as pd

From this point on, any instance of a pandas DataFrame, say df, that you have in your code is automatically enriched with a set of kxy methods accessible as df.kxy..

Training

Training a compressed Random Forest model can be done in a single line of code.

results = training_df.kxy.fit(target_column, learner_func, \
    problem_type=problem_type, feature_selection_method='leanml')

problem_type reflects the nature of the predictive problem to solve and should be either 'regression' or 'classification'.

feature_selection_method should be set to 'leanml'. If you want to know all possible values and why you should use 'leanml', read this blog post.

In general, learner_func is the function we will call to create new trainable instances of your model. It takes three optional parameters:

n_vars: The number of features the model should expect, in case it is required to instantiate the model (e.g. for neural networks).
path: Where to save or load the model from, if needed.
safe: A boolean that controls what to do when the model can't be loaded from path. The convention is that, if path is not None, learner_func should try to load the model from disk. If this fails then learner_func should create a new instance of your model if safe is set to True, and raise an exception if safe is False.

For Random Forest models specifically, we provide a utility function (get_sklearn_learner) to generate a learner_func that creates instances of Random Forest (or any other Scikit-Learn) models with set hyper-parameters.

Here is an illustration in the case of a regression problem.

from kxy.learning import get_sklearn_learner
class_name = 'sklearn.ensemble.RandomForestRegressor'
args = {}
kwargs = {
	'min_samples_split': 0.01, 'max_samples': 0.5,\
	'n_estimators': 100\
}
learner_func = get_sklearn_learner(class_name, *args, **kwargs)

class_name should be either 'sklearn.ensemble.RandomForestRegressor' or 'sklearn.ensemble.RandomForestClassifier'.

args and kwargs are positional and named arguments you would pass to the constructor of sklearn.ensemble.RandomForestRegressor or sklearn.ensemble.RandomForestClassifier.

Prediction

Once you have fitted a model, you get a predictor back in the results dictionary.

predictor = results['predictor']

You can inspect selected variables from the predictor like so:

selected_variables = predictor.selected_variables

The following line shows you how to make predictions corresponding to a DataFrame of testing features testing_df.

predictions_df = predictor.predict(testing_df)

To access the low-level Scikit-Learn model, run

sklearn_model = predictor.models[0]._model

If you choose to use the Scikit-Learn directly, remember that testing inputs should be generated like so:

X_test = testing_df[selected_variables].values

Saving/Loading

You can directly save the predictor to disk using

predictor.save(path)

To load a predictor from disk, run

from kxy.learning.leanml_predictor import LeanMLPredictor
predictor = LeanMLPredictor.load(path, learner_func)

Pricing

The kxy package is open-source. However, some of the heavy-duty optimization tasks (involved in LeanML feature selection) are run by our backend. For that, we charge a small per task fee.

That said, kxy is completely free for academic use. Simply sign up here with your university email address, and get your API key here.

import os
os.environ['KXY_API_KEY'] = ''

Finally, you don't need to sign up to try out kxy! Your first few dozen tasks are on us; just install the kxy package and give it a go. If you love it, sign up and spread the word.

XGBoost: How To Reduce Your Production Model Size by 95%

Yves-Laurent Kom Samo, PhD — Mon, 07 Mar 2022 12:00:00 GMT

Motivation

Machine learning models are peculiar pieces of software in that they are costlier to maintain than traditional pieces of software.

As time goes by, predictive models often become less effective, to the point of needing to be retrained. The root cause of this problem is known as data drift.

Using too large a production model can have many more downsides, among which high latency and poor explainability, to name but a couple.

While one should aim to keep the number of features a production model uses to a bare minimum, accidentally leaving out the wrong features can drastically reduce model performance, which would likely affect the bottom-line. Not to mention that the 'bare minimum' number of features one has to keep without affecting model performance is usually unknown to the machine learning engineers, and varies from one problem to another.

This blog post provides a solution to this dilemma in Python when you use XGBoost as your production model.

What To Expect

Details and results for each problem are summarized in the table below.

Dataset	Rows	Candidate Features	Features Selected	Performance (Full Model)	Performance (Compressed Model)	Problem Type
SkinSegmentation	245057	15	7	1	1	classification
BankNote	1372	20	4	1	0.99	classification
PowerPlant	9568	20	8	0.97	0.97	regression
AirFoil	1503	25	9	0.93	0.92	regression
YachtHydrodynamics	308	30	1	1	0.99	regression
RealEstate	414	30	9	0.72	0.72	regression
Abalone	4177	38	8	0.52	0.53	regression
Concrete	1030	40	11	0.93	0.92	regression
EnergyEfficiency	768	45	11	1	1	regression
WaterQuality	3276	45	31	0.6	0.59	classification
Shuttle	58000	45	3	1	1	classification
MagicGamma	19020	50	15	0.86	0.86	classification
Avila	20867	50	30	1	1	classification
WhiteWineQuality	4898	55	29	0.44	0.37	regression
HeartAttack	303	65	9	0.86	0.84	classification
HeartDisease	303	65	9	0.86	0.84	classification
AirQuality	8991	70	2	1	1	regression
EEGEyeState	14980	70	17	0.91	0.92	classification
LetterRecognition	20000	80	22	0.98	0.98	classification
NavalPropulsion	11934	85	5	0.99	0.99	regression
BikeSharing	17379	90	4	1	1	regression
DiabeticRetinopathy	1151	95	34	0.65	0.7	classification
BankMarketing	41188	103	14	0.77	0.76	classification
Parkinson	5875	105	2	1	1	regression
CardDefault	30000	115	26	0.66	0.66	classification
Landsat	6435	180	5	0.98	0.98	classification
Adult	48843	202	19	0.79	0.78	classification
SensorLessDrive	58509	240	23	1	1	classification
FacebookComments	209074	265	13	0.72	0.58	regression
OnlineNews	39644	290	26	0	0	regression
SocialMediaBuzz	583250	385	6	0.95	0.94	regression
Superconductivity	21263	405	17	0.91	0.9	regression
HousePricesAdvanced	1460	432	8	0.83	0.88	regression
YearPredictionMSD	515345	450	36	0.32	0.31	regression
APSFailure	76000	850	9	0.91	0.73	classification
BlogFeedback	60021	1400	13	0.58	0.57	regression
Titanic	891	1754	26	0.82	0.81	classification
CTSlices	53500	1925	34	0.99	0.98	regression

Cumulatively, there were 10229 candidate features to select from across the 38 datasets, and the kxy package only selected 555 of them in total, which corresponds to a 95% reduction in the number of features used overall.

Crucially, the average performance (testing \(R^2\) for regression problems and testing AUC for classification problems) of the compressed model was 0.82, compared to 0.83 for the full model; a virtually null performance reduction for a 95% reduction in the number of features used!

Code

A Jupyter notebook to reproduce the experiment above is available here. In this post, we will focus on showing you what it will take to compress your own XGBoost model in Python.

Setup

First, you will need to install the kxy Python package using your method of choice:

From PyPi: pip install -U kxy
From GitHub: git clone https://github.com/kxytechnologies/kxy-python.git & cd ./kxy-python & pip install .
From DockerHub: docker pull kxytechnologies/kxy. The image is shipped with kxy and all its dependencies pre-installed.

Next, simply import the kxy package in your code. The kxy package is well integrated with pandas, so while you are at it you might also want to import pandas.

import kxy
import pandas as pd

From this point on, any instance of a pandas DataFrame, say df, that you have in your code is automatically enriched with a set of kxy methods accessible as df.kxy..

Training

Training a compressed XGBoost model can be done in a single line of code.

results = training_df.kxy.fit(target_column, learner_func, \
    problem_type=problem_type, feature_selection_method='leanml')

problem_type reflects the nature of the predictive problem to solve and should be either 'regression' or 'classification'.

feature_selection_method should be set to 'leanml'. If you want to know all possible values and why you should use 'leanml', read this blog post.

In general, learner_func is the function we will call to create new trainable instances of your model. It takes three optional parameters:

n_vars: The number of features the model should expect, in case it is required to instantiate the model (e.g. for neural networks).
path: Where to save or load the model from, if needed.
safe: A boolean that controls what to do when the model can't be loaded from path. The convention is that if path is not None, learner_func should try to load the model from disk. If this fails then learner_func should create a new instance of your model if safe is set to True, and raise an exception if safe is False.

For XGBoost models specifically, we provide a utility function (get_xgboost_learner) to generate a learner_func that creates instances of XGBoost models with set hyper-parameters, using the XGBoost Scikit-Learn API .

Here is an illustration in the case of a regression problem.

from kxy.learning import get_xgboost_learner
class_name = 'xgboost.XGBRegressor' # or 'xgboost.XGBClassifier'
args = {}
kwargs = {}
learner_func = get_xgboost_learner(class_name, *args, \
	fit_kwargs={}, predict_kwargs={}, **kwargs)

args and kwargs are positional and named arguments you would pass to the constructor of xgboost.XGBRegressor or xgboost.XGBClassifier. fit_kwargs (resp. predict_kwargs) are named arguments you would pass to the fit (resp. predict) method of an instance of xgboost.XGBRegressor or xgboost.XGBClassifier.

Prediction

Once you have fitted a model, you get a predictor back in the results dictionary.

predictor = results['predictor']

You can inspect selected variables from the predictor like so:

selected_variables = predictor.selected_variables

The following line shows you how to make predictions corresponding to a DataFrame of testing features testing_df.

predictions_df = predictor.predict(testing_df)

To access the low-level xgboost.sklearn.XGBModel and booster, run

xgbmodel = predictor.models[0]._model
booster = xgbmodel.get_booster()

If you choose to use the booster or XGBModel directly, remember that testing inputs should be generated like so:

X_test = testing_df[selected_variables].values

Saving/Loading

You can directly save the predictor to disk using

predictor.save(path)

To load a predictor from disk, run

from kxy.learning.leanml_predictor import LeanMLPredictor
predictor = LeanMLPredictor.load(path, learner_func)

Pricing

The kxy package is open-source. However, some of the heavy-duty optimization tasks (involved in LeanML feature selection) are run by our backend. For that, we charge a small per task fee.

That said, kxy is completely free for academic use. Simply sign up here with your university email address, and get your API key here.

import os
os.environ['KXY_API_KEY'] = ''

Finally, you don't need to sign up to try out kxy! Your first few dozen tasks are on us; just install the kxy package and give it a go. If you love it, sign up and spread the word.

LightGBM: How To Reduce Your Production Model Size by 95%

Yves-Laurent Kom Samo, PhD — Sun, 06 Mar 2022 12:00:00 GMT

Motivation

Machine learning models are peculiar pieces of software in that they are costlier to maintain than traditional pieces of software.

As time goes by, predictive models often become less effective, to the point of needing to be retrained. The root cause of this problem is known as data drift.

Using too large a production model can have many more downsides, among which high latency and poor explainability, to name but a couple.

While one should aim to keep the number of features a production model uses to a bare minimum, accidentally leaving out the wrong features can drastically reduce model performance, which would likely affect the bottom-line. Not to mention that the 'bare minimum' number of features one has to keep without affecting model performance is usually unknown to the machine learning engineers, and varies from one problem to another.

This blog post provides a solution to this dilemma in Python when you use LightGBM as your production model.

What To Expect

Details and results for each problem are summarized in the table below.

Dataset	Rows	Candidate Features	Selected Features	Performance (Full Model)	Performance (Compressed Model)	Problem Type
SkinSegmentation	245057	15	4	1	1	classification
BankNote	1372	20	5	0.99	0.98	classification
PowerPlant	9568	20	8	0.97	0.97	regression
AirFoil	1503	25	12	0.92	0.93	regression
YachtHydrodynamics	308	30	2	1	0.99	regression
RealEstate	414	30	11	0.74	0.77	regression
Abalone	4177	38	7	0.57	0.54	regression
Concrete	1030	40	11	0.93	0.92	regression
EnergyEfficiency	768	45	12	1	1	regression
WaterQuality	3276	45	28	0.59	0.58	classification
Shuttle	58000	45	3	1	1	classification
MagicGamma	19020	50	20	0.86	0.85	classification
Avila	20867	50	30	1	1	classification
WhiteWineQuality	4898	55	25	0.43	0.38	regression
HeartAttack	303	65	9	0.82	0.8	classification
HeartDisease	303	65	9	0.82	0.85	classification
AirQuality	8991	70	6	1	1	regression
EEGEyeState	14980	70	18	0.92	0.93	classification
LetterRecognition	20000	80	19	0.98	0.98	classification
NavalPropulsion	11934	85	7	1	0.99	regression
BikeSharing	17379	90	3	1	1	regression
DiabeticRetinopathy	1151	95	34	0.65	0.72	classification
BankMarketing	41188	103	13	0.77	0.77	classification
Parkinson	5875	105	6	1	1	regression
CardDefault	30000	115	24	0.66	0.66	classification
Landsat	6435	180	6	0.98	0.98	classification
Adult	48843	202	4	0.79	0.76	classification
SensorLessDrive	58509	240	19	1	1	classification
FacebookComments	209074	265	11	0.71	0.59	regression
OnlineNews	39644	290	26	0.01	0.01	regression
SocialMediaBuzz	583250	385	6	0.93	0.91	regression
Superconductivity	21263	405	18	0.92	0.91	regression
HousePricesAdvanced	1460	432	8	0.83	0.81	regression
YearPredictionMSD	515345	450	41	0.34	0.33	regression
APSFailure	76000	850	3	0.87	0.58	classification
BlogFeedback	60021	1400	20	0.59	0.58	regression
Titanic	891	1754	26	0.84	0.82	classification
CTSlices	53500	1925	33	0.99	0.99	regression

Cumulatively, there were 10229 candidate features to select from across the 38 datasets, and the kxy package only selected 547 of them in total, which corresponds to a 95% reduction in the number of features used overall.

Crucially, the average performance (testing \(R^2\) for regression problems and testing AUC for classification problems) of the compressed model was 0.81, compared to 0.83 for the full model; a virtually null performance reduction for a 95% reduction in the number of features used!

Code

A Jupyter notebook to reproduce the experiment above is available here. In this post, we will focus on showing you what it will take to compress your own LightGBM model in Python.

Setup

First, you will need to install the kxy Python package using your method of choice:

From PyPi: pip install -U kxy
From GitHub: git clone https://github.com/kxytechnologies/kxy-python.git & cd ./kxy-python & pip install .
From DockerHub: docker pull kxytechnologies/kxy. The image is shipped with kxy and all its dependencies pre-installed.

Next, simply import the kxy package in your code. The kxy package is well integrated with pandas, so while you are at it you might also want to import pandas.

import kxy
import pandas as pd

From this point on, any instance of a pandas DataFrame, say df, that you have in your code is automatically enriched with a set of kxy methods accessible as df.kxy..

Training

Training a compressed LightGBM model can be done in a single line of code.

results = training_df.kxy.fit(target_column, learner_func, \
    problem_type=problem_type, feature_selection_method='leanml')

problem_type reflects the nature of the predictive problem to solve and should be either 'regression' or 'classification'.

feature_selection_method should be set to 'leanml'. If you want to know all possible values and why you should use 'leanml', read this blog post.

In general, learner_func is the function we will call to create new trainable instances of your model. It takes three optional parameters:

n_vars: The number of features the model should expect, in case it is required to instantiate the model (e.g. for neural networks).
path: Where to save or load the model from, if needed.
safe: A boolean that controls what to do when the model can't be loaded from path. The convention is that if path is not None, learner_func should try to load the model from disk. If this fails then learner_func should create a new instance of your model if safe is set to True, and raise an exception if safe is False.

For LightGBM models specifically, we provide a utility function (get_lightgbm_learner_learning_api) to generate a learner_func that creates instances of LightGBM models with set hyper-parameters, using the LightGBM training API.

Here is an illustration for binary classification.

from kxy.learning import get_lightgbm_learner_learning_api
params = {
	'objective': 'binary',
	'metric': ['auc', 'binary_logloss'],
}
learner_func = get_lightgbm_learner_learning_api(params, \
	num_boost_round=100, fobj=None, feval=None, init_model=None, \
	feature_name='auto', categorical_feature='auto', \
	early_stopping_rounds=None, verbose_eval='warn', \
    learning_rates=None, keep_training_booster=False, \
    callbacks=None, split_random_seed=None, \
	weight_func=None, verbose=-1, importance_type='split')

Parameters of get_lightgbm_learner_learning_api are the same parameters you would pass to lightgbm.train.

Prediction

Once you've fitted a model, you get a predictor back in the results dictionary.

predictor = results['predictor']

You can inspect selected variables from the predictor like so:

selected_variables = predictor.selected_variables

The following line shows you how to make predictions corresponding to a DataFrame of testing features testing_df.

predictions_df = predictor.predict(testing_df)

To access the low-level 'booster'/model returned by lightgbm.train, run

booster = predictor.models[0]._model

If you choose to use the booster directly, remember that testing inputs should be generated like so:

X_test = testing_df[selected_variables].values

Saving/Loading

You can directly save the predictor to disk using

predictor.save(path)

To load a predictor from disk, run

from kxy.learning.leanml_predictor import LeanMLPredictor
predictor = LeanMLPredictor.load(path, learner_func)

Pricing

The kxy package is open-source. However, some of the heavy-duty optimization tasks (involved in LeanML feature selection) are run by our backend. For that, we charge a small per task fee.

That said, kxy is completely free for academic use. Simply sign up here with your university email address, and get your API key here.

import os
os.environ['KXY_API_KEY'] = ''

Finally, you don't need to sign up to try out kxy! Your first few dozen tasks are on us; just install the kxy package and give it a go. If you love it, sign up and spread the word.

How To Seamlessly Compress Any Tabular Model in Python

Yves-Laurent Kom Samo, PhD — Fri, 04 Feb 2022 19:01:06 GMT

TL'DR

In this article, we show you how to train your favorite predictive models in Python using more than 80% fewer features, at no performance cost.

Supported models include but are not limited to LightGBM, XGBoost, all tabular models in scikit-learn , deep neural classifiers and regressors implemented using TensorFlow or PyTorch, AWS' AutoGluon tabular models, etc.

We provide a detailed comparative analysis between Recursive Feature Elimination (RFE), Boruta, and LeanML feature selection, an approach we developed. Our benchmark is based on 38 representative real-world regression and classification problems.

We find that RFE should always be avoided. Boruta is 5 times slower than LeanML, achieves a lower compression rate than LeanML, and underperforms both LeanML and the full model. Finally, LeanML performs similarly to the full model, despite the drastic reduction in the number of features used.

Motivation
Tutorial
Empirical Evaluation

Motivation

Effective predictive modeling requires effective feature construction and effective feature selection.

Feature Construction is concerned with turning raw inputs (i.e. business and customer data as persisted in databases) into representations or features whose relationships to the target variable are (simple enough and) consistent with patterns that Machine Learning models in our toolbox can reliably learn.

While we do not know beforehand what transformations will yield such patterns, we do on the other hand know what feature transformations would make what types of patterns learnable using models in our Machine Learning toolbox.

For instance, extracting seconds, minutes, hour of the day, day of the month, day of the week, and month as separate features from an epoch timestamp makes it easy for most Machine Learning models to learn seasonal patterns.

As an alternative, it would be very hard for any model in a modern toolbox (e.g. LightGBM, XGBoost, Random Forest, GP, and other Bayesian models etc.) to reliably learn patterns such as weather or intra-week seasonality directly from epoch timestamps.

Because we might not know a priori what types of patterns our problem exhibits (e.g. whether or not seasonal effects play a key role), Feature Construction should focus on making as many potential types of patterns as possible learnable by models in our toolbox.

As a result, many more features will be generated by Feature Construction than needed. This is where Feature Selection comes in.

For more information about Feature Construction, check out this blog post.

Feature Selection is concerned with removing unnecessary features from a list of candidates.

A feature is deemed unnecessary when it is either uninformative or redundant.

Uninformative Features are features that are not useful to predict the target of interest, whether they are used by themselves, or in conjunction with other candidate features.

Redundant Features are features that may be useful to predict the target, but that add no incremental value when used in conjunction with some other candidate features.

For instance, for most models, distances in feet (resp. temperatures in Fahrenheit) are redundant when the same distances (resp. temperatures) in meters (resp. in Celsius) are also available as features.

Fundamentally, training a predictive model with unnecessary features increases the chance of overfitting, and makes the model harder to explain and costlier to maintain.

In a previous post, we addressed how to remove unnecessary features from a list of candidates. The algorithm we proposed goes as follows:

Select as the first feature the feature \(f_1\) that, when used by itself to predict the target \(y\), has the highest achievable performance \(a_1\), as described in this blog.
For \(i\) ranging from 1 to the total number \(k\) of candidate features, out of all \((k-i)\) features not yet selected, select as \((i+1)\)-th feature \(f_{i+1}\) the feature that, when used in conjunction with all \(i\) previously selected features, yields the highest increase in the achievable performance \(|a_{i+1}-a_{i}|\). Said differently, \(f_{i+1}\) is the feature that may complement previously selected features \((f_1, \dots, f_i )\) the most to predict the target \(y\).

Formally, if we denote \(j\) the smallest index such that \(a_j=a_k\) then, if \(j < k\), all features \(\{a_{j+1}, \dots, a_k\}\) ought to be unnecessary, and can be safely removed.

More generally, for \(l>i, ~\epsilon_{il} := |a_l-a_i|\) represents the largest boost in performance that features \(\{f_{i+1}, \dots, f_l\}\) can bring about over only using features \(\{f_1, \dots, f_i\}\).

Even when features \(\{f_{i+1}, \dots, f_k\}\) are not strictly unnecessary (i.e. when \(\epsilon_{ik} \neq 0\)), we still might not want to use features \(\{f_{i+1}, \dots, f_k\}\) if \(\epsilon_{ik} \) is too small.

Indeed, the downside in using \((k-i)\) more features (e.g. increased risk of overfitting, models costlier to train and maintain, models harder to explain, etc.) might very well outweigh the only benefit, namely a potential increase in performance by up to \(\epsilon_{ik}\).

In fact, while the potential performance boost \(\epsilon_{ik}\) is always non-negative, in practice, when \(\epsilon_{ik}\) is too small, training a model with all features might actually result in poorer performance than training the model with features \(\{f_1, \dots, f_i\}\).

To sum up, we need to use the first few features \(f_i\), up to a certain fraction of the total achievable performance \(a_k\). However, subsequent features should only be added if they actually boost the performance of a specific model out-of-sample by a high enough margin.

We refer to this approach as the LeanML feature selection algorithm. We will discuss where to draw the line and provide more implementation details in the explanation below.

For now, let's illustrate how to wrap this feature selection approach around any predictive model in Python using a concrete example.

Tutorial

For illustration purposes, we will use the UCI Bank Marketing classification dataset, and wrap LeanML feature selection around a LightGBM classifier.

Code

# 0. As a one-off, run 'pip install kxy', then 'kxy configure'
# This import is necessary to get all df.kxy.* methods
import kxy

# 1. Load your data
# pip install kxy_datasets
from kxy_datasets.classifications import BankMarketing
dataset = BankMarketing()
target_column = dataset.y_column
df = dataset.df

# 2. Generate candidate features
features_df = df.kxy.generate_features(entity=None, max_lag=None,\
    entity_name='*', exclude=[target_column])
features_df = features_df.drop('y_yes', axis=1)
target_column = 'y_no'

# 3. Training/Testing split
# pip install scikit-learn
from sklearn.model_selection import train_test_split
train_features_df, test_features_df = train_test_split(features_df, \
	test_size=0.2, random_state=0)
test_labels_df = test_features_df.loc[:, [target_column]]
test_features_df = test_features_df.drop(target_column, axis=1)

# 4. Create a LightGBM learner function.

# A learner function is a function that expects up to two optional 
# variables: n_vars and path. When called it returns an instance of 
# 'predictive model' expecting n_vars features. The path parameter, 
# when provided, allows the learner function to load a saved model 
# from disk. 

# A 'predictive model' here is any class with a fit(self, x, y) method 
# and predict(self, x) method. To use the path argument of the learner 
# function, the class should also define a save(self, path) method to 
# save a model to disk, and a load(cls, path) class method to load a 
# saved model from disk. 

# See kxy.learning.base_learners for helper functions that allow you 
# create learner functions that return instances of popular predictive 
# models (e.g. lightgbm, xgboost, sklearn, tensorflow, pytorch models 
# etc.).

from kxy.learning import get_lightgbm_learner_learning_api
params = {
	'objective': 'binary',
	'metric': ['auc', 'binary_logloss'],
}
lightgbm_learner_func = get_lightgbm_learner_learning_api(params, \
	num_boost_round=10000, early_stopping_rounds=50, verbose_eval=50)
    
# 5. Fit a LightGBM classifier wrapped around LeanML feature selection
results = train_features_df.kxy.fit(target_column, \
	lightgbm_learner_func, problem_type='classification', \
	feature_selection_method='leanml')
predictor = results['predictor']

# 6. Make predictions from a dataframe of test features
test_predictions_df = predictor.predict(test_features_df)

# 7. Compute out-of-sample accuracy and AUC
from sklearn.metrics import accuracy_score, roc_auc_score
accuracy = accuracy_score(
    test_labels_df[target_column].values, \
    test_predictions_df[target_column].values, \
)
auc = roc_auc_score( \
    test_labels_df[target_column].values, \
    test_predictions_df[target_column].values, \
    multi_class='ovr'
)

print('LeanML -- Testing Accuracy: %.2f, AUC: %.2f' % (accuracy, auc))
selected_features = predictor.selected_variables
print('LeanML -- Selected Variables:')
import pprint as pp
pp.pprint(selected_features)

# 8. (Optional) Save the trained model.
path = './lightgbm_uci_bank_marketing.sav'
predictor.save(path)

# 9. (Optional) Load the saved model.
from kxy.learning.leanml_predictor import LeanMLPredictor
loaded_predictor = LeanMLPredictor.load(path, lightgbm_learner_func)

Output

LeanML -- Testing Accuracy: 0.92, Testing AUC: 0.77
LeanML -- Selected Variables:
['duration',
 'euribor3m.ABS(* - MEAN(*))',
 'age',
 'campaign',
 'job_blue-collar',
 'housing_yes',
 'pdays',
 'poutcome_nonexistent',
 'nr.employed.ABS(* - MEDIAN(*))',
 'duration.ABS(* - Q75(*))',
 'loan_no',
 'nr.employed.ABS(* - Q25(*))']

Explanation

The key step in the code above is step 5. The method df.kxy.fit is available on any pandas DataFrame object df, so long as you import the package kxy.

df.kxy.fit trains the model of your choosing using df as training data while performing LeanML feature selection.

To be specific, it has the following signature:

kxy.fit(self, target_column, learner_func, problem_type=None, \
    snr='auto', train_frac=0.8, random_state=0, max_n_features=None, \
    min_n_features=None, start_n_features=None, anonymize=False, \
    benchmark_feature=None, missing_value_imputation=False, \
    score='auto', n_down_perf_before_stop=3, \
    regression_baseline='mean', additive_learning=False, \
    regression_error_type='additive', return_scores=False, \
    start_n_features_perf_frac=0.9, feature_selection_method='leanml',\
    rfe_n_features=None, boruta_pval=0.5, boruta_n_evaluations=20, \
    max_duration=None, val_performance_buffer=0.0)

Core Parameters

target_column should be the name of the column that contains target values. All other columns are considered candidate features/explanatory variables.

learner_func can be any function that returns an instance of the predictive model of your choosing. The predictive model can be any class that defines methods fit(self, X, y) and predict(self, X), where X and y are NumPy arrays. Most existing libraries can be leveraged to construct learner functions.

In fact, in the submodule kxy.learner.base_learners we provide a range of functions that allow you to construct valid learner_func functions that return models from popular libraries (e.g. sklearn, lightgbm, xgboost, tensorflow, pytorch etc.) with frozen hyper-parameters.

One such function is the function get_lightgbm_learner_learning_api we used above, which returns a learner_func instantiating LightGBM predictive models using the LightGBM training API.

The parameter problem_type should be either 'classification' or 'regression' depending on the nature of the predictive problem. When it is None (the default), it is inferred from the data.

The LeanML Feature Selection Algorithm

When df.kxy.fit is called with feature_selection_method 'leanml' (the default), a fraction train_frac of rows are used for training, while the remaining rows are used for validation.

We run the model-free variable selection algorithm of this blog post, which we recall is available through the method df.kxy.variable_selection. This gives us the aforementioned sequence of top features and highest performances achievable using these features \(\{(f_1, a_1), \dots, (f_k, a_k)\}\), among which highest \(R^2\) achievable.

For a discussion on how the \(R^2\) is defined for classification problems, check out this paper.

Next, we find the smallest index \(j\) such that the associated highest \(R^2\) achievable, namely \(a_j\), is greater than the fractionstart_n_features_perf_frac of the highest \(R^2\) achievable using all features, namely \(a_k\).

We then train the model of your choosing (as per learner_func) on the training set with the initial set of features \((f_1, \dots, f_j)\), we record its performance (\(R^2\) for regression problems and classification accuracy for classification problems) on the validation set as the benchmark validation performance, and we consider adding remaining features one at a time, in increasing order of index, starting with \(f_{j+1}\).

Specifically, for \(i\) ranging from \(j+1\) to \(k\) we train a new model with features \((f_1, \dots, f_i)\), and compute its validation performance. If the computed validation performance exceeds the benchmark validation performance (by val_performance_buffer or more), then we accept features \((f_1, \dots, f_i)\) as the new set of features, and we set the associated validation performance as the new benchmark validation performance to beat.

If the validation performance does not exceed the benchmark (by a margin of at least val_performance_buffer) n_down_perf_before_stop consecutive times (3 by default), then we stop and we keep the last accepted features set and its associated trained model.

We also stop adding features if a valid hard cap on the number of features is provided through the parameter max_n_features, and when i>max_n_features.

For more implementation details, check out the source code.

Empirical Evaluation

The kxy package also allows you to seamlessly wrap Recursive Feature Elimination (RFE) and Boruta around any predictive model in Python.

This is done by simply changing the feature_selection_method argument from 'leanml'to 'boruta', or 'rfe'. You may also set feature_selection_method to 'none' or None to train a model without feature selection in a consistent fashion.

To use RFE as the feature selection method, the number of features to keep should be provided through the parameter rfe_n_features.

When using Boruta, boruta_n_evaluations is the parameter that controls the number of random trials/rounds to run.

boruta_pval is the confidence level for the number of hits above which we will reject the null hypothesis \(H_0\) that we don't know whether a feature is important or not, in favor of the alternative \(H_1\) that the feature is more likely to be important than not.

In other words, features that are selected by the Boruta method are features whose number of hits h is such that \(\mathbb{P}(n>h) \leq \) (1-boruta_pval) where n is a Binomial random variable with boruta_n_evaluations trials, each with probability 1/2.

Here's an illustration of how to wrap Boruta and RFE around LightGBM using our previous Bank Marketing example.

Code

# 10.a Fit a LightGBM classifier wrapped around RFE feature selection
n_leanml_features = len(selected_features)
rfe_results = train_features_df.kxy.fit(target_column, \
	lightgbm_learner_func, problem_type='classification', \
	feature_selection_method='rfe', rfe_n_features=n_leanml_features)
rfe_predictor = rfe_results['predictor']

# 10.b Fit a LightGBM classifier wrapped around Boruta feature 
# selection.
boruta_results = train_features_df.kxy.fit(target_column, \
	lightgbm_learner_func, problem_type='classification', \
	feature_selection_method='boruta', boruta_n_evaluations= 20, \
    boruta_pval=0.95)
boruta_predictor = boruta_results['predictor']

# 10.c Fit a LightGBM classifier wrapped around Boruta feature 
# selection.
none_results = train_features_df.kxy.fit(target_column, \
	lightgbm_learner_func, problem_type='classification', \
	feature_selection_method=None)
none_predictor = none_results['predictor']

# 11. Make predictions from a dataframe of test features
rfe_test_predictions_df = rfe_predictor.predict(test_features_df)
boruta_test_predictions_df = boruta_predictor.predict(test_features_df)
none_test_predictions_df = none_predictor.predict(test_features_df)

# 12. Compute out-of-sample accuracy and AUC
rfe_accuracy = accuracy_score(
    test_labels_df[target_column].values, \
    rfe_test_predictions_df[target_column].values, \
)
rfe_auc = roc_auc_score( \
    test_labels_df[target_column].values, \
    rfe_test_predictions_df[target_column].values, \
    multi_class='ovr'
)

boruta_accuracy = accuracy_score(
    test_labels_df[target_column].values, \
    boruta_test_predictions_df[target_column].values, \
)
boruta_auc = roc_auc_score( \
    test_labels_df[target_column].values, \
    boruta_test_predictions_df[target_column].values, \
    multi_class='ovr'
)

none_accuracy = accuracy_score(
    test_labels_df[target_column].values, \
    none_test_predictions_df[target_column].values, \
)
none_auc = roc_auc_score( \
    test_labels_df[target_column].values, \
    none_test_predictions_df[target_column].values, \
    multi_class='ovr'
)

print('RFE -- Accuracy: %.2f, AUC: %.2f' % (rfe_accuracy, rfe_auc))
rfe_selected_features = rfe_predictor.selected_variables
print('RFE -- Selected Variables:')
pp.pprint(rfe_selected_features)
print()

print('Boruta -- Accuracy: %.2f, AUC: %.2f' % (boruta_accuracy, \
	boruta_auc))
boruta_selected_features = boruta_predictor.selected_variables
print('Boruta -- Selected Variables:')
pp.pprint(boruta_selected_features)
print()

print('No Feature Selection -- Accuracy: %.2f, AUC: %.2f' % (none_accuracy, \
	none_auc))
all_features = none_predictor.selected_variables
print('No Feature Selection -- Selected Variables:')
pp.pprint(all_features)

Output

RFE -- Accuracy: 0.77, AUC: 0.51
RFE -- Selected Variables:
['euribor3m',
 'duration',
 'cons.price.idx.ABS(* - MEDIAN(*))',
 'euribor3m.ABS(* - MEAN(*))',
 'duration.ABS(* - MEAN(*))',
 'age.ABS(* - MEAN(*))',
 'age',
 'pdays',
 'euribor3m.ABS(* - Q25(*))',
 'duration.ABS(* - Q75(*))',
 'campaign',
 'duration.ABS(* - MEDIAN(*))']

Boruta -- Accuracy: 0.92, AUC: 0.77
Boruta -- Selected Variables:
['euribor3m.ABS(* - MEAN(*))',
 'euribor3m',
 'duration.ABS(* - MEDIAN(*))',
 'duration.ABS(* - MEAN(*))',
 'duration',
 'pdays',
 'duration.ABS(* - Q75(*))',
 'euribor3m.ABS(* - Q25(*))',
 'emp.var.rate',
 'cons.price.idx.ABS(* - MEDIAN(*))',
 'age.ABS(* - MEAN(*))']
 
No Feature Selection -- Accuracy: 0.92, AUC: 0.77
No Feature Selection -- Selected Variables:
['age',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed',
 'age.ABS(* - MEAN(*))',
 'age.ABS(* - MEDIAN(*))',
 'age.ABS(* - Q25(*))',
 'age.ABS(* - Q75(*))',
 'duration.ABS(* - MEAN(*))',
 'duration.ABS(* - MEDIAN(*))',
 'duration.ABS(* - Q25(*))',
 'duration.ABS(* - Q75(*))',
 'campaign.ABS(* - MEAN(*))',
 'campaign.ABS(* - MEDIAN(*))',
 'campaign.ABS(* - Q25(*))',
 'campaign.ABS(* - Q75(*))',
 'pdays.ABS(* - MEAN(*))',
 'pdays.ABS(* - MEDIAN(*))',
 'pdays.ABS(* - Q25(*))',
 'pdays.ABS(* - Q75(*))',
 'previous.ABS(* - MEAN(*))',
 'previous.ABS(* - MEDIAN(*))',
 'previous.ABS(* - Q25(*))',
 'previous.ABS(* - Q75(*))',
 'emp.var.rate.ABS(* - MEAN(*))',
 'emp.var.rate.ABS(* - MEDIAN(*))',
 'emp.var.rate.ABS(* - Q25(*))',
 'emp.var.rate.ABS(* - Q75(*))',
 'cons.price.idx.ABS(* - MEAN(*))',
 'cons.price.idx.ABS(* - MEDIAN(*))',
 'cons.price.idx.ABS(* - Q25(*))',
 'cons.price.idx.ABS(* - Q75(*))',
 'cons.conf.idx.ABS(* - MEAN(*))',
 'cons.conf.idx.ABS(* - MEDIAN(*))',
 'cons.conf.idx.ABS(* - Q25(*))',
 'cons.conf.idx.ABS(* - Q75(*))',
 'euribor3m.ABS(* - MEAN(*))',
 'euribor3m.ABS(* - MEDIAN(*))',
 'euribor3m.ABS(* - Q25(*))',
 'euribor3m.ABS(* - Q75(*))',
 'nr.employed.ABS(* - MEAN(*))',
 'nr.employed.ABS(* - MEDIAN(*))',
 'nr.employed.ABS(* - Q25(*))',
 'nr.employed.ABS(* - Q75(*))',
 'job_admin.',
 'job_blue-collar',
 'job_entrepreneur',
 'job_housemaid',
 'job_management',
 'job_retired',
 'job_self-employed',
 'job_services',
 'job_student',
 'job_technician',
 'job_unemployed',
 'job_unknown',
 'marital_divorced',
 'marital_married',
 'marital_single',
 'marital_unknown',
 'education_basic.4y',
 'education_basic.6y',
 'education_basic.9y',
 'education_high.school',
 'education_illiterate',
 'education_professional.course',
 'education_university.degree',
 'education_unknown',
 'default_no',
 'default_unknown',
 'default_yes',
 'housing_no',
 'housing_unknown',
 'housing_yes',
 'loan_no',
 'loan_unknown',
 'loan_yes',
 'contact_cellular',
 'contact_telephone',
 'month_apr',
 'month_aug',
 'month_dec',
 'month_jul',
 'month_jun',
 'month_mar',
 'month_may',
 'month_nov',
 'month_oct',
 'month_sep',
 'day_of_week_fri',
 'day_of_week_mon',
 'day_of_week_thu',
 'day_of_week_tue',
 'day_of_week_wed',
 'poutcome_failure',
 'poutcome_nonexistent',
 'poutcome_success']

Benchmark Setup

We compared wrapping LeanML, Boruta and RFE around LightGBM to solve 38 real-world classification and regression problems from the UCI ML datasets repository and Kaggle.

We used the kxy package for feature construction. Overall, the number of candidate features ranges from 15 to 1925, and the number of rows ranges from 303 to 583,250.

The full list of datasets is provided below.

	Dataset	Number of Features	Number of Rows	Problem Type	Source
0	SkinSegmentation	15	245057	classification	UCI
1	BankNote	20	1372	classification	UCI
2	PowerPlant	20	9568	regression	UCI
3	AirFoil	25	1503	regression	UCI
4	YachtHydrodynamics	30	308	regression	UCI
5	RealEstate	30	414	regression	UCI
6	Abalone	38	4177	regression	UCI
7	Concrete	40	1030	regression	UCI
8	EnergyEfficiency	45	768	regression	UCI
9	WaterQuality	45	3276	classification	Kaggle
10	Shuttle	45	58000	classification	UCI
11	MagicGamma	50	19020	classification	UCI
12	Avila	50	20867	classification	UCI
13	WhiteWineQuality	55	4898	regression	UCI
14	HeartAttack	65	303	classification	Kaggle
15	HeartDisease	65	303	classification	Kaggle
16	AirQuality	70	8991	regression	UCI
17	EEGEyeState	70	14980	classification	UCI
18	LetterRecognition	80	20000	classification	UCI
19	NavalPropulsion	85	11934	regression	UCI
20	BikeSharing	90	17379	regression	UCI
21	DiabeticRetinopathy	95	1151	classification	UCI
22	BankMarketing	103	41188	classification	UCI
23	Parkinson	105	5875	regression	UCI
24	CardDefault	115	30000	classification	UCI
25	Landsat	180	6435	classification	UCI
26	Adult	202	48843	classification	UCI
27	SensorLessDrive	240	58509	classification	UCI
28	FacebookComments	265	209074	regression	UCI
29	OnlineNews	290	39644	regression	UCI
30	SocialMediaBuzz	385	583250	regression	UCI
31	Superconductivity	405	21263	regression	UCI
32	HousePricesAdvanced	432	1460	regression	Kaggle
33	YearPredictionMSD	450	515345	regression	UCI
34	APSFailure	850	76000	classification	UCI
35	BlogFeedback	1400	60021	regression	UCI
36	Titanic	1754	891	classification	Kaggle
37	CTSlices	1925	53500	regression	UCI

Each dataset is split in two: 80% is used for training and feature selection, and the remaining 20% is used for testing.

Benchmark Results

We compare feature selection methods from the perspective of model size, performance, and training duration. A good feature selection method should select as few features as possible, with little to no performance reduction, and without requiring too much compute resources.

Model Size

We compare the number of features selected by each feature selection method. We also compare compression rates, defined as the fraction of candidate features a feature selection method did not select. For RFE, we use the same number of features as LeanML.

Results are illustrated in Figures 1, 2, 3, and 4 below.

LeanML is able to reduce the number of candidate features by 82% on average, compared to 78% for Boruta. Additionally, the number of features selected by LeanML rarely exceeds 20 and never exceeds 45.

Boruta on the other hand selected as many as 140 features. More importantly, it can be seen in Fig. 4 that Boruta almost always selected more features than LeanML, and selected up to 100 more features at times.

Fig. 1: Boxplots of compression rates of the LeanML and Boruta feature selection methods wrapped around LightGBM on 38 classification and regression datasets.

Fig. 2: Histograms of compression rates of the LeanML and Boruta feature selection methods wrapped around LightGBM on 38 real-world classification and regression datasets.

Fig. 3: Histograms of the number of features selected by the LeanML and Boruta feature selection methods wrapped around LightGBM on 38 real-world classification and regression datasets.

Fig. 4: Histogram of the difference between the number of features selected by Boruta and the number of features selected by LeanML wrapped around LightGBM on 38 real-world classification and regression datasets. Positive means Boruta selected more features than LeanML.

Performance

As the performance metric, we use the \(R^2\) for regression problems and the AUC for classification problems. These are evaluated on the testing set. See Figures 5 and 6 for results.

It can be seen that LeanML feature selection outperforms both Boruta and RFE out-of-sample, despite using far fewer features. We also see that LeanML is able to achieve the same performance as without feature selection on average with 82% fewer features, and often improves performance!

Fig. 5: Boxplots of performances of the LeanML, Boruta, and RFE feature selection methods wrapped around LightGBM on 38 classification and regression datasets. For classification problems, the performance is the testing AUC, and for regression problems, the performance is the testing \(R^2\).

Fig. 6: Histograms of performances of the LeanML, Boruta, and RFE feature selection methods wrapped around LightGBM on 38 classification and regression datasets. For classification problems, the performance is the testing AUC, and for regression problems, the performance is the testing \(R^2\).

Training Duration

We compare the training duration (in seconds) of each method divided by the number of candidate features.

Results are illustrated in Figures 7 and 8 where it can be seen that, on average, training with LeanML feature selection took about 0.15 second per candidate feature, compared to 0.75 second for Boruta, and 1.39 seconds for RFE.

Fig. 7: Boxplots of training durations of the LeanML, Boruta, and RFE feature selection methods wrapped around LightGBM on 38 classification and regression datasets. Durations are normalized by the number of candidate features.

Fig. 8: Histograms of training durations of the LeanML, Boruta, and RFE feature selection methods wrapped around LightGBM on 38 classification and regression datasets. Durations are normalized by the number of candidate features.

Beyond LightGBM

We ran the same experiments using XGBoost and Random Forest. Results are illustrated in the tables and figures below.

It can be seen that, out of the three feature selection methods, RFE is by far the worst! RFE is consistently the least accurate, and it is 100x slower than LeanML.

Boruta tends to be as accurate as LeanML on average, but it is 5 to 12 times slower and has a much lower compression rate, especially when wrapped around Random Forest. When wrapped around Random Forest, Boruta never selected fewer features than LeanML did. We found instance where Boruta selected up to 700 more features than LeanML.

As for LeanML, on average it consistently performs as well as without including feature selection, with 82% fewer features! When wrapped around Random Forest, LeanML is on average twice as fast.

Table 1: Compression rate of LeanML, Boruta and RFE (mean ± std. error), on 38 classification and regression datasets using LightGBM, XGBoost and Random Forest. The compression rate is defined as the fraction of features removed by the feature selection method.
Model	No Feature Selection	LeanML	Boruta	RFE
LightGBM	0.00 ± 0.00	0.82 ± 0.03	0.78 ± 0.04	-
Random Forest	0.00 ± 0.00	0.82 ± 0.03	0.51 ± 0.06	-
XGBoost	0.00 ± 0.00	0.82 ± 0.03	0.76 ± 0.04	-

Table 2: Testing performance (mean ± std. error) obtained by wrapping LeanML, Boruta and RFE around LightGBM, XGBoost and Random Forest on 38 real-world classification and regression problems. The performance metric used is the AUC for classification problems and the \(R^2\) for regression problems.
Model	No Feature Selection	LeanML	Boruta	RFE
LightGBM	0.82 ± 0.04	0.81 ± 0.04	0.78 ± 0.04	0.36 ± 0.05
Random Forest	0.79 ± 0.04	0.79 ± 0.04	0.79 ± 0.04	0.35 ± 0.05
XGBoost	0.83 ± 0.04	0.82 ± 0.04	0.81 ± 0.04	0.41 ± 0.06

Table 3: Training duration in seconds divided per the number of candidate features of LightGBM, XGBoost and Random Forest, around which we wrapped LeanML, Boruta, and RFE feature selection, on 38 real-world classification and regression problems.
Model	No Feature Selection	LeanML	Boruta	RFE
LightGBM	0.03 ± 0.01	0.15 ± 0.03	0.75 ± 0.22	1.39 ± 0.35
Random Forest	0.53 ± 0.29	0.26 ± 0.06	3.27 ± 0.82	2.81 ± 0.63
XGBoost	0.07 ± 0.03	0.18 ± 0.05	1.52 ± 0.39	2.12 ± 0.51

Table 4: Number of additional features selected by Boruta and RFE over LeanML feature selection when wrapped around LightGBM, XGBoost and Random Forest, on 38 real-world classification and regression problems. The 'No Feature Selection' column corresponds to the number of features deleted by LeanML.
Model	No Feature Selection	LeanML	Boruta	RFE
LightGBM	255 ± 72	0 ± 0	6 ± 4	-
Random Forest	255 ± 73	0 ± 0	53 ± 20	-
XGBoost	254 ± 72	0 ± 0	8 ± 5	-

Fig. 9: Illustration of the tradeoffs between average compression rate and average performance on the one hand and between average compression rate and training duration on the other, of LightGBM on 38 real-world classification and regression problems, without feature selection and with LeanML, Boruta, or RFE feature selection. The closer a point is to the top-right (resp. top-left) corner on the plot on the left (resp. right), the better. RFE was implemented using the same number of features as selected by LeanML.

Fig. 10: Illustration of the tradeoffs between average compression rate and average performance on the one hand and between average compression rate and training duration on the other, of XGBoost on 38 real-world classification and regression problems, without feature selection and with LeanML, Boruta, or RFE feature selection. The closer a point is to the top-right (resp. top-left) corner on the plot on the left (resp. right), the better. RFE was implemented using the same number of features as selected by LeanML.

Fig. 11: Illustration of the tradeoffs between average compression rate and average performance on the one hand and between average compression rate and training duration on the other, of Random Forest on 38 real-world classification and regression problems, without feature selection and with LeanML, Boruta, or RFE feature selection. The closer a point is to the top-right (resp. top-left) corner on the plot on the left (resp. right), the better. RFE was implemented using the same number of features as selected by LeanML.

Conclusion

We show how to seamlessly wrap LeanML, Boruta, and RFE feature selection around popular predictive models in Python, including but not limited to lightgbm, xgboost, tensorflow, and sklearn models.

Using 38 real-world classification and regression problems, we showed that LeanML feature selection cuts the number of features used by 82% on average, at no performance cost.

As an alternative, RFE can be up to 100 times slower on average than LeanML and performs significantly worse! RFE's notable poor performance is perhaps not surprising as RFE tends to exacerbate overfitting. Indeed, when a model is overfitted, it is often because of [...] the most important features! Thus, removing the least important features and keeping the most important ones can only amplify the problem.

As for Boruta, while it performs similarly to LeanML on average, it can be up to 12 times slower than LeanML on average, and almost always selects more features than LeanML, often hundreds.

Note: The source code to reproduce all experiments can be found here.

Effective Feature Selection: Beyond Shapley Values, Recursive Feature Elimination (RFE) and Boruta

Yves-Laurent Kom Samo, PhD — Fri, 21 Jan 2022 20:59:15 GMT

TL'DR

In this article, we explain why effective feature selection matters, why popular approaches such as RFE, Boruta, and SHAP values aren't good enough, and we propose an information-theoretical solution with code.

Why feature selection matters
Limits of RFE, Boruta, and Shapley values
An information-theoretical alternative, with code

One of the first lessons a machine learning engineer learns at the beginning of his career is that, when it comes to features, sometimes less is more.

The Limits of Using Too Many Features

While the temptation is high to generate as many features as possible and to let the model do the rest, training a model with redundant or uninformative features can be both ineffective and inefficient.

Non-informative features, simply put, are features that cannot contribute to improving the performance of a predictive model. As an illustration, whether a customer is left-handed or right-handed is likely non-informative about whether or/not he will churn a banking product.

Formally, non-informative features are features that are statistically independent from the target (unconditionally), and also statistically independent from the target conditional on any other feature(s) in the set of candidate features.

Redundant features on the other hand are features that, while they might be informative about the target, cannot contribute to improving the performance of a predictive model when used in conjunction with some other features in the set of candidates.

For instance, in problems where geolocation matters, the postcode can potentially be informative about the target. However, when GPS coordinates are included to the set of features, the postcode becomes redundant as it can be inferred from GPS coordinates.

Formally, redundant features are features for which there exists a subset of other features conditional on/given which they are stastically independent from the target.

Using non-informative or redundant features can cause several problems in the training phase of a machine learning project, as well as after a predictive model is deployed to production.

Before Production

Non-informative and redundant features make the training phase less effective (i.e. poorer performance) and less efficient (i.e. longer and costlier).

Increased risk of overfitting: By training a model with non-informative features, we run the risk that it gives a lot of importance to those non-informative features based on spurious patterns detected in the training dataset, which is a form of overfitting. By definition, a model shouldn't rely on non-informative features. Redundant features can also cause degeneracy while training certain models (e.g. linear regression), which could result in overfitting.

Poorer performance: Even when overfitting does not occur, an optimizer might still give small but non-negligible importance to non-informative features. This would result in poorer performance compared to not including said non-informative features.

Costlier to train: When training a model, runtime, memory usage, and CPU usage grow with the number of features. Because they are dispensable, redundant features and non-informative features make model training unnecessarily longer and costlier.

Harder to explain: The presence of redundant features makes it harder to explain a model, as it will be harder to properly attribute feature importance to redundant features.

In Production

Using non-informative or redundant features in production increases maintenance costs.

Increased outage surface: Different features could be served in production by different pipelines (e.g. the raw inputs might come from different databases or tables, processing could be done by different processes, etc.). In such a case, the more features a deployed model uses, the more feature delivery system outages the model will be exposed to, which will eventually increase downtime as any feature outage will likely take down prediction capabilities.

Faster data drift: Data distributions drift over time, causing production models to be retrained. The higher the number of features, the quicker the drift will arise, and the more often production models will need to be retrained.

Review of Existing Feature Selection Approaches & Their Limitations

Pairwise Mutual Information

A popular feature selection approach consists of quantifying how informative each feature is about the target and retaining features that have a high degree of statistical dependency on the target. This is often done using measures of association such as Pearson correlation, Spearman rank correlation, or (pairwise) mutual information.

There are two major problems with this approach. First, a feature need not be statistically dependent on the target to be useful! Because this is a common misconception, let's take a concrete illustrative example to clear this out.

Imagine we want to pin down the location of a person or an object (e.g. by predicting GPS coordinates) using elements of an address (i.e. street name, house number, city, zip code, and country). Intuitively, we would expect house numbers to be statistically independent from GPS coordinates, as most house numbers are found on every street, in every city, and in every country. Simply knowing a house number can hardly tell us anything about the associated GPS coordinates (on average).

Clearly, without house numbers, we will not be able to predict GPS coordinates with an error smaller than a few hundred meters (typical street lengths). That said, even though house numbers are expected to be statistically independent from GPS coordinates, using house numbers, we will be able to reduce the error from a few hundred meters to a few tens of meters (typical house dimensions); this is very significant!

Without the street name, the house number is indeed useless. The house number becomes more useful when the street name is known. But even then, because some street names are very common (e.g. First Street, Second Street, Third Street, etc. in the US), we need either the city or the zip code to remove any ambiguity. Said differently, house numbers can be greatly useful, but only if we also know the street name and city or the zip code. Pairwise feature selection approaches are unable to select such features as house numbers, as they never consider interactions between features.

The second problem with pairwise approaches to feature selection is that they are inherently unable to detect and remove redundant features. To detect redundant features, an approach needs to consider more than one feature at a time.

Recursive Feature Elimination

Another popular feature selection algorithm is the Recursive Feature Elimination (RFE) algorithm. It proceeds as follows:

Train a classification or regression model.
Quantify the importance of each feature used to the trained model.
Remove the least important feature.
Repeat 1-3. until we are left with the desired number of features.

RFE does a good job at removing features that are not useful to a model. However, RFE is unable to detect redundant features, and it can be very slow.

Intuitively, if we duplicate the most important feature and retrain our model, both said feature and its duplicate will likely still be deemed very important by the newly trained model. Thus, the duplicated feature will be among the last ones to be considered for removal by RFE. More generally, RFE cannot remove feature a feature on the basis that it is redundant, which effective feature selection algorithms should be able to.

As for runtime and resources usage of RFE, they grow with the number of features to remove, not the number of features to keep, which is counter-intuitive. This is made worse by the fact that each elimination round requires fully re-training our model.

To take a concrete example, if we want to select 10 features out of thousands of candidates, we need to fully train our model thousands of times. It is bad enough to have to retrain our model thousands of times, but each time we also need to calculate feature importance scores.

While feature importance scores are a byproduct of model training for some classes of models (e.g. linear regression and decision trees), for many more classes of models they are very costly to evaluate, to the point of being impractical.

Boruta

Another fundamental limitation of RFE is that the choice of the number of features to keep is left to the user. The Boruta feature selection algorithm addresses this limitation by providing a statistical framework for testing whether a feature is non-informative.

The intuitive idea underpinning Boruta is that, by randomly shuffling rows of a feature column, we effectively destroy any potential association between the feature, the target, and other features, thereby rendering it useless.

By randomly shuffling all features, we generate an additional set of useless features referred to as 'shadow features', whose importance scores we may use as benchmarks to form a threshold above which an importance score can be deemed high enough (and below which it can be deemed typical of useless features). In the case of Boruta, the threshold is the highest importance score out of all shadow features.

In Boruta, a model is trained using a combination of real features and shadow features, and feature importance scores are calculated for real and shadow features. Any real feature whose importance score is higher than the highest importance score of shadow features is said to have triggered a 'hit'. Our expectation is that useful features will trigger hits, while useless won't.

Boruta models our lack of knowledge about whether a feature is useful or not using the (null) hypothesis \(H_0\) that there is a 1/2 probability that the feature will trigger a hit. If we repeat the experiment \(k\) times then, the number of hits \(h_k\) of any real feature for which \(H_0\) holds (i.e. when we don't know whether the feature is useful or not) ought to follow the binomial distribution \(B(k, 1/2)\), of which we can evaluate confidence bounds \([hm_k, hM_k]\) at a level of our choosing (say 95%) analytically.

We conclude about the usefulness of a feature as follows:

\(h_k > hM_k\): we may reject the hypothesis \(H_0\) in favor of the alternative \(H_1\) that the feature is useful.
\(h_k < hm_k\): we may reject the hypothesis \(H_0\) in favor of the alternative \(H_2\) that the feature is useless.
\(h_k \in [hm_k, hM_k]\): we still do not know whether the feature is useful. We need to increase \(k\) to conclude.

Because models need to be trained with real and shadow features combined, an iteration of Boruta can be much more computationally demanding than an iteration of RFE. When the number of features is large, however, Boruta will typically need far fewer iterations to detect all useless features than RFE.

That said, like RFE, Boruta is unable to detect and eliminate redundant features, and for the same reasons.

RFE-SHAP and Boruta-SHAP

SHAP values offer a principled game-theoretic approach to feature importance scoring and, as such, can be combined with both RFE and Boruta.

In a nutshell, the SHAP value of a feature corresponding to a specific model prediction represents the contribution to the deviation of the specific decision made from the baseline decision, which can be attributed to the specific value of the said feature.

By taking the absolute value and averaging across all decisions made, we obtain a score that quantifies the contribution of each feature in driving model decisions away from the baseline decision (i.e. the best decision we can make without using any feature): this the SHAP feature importance score.

The appeal of SHAP values stems in large part from the fact that Shapley values, upon which they are based, satisfy a desirable uniqueness property. However, because SHAP values of identical features are the same, using RFE or Boruta with SHAP feature importance cannot detect and remove redundant features.

Feature Importance Based Approaches Are Fundamentally Flawed

More generally, in addition to the fact that they cannot eliminate redundant features, feature importance-based approaches to feature selection are fundamentally flawed.

A feature importance score (SHAP value included) quantifies the contribution or 'influence' of a feature towards generating model decisions or, said differently, the extent to which model decisions are due to the feature. However, what we really need, which a feature importance score does not and cannot do, is to quantify the contribution of a feature to the overall performance of a model.

Features with high importance scores are features that drive model decisions the most. Useful features on the other hand are features that account for model performance the most! For the two notions to coincide, the trained model has to be very accurate, an assumption rarely met but which RFE and Boruta take for granted.

Whenever a model is overfitted, chances are that the features with the highest feature importance scores are to blame, and as such should be removed. However, these are precisely the features that a feature importance-based feature selection algorithm such as RFE and Boruta will keep. Let's take a concrete example.

Consider the following toy regression problem.

In black is a fitted linear model relating input \(x\) to \(y\), and in blue is a fitted polynomial model, which can be regarded as a linear model relating features \(x, x^2, x^3, x^4, ...\) to \(y\).

From the plot above, we can tell that the nonlinear model, which is obviously overfitted, relies heavily on at least one monomial term whose degree is higher than \(1\). Thus, we expect the corresponding feature to have a high importance score, no matter the scoring method used. Hence, by applying a feature importance-based selection algorithm to the nonlinear model, we will never be able to remove all nonlinear monomials, which would be the right thing to do in this example.

Had we estimated the contribution of each feature to model performance instead, we would have been able to remove all nonlinear monomials on the basis that they are responsible for overfitting.

Effective Information-Theoretic Feature Selection With KXY

To be effective, a feature selection algorithm should do two things right: 1) discard redundant features, and 2) keep features that contribute the most to model performance.

Algorithm

Building on our previous blog post, we propose the following algorithm:

Select as the first feature the feature \(f_1\) that, when used by itself to predict the target \(y\), has the highest achievable performance \(a_1\), as described in this blog.
For \(i\) ranging from 1 to the total number \(k\) of candidate features, out of all \((k-i)\) features not yet selected, select as \((i+1)\)-th feature \(f_{i+1}\) the feature that, when used in conjunction with all \(i\) previously selected features, yields the highest increase in the achievable performance \((a_{i+1}-a_{i})\). Said differently, \(f_{i+1}\) is the feature that complements previously selected features \((f_1, \dots, f_i )\) the most to predict the target \(y\).

Note that this algorithm is model-free. It only considers the potential effect a feature can have on model performance.

At each step \(i\), any feature that is redundant relative to the \(i\) features already selected is the least likely to be selected as, by definition of it being redundant, it cannot increase the highest performance achievable.

Similarly, any feature that is non-informative cannot increase the highest performance achievable and, as such, is bound to be among the last features selected.

Formally, if we denote \(j\) the smallest index such that \(a_j=a_k\) then, if \(j < k\), all features \(\{a_{j+1}, \dots, a_k\}\) are either redundant or useless and, as such, can be safely discarded.

More generally, the sequence \(\{ (f_1, a_1), \dots, (f_j, a_j)\}\) allows us to strike the right balance between the number of features used and the potential performance they can bring about. Even if a feature isn't completely useless or redundant, the performance boost it may bring about can be so small that using the feature wouldn't be worth it.

Code Snippet

We use the UCI Bank Marketing dataset as an example.

But first, we generate some features. The code snippet below generates 103 features out of the 20 raw explanatory variables, and 1-hot encodes the target.

# 0. As a one-off, run 'pip install kxy', then 'kxy configure'
import kxy
# 1. Load your data
# pip install kxy_datasets
from kxy_datasets.classifications import BankMarketing
dataset = BankMarketing()
target_column = dataset.y_column
df = dataset.df
# 2. Generate candidate features
features_df = df.kxy.generate_features(entity=None, max_lag=None,\
    entity_name='*', exclude=[target_column])
del features_df['y_no']
target_column = 'y_yes'

Here is how simple it is to generate the sequence \(\{ (f_1, a_1), \dots, (f_k, a_k)\}\) in Python.

# 3. Run model-free variable selection, simple!
features_df.kxy.variable_selection(target_column, \
	problem_type='classification')

	Variable	Running Achievable R-Squared	Running Achievable Accuracy
Selection Order
0	No Variable	0.0000	0.89
1	duration	0.0472	0.90
2	euribor3m	0.0743	0.91
3	age	0.0764	0.91
4	campaign	0.0764	0.91
5	job_services	0.24	0.94
6	housing_yes	0.24	0.94
7	job_admin.	0.24	0.94
8	pdays	0.25	0.95
9	previous	0.26	0.95
10	emp.var.rate	0.26	0.95
11	cons.price.idx	0.27	0.95
12	cons.conf.idx	0.27	0.95
13	nr.employed	0.27	0.95
14	age.ABS(* - MEAN(*))	0.27	0.95
15	age.ABS(* - MEDIAN(*))	0.27	0.95
16	age.ABS(* - Q25(*))	0.27	0.95
17	age.ABS(* - Q75(*))	0.27	0.95
18	duration.ABS(* - MEAN(*))	0.27	0.95
19	duration.ABS(* - MEDIAN(*))	0.28	0.95
20	duration.ABS(* - Q25(*))	0.28	0.95
21	duration.ABS(* - Q75(*))	0.28	0.95
22	campaign.ABS(* - MEAN(*))	0.28	0.95
23	campaign.ABS(* - MEDIAN(*))	0.28	0.95
24	campaign.ABS(* - Q25(*))	0.28	0.95
25	campaign.ABS(* - Q75(*))	0.28	0.95
26	pdays.ABS(* - MEAN(*))	0.28	0.95
27	pdays.ABS(* - MEDIAN(*))	0.28	0.95
28	pdays.ABS(* - Q25(*))	0.28	0.95
29	pdays.ABS(* - Q75(*))	0.28	0.95
30	previous.ABS(* - MEAN(*))	0.28	0.95
31	previous.ABS(* - MEDIAN(*))	0.28	0.95
32	previous.ABS(* - Q25(*))	0.28	0.95
33	previous.ABS(* - Q75(*))	0.28	0.95
34	emp.var.rate.ABS(* - MEAN(*))	0.28	0.95
35	emp.var.rate.ABS(* - MEDIAN(*))	0.28	0.95
36	emp.var.rate.ABS(* - Q25(*))	0.28	0.95
37	emp.var.rate.ABS(* - Q75(*))	0.28	0.95
38	cons.price.idx.ABS(* - MEAN(*))	0.28	0.95
39	cons.price.idx.ABS(* - MEDIAN(*))	0.28	0.95
40	cons.price.idx.ABS(* - Q25(*))	0.28	0.95
41	cons.price.idx.ABS(* - Q75(*))	0.28	0.95
42	cons.conf.idx.ABS(* - MEAN(*))	0.28	0.95
43	cons.conf.idx.ABS(* - MEDIAN(*))	0.28	0.95
44	cons.conf.idx.ABS(* - Q25(*))	0.29	0.95
45	cons.conf.idx.ABS(* - Q75(*))	0.29	0.95
46	euribor3m.ABS(* - MEAN(*))	0.29	0.95
47	euribor3m.ABS(* - MEDIAN(*))	0.29	0.95
48	euribor3m.ABS(* - Q25(*))	0.29	0.95
49	euribor3m.ABS(* - Q75(*))	0.29	0.96
50	nr.employed.ABS(* - MEAN(*))	0.29	0.96

As we can see, the top-8 explanatory variables collectively contribute to an increase of the achievable classification accuracy above the baseline by 6%, while the remaining 90+ features can only generate a 1% accuracy boost.

Feature Engineering: Why It Matters, and How To Do It Right

Yves-Laurent Kom Samo, PhD — Thu, 16 Dec 2021 07:11:28 GMT

Insightful data and effective feature engineering are the cornerstones of any successful Machine Learning project.

But why should you care about constructing features if you've already invested a great deal in capturing as much information as possible about your customers, business activity, or more generally the problem of interest?

Shouldn't the Machine Learning model handle the rest? Is it not counterintuitive that a great model would need a lot of feature engineering?

In this article, we will clarify this apparent paradox and arm you with principles and tools that will help you create better features.

The Main Objective: Increase Model-Representation Adequacy

To understand the relationship between features and models, let us consider a supervised learning problem (regression or classification) with raw categorical and/or continuous inputs \(x \in \mathcal{X}\) and target \(y \in \mathcal{Y}\).

The Oracle Predictor

When the problem is a regression problem with MSE loss, the theoretical-best regressor is known to be the model that maps \(x\) to the associated conditional expectation \(x \to \mathbb{E}(y|x).\) When the loss function is MAE, the theoretical-best regressor maps \(x\) to the median of the conditional distribution of \(y\) given \(x\). When the problem is a classification problem and the loss function is the classification error, the theoretical-best classifier is the one that maps \(x\) to the mode of the (discrete) conditional distribution of \(y\) given \(x\).

Note that, as we would expect, the theoretical-best model does not need any feature engineering! It works directly from raw inputs \(x\), which could be of any type, and takes care of any needed feature engineering.

The Oracle Feature

Let's denote \(\bar{z}:= \bar{f}(x)\) the prediction made by the theoretical-best model. We can look at \(\bar{z}\) as a feature constructed from \(x\). To achieve the best performance possible from \(\bar{z}\), all we need is to use the identity function as model: \(x \to x\).

Note that we do not need any model training! Feature engineering has already done all the work.

Real-Life Learner Are Imperfect

If we knew \(\bar{f}\), then we would not need to do any feature engineering. If we knew \(\bar{z}\), then we would not need to do any model training.

In practice, however, we never know \(\bar{f}\) or \(\bar{z}\). The best we can do is to approximate the oracle predictor with Machine Learning models in our toolbox (e.g. linear models, LightGBM, XGBoost, Random Forest, kernel-based learners, ensemble models, etc.).

Models in our toolbox are imperfect learners in that they only work well in the presence of specific types of patterns relating inputs to the target.

For instance, linear models (e.g. OLS, LASSO, Ridge, ElasticNet etc.) can only capture linear relationships between inputs and the target. Decision tree regressors work best in situations where a good global regressor can be learned by partitioning the input space into \(m\) disjoint subsets using simple decision rules and learning \(m\) very simple local regressors.

Kernel-Ridge regressors and Gaussian process regressors with smooth kernels assume that inputs that are close to each other (in Euclidean distance) will always have targets that are also close to each other (in Euclidean distance). The same assumption is made by nearest neighbors regressors.

Beyond smoothness, kernel-based regressors can also be used to learn global patterns in situations where inputs that are close to each other in a given norm, will have targets that are close to each other in the Euclidean norm. The norm can be chosen to learn periodic or seasonal patterns, with or without a trend, to name but a few.

Features Increase Model-Representation Adequacy

In real-life applications, the patterns relating raw inputs \(x\) to the target \(y\) are often intricate and inconsistent with what models in our toolbox expect. This is where features come in.

Instead of training our model directly on \(x\), we will construct a range of features/representations \(z_i\) that we would expect to have simpler relationships with the target \(y\), relationships that are easily learned by popular model classes without overfitting.

Example types of simple relationships include:

\(y\) tends to increase (resp. decrease) as a function of \(z_i\)
\(y\) tends to increase (resp. decrease) as \(z_i\) deviates from canonical values or range

To sum-up, feature construction serves one primary role: simplify the relationship between inputs and target to make it consistent with what the model to train can reliably learn.

The Main Cost: The Data Processing Inequality or The Law of 'Juice-Dissipation'

On the face of it, it might seem as though feature construction increases the insightfulness/informativeness/juice in our raw inputs \(x\) to predict the target \(y\).

This is however not the case; on the contrary. Any transformation applied to our raw inputs \(x\) can only reduce/dissipate the juice that was in \(x\) to predict the target \(y\). The result is often referred to as the Data Processing Inequality.

Thus, effective feature engineering ought to operate a balancing act between simplifying the inputs-target relationship while preserving as much of the juice that was in raw inputs (to predict the target) as possible.

The solution we advocate for this is two-staged. First, you should construct as many features as reasonably possible, and then select the best ones based on the highest performance they may yield.

We cover the first stage in the section below and will cover feature selection in this blog post.

Popular Feature Construction Techniques

Below are a few popular feature construction techniques.

Ordinal Encoding

Most models expect ordinal inputs. Non-ordinal inputs such as strings should be ordinarily encoded (e.g. using 1-hot-encoding) before training the model.

Missing Value Imputation

Many models cannot cope with missing values. Thus, missing values should be replaced with non-informative baselines before training the model.

Entity Aggregations

Models typically expect one \(x\) per \(y\). However, it can happen that, in the problem of interest, a collection of \(x\) is naturally associated with the same \(y\). When this happens, we need to aggregate the collection into a single features vector that will be associated with the target. Example aggregation operations include sum, mean, median, standard deviation, skewness, kurtosis, minimum, maximum, quantiles, etc. for continuous inputs, and top-k most/least frequent values and their frequencies, the number of unique values, etc. for categorical inputs.

Deviation Features

These features express how raw inputs deviate from some canonical values (e.g. their mean, median, 25%, and 75% quantiles, etc.) in absolute value.

Seasonality Features

These are features that could reveal seasonal effects when observations are timestamped. Examples include periodic properties of the timestamp such as the second in the minute, the minute in the hour, the hour in the day, AM/PM, the day of the week, weekday vs weekend, the day of the month, the distance in days to the middle of the month, the month, etc.

Temporal Features

Most time series models (e.g. ARIMA, RNN, LSTM etc.) are conceptually made of two building blocks:

A set of temporal features that encode the full extent to which time matters (i.e. what temporal patterns the model will attempt to exploit).
A memoryless or tabular model that uses the temporal features above as inputs.

While some time series models construct their temporal features on the fly (e.g. LSTM, RNN), constructing temporal features in a model-agnostic fashion may yield greater flexibility and explainability.

The aim here is to turn past observations into features from which most tabular models can easily detect patterns such as short/medium/long term trends and trend reversals, that would otherwise be very difficult for any tabular model to learn directly from lagged values.

Temporal features are commonly constructed using rolling statistics (e.g. mean, min, max, standard deviation, skewness, kurtosis, quantiles, etc.) of each raw input with various window sizes.

Feature Construction Using The kxy Package

We have implemented a method in the kxy Python package to automate the construction of candidate features using the techniques described above.

Here's a self-contained example on the UCI Bank Marketing dataset.

Code

# pip install kxy
import kxy
# pip install kxy_datasets
from kxy_datasets.classifications import BankMarketing
dataset = BankMarketing()
target_column = dataset.y_column
df = dataset.df
features_df = df.kxy.generate_features(entity=None, max_lag=None,\
    entity_name='*', exclude=[target_column])

Raw Inputs

Features

Summary

Training a Machine Learning model is equivalent to using data to find a good enough approximation of the oracle predictor.

Every Machine Learning model is restricted in the types of patterns it may reliable learn between inputs and the target.

However, in real-life applications, the relationship between raw inputs and the target is often much more intricate than what popular models may reliably learn.

The aim of feature construction is to generate representations of raw inputs that we intuitively expect to have a much simpler relationship to the target, one that models in our toolbox can reliably learn, and that are still as insightful about the target.

An API for Data Valuation

Yves-Laurent Kom Samo, PhD — Tue, 18 May 2021 18:02:39 GMT

Today we announce the launch of our data valuation product, the first (and only) data valuation API. In this post, we distill what data valuation is, and why it should be the very first step in any (predictive) machine learning pipeline.

Photo by Albert Hyseni on Unsplash

Announcing The World's First Data Valuation API

1-in-4 commercial machine learning projects fail, and 9-in-10 trained machine learning models are not good enough to make it to production.

Today, we are launching the world's first data valuation product to increase the success rate of machine learning projects, and mitigate avoidable wastes of resources.

In a single function call, machine learning engineers can now estimate the highest performance they may achieve in a machine learning project, prior to and without training any (predictive) machine learning model. The API doesn't require machine learning engineers to do any feature engineering, whether their data are continuous, categorical or a mix of the two. It just works!

This is the first step on our mission to make machine learning lean.

Here is how simple it is in Python.

# 0. As a one-off, run 'pip install kxy', then 'kxy configure'
import kxy
import pandas as pd
# 1. Load your data
df = pd.read_csv('s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt')
# 2. Run data valuation, simple!
df.kxy.data_valuation('Churn?', problem_type='classification')

Achievable Accuracy	Achievable Log-Likelihood Per Sample	Achievable R-Squared
0.86	-3.96e-01	0.45

We charge a flat fee per API request, no matter how big your dataset is, and the platform is totally free for academic use. You can signup here and retrieve your API key here.

Read on to learn more about data valuation.

What Is Data Valuation?

It has been widely reported that data has overtaken oil as the world's most valuable asset. Yet, this analogy is somewhat misleading: two barrels of oil are hardly different, but not all data are created equal!

To understand data valuation, we need to construct a more accurate analogy by comparing the answers to four key questions: what is the valuable, where is the valuable extracted, how is the valuable extracted, and how much valuable is there to extract.

What is the valuable?

Whereas the objective in oil production is to extract oil from the ground, in a (predictive) machine learning project, we want to reliably predict a business outcome from a set of so-called 'explanatory variables'.

For instance, a mobile operator might be interested in predicting whether a customer will ultimately end up churning using attributes they already know about the customer (e.g. area code, length of service, number of interactions with customer service, billing and usage attributes, and more). The mobile operator would have seen customers churn in the past, and the hope here is to find in customer attributes patterns that are shared between customers that churn and that differentiate them from customers that don't churn.

A real estate marketplace might be interested in predicting the price a property will sell for using publicly available attributes about the neighborhood (e.g. crime statistics, local school statistics and rankings etc.) and property-specific attributes (e.g. property type, year built, number of bedrooms, number of bathrooms, livable area etc.). The marketplace would have records of previous transactions with prices and property characteristics, and the hope here is to provide the buyer and the seller with an estimation of the 'fair' price of the property to ease negotiations and, ultimately, increase transaction volumes.

In statistical terms, predicting a business outcome from explanatory variables is equivalent to learning the true conditional distribution \(\mathbb{P}_{y|\mathbf{x}}\) of the business outcome \(y\) given the associated explanatory variables \(\mathbf{x}\). Occasionally, learning the mapping between explanatory variables and the best prediction (in the mean square sense), namely \(\mathbf{x} \to E\left(y \vert \mathbf{x} \right) \), which is a property of the true predictive distribution \(\mathbb{P}_{y|\mathbf{x}}\), might be good enough.

Where do we extract the valuable?

In an oil production project, oil is extracted on the site where wells are drilled. The equivalent in a machine learning project are the explanatory variables from which we derive insights about the business outcome we want to predict.

Hence, data is more akin to an oil production site than to oil itself.

How do we extract the valuable?

Project, processes, machines and other tools collectively define how oil is pumped from the ground. The machine learning equivalent are the families of models (e.g. neural networks, tree-based models, parametric and non-parametric Bayesian models etc.) and the algorithms used to explore these families of models to approximate the true predictive distribution \(\mathbb{P}_{y|\bf{x}}\), or properties thereof such as its mean or standard deviation.

How much valuable is there to extract?

Clearly, if there is no oil in the ground on the production site, no matter what tools, machines, or techniques are used in the project, no oil will be extracted. The machine learning phrase 'garbage in, garbage out' captures a similar idea, but the analogy stops there.

While an entire subfield of physics is devoted to estimating oil reserves, until recently, little was known in the machine learning litterature about quantifying the intrinsic value a set of explanatory variables could bring about when used to predict a specific business outcome; i.e. how to 'weigh' the valuable \(\mathbb{P}_{y|\mathbf{x}}\) without first retrieving it.

Source: howmuch.net

The framework to answer the foregoing question turns out to be fairly intuitive. The same way the amount of oil in the ground can be defined as the largest amount of oil that may be extracted from the ground (no matter the tool used), the value of a dataset for predicting a specific business outcome is the highest performance that may be achieved by any (predictive) machine learning model, reliably and without overfitting. It also happens to be the performance we would achieve, if we knew the true predictive distribution \(\mathbb{P}_{y|\bf{x}}\).

The same way building a map of oil reserves does not require extracting all the oil from the ground and weighing it, estimating the highest performance achievable in a (predictive) machine learning project should not require first learning the oracle (perfect) predictive model \(\mathbb{P}_{y|\bf{x}}\) and then evaluating its performance.

To sum-up, data valuation is the estimation of the highest performance that may be achieved in a (predictive) machine learning experiment, prior to and without learning any predictive model.

Why Data Valuation?

To grasp what data valuation can do for machine learning engineers, and why it is a big deal, it might be useful to revisit the oil analogy.

Look at the global map of oil reserves above. Now, imagine a world in which every country has the same number of oil wells per square mile. Big oil conglomerates might still be able to generate profits in this world if they build enough wells, but their capital would be very poorly allocated, and production costs would be considerably higher.

As an illustration, Africa is 33 times bigger than Venezuela, yet Africa only has 41% of Venezuela's oil reserves. Thus, a company wanting to dig wells on the African continent and in Venezuela without using oil reserves would spend on average 80 times more for the same amount of oil extracted than if it dug wells proportionally to oil reserves.

The price of gas and petroleum products would need to increase to reflect higher production costs, and this would in turn have cascading effects in virtually every aspect of our lives. The world as we know it today would be very different.

Machine learning without data valuation is the equivalent of that imaginary world where the only way to know if there is oil in the ground is to build a fully-functional well, try to pump oil out, and see whether any comes out! If you are lucky, you will find some oil, but more often than not, you won't, no matter how much resources you spent digging.

If an organization has a relatively small machine learning budget then, by training predictive models without first valuing its data, the organization would be taking an excessive business risk. An organization wouldn't allocate resources to a new project without first estimating its potential impact on its bottomline. Why should it treat a machine learning project differently?

Just because an organization has terabytes of data does not mean it is sitting on a gold mine. In fact, it is estimated that 9-in-10 trained machine learning models are not good enough to make it to production, and 1-in-4 machine learning projects ultimately fail. With data valuation, every organization can now accurately estimate the risk in investing in machine learning capabilities prior to deploying any resources.

Even if an organization is lucky enough to have a large machine learning budget, it could drastically slash its costs and runtime by systematically valuing its data first, and conditioning the training of predictive models on whether a high enough performance may be achieved.

What Makes Data Valuation Work?

In this section we propose an intuitive and high-level explanation of what makes data valuation work. For an in-depth technical explanation, see this paper we wrote.

To understand why it is possible to estimate the highest performance achievable in a (predictive) machine learning project without learning any predictive model, and to get a sense for what is necessary to make this work, it is helpful to compare data valuation to how oil companies discover oil at sea without drilling.

Illustration of oil exploration at sea (credit).

Oil exploration at sea requires two key ingredients: a scientific principle and the technology that implements it.

The scientific principle relates something we can directly measure to what we cannot directly measure but we need to quantify nonetheless, namely the amount of oil that may be extracted.

Simply put, offshore oil survey relies on the principle that, when certain types of waves (seismic waves) are emitted in the sea, the way they bounce back is indicative of what type of rock or soil layers they hit. Because physics laws allow us to differentiate how seismic waves would bounce back if they hit an oil layer from how they would bounce back if they hit other layers, it is possible to find and map-out oil layers at sea, in theory.

In practice, to put this principle to use, oil companies need the technology to emit seismic waves at sea in a controlled manner, and to observe how they bounce back.

Data valuation works much like oil survey at sea.

Effective data valuation requires a scientific principle that relates a quantity, one that is much easier to estimate from the data than finding the best predictive model, to the highest performance that may be achieved by any (predictive) machine learning model.

As it turns out, whether the performance of interest is the Root-Mean-Square-Error, the \(R^2\), or even the classification accuracy, the highest performance achievable when using the explanatory variables \(\bf{x}\) to predict the business outcome \(y\) can be expressed solely as a function of the mutual information \(I(y, \bf{x})\), and sometimes a measure of the variability of the business outcome \(y\) such as its standard deviation (in the case of the RMSE) and its Shannon entropy (in the case of classification accuracy).

As an illustration, the formula relating the highest \(R^2\) achievable when using \(\bf{x}\) to predict \(y \) is \(\bar{R}^2 = 1 - e^{-2I(y, \bf{x})}\), and this is true whether explanatory variables \(\bf{x}\) are continuous, categorical, or a mix!

We wrote the first paper establishing these relations for a variety of performance metrics. You can find a copy here.

Additionally, effective data valuation requires an accurate and data-efficient way of estimating any mutual information from data (i.e. the technology), without first learning the true joint or predictive distribution.

If you struggle to see how we could reliably estimate a mutual information without first estimating the true predictive or joint distribution, then consider this. Although the mutual information fully characterizes the value in the data, it does not fully characterize the data itself; the joint distribution does! For instance, the mutual information does not depend on marginal distributions; the joint distribution does. Any feature transformation of explanatory variables that can be undone (i.e. from which we can recover the original explanatory variables) would not change the mutual information, but it would change the joint distribution! A direct consequence of this is that a good mutual information estimator should not be sensitive to, nor should it require feature engineering.

Put bluntly, there are considerably more ways learning a joint or predictive distribution can go wrong than ways estimating a mutual information can go wrong. This is great news for data valuation because it means that we can value a dataset much easier, faster, and cheaper than finding the best predictive model, and therefore that machine learning engineers could slash cost and save time by always valuing a dataset prior to finding a predictive model.

If you want to dig deep into mutual information estimation, check out our AISTATS 2021 paper.