Motivation
AutoML, Large Feature Sets, and Overfitting
Automating algorithm selection and hyper-parameter tuning using an AutoML library such as AWS' AutoGluon can save machine learning engineers tremendously in development costs. However, with great modeling power comes an increased risk of overfitting.
To illustrate this, let us first consider a single algorithm trained with one set of hyper-parameters using k-fold cross-validation. While cross-validation might considerably reduce the probability that our trained model will perform poorly on unseen data, it can never reduce it down to 0. There will always be a small chance that our model was overfitted and will perform very poorly on unseen data. Let us denote \(p_o\) this probability of overfitting.
Now, let us consider independently training not just \(1\) but multiple configurations algorithm + hyper-parameters, and let us assume that \(m\) of these configurations yielded satisfying held-out performances. The probability that there will be at least one of the \(m\) successful configurations that will not generalize well to unseen data despite cross-validation is \(p_{o_m} = 1-(1-p_o)^m \approx mp_o\). This is a huge jump from \(p_o\).
As an illustration, let us assume that there is a 1% chance that a model that did well after k-fold cross-validation will perform poorly on unseen data (i.e. \(p_o=0.01\)). If we found \(m=10\) satisfying algorithm + hyper-parameters configurations, then there is a \(p_{o_m}=0.096\) chance that at least one of them will not generalize to new data after cross-validation. This increases to \(p_{o_m}=0.634\) when \(m=100\) and, for \(m=1000\), it is almost certain that at least one configuration will not generalize to new data after cross-validation!
The foregoing analysis was made for one problem/dataset. When we have \(q\) problems to solve, the issue gets much worse. If for each problem we have the same number \(m_q\) of satisfying configurations after cross-validation, then the probability that at least one satisfying configuration for at least one problem will not generalize to new data after cross-validation becomes \(p_{o_{mq}} := 1-(1-p_o)^{qm_q} \approx qm_qp_o \).
In concrete terms, if your k-fold cross-validation is so good that it only has a 1% chance of letting an overfitted model slip through the cracks, and you have 10 predictive models to build, and you use an AutoML suite such as AWS' AutoGluon that finds 10 satisfying configurations algorithm + hyper-parameters you can rely on for each problem, then there is a whopping 63% chance that at least one configuration will perform poorly on unseen data, even though you (kind of) did everything right!
So, what can you do about it? Let's look at each variable affecting the problem and consider possible solutions.
\(\bf{q}\) — The larger the number of problems you have to solve, the more likely it is that at least one configuration will overfit. However, this variable reflects the needs of the business and can hardly be controlled.
\(\bf{m_q}\) — The more configurations your AutoML library tries out the more likely it is that at least one overfitted configuration will trick your k-fold cross-validation. However, by not trying enough algorithms or hyper-parameters, you are running the risk that your selected model or ensemble will be suboptimal. A good tradeoff is to not consider configurations that are too similar. But then again, if you use an existing AutoML library, you might not have enough control over this.
\(\bf{p_o}\) — This variable essentially reflects the quality of your single-configuration k-fold cross-validation. The most obvious way to increase it is to simply increase the number of folds \(k\). However, this can vastly increase your runtime and development costs more generally.
Another factor driving the probability of overfitting is the number of features a model uses. The more features are used during training, the more opportunity there is for a (flexible) model to discover spurious patterns. Reducing the number of features used in algorithm and hyper-parameter search will not only reduce the likelihood of overfitting but will also reduce runtime and overall development costs.
The challenge here is that, if you choose the wrong subset of features, or you simply don't use enough features, then you will decrease model performance and hurt the bottom-line.
To sum up, because it runs a large number of experiments, an AutoML algorithm reduces the statistical power of k-fold cross-validation, and increases the likelihood that a model will perform poorly when deployed to production, despite performing superbly on held-out data during cross-validation. A cost-effective approach to addressing this issue is to reduce the number of features the AutoML algorithm learns from, while ensuring insightful features are not left out.
Large Feature Sets and Maintenance Cost
Beyond overfitting and, more generally, higher development costs, another peculiarity of machine learning models as pieces of software is that they are costlier to maintain than traditional pieces of software.
The cost of maintaining predictive machine learning models in production is exacerbated by several factors, among which are data pipeline outages and model performance decay resulting from data drift.
When a data pipeline goes down and predictive models stop receiving some of their inputs, those predictive models (usually) stop working. Such an outage, whose likelihood increases with the number of features that predictive models rely on, can severely handicap a product, and present a big opportunity cost. Intuitively, the fewer the number of features a predictive model uses, the less likely it is to go down.
As time goes by, predictive models often become less effective, to the point of needing to be retrained. The root cause of this problem is known as data drift.
The way we humans behave tends to change or 'drift' over time. It is, therefore, no surprise that distributions of data generated by human activities also change over time. In particular, the relationship between features a production model uses and the target it predicts will also change over time, thereby gradually rendering obsolete the specific relationship learned by the production model at the time of training, and upon which it relies to make predictions. The more features the production model uses, the more rapidly data will drift, and the more often the production model will need to be retrained.
While one should aim to keep the number of features a production model uses to a bare minimum, accidentally leaving out the wrong features can drastically reduce model performance, which would likely affect the bottom-line. Not to mention that the 'bare minimum' number of features one has to keep without affecting model performance is usually unknown to machine learning engineers, and varies from one problem to another.
In short, if your production model uses too many features, you will increase your maintenance cost (among other downsides). But if you choose the wrong subset of features, or you simply don't use enough features, then you will decrease model performance, and the bottom-line with that.
This blog post shows you how to drastically reduce the number of features used by AWS' AutoGluon in Python while improving model performance.
What To Expect
Using the kxy
Python package you don't have to choose between high maintenance cost and low bottom-line, or between overfitting because you passed too many features to your AutoML algorithm (AWS' AutoGluon in this case), and poor performance because you left out insightful features.
The kxy
package allows you to drastically reduce the number of features used by AutoGluon, while improving model performance.
Indeed, in an experiment on 38 real-world classification and regression problems from the UCI Machine Learning Repository and Kaggle, using the kxy
package, we were able to reduce the number of features used by 95% while improving performance.
The datasets used had between 15 and 1925 automatically generated candidate features, and between 303 and 583250 rows. We did a random 80/20 training/testing data split, and used as the evaluation metric the testing \(R^2\) for regression problems, and the testing AUC for classification problems.
Details and results for each problem are summarized in the table below.
Dataset | Rows | Candidate Features | Features Selected | Performance (Full Model) | Performance (Compressed Model) | Problem Type | Source |
---|---|---|---|---|---|---|---|
SkinSegmentation | 245057 | 15 | 4 | 1 | 1 | classification | UCI |
BankNote | 1372 | 20 | 6 | 0.99 | 1 | classification | UCI |
PowerPlant | 9568 | 20 | 6 | 0.97 | 0.97 | regression | UCI |
AirFoil | 1503 | 25 | 14 | 0.95 | 0.94 | regression | UCI |
YachtHydrodynamics | 308 | 30 | 1 | 1 | 0.99 | regression | UCI |
RealEstate | 414 | 30 | 9 | 0.75 | 0.76 | regression | UCI |
Abalone | 4177 | 38 | 5 | 0.58 | 0.58 | regression | UCI |
Concrete | 1030 | 40 | 11 | 0.93 | 0.92 | regression | UCI |
EnergyEfficiency | 768 | 45 | 7 | 1 | 1 | regression | UCI |
WaterQuality | 3276 | 45 | 30 | 0.59 | 0.6 | classification | Kaggle |
Shuttle | 58000 | 45 | 4 | 1 | 1 | classification | UCI |
MagicGamma | 19020 | 50 | 13 | 0.86 | 0.86 | classification | UCI |
Avila | 20867 | 50 | 31 | 1 | 1 | classification | UCI |
WhiteWineQuality | 4898 | 55 | 27 | 0.48 | 0.44 | regression | UCI |
HeartAttack | 303 | 65 | 8 | 0.83 | 0.81 | classification | Kaggle |
HeartDisease | 303 | 65 | 9 | 0.83 | 0.83 | classification | Kaggle |
AirQuality | 8991 | 70 | 2 | 1 | 1 | regression | UCI |
EEGEyeState | 14980 | 70 | 17 | 0.97 | 0.97 | classification | UCI |
LetterRecognition | 20000 | 80 | 22 | 0.99 | 0.99 | classification | UCI |
NavalPropulsion | 11934 | 85 | 6 | 1 | 1 | regression | UCI |
BikeSharing | 17379 | 90 | 3 | 1 | 1 | regression | UCI |
DiabeticRetinopathy | 1151 | 95 | 32 | 0.7 | 0.71 | classification | UCI |
BankMarketing | 41188 | 103 | 17 | 0.76 | 0.77 | classification | UCI |
Parkinson | 5875 | 105 | 2 | 1 | 1 | regression | UCI |
CardDefault | 30000 | 115 | 24 | 0.66 | 0.66 | classification | UCI |
Landsat | 6435 | 180 | 6 | 0.99 | 0.98 | classification | UCI |
Adult | 48843 | 202 | 8 | 0.79 | 0.78 | classification | UCI |
SensorLessDrive | 58509 | 240 | 19 | 1 | 1 | classification | UCI |
FacebookComments | 209074 | 265 | 13 | -13.36 | 0.61 | regression | UCI |
OnlineNews | 39644 | 290 | 26 | -0.71 | 0.04 | regression | UCI |
SocialMediaBuzz | 583250 | 385 | 6 | 0.94 | 0.93 | regression | UCI |
Superconductivity | 21263 | 405 | 19 | 0.92 | 0.91 | regression | UCI |
HousePricesAdvanced | 1460 | 432 | 9 | 0.88 | 0.87 | regression | Kaggle |
YearPredictionMSD | 515345 | 450 | 35 | 0.41 | 0.36 | regression | UCI |
APSFailure | 76000 | 850 | 13 | 0.86 | 0.69 | classification | UCI |
BlogFeedback | 60021 | 1400 | 17 | 0.6 | 0.59 | regression | UCI |
Titanic | 891 | 1754 | 28 | 0.82 | 0.79 | classification | Kaggle |
CTSlices | 53500 | 1925 | 31 | 1 | 1 | regression | UCI |
Cumulatively, there were 10229 candidate features to select from across the 38 datasets, and the kxy
package only selected 540 of them in total, which corresponds to a 95% reduction in the number of features used overall.
Crucially, the average performance (testing \(R^2\) for regression problems and testing AUC for classification problems) of the compressed model was 0.82, compared to only 0.45 for the full model; a drastic performance increase despite a 95% reduction in the number of features used!
Looking closely at the results in the table above, we see that AutoGluon yielded negative testing performances on FacebookComments and OnlineNews when using all features, but its compressed version did not! While these two datasets explain the big average performance difference between full AutoGluon and compressed AutoGluon, when they are excluded, full and compressed AutoGluon have the same average performance, despite compressed AutoGluon using only 5% of the features used by full AutoGluon!
Code
A Jupyter notebook to reproduce the experiments above is available here. In this post, we will focus on showing you what it will take to compress your own AutoGluon model in Python.
Setup
First, you will need to install the kxy
Python package using your method of choice:
- From PyPi:
pip install -U kxy
- From GitHub:
git clone https://github.com/kxytechnologies/kxy-python.git & cd ./kxy-python & pip install .
- From DockerHub:
docker pull kxytechnologies/kxy
. The image is shipped withkxy
and all its dependencies pre-installed.
Next, simply import the kxy
package in your code. The kxy
package is well integrated with pandas
, so while you are at it you might also want to import pandas
.
import kxy
import pandas as pd
From this point on, any instance of a pandas DataFrame, say df
, that you have in your code is automatically enriched with a set of kxy
methods accessible as df.kxy.<method_name>
.
Training
Training a compressed AutoGluon model can be done in a single line of code.
results = training_df.kxy.fit(target_column, learner_func, \
problem_type=problem_type, feature_selection_method='leanml')
training_df
is the pandas DataFrame containing training data. target_column
is a variable containing the name of the target column. All other columns are considered candidate features/explanatory variables.
problem_type
reflects the nature of the predictive problem to solve and should be either 'regression'
or 'classification'
.
feature_selection_method
should be set to 'leanml'
. If you want to know all possible values and why you should use 'leanml'
, read this blog post.
In general, learner_func
is the function we will call to create new trainable instances of your model. It takes three optional parameters:
n_vars
: The number of features the model should expect, in case it is required to instantiate the model (e.g. for neural networks).path
: Where to save or load the model from, if needed.safe
: A boolean that controls what to do when the model can't be loaded frompath
. The convention is that, ifpath
is notNone
,learner_func
should try to load the model from disk. If this fails thenlearner_func
should create a new instance of your model ifsafe
is set toTrue
, and raise an exception ifsafe
isFalse
.
learner_func
should return a model following the Scikit-Learn API. That is, at the very least, returned models should have fit(self, X, y)
and predict(self, X)
methods, where X
and y
are NumPy arrays. If you intend to save/load your compressed models, models returned by learner_func
should also have a save(self, path)
method to save a specific instance to disk.
For AutoGluon models specifically, we provide a utility function (get_autogluon_learner
) to generate a learner_func
that creates instances of autogluon.tabular.TabularPredictor with set hyper-parameters.
Here is an illustration in the case of a regression problem.
from kxy.learning import get_autogluon_learner
kwargs = {}
fit_kwargs = {}
learner_func = get_autogluon_learner(problem_type='regression', \
eval_metric=None, verbosity=2, sample_weight=None, \
weight_evaluation=False, groups=None, fit_kwargs={}, **kwargs)
problem_type
, eval_metric
, verbosity
, sample_weight
, weight_evaluation
, groups
, and kwargs
are all parameters you would pass to the constructor of autogluon.tabular.TabularPredictor. It is worth noting that problem_type
here is not the same as problem_type
you would pass to df.kxy.fit
. fit_kwargs
is the dictionary of named arguments you would pass to the fit method of an instance of autogluon.tabular.TabularPredictor.
Prediction
Once you have fitted a model, you get a predictor back in the results
dictionary.
predictor = results['predictor']
You can inspect selected variables from the predictor like so:
selected_variables = predictor.selected_variables
The following line shows you how to make predictions corresponding to a DataFrame of testing features testing_df
.
predictions_df = predictor.predict(testing_df)
All that is required is for testing_df
to have all columns contained in selected_variables
. predictions_df
is a pandas DataFrame with a single column whose name is the same as the target column in the training DataFrame training_df
.
To access the low-level TabularPredictor
model, run
autogluon_tabular_predictor = predictor.models[0]._model
If you choose to use the TabularPredictor
directly, remember that testing inputs data_test
should be generated like so:
X_test = testing_df[selected_variables].values
X_columns = predictor.models[0].x_columns
data_test = pd.DataFrame(X_test, columns=X_columns)
Saving/Loading
You can directly save the predictor to disk using
predictor.save(path)
To load a predictor from disk, run
from kxy.learning.leanml_predictor import LeanMLPredictor
predictor = LeanMLPredictor.load(path, learner_func)
Pricing
The kxy
package is open-source. However, some of the heavy-duty optimization tasks (involved in LeanML feature selection) are run by our backend. For that, we charge a small per task fee.
That said, kxy
is completely free for academic use. Simply sign up here with your university email address, and get your API key here.
Once you have your API key, simply run kxy configure <YOUR API KEY>
in the terminal as a one-off, or set your API key as the value of the environment variable KXY_API_KEY
, for instance by adding the two lines below to your Python code before importing the kxy
package:
import os
os.environ['KXY_API_KEY'] = '<YOUR API KEY>'
Finally, you don't need to sign up to try out kxy
! Your first few dozen tasks are on us; just install the kxy
package and give it a go. If you love it, sign up and spread the word.