## Motivation

### AutoML, Large Feature Sets, and Overfitting

Automating algorithm selection and hyper-parameter tuning using an AutoML library such as AWS' AutoGluon can save machine learning engineers tremendously in development costs. However, with great modeling power comes an increased risk of overfitting.

To illustrate this, let us first consider a single algorithm trained with one set of hyper-parameters using k-fold cross-validation. While cross-validation might considerably reduce the probability that our trained model will perform poorly on unseen data, it can never reduce it down to 0. There will always be a small chance that our model was overfitted and will perform very poorly on unseen data. Let us denote $p_o$ this probability of overfitting.

Now, let us consider independently training not just $1$ but multiple configurations algorithm + hyper-parameters, and let us assume that $m$ of these configurations yielded satisfying held-out performances. The probability that there will be at least one of the $m$ successful configurations that will not generalize well to unseen data despite cross-validation is $p_{o_m} = 1-(1-p_o)^m \approx mp_o$. This is a huge jump from $p_o$.

As an illustration, let us assume that there is a 1% chance that a model that did well after k-fold cross-validation will perform poorly on unseen data (i.e. $p_o=0.01$). If we found $m=10$ satisfying algorithm + hyper-parameters configurations, then there is a $p_{o_m}=0.096$ chance that at least one of them will not generalize to new data after cross-validation. This increases to $p_{o_m}=0.634$ when $m=100$ and, for $m=1000$, it is almost certain that at least one configuration will not generalize to new data after cross-validation!

The foregoing analysis was made for one problem/dataset. When we have $q$ problems to solve, the issue gets much worse. If for each problem we have the same number $m_q$ of satisfying configurations after cross-validation, then the probability that at least one satisfying configuration for at least one problem will not generalize to new data after cross-validation becomes $p_{o_{mq}} := 1-(1-p_o)^{qm_q} \approx qm_qp_o$.

In concrete terms, if your k-fold cross-validation is so good that it only has a 1% chance of letting an overfitted model slip through the cracks, and you have 10 predictive models to build, and you use an AutoML suite such as AWS' AutoGluon that finds 10 satisfying configurations algorithm + hyper-parameters you can rely on for each problem, then there is a whopping 63% chance that at least one configuration will perform poorly on unseen data, even though you (kind of) did everything right!

So, what can you do about it? Let's look at each variable affecting the problem and consider possible solutions.

$\bf{q}$ — The larger the number of problems you have to solve, the more likely it is that at least one configuration will overfit. However, this variable reflects the needs of the business and can hardly be controlled.

$\bf{m_q}$ — The more configurations your AutoML library tries out the more likely it is that at least one overfitted configuration will trick your k-fold cross-validation. However, by not trying enough algorithms or hyper-parameters, you are running the risk that your selected model or ensemble will be suboptimal. A good tradeoff is to not consider configurations that are too similar. But then again, if you use an existing AutoML library, you might not have enough control over this.

$\bf{p_o}$ — This variable essentially reflects the quality of your single-configuration k-fold cross-validation. The most obvious way to increase it is to simply increase the number of folds $k$. However, this can vastly increase your runtime and development costs more generally.

Another factor driving the probability of overfitting is the number of features a model uses. The more features are used during training, the more opportunity there is for a (flexible) model to discover spurious patterns. Reducing the number of features used in algorithm and hyper-parameter search will not only reduce the likelihood of overfitting but will also reduce runtime and overall development costs.

The challenge here is that, if you choose the wrong subset of features, or you simply don't use enough features,  then you will decrease model performance and hurt the bottom-line.

To sum up, because it runs a large number of experiments, an AutoML algorithm reduces the statistical power of k-fold cross-validation, and increases the likelihood that a model will perform poorly when deployed to production, despite performing superbly on held-out data during cross-validation. A cost-effective approach to addressing this issue is to reduce the number of features the AutoML algorithm learns from, while ensuring insightful features are not left out.

### Large Feature Sets and Maintenance Cost

Beyond overfitting and, more generally, higher development costs, another peculiarity of machine learning models as pieces of software is that they are costlier to maintain than traditional pieces of software.

The cost of maintaining predictive machine learning models in production is exacerbated by several factors, among which are data pipeline outages and model performance decay resulting from data drift.

When a data pipeline goes down and predictive models stop receiving some of their inputs, those predictive models (usually) stop working. Such an outage, whose likelihood increases with the number of features that predictive models rely on, can severely handicap a product, and present a big opportunity cost. Intuitively, the fewer the number of features a predictive model uses, the less likely it is to go down.

As time goes by, predictive models often become less effective, to the point of needing to be retrained. The root cause of this problem is known as data drift.

The way we humans behave tends to change or 'drift' over time. It is, therefore, no surprise that distributions of data generated by human activities also change over time. In particular, the relationship between features a production model uses and the target it predicts will also change over time, thereby gradually rendering obsolete the specific relationship learned by the production model at the time of training, and upon which it relies to make predictions. The more features the production model uses, the more rapidly data will drift, and the more often the production model will need to be retrained.

While one should aim to keep the number of features a production model uses to a bare minimum, accidentally leaving out the wrong features can drastically reduce model performance, which would likely affect the bottom-line. Not to mention that the 'bare minimum' number of features one has to keep without affecting model performance is usually unknown to machine learning engineers, and varies from one problem to another.

In short, if your production model uses too many features, you will increase your maintenance cost (among other downsides). But if you choose the wrong subset of features, or you simply don't use enough features,  then you will decrease model performance, and the bottom-line with that.

This blog post shows you how to drastically reduce the number of features used by AWS' AutoGluon in Python while improving model performance.

## What To Expect

Using the kxy Python package you don't have to choose between high maintenance cost and low bottom-line, or between overfitting because you passed too many features to your AutoML algorithm (AWS' AutoGluon in this case), and poor performance because you left out insightful features.

The kxy package allows you to drastically reduce the number of features used by AutoGluon, while improving model performance.

Indeed, in an experiment on 38 real-world classification and regression problems from the UCI Machine Learning Repository and Kaggle, using the kxy package, we were able to reduce the number of features used by 95% while improving performance.

The datasets used had between 15 and 1925 automatically generated candidate features, and between 303 and 583250 rows. We did a random 80/20 training/testing data split, and used as the evaluation metric the testing $R^2$ for regression problems, and the testing AUC for classification problems.

Details and results for each problem are summarized in the table below.

Dataset Rows Candidate Features Features Selected Performance (Full Model) Performance (Compressed Model) Problem Type Source
SkinSegmentation 245057 15 4 1 1 classification UCI
BankNote 1372 20 6 0.99 1 classification UCI
PowerPlant 9568 20 6 0.97 0.97 regression UCI
AirFoil 1503 25 14 0.95 0.94 regression UCI
YachtHydrodynamics 308 30 1 1 0.99 regression UCI
RealEstate 414 30 9 0.75 0.76 regression UCI
Abalone 4177 38 5 0.58 0.58 regression UCI
Concrete 1030 40 11 0.93 0.92 regression UCI
EnergyEfficiency 768 45 7 1 1 regression UCI
WaterQuality 3276 45 30 0.59 0.6 classification Kaggle
Shuttle 58000 45 4 1 1 classification UCI
MagicGamma 19020 50 13 0.86 0.86 classification UCI
Avila 20867 50 31 1 1 classification UCI
WhiteWineQuality 4898 55 27 0.48 0.44 regression UCI
HeartAttack 303 65 8 0.83 0.81 classification Kaggle
HeartDisease 303 65 9 0.83 0.83 classification Kaggle
AirQuality 8991 70 2 1 1 regression UCI
EEGEyeState 14980 70 17 0.97 0.97 classification UCI
LetterRecognition 20000 80 22 0.99 0.99 classification UCI
NavalPropulsion 11934 85 6 1 1 regression UCI
BikeSharing 17379 90 3 1 1 regression UCI
DiabeticRetinopathy 1151 95 32 0.7 0.71 classification UCI
BankMarketing 41188 103 17 0.76 0.77 classification UCI
Parkinson 5875 105 2 1 1 regression UCI
CardDefault 30000 115 24 0.66 0.66 classification UCI
Landsat 6435 180 6 0.99 0.98 classification UCI
Adult 48843 202 8 0.79 0.78 classification UCI
SensorLessDrive 58509 240 19 1 1 classification UCI
OnlineNews 39644 290 26 -0.71 0.04 regression UCI
SocialMediaBuzz 583250 385 6 0.94 0.93 regression UCI
Superconductivity 21263 405 19 0.92 0.91 regression UCI
HousePricesAdvanced 1460 432 9 0.88 0.87 regression Kaggle
YearPredictionMSD 515345 450 35 0.41 0.36 regression UCI
APSFailure 76000 850 13 0.86 0.69 classification UCI
BlogFeedback 60021 1400 17 0.6 0.59 regression UCI
Titanic 891 1754 28 0.82 0.79 classification Kaggle
CTSlices 53500 1925 31 1 1 regression UCI

Cumulatively, there were 10229 candidate features to select from across the 38 datasets, and the kxy package only selected 540 of them in total, which corresponds to a 95% reduction in the number of features used overall.

Crucially, the average performance (testing $R^2$ for regression problems and testing AUC for classification problems) of the compressed model was 0.82, compared to only 0.45 for the full model; a drastic performance increase despite a 95% reduction in the number of features used!

Looking closely at the results in the table above, we see that AutoGluon yielded negative testing performances on FacebookComments and OnlineNews when using all features, but its compressed version did not! While these two datasets explain the big average performance difference between full AutoGluon and compressed AutoGluon, when they are excluded, full and compressed AutoGluon have the same average performance, despite compressed AutoGluon using only 5% of the features used by full AutoGluon!

## Code

A Jupyter notebook to reproduce the experiments above is available here. In this post, we will focus on showing you what it will take to compress your own AutoGluon model in Python.

### Setup

First, you will need to install the kxy Python package using your method of choice:

• From PyPi: pip install -U kxy
• From GitHub: git clone https://github.com/kxytechnologies/kxy-python.git & cd ./kxy-python & pip install .
• From DockerHub: docker pull kxytechnologies/kxy. The image is shipped with kxy and all its dependencies pre-installed.

Next, simply import the kxy package in your code. The kxy package is well integrated with pandas, so while you are at it you might also want to import pandas.

import kxy
import pandas as pd


From this point on, any instance of a pandas DataFrame, say df, that you have in your code is automatically enriched with a set of kxy methods accessible as df.kxy.<method_name>.

### Training

Training a compressed AutoGluon model can be done in a single line of code.

results = training_df.kxy.fit(target_column, learner_func, \
problem_type=problem_type, feature_selection_method='leanml')


training_df is the pandas DataFrame containing training data. target_column is a variable containing the name of the target column. All other columns are considered candidate features/explanatory variables.

problem_type reflects the nature of the predictive problem to solve and should be either 'regression' or 'classification'.

feature_selection_method should be set to 'leanml'. If you want to know all possible values and why you should use 'leanml', read this blog post.

In general, learner_func is the function we will call to create new trainable instances of your model. It takes three optional parameters:

• n_vars: The number of features the model should expect, in case it is required to instantiate the model (e.g. for neural networks).
• path: Where to save or load the model from, if needed.
• safe: A boolean that controls what to do when the model can't be loaded from path. The convention is that, if path is not None, learner_func should try to load the model from disk. If this fails then learner_func should create a new instance of your model if safe is set to True, and raise an exception if safe is False.

learner_func should return a model following the Scikit-Learn API. That is, at the very least, returned models should have fit(self, X, y) and predict(self, X) methods, where X and y are NumPy arrays. If you intend to save/load your compressed models, models returned by learner_func should also have a save(self, path) method to save a specific instance to disk.

For AutoGluon models specifically, we provide a utility function  (get_autogluon_learner) to generate a learner_func that creates instances of autogluon.tabular.TabularPredictor with set hyper-parameters.

Here is an illustration in the case of a regression problem.

from kxy.learning import get_autogluon_learner
kwargs = {}
fit_kwargs = {}
learner_func = get_autogluon_learner(problem_type='regression', \
eval_metric=None, verbosity=2, sample_weight=None, \
weight_evaluation=False, groups=None, fit_kwargs={}, **kwargs)

problem_type, eval_metric, verbosity, sample_weight, weight_evaluation, groups, and kwargs are all parameters you would pass to the constructor of autogluon.tabular.TabularPredictor. It is worth noting that problem_type here is not the same as  problem_type you would pass to df.kxy.fit. fit_kwargs  is the dictionary of named arguments you would pass to the fit method of an instance of autogluon.tabular.TabularPredictor.

### Prediction

Once you have fitted a model, you get a predictor back in the results dictionary.

predictor = results['predictor']

You can inspect selected variables from the predictor like so:

selected_variables = predictor.selected_variables

The following line shows you how to make predictions corresponding to a DataFrame of testing features testing_df.

predictions_df = predictor.predict(testing_df)

All that is required is for testing_df to have all columns contained in selected_variables. predictions_df is a pandas DataFrame with a single column whose name is the same as the target column in the training DataFrame training_df.

To access the low-level TabularPredictor model, run

autogluon_tabular_predictor = predictor.models[0]._model

If you choose to use the TabularPredictor directly, remember that testing inputs data_test should be generated like so:

X_test = testing_df[selected_variables].values
X_columns = predictor.models[0].x_columns
data_test = pd.DataFrame(X_test, columns=X_columns)

You can directly save the predictor to disk using

predictor.save(path)

To load a predictor from disk, run

from kxy.learning.leanml_predictor import LeanMLPredictor
predictor = LeanMLPredictor.load(path, learner_func)

## Pricing

The kxy package is open-source. However, some of the heavy-duty optimization tasks (involved in LeanML feature selection) are run by our backend. For that, we charge a small per task fee.

That said, kxy is completely free for academic use. Simply sign up here with your university email address, and get your API key here.

Once you have your API key, simply run kxy configure <YOUR API KEY> in the terminal as a one-off, or set your API key as the value of the environment variable KXY_API_KEY, for instance by adding the two lines below to your Python code before importing the kxy package:

import os
os.environ['KXY_API_KEY'] = '<YOUR API KEY>'

Finally, you don't need to sign up to try out kxy! Your first few dozen tasks are on us; just install the kxy package and give it a go. If you love it, sign up and spread the word.