## Motivation

Machine learning models are peculiar pieces of software in that they are costlier to maintain than traditional pieces of software.

The cost of maintaining predictive machine learning models in production is exacerbated by several factors, among which are data pipeline outages and model performance decay resulting from data drift.

When a data pipeline goes down and predictive models stop receiving some of their inputs, those predictive models (usually) stop working. Such an outage, whose likelihood increases with the number of features that predictive models rely on, can severely handicap a product, and present a big opportunity cost. Intuitively, the fewer the number of features a predictive model uses, the less likely it is to go down.

As time goes by, predictive models often become less effective, to the point of needing to be retrained. The root cause of this problem is known as data drift.

The way we humans behave tends to change or 'drift' over time. It is, therefore, no surprise that distributions of data generated by human activities also change over time. In particular, the relationship between features a production model uses and the target it predicts will also change over time, thereby gradually rendering obsolete the specific relationship learned by the production model at the time of training, and upon which it relies to make predictions. The more features the production model uses, the more rapidly data will drift, and the more often the production model will need to be retrained.

Using too large a production model can have many more downsides, among which high latency and poor explainability, to name but a couple.

While one should aim to keep the number of features a production model uses to a bare minimum, accidentally leaving out the wrong features can drastically reduce model performance, which would likely affect the bottom-line. Not to mention that the 'bare minimum' number of features one has to keep without affecting model performance is usually unknown to the machine learning engineers, and varies from one problem to another.

In short, if your production model uses too many features, you will increase your maintenance cost (among other downsides). But if you choose the wrong subset of features, or you simply don't use enough features,  then you will decrease model performance, and the bottom-line with that.

This blog post provides a solution to this dilemma in Python when you use XGBoost as your production model.

## What To Expect

Using the kxy Python package you don't have to choose between high maintenance cost and low bottom line; you can drastically reduce the number of features your production model uses without losing model performance.

Indeed, in an experiment on 38 real-world classification and regression problems from the UCI Machine Learning Repository and Kaggle, using the kxy package, we were able to reduce the number of features used by 95% with virtually no effect on model performance.

The datasets used had between 15 and 1925 automatically generated candidate features, and between 303 and 583250 rows. We did a random 80/20 training/testing data split, and used as the evaluation metric the testing $R^2$ for regression problems, and the testing AUC for classification problems.

Details and results for each problem are summarized in the table below.

Dataset Rows Candidate Features Features Selected Performance (Full Model) Performance (Compressed Model) Problem Type
SkinSegmentation 245057 15 7 1 1 classification
BankNote 1372 20 4 1 0.99 classification
PowerPlant 9568 20 8 0.97 0.97 regression
AirFoil 1503 25 9 0.93 0.92 regression
YachtHydrodynamics 308 30 1 1 0.99 regression
RealEstate 414 30 9 0.72 0.72 regression
Abalone 4177 38 8 0.52 0.53 regression
Concrete 1030 40 11 0.93 0.92 regression
EnergyEfficiency 768 45 11 1 1 regression
WaterQuality 3276 45 31 0.6 0.59 classification
Shuttle 58000 45 3 1 1 classification
MagicGamma 19020 50 15 0.86 0.86 classification
Avila 20867 50 30 1 1 classification
WhiteWineQuality 4898 55 29 0.44 0.37 regression
HeartAttack 303 65 9 0.86 0.84 classification
HeartDisease 303 65 9 0.86 0.84 classification
AirQuality 8991 70 2 1 1 regression
EEGEyeState 14980 70 17 0.91 0.92 classification
LetterRecognition 20000 80 22 0.98 0.98 classification
NavalPropulsion 11934 85 5 0.99 0.99 regression
BikeSharing 17379 90 4 1 1 regression
DiabeticRetinopathy 1151 95 34 0.65 0.7 classification
BankMarketing 41188 103 14 0.77 0.76 classification
Parkinson 5875 105 2 1 1 regression
CardDefault 30000 115 26 0.66 0.66 classification
Landsat 6435 180 5 0.98 0.98 classification
Adult 48843 202 19 0.79 0.78 classification
SensorLessDrive 58509 240 23 1 1 classification
OnlineNews 39644 290 26 0 0 regression
SocialMediaBuzz 583250 385 6 0.95 0.94 regression
Superconductivity 21263 405 17 0.91 0.9 regression
HousePricesAdvanced 1460 432 8 0.83 0.88 regression
YearPredictionMSD 515345 450 36 0.32 0.31 regression
APSFailure 76000 850 9 0.91 0.73 classification
BlogFeedback 60021 1400 13 0.58 0.57 regression
Titanic 891 1754 26 0.82 0.81 classification
CTSlices 53500 1925 34 0.99 0.98 regression

Cumulatively, there were 10229 candidate features to select from across the 38 datasets, and the kxy package only selected 555 of them in total, which corresponds to a 95% reduction in the number of features used overall.

Crucially, the average performance (testing $R^2$ for regression problems and testing AUC for classification problems) of the compressed model was 0.82, compared to 0.83 for the full model; a virtually null performance reduction for a 95% reduction in the number of features used!

## Code

A Jupyter notebook to reproduce the experiment above is available here. In this post, we will focus on showing you what it will take to compress your own XGBoost model in Python.

### Setup

First, you will need to install the kxy Python package using your method of choice:

• From PyPi: pip install -U kxy
• From GitHub: git clone https://github.com/kxytechnologies/kxy-python.git & cd ./kxy-python & pip install .
• From DockerHub: docker pull kxytechnologies/kxy. The image is shipped with kxy and all its dependencies pre-installed.

Next, simply import the kxy package in your code. The kxy package is well integrated with pandas, so while you are at it you might also want to import pandas.

import kxy
import pandas as pd


From this point on, any instance of a pandas DataFrame, say df, that you have in your code is automatically enriched with a set of kxy methods accessible as df.kxy.<method_name>.

### Training

Training a compressed XGBoost model can be done in a single line of code.

results = training_df.kxy.fit(target_column, learner_func, \
problem_type=problem_type, feature_selection_method='leanml')


training_df is the pandas DataFrame containing training data. target_column is a variable containing the name of the target column. All other columns are considered candidate features/explanatory variables.

problem_type reflects the nature of the predictive problem to solve and should be either 'regression' or 'classification'.

feature_selection_method should be set to 'leanml'. If you want to know all possible values and why you should use 'leanml', read this blog post.

In general, learner_func is the function we will call to create new trainable instances of your model. It takes three optional parameters:

• n_vars: The number of features the model should expect, in case it is required to instantiate the model (e.g. for neural networks).
• path: Where to save or load the model from, if needed.
• safe: A boolean that controls what to do when the model can't be loaded from path. The convention is that if path is not None, learner_func should try to load the model from disk. If this fails then learner_func should create a new instance of your model if safe is set to True, and raise an exception if safe is False.

learner_func should return a model following the Scikit-Learn API. That is, at the very least, returned models should have fit(self, X, y) and predict(self, X) methods, where X and y are NumPy arrays. If you intend to save/load your compressed models, models returned by learner_func should also have a save(self, path) method to save a specific instance to disk.

For XGBoost models specifically, we provide a utility function (get_xgboost_learner) to generate a learner_func that creates instances of XGBoost models with set hyper-parameters, using the XGBoost Scikit-Learn API .

Here is an illustration in the case of a regression problem.

from kxy.learning import get_xgboost_learner
class_name = 'xgboost.XGBRegressor' # or 'xgboost.XGBClassifier'
args = {}
kwargs = {}
learner_func = get_xgboost_learner(class_name, *args, \
fit_kwargs={}, predict_kwargs={}, **kwargs)

args and kwargs are positional and named arguments you would pass to the constructor of xgboost.XGBRegressor or xgboost.XGBClassifier. fit_kwargs (resp. predict_kwargs) are named arguments you would pass to the fit (resp. predict) method of an instance of xgboost.XGBRegressor or xgboost.XGBClassifier.

### Prediction

Once you have fitted a model, you get a predictor back in the results dictionary.

predictor = results['predictor']

You can inspect selected variables from the predictor like so:

selected_variables = predictor.selected_variables

The following line shows you how to make predictions corresponding to a DataFrame of testing features testing_df.

predictions_df = predictor.predict(testing_df)

All that is required is for testing_df to have all columns contained in selected_variables. predictions_df is a pandas DataFrame with a single column whose name is the same as the target column in the training DataFrame training_df.

To access the low-level xgboost.sklearn.XGBModel and booster, run

xgbmodel = predictor.models[0]._model
booster = xgbmodel.get_booster()

If you choose to use the booster or XGBModel directly, remember that testing inputs should be generated like so:

X_test = testing_df[selected_variables].values

You can directly save the predictor to disk using

predictor.save(path)

To load a predictor from disk, run

from kxy.learning.leanml_predictor import LeanMLPredictor
predictor = LeanMLPredictor.load(path, learner_func)

## Pricing

The kxy package is open-source. However, some of the heavy-duty optimization tasks (involved in LeanML feature selection) are run by our backend. For that, we charge a small per task fee.

That said, kxy is completely free for academic use. Simply sign up here with your university email address, and get your API key here.

Once you have your API key, simply run kxy configure <YOUR API KEY> in the terminal as a one-off, or set your API key as the value of the environment variable KXY_API_KEY, for instance by adding the two lines below to your Python code before importing the kxy package:

import os
os.environ['KXY_API_KEY'] = '<YOUR API KEY>'

Finally, you don't need to sign up to try out kxy! Your first few dozen tasks are on us; just install the kxy package and give it a go. If you love it, sign up and spread the word.