Motivation
Machine learning models are peculiar pieces of software in that they are costlier to maintain than traditional pieces of software.
The cost of maintaining predictive machine learning models in production is exacerbated by several factors, among which are data pipeline outages and model performance decay resulting from data drift.
When a data pipeline goes down and predictive models stop receiving some of their inputs, those predictive models (usually) stop working. Such an outage, whose likelihood increases with the number of features that predictive models rely on, can severely handicap a product, and present a big opportunity cost. Intuitively, the fewer the number of features a predictive model uses, the less likely it is to go down.
As time goes by, predictive models often become less effective, to the point of needing to be retrained. The root cause of this problem is known as data drift.
The way we humans behave tends to change or 'drift' over time. It is, therefore, no surprise that distributions of data generated by human activities also change over time. In particular, the relationship between features a production model uses and the target it predicts will also change over time, thereby gradually rendering obsolete the specific relationship learned by the production model at the time of training, and upon which it relies to make predictions. The more features the production model uses, the more rapidly data will drift, and the more often the production model will need to be retrained.
Using too large a production model can have many more downsides, among which high latency and poor explainability, to name but a couple.
While one should aim to keep the number of features a production model uses to a bare minimum, accidentally leaving out the wrong features can drastically reduce model performance, which would likely affect the bottom-line. Not to mention that the 'bare minimum' number of features one has to keep without affecting model performance is usually unknown to the machine learning engineers, and varies from one problem to another.
In short, if your production model uses too many features, you will increase your maintenance cost (among other downsides). But if you choose the wrong subset of features, or you simply don't use enough features, then you will decrease model performance, and the bottom-line with that.
This blog post provides a solution to this dilemma in Python when you use XGBoost as your production model.
What To Expect
Using the kxy
Python package you don't have to choose between high maintenance cost and low bottom line; you can drastically reduce the number of features your production model uses without losing model performance.
Indeed, in an experiment on 38 real-world classification and regression problems from the UCI Machine Learning Repository and Kaggle, using the kxy
package, we were able to reduce the number of features used by 95% with virtually no effect on model performance.
The datasets used had between 15 and 1925 automatically generated candidate features, and between 303 and 583250 rows. We did a random 80/20 training/testing data split, and used as the evaluation metric the testing \(R^2\) for regression problems, and the testing AUC for classification problems.
Details and results for each problem are summarized in the table below.
Dataset | Rows | Candidate Features | Features Selected | Performance (Full Model) | Performance (Compressed Model) | Problem Type |
---|---|---|---|---|---|---|
SkinSegmentation | 245057 | 15 | 7 | 1 | 1 | classification |
BankNote | 1372 | 20 | 4 | 1 | 0.99 | classification |
PowerPlant | 9568 | 20 | 8 | 0.97 | 0.97 | regression |
AirFoil | 1503 | 25 | 9 | 0.93 | 0.92 | regression |
YachtHydrodynamics | 308 | 30 | 1 | 1 | 0.99 | regression |
RealEstate | 414 | 30 | 9 | 0.72 | 0.72 | regression |
Abalone | 4177 | 38 | 8 | 0.52 | 0.53 | regression |
Concrete | 1030 | 40 | 11 | 0.93 | 0.92 | regression |
EnergyEfficiency | 768 | 45 | 11 | 1 | 1 | regression |
WaterQuality | 3276 | 45 | 31 | 0.6 | 0.59 | classification |
Shuttle | 58000 | 45 | 3 | 1 | 1 | classification |
MagicGamma | 19020 | 50 | 15 | 0.86 | 0.86 | classification |
Avila | 20867 | 50 | 30 | 1 | 1 | classification |
WhiteWineQuality | 4898 | 55 | 29 | 0.44 | 0.37 | regression |
HeartAttack | 303 | 65 | 9 | 0.86 | 0.84 | classification |
HeartDisease | 303 | 65 | 9 | 0.86 | 0.84 | classification |
AirQuality | 8991 | 70 | 2 | 1 | 1 | regression |
EEGEyeState | 14980 | 70 | 17 | 0.91 | 0.92 | classification |
LetterRecognition | 20000 | 80 | 22 | 0.98 | 0.98 | classification |
NavalPropulsion | 11934 | 85 | 5 | 0.99 | 0.99 | regression |
BikeSharing | 17379 | 90 | 4 | 1 | 1 | regression |
DiabeticRetinopathy | 1151 | 95 | 34 | 0.65 | 0.7 | classification |
BankMarketing | 41188 | 103 | 14 | 0.77 | 0.76 | classification |
Parkinson | 5875 | 105 | 2 | 1 | 1 | regression |
CardDefault | 30000 | 115 | 26 | 0.66 | 0.66 | classification |
Landsat | 6435 | 180 | 5 | 0.98 | 0.98 | classification |
Adult | 48843 | 202 | 19 | 0.79 | 0.78 | classification |
SensorLessDrive | 58509 | 240 | 23 | 1 | 1 | classification |
FacebookComments | 209074 | 265 | 13 | 0.72 | 0.58 | regression |
OnlineNews | 39644 | 290 | 26 | 0 | 0 | regression |
SocialMediaBuzz | 583250 | 385 | 6 | 0.95 | 0.94 | regression |
Superconductivity | 21263 | 405 | 17 | 0.91 | 0.9 | regression |
HousePricesAdvanced | 1460 | 432 | 8 | 0.83 | 0.88 | regression |
YearPredictionMSD | 515345 | 450 | 36 | 0.32 | 0.31 | regression |
APSFailure | 76000 | 850 | 9 | 0.91 | 0.73 | classification |
BlogFeedback | 60021 | 1400 | 13 | 0.58 | 0.57 | regression |
Titanic | 891 | 1754 | 26 | 0.82 | 0.81 | classification |
CTSlices | 53500 | 1925 | 34 | 0.99 | 0.98 | regression |
Cumulatively, there were 10229 candidate features to select from across the 38 datasets, and the kxy
package only selected 555 of them in total, which corresponds to a 95% reduction in the number of features used overall.
Crucially, the average performance (testing \(R^2\) for regression problems and testing AUC for classification problems) of the compressed model was 0.82, compared to 0.83 for the full model; a virtually null performance reduction for a 95% reduction in the number of features used!
Code
A Jupyter notebook to reproduce the experiment above is available here. In this post, we will focus on showing you what it will take to compress your own XGBoost model in Python.
Setup
First, you will need to install the kxy
Python package using your method of choice:
- From PyPi:
pip install -U kxy
- From GitHub:
git clone https://github.com/kxytechnologies/kxy-python.git & cd ./kxy-python & pip install .
- From DockerHub:
docker pull kxytechnologies/kxy
. The image is shipped withkxy
and all its dependencies pre-installed.
Next, simply import the kxy
package in your code. The kxy
package is well integrated with pandas
, so while you are at it you might also want to import pandas
.
import kxy
import pandas as pd
From this point on, any instance of a pandas DataFrame, say df
, that you have in your code is automatically enriched with a set of kxy
methods accessible as df.kxy.<method_name>
.
Training
Training a compressed XGBoost model can be done in a single line of code.
results = training_df.kxy.fit(target_column, learner_func, \
problem_type=problem_type, feature_selection_method='leanml')
training_df
is the pandas DataFrame containing training data. target_column
is a variable containing the name of the target column. All other columns are considered candidate features/explanatory variables.
problem_type
reflects the nature of the predictive problem to solve and should be either 'regression'
or 'classification'
.
feature_selection_method
should be set to 'leanml'
. If you want to know all possible values and why you should use 'leanml'
, read this blog post.
In general, learner_func
is the function we will call to create new trainable instances of your model. It takes three optional parameters:
n_vars
: The number of features the model should expect, in case it is required to instantiate the model (e.g. for neural networks).path
: Where to save or load the model from, if needed.safe
: A boolean that controls what to do when the model can't be loaded frompath
. The convention is that ifpath
is notNone
,learner_func
should try to load the model from disk. If this fails thenlearner_func
should create a new instance of your model ifsafe
is set toTrue
, and raise an exception ifsafe
isFalse
.
learner_func
should return a model following the Scikit-Learn API. That is, at the very least, returned models should have fit(self, X, y)
and predict(self, X)
methods, where X
and y
are NumPy arrays. If you intend to save/load your compressed models, models returned by learner_func
should also have a save(self, path)
method to save a specific instance to disk.
For XGBoost models specifically, we provide a utility function (get_xgboost_learner
) to generate a learner_func
that creates instances of XGBoost models with set hyper-parameters, using the XGBoost Scikit-Learn API .
Here is an illustration in the case of a regression problem.
from kxy.learning import get_xgboost_learner
class_name = 'xgboost.XGBRegressor' # or 'xgboost.XGBClassifier'
args = {}
kwargs = {}
learner_func = get_xgboost_learner(class_name, *args, \
fit_kwargs={}, predict_kwargs={}, **kwargs)
args
and kwargs
are positional and named arguments you would pass to the constructor of xgboost.XGBRegressor or xgboost.XGBClassifier. fit_kwargs
(resp. predict_kwargs
) are named arguments you would pass to the fit (resp. predict) method of an instance of xgboost.XGBRegressor or xgboost.XGBClassifier.
Prediction
Once you have fitted a model, you get a predictor back in the results
dictionary.
predictor = results['predictor']
You can inspect selected variables from the predictor like so:
selected_variables = predictor.selected_variables
The following line shows you how to make predictions corresponding to a DataFrame of testing features testing_df
.
predictions_df = predictor.predict(testing_df)
All that is required is for testing_df
to have all columns contained in selected_variables
. predictions_df
is a pandas DataFrame with a single column whose name is the same as the target column in the training DataFrame training_df
.
To access the low-level xgboost.sklearn.XGBModel
and booster, run
xgbmodel = predictor.models[0]._model
booster = xgbmodel.get_booster()
If you choose to use the booster or XGBModel directly, remember that testing inputs should be generated like so:
X_test = testing_df[selected_variables].values
Saving/Loading
You can directly save the predictor to disk using
predictor.save(path)
To load a predictor from disk, run
from kxy.learning.leanml_predictor import LeanMLPredictor
predictor = LeanMLPredictor.load(path, learner_func)
Pricing
The kxy
package is open-source. However, some of the heavy-duty optimization tasks (involved in LeanML feature selection) are run by our backend. For that, we charge a small per task fee.
That said, kxy
is completely free for academic use. Simply sign up here with your university email address, and get your API key here.
Once you have your API key, simply run kxy configure <YOUR API KEY>
in the terminal as a one-off, or set your API key as the value of the environment variable KXY_API_KEY
, for instance by adding the two lines below to your Python code before importing the kxy
package:
import os
os.environ['KXY_API_KEY'] = '<YOUR API KEY>'
Finally, you don't need to sign up to try out kxy
! Your first few dozen tasks are on us; just install the kxy
package and give it a go. If you love it, sign up and spread the word.