Feature Engineering: Why It Matters, and How To Do It Right

We explain why every great model needs great features, and we arm you with tools and principles that will help you create better features.

· 6 min read
Feature Engineering: Why It Matters, and How To Do It Right
Photo by César Couto / Unsplash

Insightful data and effective feature engineering are the cornerstones of any successful Machine Learning project.

But why should you care about constructing features if you've already invested a great deal in capturing as much information as possible about your customers, business activity, or more generally the problem of interest?

Shouldn't the Machine Learning model handle the rest? Is it not counterintuitive that a great model would need a lot of feature engineering?

In this article, we will clarify this apparent paradox and arm you with principles and tools that will help you create better features.

The Main Objective: Increase Model-Representation Adequacy

To understand the relationship between features and models, let us consider a supervised learning problem (regression or classification) with raw categorical and/or continuous inputs \(x \in \mathcal{X}\) and target \(y \in \mathcal{Y}\).

The Oracle Predictor

When the problem is a regression problem with MSE loss, the theoretical-best regressor is known to be the model that maps \(x\) to the associated conditional expectation  \(x \to \mathbb{E}(y|x).\) When the loss function is MAE, the theoretical-best regressor maps \(x\) to the median of the conditional distribution of \(y\) given \(x\). When the problem is a classification problem and the loss function is the classification error, the theoretical-best classifier is the one that maps \(x\) to the mode of the (discrete) conditional distribution of \(y\) given \(x\).

Note that, as we would expect, the theoretical-best model does not need any feature engineering! It works directly from raw inputs \(x\), which could be of any type, and takes care of any needed feature engineering.

The Oracle Feature

Let's denote \(\bar{z}:= \bar{f}(x)\) the prediction made by the theoretical-best model. We can look at \(\bar{z}\) as a feature constructed from \(x\). To achieve the best performance possible from \(\bar{z}\), all we need is to use the identity function as model: \(x \to x\).

Note that we do not need any model training! Feature engineering has already done all the work.

Real-Life Learner Are Imperfect

If we knew \(\bar{f}\), then we would not need to do any feature engineering. If we knew \(\bar{z}\), then we would not need to do any model training.

In practice, however, we never know \(\bar{f}\) or \(\bar{z}\). The best we can do is to approximate the oracle predictor with Machine Learning models in our toolbox (e.g. linear models, LightGBM, XGBoost, Random Forest, kernel-based learners, ensemble models, etc.).

Models in our toolbox are imperfect learners in that they only work well in the presence of specific types of patterns relating inputs to the target.

For instance, linear models (e.g. OLS, LASSO, Ridge, ElasticNet etc.) can only capture linear relationships between inputs and the target. Decision tree regressors work best in situations where a good global regressor can be learned by partitioning the input space into \(m\) disjoint subsets using simple decision rules and learning \(m\) very simple local regressors.

Kernel-Ridge regressors and Gaussian process regressors with smooth kernels assume that inputs that are close to each other (in Euclidean distance) will always have targets that are also close to each other (in Euclidean distance). The same assumption is made by nearest neighbors regressors.

Beyond smoothness, kernel-based regressors can also be used to learn global patterns in situations where inputs that are close to each other in a given norm, will have targets that are close to each other in the Euclidean norm. The norm can be chosen to learn periodic or seasonal patterns, with or without a trend, to name but a few.

Features Increase Model-Representation Adequacy

In real-life applications, the patterns relating raw inputs \(x\) to the target \(y\) are often intricate and inconsistent with what models in our toolbox expect. This is where features come in.

Instead of training our model directly on \(x\), we will construct a range of features/representations \(z_i\) that we would expect to have simpler relationships with the target \(y\), relationships that are easily learned by popular model classes without overfitting.

Example types of simple relationships include:

  • \(y\) tends to increase (resp. decrease) as a function of \(z_i\)
  • \(y\) tends to increase (resp. decrease) as \(z_i\) deviates from canonical values or range

To sum-up, feature construction serves one primary role: simplify the relationship between inputs and target to make it consistent with what the model to train can reliably learn.

The Main Cost: The Data Processing Inequality or The Law of 'Juice-Dissipation'

On the face of it, it might seem as though feature construction increases the insightfulness/informativeness/juice in our raw inputs \(x\) to predict the target \(y\).

This is however not the case; on the contrary. Any transformation applied to our raw inputs \(x\) can only reduce/dissipate the juice that was in \(x\) to predict the target \(y\). The result is often referred to as the Data Processing Inequality.

Thus, effective feature engineering ought to operate a balancing act between simplifying the inputs-target relationship while preserving as much of the juice that was in raw inputs (to predict the target) as possible.

The solution we advocate for this is two-staged. First, you should construct as many features as reasonably possible, and then select the best ones based on the highest performance they may yield.

We cover the first stage in the section below and will cover feature selection in this blog post.

Below are a few popular feature construction techniques.

Ordinal Encoding

Most models expect ordinal inputs. Non-ordinal inputs such as strings should be ordinarily encoded (e.g. using 1-hot-encoding) before training the model.

Missing Value Imputation

Many models cannot cope with missing values. Thus,  missing values should be replaced with non-informative baselines before training the model.

Entity Aggregations

Models typically expect one \(x\) per \(y\). However, it can happen that, in the problem of interest, a collection of \(x\) is naturally associated with the same \(y\). When this happens, we need to aggregate the collection into a single features vector that will be associated with the target. Example aggregation operations include sum, mean, median, standard deviation, skewness, kurtosis, minimum, maximum, quantiles, etc. for continuous inputs, and top-k most/least frequent values and their frequencies, the number of unique values, etc. for categorical inputs.

Deviation Features

These features express how raw inputs deviate from some canonical values (e.g. their mean, median, 25%, and 75% quantiles, etc.) in absolute value.

Seasonality Features

These are features that could reveal seasonal effects when observations are timestamped. Examples include periodic properties of the timestamp such as the second in the minute, the minute in the hour, the hour in the day, AM/PM, the day of the week, weekday vs weekend, the day of the month, the distance in days to the middle of the month, the month, etc.

Temporal Features

Most time series models (e.g. ARIMA, RNN, LSTM etc.) are conceptually made of two building blocks:

  • A set of temporal features that encode the full extent to which time matters (i.e. what temporal patterns the model will attempt to exploit).
  • A memoryless or tabular model that uses the temporal features above as inputs.

While some time series models construct their temporal features on the fly (e.g. LSTM, RNN), constructing temporal features in a model-agnostic fashion may yield greater flexibility and explainability.

The aim here is to turn past observations into features from which most tabular models can easily detect patterns such as short/medium/long term trends and trend reversals, that would otherwise be very difficult for any tabular model to learn directly from lagged values.

Temporal features are commonly constructed using rolling statistics (e.g. mean, min, max, standard deviation, skewness, kurtosis, quantiles, etc.) of each raw input with various window sizes.

Feature Construction Using The kxy Package

We have implemented a method in the kxy Python package to automate the construction of candidate features using the techniques described above.

Here's a self-contained example on the UCI Bank Marketing dataset.

Code

# pip install kxy
import kxy
# pip install kxy_datasets
from kxy_datasets.classifications import BankMarketing
dataset = BankMarketing()
target_column = dataset.y_column
df = dataset.df
features_df = df.kxy.generate_features(entity=None, max_lag=None,\
    entity_name='*', exclude=[target_column])

Raw Inputs

Features

Summary

Training a Machine Learning model is equivalent to using data to find a good enough approximation of the oracle predictor.

Every Machine Learning model is restricted in the types of patterns it may reliable learn between inputs and the target.

However, in real-life applications, the relationship between raw inputs and the target is often much more intricate than what popular models may reliably learn.

The aim of feature construction is to generate representations of raw inputs that we intuitively expect to have a much simpler relationship to the target, one that models in our toolbox can reliably learn, and that are still as insightful about the target.