Feature Engineering With Game Theory: Beyond SHAP values

Understanding the difference between feature importance, feature usefulness, and feature potential using Shapley values.

· 7 min read
Feature Engineering With Game Theory: Beyond SHAP values
Photo by Ross Sneddon / Unsplash

There is no great predictive (a.k.a tabular) model without great features. In our quest for great features, we might find ourselves with hundreds of candidate features.

Such a large number of candidate features comes with several challenges, not the least of which are the need to limit the number of features used to train our model (i.e. feature selection) and the need to understand how each selected feature affects model decisions (i.e. model explanation) and model performance.

Unfortunately, the ease with which machine learning engineers can generate feature importance scores nowadays has resulted in feature importance scores being misunderstood, and abusively used for feature selection.

In this post, we discuss three fundamental properties of a feature every machine learning engineer needs to be aware of, namely its importance, its usefulness, and its potential, and we clarify when each one may be used.

We focus on game-theoretic implementations using Shapley values to stress the similarities and differences between the three properties.

Shapley Values

Shapley values provide a framework for fairly attributing payoffs to individual players that collaborate in a game that may be played individually or in coalitions.

As is customary, let us denote \(N\) the set of all players, \(n\) the total number of players, and \(v: 2^n \to \mathbb{R}\) the function such that \(v(S)\) is the payoff received by the coalition of players \(S\).

As per Shapley value, the fair amount that player \(i\) should get is defined as \[\phi_i(v) := \frac{1}{n}\sum_{S \subset N \backslash \{i\}} {n-1 \choose |S|}^{-1} \Delta v_i(S) , \]where \(\Delta v_i(S) := v\left( S \cup \{i\} \right) - v\left( S\right)\).

One way to understand \(\phi_i(v)\) is that, if player \(i\) joins a coalition \(S\), it is only fair that she asks as attribution for her contribution \(v\left( S \cup \{i\} \right) - v\left( S\right)\).

To avoid giving players an advantage or disadvantage for entering a coalition early or late, we take the average over all possible permutations in which the coalition can be formed.

Shapley value is the only payoff attribution rule that satisfies the following four properties:

  1. Efficiency: The sum of individual attributions is the payoff of the coalition made of all players minus the payoff of the empty coalition: \(\sum_{i=1}^n \phi_i(v) = v(N)-v(\emptyset) \).
  2. Symmetry: If two players \(i\) and \(j\) always have the same impact in any coalition they are added to and that contains neither, then they should be attributed the same payoff. That is, if \(v(S \cup \{i\}) = v(S \cup \{j\})\) for every \(S\) such that \(i \notin S\) and \(j \notin S\), then \(\phi_i(v)=\phi_j(v)\).
  3. Linearity: Running two attributions one after the other would produce the same result as running one attribution with the combined payoff: \(\forall v, w, i, ~~~~ \phi_i(v+w) = \phi_i(v) + \phi_i(w).\)
  4. Null player: A player that does not contribute when added to any coalition should not be attributed any payoff: if \(\forall S, v(S \cup \{i\}) = v(S)\) then \(\phi_i(v)=0\).

Shapley Values For Predictive Modeling

Shapley values provide a flexible framework for understanding the marginal contribution of a feature when building a predictive model.

Features are essentially players that collaborate in a game related to predictive modeling. Using multiple features in a model is tantamount to players forming a coalition to play the game. The specific rules of the game and the associated payoff function \(v\) will vary depending on what it is that we want to attribute back to a specific feature.

To avoid any ambiguity between features and samples, going forward we will use \(D\) instead of \(N\) to refer to the set of all candidate features and \(d\) instead of \(n\) to refer to its size.

Properties of A Feature

I. Feature Importance

A feature importance score is any score that quantifies the extent to which a specific feature drives decisions of a specific model.

A popular example is given by Mean Absolute SHAP (SHapley Additive exPlanations) values.

For a model using all features \(x\) to make a decision \(f(x)\), SHAP values quantify the contribution of each feature in generating the specific model decision.

For regression problems, \(f(x)\) is the model's prediction, whereas for classification problems it is typically the estimated predictive probability of an outcome.

SHAP attributes to the features coalition \(S\) represented by the feature vector \(x_S\) the payoff \[v_{SHAP}(S) := E(f(x) \vert x_S ) = \int f(x) d\mathbb{P}_{x_{-S}},\] where \(x_{-S}\) represents the vector of features not in the coalition.

In plain English, SHAP's payoff is the average model decision made when features in the coalition take specific values and other features vary as per the data generating distribution.

Because \(v_{SHAP}(D) = f(x)\) and \(v_{SHAP}(\emptyset) = E(f(x))\), the efficiency property above yields the following decomposition of model decisions \[f(x) = E(f(x)) + \sum_{i=1}^d \phi_i(v_{SHAP}).\]Thus, SHAP allows us to quantify the (possibly negative) contribution of a specific feature towards the deviation of the model decision from its average value \(E(f(x))\).

The absolute term \(\vert \phi_i(v_{SHAP}) \vert \) reflects the sensitivity of the specific model decision made to the value of the i-th feature in \(x\). Thus, its average, namely the Mean Absolute SHAP value \( \int \vert \phi_i(v_{SHAP}) \vert d\mathbb{P}_{x}\), quantifies the extent to which feature \(i\) drives or affects model decisions overall.

At this point, it is important to stress that, so far, we have not mentioned or considered true labels, let alone the performance of the model!

Feature importance scores, SHAP values or otherwise, simply cannot inform us whether a model performs well or what features drive model performance. Feature importance scores are great for model explanation, but should not be used beyond that.

II. Feature Usefulness

A feature usefulness score is any score that quantifies the extent to which a specific feature drives the performance of a specific model.

One such score can be obtained by using Shapley value \(\phi_i(v_{USEFUL})\) with payoff function the negative risk, namely \[\begin{align}v_{USEFUL}(S) := -E\left[L(v_{SHAP}(S), y)\right]\end{align},\]where \(L\) is a loss function reflecting model performance and \(y\) is the true label associated to \(x\). Alternatively, we may use the (cross-validated) negative empirical risk as a proxy for the negative risk.

You can think of \(v_{USEFUL}(S)\) as the performance (negative loss) that we would obtain when only using features in the coalition \(S\).

Note that \(v_{USEFUL}(\emptyset) := P_\emptyset\) is the performance of the baseline strategy consisting of always making the same decision (i.e. \(E(f(x))\) ) regardless of feature values. Similarly, \(v_{USEFUL}(D) := P_D\) is the performance of the model when using all features.

The efficiency property of Shapley values yields the following performance decomposition: \[P_D = P_\emptyset + \sum_{i=1}^d \phi_i(v_{USEFUL}).\]This equation can be very useful for feature selection.

Features with small Shapley usefulness scores \(\phi_i(v_{USEFUL})\) are good candidates to be removed. In particular, it is worth stressing that \(\phi_i(v_{USEFUL})\) may very well be negative!

This would occur if a feature decreases model performance when added to a coalition, which could indicate that said feature causes the model to overfit.

Feature usefulness is a model-specific notion; a feature can be useful to a model, but not to another. A direct consequence of this is that using one model to select features that will then be used in another model can be perilous.

III. Feature Potential

The working assumption when training a predictive model is that there is a sufficient amount of predictive power or 'juice' in our set of candidate features to predict the target.

Often times this is true and yet some features only contribute marginally to the total amount of juice there is in the set of candidate features. In such a case, it is important to be able to detect and remove those features with low 'potential'.

Formally, a feature potential or feature potential usefulness score is any score that quantifies the contribution of a specific feature to the highest performance achievable using a set of candidate features.

One such score  can be obtained by using Shapley value \(\phi_i(v_{POTENTIAL})\) with payoff function the supremum of the negative risk across all possible models, namely \[v_{POTENTIAL}(S) := \sup_{f \in \mathcal{F}} -E\left[L(v_{SHAP}(S), y)\right],\]where \(\mathcal{F}\) is a flexible enough function space and \(v_{SHAP}(S) = E(f(x)\vert x_s)\).

\(\sup_{f \in \mathcal{F}} -E\left[L(v_{SHAP}(S), y)\right]\) represents the highest performance achievable using features in the coalition \(S\) to predict the target \(y\).

It was shown in this paper that, for most performance metrics, the highest performance achievable using \(x_S\) to predict \(y\) is an increasing function of the mutual information \(I(y; x_S)\). As a result, we may also use as a proxy the payoff function \[v_{MI}(S) := I(y; x_S).\]

Note that \(v_{MI}(\emptyset) := 0\) and \(v_{MI}(D) := I(y; x)\). This implies that the efficiency property of Shapley values yields the following mutual information decomposition:  \[I(y; x) = \sum_{i=1}^d \phi_i(v_{MI}).\]

When a feature has a high potential, we can conclude that there exists a model to which the feature can be useful for predicting the target. On the other hand, when the feature's potential is null, we can safely conclude that there does not exist any model to which the feature can be useful for predicting the target, either by itself or in conjunction with other features in the set of candidates.

Feature potential is set-specific. A feature might very well have no potential relative to a set but have some potential relative to another set.

In effect, by noting that \(I(y; x_S, x_i)-I(y; x_S) = I(y; x_i \vert x_S),\) we may rewrite \(\phi_i(v_{MI})\) as \[\phi_i(v_{MI}) = \frac{1}{d}\sum_{S \subset D \backslash \{i\}} {d-1 \choose |S|}^{-1} I(y; x_i \vert x_S).\]

Thus, for a feature to have potential, the feature and the target need to be statiscally dependent, either unconditionally (when \(S=\emptyset\)) or conditional on some other features in the set of candidate features.

If a feature \(i\) has no potential in a set of candidate features \(D\), we can only conclude that is it statistically independent from the target, both unconditionally and conditional on any subset of features in \(D\). We cannot, however, rule out the possibility that there could exist another set of candidate features \(C\) containing features conditional on which our feature \(i\) and the target are statistically dependent!

Conclusion

Feature importance scores are good at one thing and one thing only: explaining decisions made by a specific model. For feature selection, feature usefulness and feature potential are more relevant properties to consider.

Of feature importance, feature usefulness, and feature potential, feature potential is the only property of a feature that allows us to draw conclusions that are not model-specific. In particular, because a feature is important or useful to a Random Forest or a Gradient Boosted Tree does not mean it will be important or useful to a linear model, a kernel-based method, or any other model!

Feature usefulness and feature potential are the only two properties that allow us to draw conclusions related to model performance. In particular, feature importance scores should not be used for feature selection.

Features should not be selected for their importance but for their (potential) usefulness.