Today we announce the launch of our **data valuation** product, the first (and only) data valuation API. In this post, we distill what **data valuation** is, and why it should be the very first step in any (predictive) machine learning pipeline.

## Announcing The World's First Data Valuation API

1-in-4 commercial machine learning projects fail, and 9-in-10 trained machine learning models are not good enough to make it to production.

Today, we are launching the world's first data valuation product to increase the success rate of machine learning projects, and mitigate avoidable wastes of resources.

In a single function call, machine learning engineers can now estimate the highest performance they may achieve in a machine learning project, prior to and without training any (predictive) machine learning model. The API doesn't require machine learning engineers to do any feature engineering, whether their data are continuous, categorical or a mix of the two. It just works!

This is the first step on our mission to make machine learning *lean*.

Here is how simple it is in Python.

```
# 0. As a one-off, run 'pip install kxy', then 'kxy configure'
import kxy
import pandas as pd
# 1. Load your data
df = pd.read_csv('s3://sagemaker-sample-files/datasets/tabular/synthetic/churn.txt')
# 2. Run data valuation, simple!
df.kxy.data_valuation('Churn?', problem_type='classification')
```

Achievable Accuracy | Achievable Log-Likelihood Per Sample | Achievable R-Squared |
---|---|---|

0.86 | -3.96e-01 | 0.45 |

We charge a flat fee per API request, no matter how big your dataset is, and the platform is totally free for academic use. You can signup here and retrieve your API key here.

Read on to learn more about data valuation.

## What Is Data Valuation?

It has been widely reported that data has overtaken oil as the world's most valuable asset. Yet, this analogy is somewhat misleading: two barrels of oil are hardly different, but not all data are created equal!

To understand data valuation, we need to construct a more accurate analogy by comparing the answers to four key questions: **what** is the valuable, **where** is the valuable extracted, **how** is the valuable extracted, and **how much** valuable is there to extract.

Whatis the valuable?

Whereas the objective in oil production is to extract oil from the ground, in a (predictive) machine learning project, we want to reliably predict a business outcome from a set of so-called 'explanatory variables'.

For instance, a mobile operator might be interested in predicting whether a customer will ultimately end up churning using attributes they already know about the customer (e.g. area code, length of service, number of interactions with customer service, billing and usage attributes, and more). The mobile operator would have seen customers churn in the past, and the hope here is to find in customer attributes patterns that are shared between customers that churn and that differentiate them from customers that don't churn.

A real estate marketplace might be interested in predicting the price a property will sell for using publicly available attributes about the neighborhood (e.g. crime statistics, local school statistics and rankings etc.) and property-specific attributes (e.g. property type, year built, number of bedrooms, number of bathrooms, livable area etc.). The marketplace would have records of previous transactions with prices and property characteristics, and the hope here is to provide the buyer and the seller with an estimation of the 'fair' price of the property to ease negotiations and, ultimately, increase transaction volumes.

In statistical terms, predicting a business outcome from explanatory variables is equivalent to learning the true conditional distribution \(\mathbb{P}_{y|\mathbf{x}}\) of the business outcome \(y\) given the associated explanatory variables \(\mathbf{x}\). Occasionally, learning the mapping between explanatory variables and the best prediction (in the mean square sense), namely \(\mathbf{x} \to E\left(y \vert \mathbf{x} \right) \), which is a property of the true predictive distribution \(\mathbb{P}_{y|\mathbf{x}}\), might be good enough.

Wheredo we extract the valuable?

In an oil production project, oil is extracted on the site where wells are drilled. The equivalent in a machine learning project are the explanatory variables from which we derive insights about the business outcome we want to predict.

Hence, data is more akin to an oil production site than to oil itself.

Howdo we extract the valuable?

Project, processes, machines and other tools collectively define how oil is pumped from the ground. The machine learning equivalent are the families of models (e.g. neural networks, tree-based models, parametric and non-parametric Bayesian models etc.) and the algorithms used to explore these families of models to approximate the true predictive distribution \(\mathbb{P}_{y|\bf{x}}\), or properties thereof such as its mean or standard deviation.

How muchvaluable is there to extract?

Clearly, if there is no oil in the ground on the production site, no matter what tools, machines, or techniques are used in the project, no oil will be extracted. The machine learning phrase '*garbage in, garbage out*' captures a similar idea, but the analogy stops there.

While an entire subfield of physics is devoted to estimating oil reserves, until recently, little was known in the machine learning litterature about quantifying the intrinsic value a set of explanatory variables could bring about when used to predict a specific business outcome; i.e. how to 'weigh' the valuable \(\mathbb{P}_{y|\mathbf{x}}\) without first retrieving it.

The framework to answer the foregoing question turns out to be fairly intuitive. The same way the amount of oil in the ground can be defined as the largest amount of oil that may be extracted from the ground (no matter the tool used), the value of a dataset for predicting a specific business outcome is the highest performance that may be achieved by any (predictive) machine learning model, reliably and without overfitting. It also happens to be the performance we would achieve, if we knew the true predictive distribution \(\mathbb{P}_{y|\bf{x}}\).

The same way building a map of oil reserves does not require extracting all the oil from the ground and weighing it, estimating the highest performance achievable in a (predictive) machine learning project should not require first learning the oracle (perfect) predictive model \(\mathbb{P}_{y|\bf{x}}\) and then evaluating its performance.

*To sum-up, data valuation is the estimation of the highest performance that may be achieved in a (predictive) machine learning experiment, prior to and without learning any predictive model. *

## Why Data Valuation?

To grasp what data valuation can do for machine learning engineers, and why it is a big deal, it might be useful to revisit the oil analogy.

Look at the global map of oil reserves above. Now, imagine a world in which every country has the same number of oil wells per square mile. Big oil conglomerates might still be able to generate profits in this world if they build enough wells, but their capital would be very poorly allocated, and production costs would be considerably higher.

As an illustration, Africa is 33 times bigger than Venezuela, yet Africa only has 41% of Venezuela's oil reserves. Thus, a company wanting to dig wells on the African continent and in Venezuela without using oil reserves would spend on average 80 times more for the same amount of oil extracted than if it dug wells proportionally to oil reserves.

The price of gas and petroleum products would need to increase to reflect higher production costs, and this would in turn have cascading effects in virtually every aspect of our lives. The world as we know it today would be very different.

Machine learning without data valuation is the equivalent of that imaginary world where the only way to know if there is oil in the ground is to build a fully-functional well, try to pump oil out, and see whether any comes out! If you are lucky, you will find some oil, but more often than not, you won't, no matter how much resources you spent digging.

If an organization has a relatively small machine learning budget then, by training predictive models without first valuing its data, the organization would be taking an excessive business risk. An organization wouldn't allocate resources to a new project without first estimating its potential impact on its bottomline. Why should it treat a machine learning project differently?

Just because an organization has terabytes of data does not mean it is sitting on a gold mine. In fact, it is estimated that 9-in-10 trained machine learning models are not good enough to make it to production, and 1-in-4 machine learning projects ultimately fail. With data valuation, every organization can now accurately estimate the risk in investing in machine learning capabilities prior to deploying any resources.

Even if an organization is lucky enough to have a large machine learning budget, it could drastically slash its costs and runtime by systematically valuing its data first, and conditioning the training of predictive models on whether a high enough performance may be achieved.

## What Makes Data Valuation Work?

In this section we propose an intuitive and high-level explanation of what makes data valuation work. For an in-depth technical explanation, see this paper we wrote.

To understand why it is possible to estimate the highest performance achievable in a (predictive) machine learning project without learning any predictive model, and to get a sense for what is necessary to make this work, it is helpful to compare data valuation to how oil companies discover oil at sea without drilling.

Oil exploration at sea requires two key ingredients: a **scientific principle** and the **technology** that implements it.

The **scientific principle** relates something we *can directly measure* to what we *cannot directly measure* but we need to quantify nonetheless, namely the amount of oil that may be extracted.

Simply put, offshore oil survey relies on the principle that, when certain types of waves (seismic waves) are emitted in the sea, the way they bounce back is indicative of what type of rock or soil layers they hit. Because physics laws allow us to differentiate how seismic waves would bounce back if they hit an oil layer from how they would bounce back if they hit other layers, it is possible to find and map-out oil layers at sea, in theory.

In practice, to put this principle to use, oil companies need the **technology** to emit seismic waves at sea in a controlled manner, and to observe how they bounce back.

Data valuation works much like oil survey at sea.

Effective data valuation requires a **scientific principle** that relates a quantity, one that is much easier to estimate from the data than finding the best predictive model, to the highest performance that may be achieved by any (predictive) machine learning model.

As it turns out, whether the performance of interest is the Root-Mean-Square-Error, the \(R^2\), or even the classification accuracy, the highest performance achievable when using the explanatory variables \(\bf{x}\) to predict the business outcome \(y\) can be expressed solely as a function of the mutual information \(I(y, \bf{x})\), and sometimes a measure of the variability of the business outcome \(y\) such as its standard deviation (in the case of the RMSE) and its Shannon entropy (in the case of classification accuracy).

As an illustration, the formula relating the highest \(R^2\) achievable when using \(\bf{x}\) to predict \(y \) is \(\bar{R}^2 = 1 - e^{-2I(y, \bf{x})}\), and this is true whether explanatory variables \(\bf{x}\) are continuous, categorical, or a mix!

We wrote the first paper establishing these relations for a variety of performance metrics. You can find a copy here.

Additionally, effective data valuation requires an accurate and data-efficient way of estimating any mutual information from data (i.e. the **technology**), *without* first learning the true joint or predictive distribution.

If you struggle to see how we could reliably estimate a mutual information without first estimating the true predictive or joint distribution, then consider this. Although the mutual information fully characterizes the value in the data, it does not fully characterize the data itself; the joint distribution does! For instance, the mutual information does not depend on marginal distributions; the joint distribution does. Any feature transformation of explanatory variables that can be undone (i.e. from which we can recover the original explanatory variables) would not change the mutual information, but it would change the joint distribution! A direct consequence of this is that a good mutual information estimator should not be sensitive to, nor should it require feature engineering.

Put bluntly, there are considerably more ways learning a joint or predictive distribution can go wrong than ways estimating a mutual information can go wrong. This is great news for data valuation because it means that we can value a dataset *much easier, faster, and cheaper* than finding the best predictive model, and therefore that machine learning engineers could slash cost and save time by always valuing a dataset prior to finding a predictive model.

If you want to dig deep into mutual information estimation, check out our AISTATS 2021 paper.