Autoencoders are essential tools for compression and generative modeling in any modern Machine Learning toolbox.

Unfortunately, their success as unsupervised pre-training step in vision problems has led to some Machine Learning engineers wrongly considering autoencoders great tools for feature engineering or dimensionality reduction.

In this article, we explain why using autoencoders (AEs) for feature engineering on tabular data can be problematic.

First, let's briefly recall what AEs are.

What Are AEs?

An autoencoder is a module consisting of an encoder and a decoder typically used for compression or generative modeling.

Conceptually, the encoder takes an input vector $x \in \mathcal{X}$ and generates a lower-dimensional feature or 'code' vector $c \in \mathcal{C}$: $x \to c$. The decoder operates on codes. It uses the code $c \in \mathcal{C}$ to generate a reconstructed version $x^\prime \in \mathcal{X}$ of the original input vector $x$: $c \to x^\prime$.

The quality of an autoencoder is gauged by a reconstruction error $\mathcal{E}$ that quantifies how close $x$ is to its reconstruction $x^{\prime}$.

Achieving a low reconstruction error despite the code space $\mathcal{C}$ being lower-dimensional than the input space $\mathcal{X}$ is a sign that the encoder generates richer representations of the corresponding original inputs.

There are two main types of autoencoders: Traditional AEs and Variational AEs.

Traditional Autoencoders

Traditional AEs (TAEs) define the code and reconstructed input as $c = f_\theta(x)$ and $x^{\prime} = g_\theta(c)$, for some functions $f_\theta, g_\theta$.

They use as reconstruction error \begin{align}\mathcal{E} :&= E\left( ||x-x^\prime||^2 \right) \\ &= E\left( ||x-g_\theta \circ f_\theta(x)||^2 \right),\end{align}or a regularized flavor thereof $\mathcal{E} = E\left( ||x-g_\theta \circ f_\theta(x)||^2 \right) + \lambda \mathcal{R}(\theta),$for $\lambda \geq 0$.

Variational Autoencoders

While the mathematical artillery required to construct Variational Autoencoders (VAEs) is a lot more complex, they are very similar to TAEs in spirit.

AIM

Like TAEs, VAEs aim at learning a mapping $x \to c \to x^\prime ,$ where $x \in \mathcal{X}$ is the original input vector, $c \in \mathcal{C}$ is the code, and $x^\prime \in \mathcal{X}$ is a reconstruction of the original input.

However, TAEs do not guarantee that the encoder will map the input space to the entire code space. Specifically, not every element of $\mathcal{C}$ is a valid code of an element of $\mathcal{X}$.

As a result, if we simply apply a TAE's decoder to any element of $\mathcal{C}$, we may not necessarily generate an output that resembles the type of inputs that the TAE's encoder is meant to encode. For that we need to apply the decoder only to those elements of $\mathcal{C}$ that were obtained by encoding an element of $\mathcal{X}$.

However, because it is very hard to characterize this subset of $\mathcal{C}$, it is virtually impossible to use a TAE's decoder to generate a draw from the distribution that its encoder encodes.

VAEs address this limitation. Specifically, VAEs require that the encoder always maps the distribution of original inputs $\mathbb{P}_{x}$ on $\mathcal{X}$ to the same code distribution $\mathbb{P}_{c}$ on $\mathcal{C}$. $\mathbb{P}_{c}$ is typically chosen to be a multivariate standard normal$\mathbb{P}_c = \mathcal{N}(0, I).$This ensures that, not only is every element of $\mathcal{C}$ a valid code (i.e. there exists an element of $\mathcal{X}$ of which it is the code), but also that codes are cryptic in the sense that we cannot say anything about what distribution $\mathbb{P}_{x}$ has been encoded simply from observing  $\mathbb{P}_c$.

FEASIBILITY (COMPRESSION-FREE)

At this point, you might be wondering whether, for any $\mathbb{P}_x$, we may actually always find a transformation that maps $\mathbb{P}_x$  to a multivariate standard normal  $\mathbb{P}_c = \mathcal{N}(0, I)$.

The answer is yes, and the transformation is available analytically! The ideal encoder maps any $x \in \mathcal{X}$ to code $c= F_c^{-1} \circ F_x(x),$ and the ideal decoder maps any $c \in \mathcal{C}$ to original input as $x^\prime = F_x^{-1}\circ F_c(c) = x,$ where $F_c$ (resp. $F_x$) is the vector-valued function whose i-th coordinate function is the CDF of the standard normal (resp. the CDF of the i-th coordinate distribution of $\mathbb{P}_x$).

Note that, under this ideal transformation, the conditional distributions $\mathbb{P}_{x \vert c}=\mathbb{P}_{x^\prime \vert c}$ and $\mathbb{P}_{c \vert x} = \mathbb{P}_{c \vert x^\prime}$ are point-masses. In other words, if we know the original input and the ideal encoder, then we know its code, and if we know a code and the ideal decoder, then we know what input it is the code of.

TRACTABILITY

In practice, however, it is extremely hard to reliably learn $F_x$ directly. Imagine having to learn the CDFs of pixels of images of dogs!

VAEs deal with this difficulty by acknowledging that we may never learn the ideal encoder/decoder, and leverage Bayesian statistics to cope with this uncertainty.

In an ideal world, given a code $c$ that our encoder generated, there is only one choice for what original input $x$ it is the code of. But because our encoder may not be ideal, VAEs assume that the conditional distribution of $x$ given $c$ is not a point mass but has a variance $\sigma^2$ that accounts for the encoder's imperfection.

Specifically, VAEs assume that the code generated by the encoder indeed has distribution $\mathbb{E}_{c} = \mathbb{P}_{c}=\mathcal{N}(0, I),$but that the conditional distribution of $x$ given $c$ implied by the encoder reads $\mathbb{E}_{x \vert c} = \mathcal{N}\left(f_\theta(c), \sigma I \right).$

The ideal encoder corresponds to the case $f_\theta(c) = F_x^{-1}\circ F_c(c)$ and $\sigma = 0$. In general, however, $\sigma > 0$ to reflect the fact that the encoder is not ideal. In such a case, the encoder is fuzzy in the sense that the code associated to input $x$ is sampled from the distribution $\mathbb{E}_{c \vert x}$ and is not unique.

Similarly, because our decoder might not be ideal, the conditional distribution of $c$ given $x$ implied by the decoder is not a point mass. Instead, VAEs assume that the conditional distribution of $c$ given $x^\prime$ implied by the decoder reads $\mathbb{D}_{c \vert x^\prime} = \mathcal{N}\left(\mu_\theta(x^\prime), \sigma_\theta(x^\prime) \right),$and that the distribution of reconstructed inputs is the same as that of original inputs$\mathbb{D}_{x^\prime} = \mathbb{P}_{x}.$

The ideal decoder corresponds to the case $\mu_\theta(x^\prime) = F_c^{-1}\circ F_x(x^\prime)$ and $\sigma_\theta(x^\prime) = 0$. In general, however, $\sigma_\theta(x^\prime) > 0$ to reflect the fact that the decoder is not ideal. In such a case, the decoder is fuzzy in the sense that the reconstructed input associated to code $c$ is sampled from the distribution $\mathbb{D}_{x^\prime \vert c}$ and is not unique.

RECONSTRUCTION ERROR

Using  Bayes' theorem, we may get the joint distributions governing the encoder and the decoder as $\mathbb{D}_{(x^\prime, c)} = \mathbb{D}_{x^\prime} \mathbb{D}_{(c \vert x^\prime)} = \mathbb{P}_{x} \mathbb{D}_{(c \vert x^\prime)},$and$\mathbb{E}_{(x, c)} = \mathbb{E}_{c} \mathbb{E}_{(x \vert c)} = \mathbb{P}_{c} \mathbb{E}_{(x \vert c)}.$

Note that the encoder and the decoder are both ideal if and only if $\mathbb{D}_{(x^\prime, c)}=\mathbb{E}_{(x, c)}$.

In effect, when both the encoder and the decoder are ideal the two distributions are the same, as previously discussed.

Conversely, when the two distributions are the same, the encoder maps $\mathbb{E}_x$, which is equal to $\mathbb{P}_x$ (because to two joint distributions are the same), to $\mathbb{E}_c:=\mathbb{P}_c$, and therefore is ideal.

Similarly, when the two distributions are the same, the decoder maps $\mathbb{D}_c$, which is equal to $\mathbb{P}_c=\mathcal{N}(0, I)$ (because to two joint distributions are the same), to $\mathbb{D}_{x^\prime}:=\mathbb{P}_x$, and therefore is ideal.

Thus, a natural reconstruction error is the KL-divergence between $\mathbb{D}_{(x^\prime, c)}$ and $\mathbb{E}_{(x, c)}$ which quantifies how different the decoder and encoder joint distributions are:$\mathcal{E} := KL \left( \mathbb{D}_{(x^\prime, c)} || \mathbb{E}_{(x, c)}\right).$

To arrive at the form of the objective function commonly used in the VAE litterature, we may use Bayes' theorem and get \begin{align}\mathcal{E} =& E_{\mathbb{P}_x}\left( \log \mathbb{P}_x \right) - E_{\mathbb{P}_x}\left( \log \mathbb{E}_x \right) \\ & +E_{\mathbb{P}_x}\left[KL \left( \mathbb{D}_{c \vert x^\prime} || \mathbb{E}_{c \vert x}\right)\right] \end{align}where we have used the fact that $\mathbb{D}_{x^\prime} := \mathbb{P}_x$.

Additionally, we may drop the term $E_{\mathbb{P}_x}\left( \log \mathbb{P}_x \right)$ because it does not depend on any model parameter, and arrive at the usual formulation of VAEs' optimization problem \begin{align}\min_{f_\theta, \sigma, \mu_\theta, \sigma_\theta} ~&E_{\mathbb{P}_x}\left[ KL \left( \mathbb{D}_{c \vert x^\prime} || \mathbb{E}_{c \vert x}\right)\right] \\ &-E_{\mathbb{P}_x}\left( \log \mathbb{E}_x \right).\end{align}

Let's verify that the unique solution to the foregoing optimization problem corresponds to the ideal encoder and decoder.

The negative-likelihood term $-E_{\mathbb{P}_x}\left( \log \mathbb{E}_x \right)$ reflects the extent to which the encoder model $\mathbb{E}_{(x, c)}$ is consistent with the true distribution of original inputs $\mathbb{E}_{x}$. It is a well-known result that this term is minimized when $\mathbb{E}_x = \mathbb{P}_x$, which indeed corresponds to the ideal encoder.

The KL-divergence term $E_{\mathbb{P}_x}\left[ KL \left( \mathbb{D}_{c \vert x^\prime} || \mathbb{E}_{c \vert x}\right)\right]$ measures how consistent the real encoder is on average with the encoder implied by the decoder's joint distribution.

It is minimized when for every $x \in \mathcal{X}, ~~ \mathbb{D}_{c \vert x^\prime} = \mathbb{E}_{c \vert x}$, which also implies \begin{align}\mathbb{D}_c :&= \int \mathbb{D}_{c \vert x^\prime} \mathbb{D}_{x^\prime} dc \\ &= \int \mathbb{D}_{c \vert x^\prime} \mathbb{P}_x dc \\&= \int \mathbb{E}_{c \vert x} \mathbb{E}_x dc \\&= \mathbb{E}_c \\&= \mathbb{P}_c,\end{align}in which case the decoder is also ideal.

EXTENSION TO LOWER-DIMENSIONAL CODES

‌So far we have focused on the case where it is always possible to map the input distribution to a multivariate standard normal, no matter the input distribution $\mathbb{P}_x$. In this case, the ideal encoder maps $x \in \mathcal{X}$ to code $c= F_c^{-1} \circ F_x(x)$, which ought to have the same dimension as $x$, as this mapping is reversible.

In practice, we are often interested in lower-dimensional codes, which VAEs equally assume follow a multivariate standard normal distribution, but a lower-dimensional one.

The challenge now is to find a joint distribution $\mathbb{Q}_{(x, c)}$ on $\mathcal{X} \times \mathcal{C}$ such that its $x$-marginal is the true input distribution $\mathbb{P}_x$, and its $c$-marginal is the lower-dimensional standard normal code distribution $\mathbb{P}_c$.

This is clearly not always possible, mathematically speaking. The intuition giving us hope that we may find such a joint distribution in practice is that some physical applications (e.g. computer vision) are structured enough that the true input distribution $\mathbb{P}_x$ is supported on a lower-dimensional manifold $\mathcal{M} \subset \mathcal{X}$; they might not be supported on the whole of $\mathcal{X}$.

For instance, because a real-world image typically contains objects which themselves have edges, textures, patterns, and obey the laws of physics, not every matrix of pixel value can be regarded as a real-world image.

If we can find such a distribution $\mathbb{Q}_{(x, c)}$, then we can find the ideal (compressing) encoder that maps $\mathbb{P}_x$ to $\mathbb{P}_c$ by sampling from the conditional distribution $\mathbb{Q}_{c \vert x}$. Similarly, we can find the ideal (expanding) decoder that maps $\mathbb{P}_c$ to $\mathbb{P}_x$ by sampling from the conditional distribution $\mathbb{Q}_{x \vert c}$.

Note that both ideal encoder and ideal decoder are now fuzzy.

Using joint distributions for the encoder and decoder of the forms introduced earlier, if we can find parameters $f_\theta, \sigma, \mu_\theta, \sigma_\theta$ such that $KL \left( \mathbb{D}_{(x^\prime, c)} || \mathbb{E}_{(x, c)}\right)=0,$ then we are guaranteed that $\mathbb{Q}_{(x, c)}$ exists, AND that $\mathbb{D}_{(x^\prime, c)}$ and $\mathbb{E}_{(x, c)}$ are the ideal decoder and encoder respectively!

Thus, whether the code is lower-dimensional or not, solving the optimization problem \begin{align}\min_{f_\theta, \sigma, \mu_\theta, \sigma_\theta} ~&E_{\mathbb{P}_x}\left[ KL \left( \mathbb{D}_{c \vert x^\prime} || \mathbb{E}_{c \vert x}\right)\right] \\ &-E_{\mathbb{P}_x}\left( \log \mathbb{E}_x \right)\end{align}is the key to learning the ideal encoder and decoder, and this is what VAEs do.

Feature Engineering With AEs

‌Now that we are on the same page about what AEs are, let me tell you why you should be careful before using them for feature engineering.

When AEs are used for feature engineering, the idea is that, because the code vector $c$ provides a richer and lower-dimensional representation of the original input vector $x$, it may be used as a substitute feature vector to predict a target of interest $y$. The hope is that we would get a better performance by predicting $y$ using $c$ instead of $x$

Reason 1: Low reconstruction error does not guarantee low signal loss

Whether you use TAEs or VAEs, a low reconstruction error simply does not guarantee a low signal loss. After all, AEs simply do not know anything about $y$!

In TAEs, much of the reconstruction error $\mathcal{E} := E\left( ||x-x^\prime||^2 \right)$ could very well be the signal that was in $x$ to predict $y$, irreversibly lost when the input was encoded.

Let us take a concrete example to illustrate this. Imagine we are dealing with a tabular classification problem using categorical and continuous features $x$, and that we know can be perfectly solved using a single classification tree (CT).

A good proxy for the signal lost by a TAE is the error made by our CT using reconstructed inputs $x^\prime$ to predict $y$.

Imagine that our CT first splits on a binary feature $b$ that accounts for only 10% of the total variance of $x$. Suppose that our TAE reconstructs all coordinates of $x$ perfectly except for $b$ that it gets wrong 10% of the time. This means that our TAE achieves a reconstruction error of only about 1% of the total variance of $x$.

However, because our CT first splits on values of features $b$, on average, 10% of reconstructed inputs $x^\prime$ will fall in the wrong leaves of our CT. This could potentially result in a 10% classification error, despite a 1% reconstruction error! If the TAE got 20% of $b$ wrong, then this would still lead to a very small 2% reconstruction error, but potentially a whopping 20% classification error, solely due to loss of signal by the AE!

One could think that making 20% error on a single feature might be too high, but we could arrive at the same conclusion if the TAE got a reconstruction error of 5% on 4 binary features, each accounting for 10% of the total variance. In such a case, the overall reconstruction error of the TAE would still be 2%, and the classification error could still be as high as 20%!

In VAEs, the very fact that the ideal encoder is fuzzy (i.e. $\mathbb{Q}_{x \vert c}$ has a non-zero variance), which is always the case when the code is lower-dimensional, is sufficient to conclude that the encoder is lossy or, equivalently, a code $c$ does not contain all the information, let alone all the signal, that was in $x$.

In a sense, you can think of the conditional entropy $h(x \vert c)$ as the amount of information about $x$ irreversibly lost when encoding. Unfortunately, there is no guarantee that this loss information is pure noise; much of it could very well be signal that was in $x$ to predict $y$.

In fact, as illustrated in the CT example above, a 2% information loss about $x$ can result in a reduction of classification accuracy by 20%.

Reason 2: Learning patterns from the code could be harder than from original features

When AEs are used for feature selection, new features are constructed.

In general, the primary goal of feature construction is to simplify the relationship between inputs and the target into one that models our toolbox can reliably learn.

The questions you should be asking yourself before using AE's codes as features in a tabular predictive model are:

• Do the codes make any sense?
• Can I think of an explanation for why the codes could have as simple a relationship to the target as the original inputs?

If the answer to either question is no, then codes would likely be less useful than original features.

As an illustration, imagine we want to predict a person's income using, among other features, GPS coordinates of her primary residence, age, number of children, number of hours worked per week.

While it is easy to see how a tree-based learner could exploit these features, combining them into a code vector would likely result in features that make little sense and are much harder to learn anything meaningful from using tree-based methods.

The Computer Vision Exception

Interestingly, unsupervised pre-training in computer vision does exactly what I am advising you against, and with great success!

So why are AEs effective as pre-training step for computer vision, but not for tabular data?

Reason 1: Low reconstruction error in computer vision usually guarantees low signal loss

In computer vision, corrupting an image or a video does not make it much harder to solve the problem at hand.

As illustrated in the image above, adding a Gaussian noise with standard deviation more than 50% of the pixel standard deviation does not make it much harder for a human being to tell what digit is in the corrupted image.

In other words, even if a TAE achieves a reconstruction error as high as 50% of the total variance of the image distribution, the corrupted reconstruction will still likely have all we need to recognize the digit in the image!

This implies that the code too contains all we need to recognize the digit in the image. Indeed, by the data processing inequality, $I(y; x) \geq I(y; c) \geq I(y; x^\prime),$meaning that the performance achievable using code $c$ to predict image class $y$ is at least as high as the performance achievable using corrupted reconstruction $x^\prime$ to predict image class $y$.

In practice, State-Of-The-Art TAEs on MNIST and other computer vision problems achieve far smaller reconstruction errors as can be evidenced from the Fig. 2 above, and virtually no signal loss!

Reason 2: Features learned through convolutional layers make it easier to learn structures in images from

Several studies have shown that convolutional layers are capable of learning higher-level abstractions in images such as edges, texture, parts, and objects.

When used as features, these abstractions make it easier to solve computer vision problems from than raw images.

For a great resource on what types of features computer vision models learn, click here.

Conclusion

Autoencoders are great tools for data compression and generative modeling.

However, in order to use autoencoders as pre-processing step in a predictive modeling problem, one needs to ensure that:

• A low reconstruction error in the problem at hand implies that the problem is as easy to solve using the reconstructed inputs as using the original inputs.
• The code represents features/abstractions that make it easier to predict the target than using the original inputs.

· 8 min read

· 9 min read

· 7 min read

· 7 min read

· 4 min read

· 10 min read