Interpreting the LASSO as a *really* simple neural network

Suppose you want to forecast the return of a particular stock using many different predictors (think: past returns, market cap, asset growth, etc…). One way to do this would be to use the LASSO. Alternatively, you could use a neural network to make your forecast. On the surface, these two approaches look very different. However, it turns out that it’s possible to recast the LASSO as a *really* simple neural network.

This post outlines how.

This connection suggests we can use penalized regressions, such as the LASSO, as microscopes for studying more complicated machine-learning models, like neural networks, which often exhibit surprising new behavior. For example, if you include more predictors than observations in an OLS regression, then you’ll be able to perfectly fit your training data but your out-of-sample performance will be terrible. By contrast, highly over-parameters neural networks often have the best out-of-sample fit.

Because these models are so complicated, it’s often hard to understand why a pattern like this might emerge. Penalized regression models like the LASSO occupy a middle ground between OLS and complicated machine-learning models. Thus, if the LASSO can be viewed as a really simple neural net, then it might be possible to use this intermediate setup as a laboratory for understanding more complicated procedures. That’s the idea behind HastieMontanariRossetTibshirani22. And KellyMalamudZhou22 build on their logic.

General setup

Imagine that you’ve got historical data on the returns of $N \gg 1$ different stocks, $\{ \mathit{Ret}_n \}_{n=1}^N$ , and you want to make the best forecast possible for the future return of the $(N+1)$ st stock, $\widehat{\mathit{Ret}}_{N+1}$ . You have access to $K \gg 1$ different return predictors. Let $X_{n,k}$ denote the value of the $k$ th predictor for the $n$ th stock. Assume that each predictor has been normalized to have mean zero and variance $(1/K)$ in the cross-section. Without loss of generality, also assume that the cross-sectional average return is zero.

If there were only one predictor, $K = 1$ , then it’d be possible to estimate the OLS regression below:

$\begin{equation*} \hat{\beta}^{\text{OLS}} = \arg \min_{\beta} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \beta \cdot X_n \, \right\}^2 \end{equation*}$

In this case, the solution is given by $\hat{\beta}^{\text{OLS}} \propto \sum_{n=1}^N \, (\mathit{Ret}_n - 0) \times (X_n - 0)$ . If the predictor tends to be positive, $X_n > 0$ , for stocks that subsequently realize positive returns, $\mathit{Ret}_n > 0$ , then the OLS slope coefficient associated with it will be positive. It will also be profitable to trade on this predictor, too.

You can also use an OLS regression to create a return forecast when you have more than one predictor

$\begin{equation*} \widehat{\mathit{Ret}}_{N+1}^{\text{OLS}} = \sum_{k=1}^K \hat{\beta}_k^{\text{OLS}} \cdot X_{N+1,k} \end{equation*}$

provided that you still have more observations than predictors, $N > K$ . If you’ve got $K=200$ predictors and $N=500$ stocks in your training data, then you’re in business. However, if your training data only contains $N=100$ stocks, then you’re SOL. You’ll have to use something other than an OLS regression.

The LASSO

One popular approach is to fit a LASSO specification. This is essentially an OLS regression with an additional absolute-value penalty applied to each predictive coefficient:

$\begin{equation*} \min_{\beta_1,\ldots,\beta_K} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \sum_{k=1}^K \beta_k \cdot X_{n,k} \, \right\}^2 + 2 \cdot \lambda \cdot \sum_{k=1}^K |\beta_k| \end{equation*}$

The pre-factor of $\lambda \geq 0$ in front of the penalty term is a tuning parameter, which can be optimally chosen via cross-validation. Notice that, when $\lambda = 0$ , there is no penalty at all and the LASSO is equivalent to OLS. But when $\lambda > 0$ , the LASSO’s coefficients will differ from OLS estimates as shown in the interactive figure below.

To see what I mean, let’s return to the case where there’s only one predictor. Alternatively, you could think about a world with orthogonal predictors, $\Cov(X_k, \, X_{k'}) = 0$ for all $k \neq k'$ . In either case, we have:

$\begin{equation*} \hat{\beta}_k^{\text{LASSO}} = \mathrm{Sign}(\hat{\beta}_k^{\text{OLS}}) \times \max\big\{ 0, \, |\hat{\beta}_k^{\text{OLS}}| - \lambda \big\} \end{equation*}$

This expression tells us that the LASSO does two things. First, it shrinks large OLS coefficients toward zero, $|\hat{\beta}_k^{\text{LASSO}}| < |\hat{\beta}_k^{\text{OLS}}|$ . Second, it forces all small OLS coefficients, $|\hat{\beta}_k^{\text{OLS}}| < \lambda$ , to be exactly zero, $\hat{\beta}_k^{\text{LASSO}} = 0$ .

Neural network

The LASSO is still able to make forecasts in situations where there are more predictors than observations because it kills off all the smallest predictors. Morally speaking, if only $5$ of your $K=200$ predictors have any forecasting power, then you shouldn’t need $N \geq 200$ observations to figure this out. $20$ data points should do just fine. An alternative approach to making a return forecast when $K > N$ would be to use a neural network. On the surface, this seems like a very different strategy. Instead of a bet on sparsity, large neural networks often perform best when highly over-parameterized.

There are lots of kinds of neural networks. In this post, I’m going to mainly focus on neural networks with only one hidden layer that has the same number of nodes as predictors. e.g., with $K=200$ predictors, there will be $H = 200$ hidden nodes. The diagram to the left shows what this would look like in a situation with $K=3$ predictors and $H=3$ hidden nodes so that we can see what’s going on.

The value of each hidden node is determined by an activation function that takes a linear combination of predictor values as its input:

$\begin{equation*} H_k = \mathrm{h}\!\left( \beta_{0 \to k} + \sum_{k'=1}^{K} \beta_{k' \to k} \cdot X_{n,k'}\right) \end{equation*}$

e.g., you could set $\mathrm{h}(z) = z$ , $\mathrm{h}(z) = \max\{0, \, z \}$ , or something else entirely. $\vec{\beta}_k = (\beta_{0 \to k}, \, \beta_{1 \to k}, \ldots, \, \beta_{K \to k})$ contains the weights that go into the $k$ th hidden node. It has $(K+1)$ elements due to the intercept term.

The return forecast generated by this neural network, $\widehat{\mathit{Ret}}_{N+1}^{\text{NNet}}$ , is then a weighted average of its $K$ hidden nodes where the weights are chosen by solving the optimization problem below:

$\begin{equation*} \min_{\substack{\alpha_1, \ldots, \alpha_K \\ \vec{\beta}_1, \ldots, \vec{\beta}_K}} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \sum_{k=1}^K \alpha_k \cdot \mathrm{h}\!\left( \beta_{0 \to k} + \sum_{k'=1}^{K} \beta_{k' \to k} \cdot X_{n,k'}\right) \, \right\}^2 \!\! + \lambda \cdot \sum_{k=1}^K \left( \alpha_k^2 + \beta_{0 \to k}^2 + \sum_{k'=1}^K \beta_{k'\to k}^2 \right) \end{equation*}$

This objective function includes a penalty term just like the LASSO, but the penalty is quadratic. It’s equivalent to the common practice of training a neural network via gradient descent with weight decay.

Degrees of freedom

If our goal is to write down the LASSO as a special case of a neural network, then there are two apparent differences that need to be finessed. The first involves degrees of freedom. In the LASSO, there is one parameter that needs to be estimated for each predictor. In the neural network above, each predictor is associated with $(K + 2)$ free parameters. In addition, you must also choose an activation function, $\mathrm{h}(\cdot)$ .

To represent the LASSO as a neural network, we’re going to have to shut down $(K+1)$ of the degrees of freedom associated with each predictor. So, let’s start by looking at a neural network that’s “simply connected”—i.e., a network where $\beta_{k' \to k} = 0$ whenever $k' \neq k$ . Let’s also assume a linear activation function, $\mathrm{h}(z) = z$ , and restrict ourselves to the case where there’s no constant term, $\beta_{0 \to k} = 0$ .

After making these assumptions, we are left with the neural network in the diagram above. There are now only two free parameters associated with each predictor: $\alpha_k$ and $\beta_{k \to k}$ . To estimate all $2 \cdot K$ of these values, we must maximize the objective below:

$\begin{equation*} \min_{\substack{\alpha_1, \ldots, \alpha_K \\ \beta_{1 \to 1}, \ldots, \beta_{K \to K}}} \, \sum_{n=1}^N \left\{ \, Y_n - \sum_{k=1}^K \alpha_k \cdot \beta_{k \to k} \cdot X_{n,k} \, \right\}^2 + \lambda \cdot \sum_{k=1}^K (\alpha_k^2 + \beta_{k \to k}^2) \end{equation*}$

This looks almost like the LASSO objective function. But there’s still one glaring difference left…

Nature of the penalty

In the LASSO, we’ve got an absolute-value penalty; whereas, the neural network has a quadratic penalty. This seems important! To see why, consider replacing the absolute-value penalty in the LASSO with

$\begin{equation*} \min_{\beta_1,\ldots,\beta_K} \, \sum_{n=1}^N \left\{ \, Y_n - \sum_{k=1}^K \beta_k \cdot X_{n,k} \, \right\}^2 + \lambda \cdot \sum_{k=1}^K \beta_k^2 \end{equation*}$

When you do this, you’re left with something called the Ridge regression.

Just like with the LASSO, we can characterize the Ridge estimates relative to OLS in the case where there’s only one predictor or all predictors are orthogonal to one another:

$\begin{equation*} \hat{\beta}_k^{\text{Ridge}} = \left( {\textstyle \frac{1}{1 + \lambda}} \right) \times \hat{\beta}_k^{\text{OLS}} \end{equation*}$

When you increase the value of $\lambda$ in the figure to the right, you’ll see that the slope of the line changes. The larger the $\lambda$ , the less $\hat{\beta}_k^{\text{Ridge}}$ changes in response to a change in $\hat{\beta}_k^{\text{OLS}}$ . Notice how this effect is qualitatively different from the effect of increasing $\lambda$ in a LASSO specification. There, $\lambda$ controlled the size of the inaction region. But, provided that $|\hat{\beta}_k^{\text{LASSO}}| > 0$ , the LASSO estimate always moved one-for-one with $\hat{\beta}_k^{\text{OLS}}$ .

However, this Ridge intuition is misleading. In the simply-connected neural-network structure that I outline above, we are not choosing a single coefficient $\beta_{k \to k}$ . Instead, because there is a hidden layer, we are choosing the product of $\alpha_k \cdot \beta_{k \to k}$ . And this makes all the difference. For any value of $c \geq 0$ , we have that

$\begin{equation*} \min_{\alpha,\beta \geq 0} \big\{ \, \alpha^2 + \beta^2 : \alpha \cdot \beta = c \, \big\} = 2 \cdot |c| \end{equation*}$

where the minimum is at $\alpha_k = \beta_{k \to k} = \sqrt{c}$ . This is just the inequality relating arithmetic and geometric averages. It’s what allows a single hidden layer to sneak in a threshold through the back door.

Some extensions

We’ve just seen that you can think about the LASSO as a simply-connected two-layer neural network with a linear activation function and no bias terms, which was trained via gradient descent with weight decay. This is not my observation. I first saw it in Tibshirani21. The step where you reduce the degrees of freedom is obvious enough. But I had never made the connection with the arithmetic/geometric mean inequality. That second step struck me (and still strikes me) as really cool. It’s also a very concrete example of the flexibility inherent in neural networks. The hidden layer allows a neural network to do things you wouldn’t guess possible based only on the functional forms involved.

In addition to outlining the argument above, Tibshirani21 also gives a couple of other interesting extensions. e.g., the note shows how, by increasing the number of hidden layers in the neural network, you can reproduce the output of a LASSO-like specification below

The more hidden layers you include, the closer you get to best-subset selection, $q=0$ . The note also shows that it’s possible to write group-LASSO as a neural network that ain’t quite so simply connected.