The Tension Between Learning and Predicting

1. Motivation

Imagine we’re traders in a market where the cross-section of returns is related to $V \geq 1$ variables:

$\begin{align*} r_s = \alpha^\star + {\textstyle \sum_v} \, \beta_v^{\star} \cdot x_{s,v} + \epsilon_s^{\star}. \end{align*}$

In the equation above, $\alpha^\star$ denotes the mean return, and each $\beta_v^\star$ captures the relationship between returns and the $v$ th right-hand-side variable. Some notation: I’m going to be using a “star” to denote true parameter values and a “hat” to denote in-sample estimates of these values. e.g., $\hat{\beta}_v$ denotes an in-sample estimate of $\beta_v^\star$ . To make things simple, let’s assume that $\sum_s x_{s,v} = 0$ , $\sum_s x_{s,v}^2 = S$ , $\sum_s x_{s,v} \cdot x_{s,v'} = 0$ , and $\epsilon_s^{\star} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2)$ .

Notice that learning about the most likely parameter values,

$\begin{align*} \{ \hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}, \, \ldots, \, \hat{\beta}_V^{\text{ML}} \} &= \arg \max_{\alpha, \, \beta_1, \, \ldots, \, \beta_V} \left\{ \, \mathrm{Pr}(\mathbf{r} | \{\alpha, \, \beta_1, \, \ldots, \, \beta_V\}) \, \right\}, \end{align*}$

is really easy in this setting because $\epsilon_s^{\star} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2)$ . These maximum-likelihood estimates are just the coefficients from a cross-sectional regression,

$\begin{align*} r_s = \hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v} + \hat{\epsilon}_s^{\text{ML}}. \end{align*}$

So, finding $\{ \hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}, \, \ldots, \, \hat{\beta}_V^{\text{ML}} \}$ is homework question from Econometrics 101. Nothing could be simpler.

But, what if we’re interested in predicting returns,

$\begin{align*} \min_{\alpha, \, \beta_1, \, \ldots, \, \beta_V} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, \left( r_s - [\alpha + {\textstyle \sum_v} \, \beta_v \cdot x_{s,v}] \right)^2 \, \right\}, \end{align*}$

rather than learning about the most likely parameter values? It might seem like this is the same exact problem. And, if we’re only considering $1$ right-hand-side variable, then it is the same exact problem. When $V = 1$ the best predictions come from using $\{\hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}\}$ . But, it turns out that when there’s $2$ or more right-hand-side variables (and an unknown mean), this is no longer true. When $V \geq 2$ we can make better predictions with less likely parameters. When $V \geq 2$ there’s a tension between the learning and predicting.

Why? That’s the topic of today’s post.

2. Maximum Likelihood

Finding the most likely (ML) parameter values is equivalent to minimizing the negative log likelihood. So, because we’re assume that $\epsilon_s^\star \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2)$ , this is just

$\begin{align*} - \, \log \mathrm{Pr}(\mathbf{r} | \{\alpha, \, \beta_1, \, \ldots, \, \beta_V\}) &= {\textstyle \frac{1}{2 \cdot (S \cdot \sigma^2)}} \cdot {\textstyle \sum_s} \left(r_s - [\alpha + {\textstyle \sum_v} \, \beta_v \cdot x_{s,v}] \right)^2 + \cdots \end{align*}$

where the “ $+ \cdots$ ” at the end denotes a bunch of terms that don’t include any of the parameters that we’re optimization over. Optimizing each parameter value then gives the first-order conditions below:

$\begin{align*} 0 &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right) \cdot 1 \\ \text{and} \quad 0 &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right) \cdot x_{s,v} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}$

And, solving this system of $(V+1)$ equations and $(V+1)$ unknowns gives the most likely parameter values:

$\begin{align*} \hat{\alpha}^{\text{ML}} &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, r_s \cdot 1 \\ \text{and} \quad \hat{\beta}_v^{\text{ML}} &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, (r_s - \hat{\alpha}^{\text{ML}}) \cdot x_{s,v} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}$

Clearly, the most likely parameter values are just the coefficients from a cross-sectional regression.

Now, for the sake of argument, let’s imagine there’s an oracle who knows the true parameter values, $\{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \}$ . With access to this oracle, we can compute our mean squared error when using the maximum-likelihood estimates to predict returns given any choice of true parameter values:

$\begin{align*} \mathrm{E} \left[ \, \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right)^2 \, \middle| \, \{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \} \, \right] &= \sigma^2 + 1 \cdot (\sfrac{\sigma^2}{S}) + V \cdot (\sfrac{\sigma^2}{S}). \end{align*}$

The first term, $\sigma^2$ , captures the unavoidable error. i.e., even if we knew the true parameter values, we still wouldn’t be able to predict $\epsilon_s \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2)$ . And, the second and third terms, $1 \cdot (\sfrac{\sigma^2}{S})$ and $V \cdot (\sfrac{\sigma^2}{S})$ , capture the error that comes from using the most likely parameter values rather than the true parameter values.

3. A Better Predictor

With this benchmark in mind, let’s take a look at a variant of the James-Stein (JS) estimator:

$\begin{align*} \hat{\beta}_v^{\text{JS}} &\overset{\scriptscriptstyle \text{def}}{=} (1 - \lambda) \cdot \hat{\beta}_v^{\text{ML}} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}$

In the definition above, $\lambda \in [0, \, 1]$ denotes a bias factor that shrinks the maximum-likelihood estimates towards zero whenever $\lambda > 0$ . So, with access to an oracle, we can compute our mean squared error when using the James-Stein estimates to predict returns just like before:

$\begin{align*} \mathrm{E} \left[ \, \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{JS}} \cdot x_{s,v}] \right)^2 \, \middle| \, \{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \} \, \right] &= \sigma^2 + 1 \cdot (\sfrac{\sigma^2}{S}) \\ &\quad + V \cdot \left\{ \, (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2 \cdot ({\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2) \, \right\}. \end{align*}$

Now the third term is more complicated. If we use a more biased estimator, $\lambda \to 1$ and $|\hat{\beta}_v^{\text{JS}}| \to 0$ , then using the most likely parameter values rather than the true parameter values to predict returns is going to cause less damage. But, bias is going to generate really bad predictions whenever the true parameter value is large $|\beta_v^\star| \gg 0$ . The $(1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S})$ and $\lambda^2 \cdot (\frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2)$ terms capture these two opposing forces.

Comparing the maximum-likelihood and James-Stein prediction errors reveals that we should prefer the James-Stein estimator to the maximum-likelihood estimator if there’s a $\lambda \in (0, \, 1]$ such that:

$\begin{align*} (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2 \cdot ({\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2) < (\sfrac{\sigma^2}{S}). \end{align*}$

But, here’s the thing: if we have access to an oracle, then there’s always going to be some $\lambda > 0$ that satisfies this inequality. This is easier to see if we rearrange things a bit:

$\begin{align*} {\textstyle \frac{ \frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2 }{ (\sfrac{\sigma^2}{S}) } } < {\textstyle \frac{ 2 - \lambda }{ \lambda } }. \end{align*}$

So, there’s always a sufficiently small $\lambda$ such that this inequality holds. Thus, for any $\{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \}$ , there’s some $\lambda$ such that the James-Stein estimates gives better predictions than the most likely parameter values.

4. Abandoning the Oracle

Perhaps this isn’t a useful comparison? In the real world, we can’t see the true parameter values when deciding which estimator to use. So, we can’t know ahead of time whether or not we’ve picked a small enough $\lambda$ . It turns out that having to estimate $\lambda$ changes things, but only when there’s just $V=1$ right-hand-side variable. When there are $V \geq 2$ variables, James-Stein with an estimated $\lambda$ still gives better predictions.

To see where this distinction comes from, let’s first solve for the optimal choice of $\lambda$ when we still have access to the oracle. This will tell us what we have to estimate when we abandon the oracle. The optimal choice of $\lambda$ solves:

$\begin{align*} \lambda^\star &= \arg \min_{\lambda \in [0, \, 1]} \left\{ \, (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2 \cdot \left( {\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2 \right) \, \right\} \end{align*}$

So, if we differentiate, then we can solve the first-order condition for $\lambda^\star$ :

$\begin{align*} \lambda^\star &= {\textstyle \frac{ (\sfrac{\sigma^2}{S}) }{ \frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2 - (\sfrac{\sigma^2}{S}) } }. \end{align*}$

If the maximum-likelihood estimates are really noisy relative to the size of the true parameter values (i.e., $\sfrac{\sigma^2}{S}$ is close to $\frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2$ ), then using the most likely parameter values rather than the true parameter values is going to increase our prediction error a lot. So, we should use more biased parameter estimates.

But, notice what this formula is telling us. To estimate the right amount of bias, all we have to do is estimate the variance of the true parameter values. We don’t have to estimate every single one. So, we can estimate the variance of the true parameter values as follows,

$\begin{align*} {\textstyle \frac{1}{V-1}} \cdot {\textstyle \sum_v} \, |\hat{\beta}_v^{\text{ML}}|^2 &= (\sfrac{\sigma^2}{S}) + {\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2, \end{align*}$

where the factor of $(V - 1)$ on the left-hand side is a degrees-of-freedom correction. To see why we need this correction, recall that

$\begin{align*} \mathrm{E} \left[ \, \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right)^2 \, \right] &= \sigma^2 + 1 \cdot (\sfrac{\sigma^2}{S}) + V \cdot (\sfrac{\sigma^2}{S}), \end{align*}$

so not all $V$ maximum-likelihood estimates can move independently. There is an adding-up constraint.

Thus, if we have to estimate the right amount of bias to use, then we should choose:

$\begin{align*} \hat{\lambda} &= {\textstyle \frac{ (\sfrac{\sigma^2}{S}) }{ \frac{1}{V-1} \cdot \sum_v \, |\hat{\beta}_v^{\text{ML}}|^2 } }. \end{align*}$

Notice that, when we have to estimate the right amount of bias, $\hat{\lambda} > 0$ only if $V \geq 2$ . If $V = 1$ , then the denominator is infinite and $\hat{\lambda} = 0$ . After all, if there’s only $1$ right-hand-side variable, then the equation to estimate the variance of the true parameter values has the same first-order condition as the equation to estimate $\hat{\beta}_1^{\text{ML}}$ . With this estimated amount of bias, our prediction error becomes

$\begin{align*} \mathrm{E} \left[ \, \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{JS}} \cdot x_{s,v}] \right)^2 \, \right] &= \sigma^2 + 1 \cdot (\sfrac{\sigma^2}{S}) + (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}), \end{align*}$

which is always less than the maximum-likelihood prediction error whenever $V \geq 2$ .

5. What This Means

My last post looked at one reason why it’s harder to predict the cross-section of returns when $V \geq 2$ : Bayesian variable selection doesn’t scale. If we’re not sure which subset of variables actually predict returns, then finding the subset of variables that’s the most likely to predict returns means solving a non-convex optimization problem. It turns out that solving this optimization problem means doing an exhaustive search over the powerset containing all $2^V$ possible subsets of variables. And, this just isn’t feasible when $V \gg 2$ .

But, this scaling problem isn’t the only reason why it’s harder to predict the cross-section of returns when $V \geq 2$ . And, this post points out another one of these reasons: even if you could solve this non-convex optimization problem and find the most likely parameter values, these parameter values wouldn’t give the best predictions. When $V \geq 2$ , there’s a fundamental tension between making good predictions and learning about the most likely parameter values in the data-generating process for returns. So, when $V \geq 2$ traders are going to solve the prediction problem and live with the resulting biased beliefs about the underlying parameter values. What’s more, the $\hat{\lambda}$ with the best out-of-sample fit in the data is going to quantify how much the desire to make good predictions distorts traders’ beliefs.