The Bayesian Information Criterion

1. Motivation

Imagine that we’re trying to predict the cross-section of expected returns, and we’ve got a sneaking suspicion that $x$ might be a good predictor. So, we regress today’s returns on $x$ to see if our hunch is right,

$\begin{align*} r_{n,t} = \hat{\mu}_{\text{OLS}} + \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1} + \hat{\epsilon}_{n,t}. \end{align*}$

The logic is straightforward. If $x$ explains enough of the variation in today’s returns, then $x$ must be a good predictor and we should include it in our model of tomorrow’s returns, $\mathrm{E}_t(r_{n,t+1}) = \hat{\mu}_{\text{OLS}} + \hat{\beta}_{\text{OLS}} \cdot x_{n,t}$ .

But, how much variation is “enough variation”? After all, even if $x$ doesn’t actually predict tomorrow’s returns, we’re still going to fit today’s returns better if we use an additional right-hand-side variable,

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 \leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}})^2. \end{align*}$

The effect is mechanical. If we want to explain all of the variation in today’s returns, then all we have to do is include $N$ right-hand-side variables in our OLS regression. With $N$ linearly independent right-hand-side variables we can always perfectly predict $N$ stock returns, no matter what variables we choose.

The Bayesian information criterion (BIC) tells us that we should include $x$ as a right-hand-side variable if it explains at least $\sfrac{\log(N)}{N}$ of the residual variation,

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} &\leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}})^2. \end{align*}$

But, where does this $\sfrac{\log(N)}{N}$ penalty come from? And, why is following this rule the Bayesian thing to do? Bayesian updating involves learning about a parameter value by combining prior beliefs with evidence from realized data. So, what parameter are we learning about when using the Bayesian information criterion? And, what are our priors beliefs? These questions are the topic of today’s post.

2. Estimation

Instead of diving directly into our predictor-selection problem (should we include $x$ in our model?), let’s pause for a second and solve our parameter-estimation problem (how should we estimate the coefficient on $x$ ?). Suppose the data-generating process for returns is

$\begin{align*} r_{n,t} = \beta_\star \cdot x_{n,t-1} + \epsilon_{n,t} \end{align*}$

where $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ , $\epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1)$ , and $x$ is normalized so that $\frac{1}{N} \cdot \sum_{n=1}^N x_n^2 = \widehat{\mathrm{Var}}[x_n] = 1$ . For simplicity, let’s also assume that $\mu_\star = 0$ in the analysis below.

If we see $N$ returns from this data-generating process, $\mathbf{r}_t = \{ \, r_{1,t}, \, r_{2,t}, \, \ldots, \, r_{N,t} \, \}$ , then we can estimate $\beta_\star$ by choosing the parameter value that would maximize the posterior probability of realizing these returns:

$\begin{align*} \hat{\beta}_{\text{MAP}} &\overset{\scriptscriptstyle \text{def}}{=} \arg \max_{\beta} \{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) \times \mathrm{Pr}(\beta) \, \} \\ &= \arg \min_{\beta} \{ \, - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}(\beta) \, \}. \end{align*}$

This is known as maximum a posteriori (MAP) estimation, and the second equality in the expression above points out how we can either maximize the posterior probability or minimize $-(\sfrac{1}{N}) \cdot \log(\cdot)$ of this function,

$\begin{align*} \mathrm{f}(\beta) &\overset{\scriptscriptstyle \text{def}}{=} - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}(\beta). \end{align*}$

We can think about $\mathrm{f}(\beta)$ as the average improbability of the realized returns given $\beta_\star = \beta$ .

So, what is this answer? Because $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ and $\epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1)$ , we know that

$\begin{align*} \mathrm{f}(\beta) &= {\textstyle \frac{1}{N}} \cdot \left\{ \, {\textstyle \sum_{n=1}^N} {\textstyle \frac{1}{2}} \cdot (r_{n,t} - \beta \cdot x_{n,t-1})^2 + N \cdot {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) \, \right\} \\ &\qquad \quad + \, {\textstyle \frac{1}{N}} \cdot \left\{ \, {\textstyle \frac{1}{2 \cdot \sigma^2}} \cdot (\beta - 0)^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) + {\textstyle \frac{1}{2}} \cdot \log(\sigma) \, \right\} \end{align*}$

where the first line is $-(\sfrac{1}{N}) \cdot \log \mathrm{Pr}(\mathbf{r}_t|\mathbf{x}_{t-1}, \, \beta)$ and the second line is $-(\sfrac{1}{N}) \cdot \log \mathrm{Pr}(\beta)$ . What’s more, because we’re specifically choosing $\hat{\beta}_{\text{MAP}}$ to minimize $\mathrm{f}(\beta)$ , we also know that

$\begin{align*} \mathrm{f}'(\hat{\beta}_{\text{MAP}}) &= 0 = - \, {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{MAP}} \cdot x_{n,t-1}) \cdot x_{n,t-1} + {\textstyle \frac{1}{N}} \cdot {\textstyle \frac{1}{\sigma^2}} \cdot \hat{\beta}_{\text{MAP}}. \end{align*}$

And, solving this first-order condition for $\hat{\beta}_{\text{MAP}}$ tells us exactly how to estimate $\beta_\star$ :

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \frac{ N \cdot \left\{ \frac{1}{N} \cdot \sum_{n=1}^N r_{n,t} \cdot x_{n,t-1} \right\} }{ \frac{1}{\sigma^2} + N \cdot \left\{ \frac{1}{N} \cdot \sum_{n=1}^N x_{n,t-1}^2 \right\} } = \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] }. \end{align*}$

3. Selection

Now that we’ve seen the solution to our parameter-estimation problem, let’s get back to solving our predictor-selection problem. Should we include $x$ in our predictive model of tomorrow’s returns? It turns out that answering this question means learning about the prior variance of $\beta_\star$ . Is $\beta_\star$ is equally likely to take on any value, $\sigma^2 \to \infty$ ? Or, should we assume that $\beta_\star = 0$ regardless of the evidence, $\sigma^2 \to 0$ ?

To see where the first choice comes from, let’s think about the priors we’re implicitly adopting when we include $x$ in our predictive model. Since $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ , this means looking for a $\sigma^2$ such that $\hat{\beta}_{\text{MAP}} = \hat{\beta}_{\text{OLS}}$ . Inspecting the solution to our parameter-estimation problem reveals that

$\begin{align*} \lim_{\sigma^2 \to \infty} \hat{\beta}_{\text{MAP}} &= \lim_{\sigma^2 \to \infty} \left( \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] } \right) = \frac{ \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \widehat{\mathrm{Var}}[x_{n,t-1}] } = \hat{\beta}_{\text{OLS}}. \end{align*}$

So, by including $x$ , we’re adopting an agnostic prior that $\beta_\star$ is equally likely to be any value under the sun.

To see where the second choice comes from, let’s think about the priors we’re implicitly adopting when we exclude $x$ from our predictive model. This means looking for a $\sigma^2$ such that $\hat{\beta}_{\text{MAP}} = 0$ regardless of the realized data, $\mathbf{x}_{t-1}$ . Again, inspecting the formula for $\hat{\beta}_{\text{MAP}}$ reveals that

$\begin{align*} \lim_{\sigma^2 \to 0} \hat{\beta}_{\text{MAP}} &= \lim_{\sigma^2 \to 0} \left( \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] } \right) =0. \end{align*}$

So, by excluding $x$ , we’re adopting a religious prior that $\beta_\star = 0$ regardless of any new evidence.

Thus, when we decide whether to include $x$ in our predictive model, what we’re really doing is learning about our priors. So, after seeing $N$ returns, $\mathbf{r}_t$ , we can decide whether to include $x$ in our predictive model by choosing the prior variance, $\sigma^2$ , that maximizes the posterior probability of realizing these returns,

$\begin{align*} \hat{\sigma}_{\text{MAP}}^2 &\overset{\scriptscriptstyle \text{def}}{=} \arg \max_{\sigma^2 \in \{\infty, \, 0\}} \left\{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \times \mathrm{Pr}( \sigma^2 ) \, \right\} \\ &= \arg \min_{\sigma^2 \in \{ \infty, \, 0\}} \left\{ \, - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) + \log \mathrm{Pr}( \sigma^2 ) \, \right\}, \end{align*}$

where the second equality in the expression above points out how we can either maximize the posterior probability or minimize $-(\sfrac{1}{N}) \cdot \log(\cdot)$ of this function—i.e., its average improbability. Either way, if we estimate $\hat{\sigma}_{\text{MAP}}^2 \to \infty$ , then we should include $x$ ; whereas, if we estimate $\hat{\sigma}_{\text{MAP}}^2 \to 0$ , then we shouldn’t.

4. Why log(N)/N?

The posterior probability of the realized returns given our choice of priors is given by

$\begin{align*} \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \cdot \mathrm{Pr}( \sigma^2 ) &= {\textstyle \int_{-\infty}^\infty} \mathrm{Pr}(\mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta) \cdot \mathrm{Pr}(\beta|\sigma^2) \cdot \mathrm{d}\beta \\ &= {\textstyle \int_{-\infty}^\infty} \, e^{-N \cdot \mathrm{f}(\beta)} \cdot \mathrm{d}\beta. \end{align*}$

In this section, we’re going to see how to evaluate this integral. And, in the process, we’re going to see precisely where that $\sfrac{\log(N)}{N}$ penalty term in the Bayesian information criterion comes from.

Here’s the key insight in plain English. The realized returns are affected by noise shocks. By definition, excluding $x$ from our predictive model means that we aren’t learning about $\beta_\star$ from the realized returns, so there’s no way for these noise shocks to affect either our estimate of $\hat{\beta}_{\text{MAP}}$ or our posterior-probability calculations. By contrast, if we include $x$ in our predictive model, then we are learning about $\beta_\star$ from the realized returns, so these noise shocks will distort both our estimate of $\hat{\beta}_{\text{MAP}}$ and our posterior-probability calculations. The distortion caused by these noise shocks are going to be the source of the $\sfrac{\log(N)}{N}$ penalty term in the Bayesian information criterion.

Now, here’s the same insight in Mathese. Take a look at the Taylor expansion of $\mathrm{f}(\beta)$ around $\hat{\beta}_{\text{MAP}}$ ,

$\begin{align*} \mathrm{f}(\beta) &= \mathrm{f}(\hat{\beta}_{\text{MAP}}) + {\textstyle \frac{1}{2}} \cdot \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \cdot (\beta - \hat{\beta}_{\text{MAP}})^2. \end{align*}$

There’s no first-order term because $\hat{\beta}_{\text{MAP}}$ is chosen to minimize $\mathrm{f}(\beta)$ , and there are no higher-order terms because both $\beta_\star$ and $\epsilon_{n,t}$ are normally distributed. From the formula for $\mathrm{f}(\beta)$ we can calculate that

$\begin{align*} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) &= {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} \cdot x_{n,t-1}^2 + {\textstyle \frac{1}{N}} \cdot {\textstyle \frac{1}{\sigma^2}}. \end{align*}$

Recall that $\mathrm{f}(\beta)$ measures the average improbability of realizing $\mathbf{r}_t$ given that $\beta_\star = \beta$ . So, if $\mathrm{f}''(\hat{\beta}_{\text{MAP}}) \to \infty$ for a given choice of priors, then having any $\beta_\star \neq \hat{\beta}_{\text{MAP}}$ is infinitely improbable under those priors. And, this is exactly what we find when we exclude $x$ from our predictive model, $\lim_{\sigma \to 0} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) = \infty$ . By contrast, if we include $x$ in our predictive model, then $\lim_{\sigma \to \infty} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) = 1$ , meaning that we are willing to entertain the idea that $\beta_\star \neq \hat{\beta}_{\text{MAP}}$ due to distortions caused by the noise shocks.

To see why these distortions warrant a $\sfrac{\log(N)}{N}$ penalty, all we have to do then is evaluate the integral. First, let’s think about the case where we exclude $x$ from our predictive model. We just saw that, if $\sigma \to 0$ , then we are unwilling to consider any parameter values besides $\hat{\beta}_{\text{MAP}} = 0$ . So, the integral equation for our posteriors given that $\sigma^2 \to 0$ simplifies to

$\begin{align*} \lim_{\sigma^2 \to 0} \left\{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \, \right\} &= \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta_\star = 0 ) \\ &= {\textstyle \big(\frac{1}{\sqrt{2 \cdot \pi}}\big)^N} \cdot e^{ - \, \sum_{n=1}^N \frac{1}{2} \cdot (r_{n,t} - 0)^2 }. \end{align*}$

This means that the average improbability of realizing $\mathbf{r}_t$ given the priors $\sigma^2 \to 0$ is given by

$\begin{align*} \lim_{\sigma \to 0} \left\{ \, - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) \, \right\} &= {\textstyle \frac{1}{2 \cdot N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - 0)^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi). \end{align*}$

To calculate our posterior beliefs when we include $x$ , let’s use this Taylor expansion around $\hat{\beta}_{\text{MAP}}$ again,

$\begin{align*} \lim_{\sigma \to \infty} \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) &= \lim_{\sigma \to \infty} \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{-\, N \cdot \mathrm{f}(\beta)} \cdot \mathrm{d}\beta \, \right\} \\ &= \lim_{\sigma \to \infty} \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{-\, N \cdot \left\{ \mathrm{f}(\hat{\beta}_{\text{MAP}}) + \frac{1}{2} \cdot \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \cdot (\beta - \hat{\beta}_{\text{MAP}})^2 \right\}} \cdot \mathrm{d}\beta \, \right\} \\ &= \left\{ \, e^{-\, N \cdot \mathrm{f}(\hat{\beta}_{\text{OLS}})} \, \right\} \times \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{N}{2} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \, \right\}. \end{align*}$

The first term is the probability of observing the realized returns assuming that $\beta_\star = \lim_{\sigma^2 \to \infty} \hat{\beta}_{\text{MAP}} = \hat{\beta}_{\text{OLS}}$ . The second term is a penalty that accounts for the fact that $\beta_\star$ might be different from the estimated $\hat{\beta}_{\text{OLS}}$ in finite samples. Due to the central-limit theorem, this difference between $\beta_\star$ and $\hat{\beta}_{\text{OLS}}$ is going to shrink at a rate of $\sqrt{(\sfrac{1}{N})}$ :

$\begin{align*} {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{N}{2} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta &= {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{1}{2 \cdot (\sfrac{1}{N})} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \\ &= \sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})} \cdot \int_{-\infty}^\infty \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}}} \cdot e^{ - \, \frac{1}{2 \cdot (\sfrac{1}{N})} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \\ &= \sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}. \end{align*}$

So, the average improbability of realizing $\mathbf{r}_t$ given the priors $\sigma^2 \to \infty$ is given by

$\begin{align*} \lim_{\sigma \to \infty} \left\{ \, - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) \, \right\} &= {\textstyle \frac{1}{2 \cdot N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) + {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{\log(N)}{N}} + \mathrm{O}(\sfrac{1}{N}) \end{align*}$

where $\mathrm{O}(\sfrac{1}{N})$ is big-“O” notation denoting terms that shrink faster than $\sfrac{1}{N}$ as $N \to \infty$ .

5. Formatting

Bringing everything together, hopefully it’s now clear why we can decide whether to include $x$ in our predictive model by checking whether

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} + \mathrm{O}(\sfrac{1}{N}) \leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - 0)^2. \end{align*}$

The $\sfrac{\log(N)}{N}$ penalty term accounts for the fact that you’re going to be overfitting the data in sample when you include more right-hand-side variables. This criterion was first proposed in Schwarz (1978), who showed that the criterion becomes exact as $N \to \infty$ . The Bayesian information criterion is often written as an optimization problem as well:

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \arg \min_{\beta} \left\{ \, {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \beta \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} \cdot \mathrm{1}_{\{ \beta \neq 0 \}} \, \right\}. \end{align*}$

Both ways of writing down the criterion are the same. They just look different due to formatting. There is one interesting idea that pops out of writing down the Bayesian information criterion as a optimization problem. Solving for the $\hat{\beta}_{\text{MAP}}$ suggests that you should completely ignore any predictors with sufficiently small OLS coefficients:

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \begin{cases} \hat{\beta}_{\text{OLS}} &\text{if } |\hat{\beta}_{\text{OLS}}| \geq \sqrt{{\textstyle \frac{1}{2}} \cdot {\textstyle \frac{\log(N)}{N}}}, \text{ and} \\ 0 &\text{otherwise.} \end{cases} \end{align*}$