1. Motivation
Imagine that we’re trying to predict the cross-section of expected returns, and we’ve got a sneaking suspicion that might be a good predictor. So, we regress today’s returns on
to see if our hunch is right,
The logic is straightforward. If explains enough of the variation in today’s returns, then
must be a good predictor and we should include it in our model of tomorrow’s returns,
.
But, how much variation is “enough variation”? After all, even if doesn’t actually predict tomorrow’s returns, we’re still going to fit today’s returns better if we use an additional right-hand-side variable,
The effect is mechanical. If we want to explain all of the variation in today’s returns, then all we have to do is include right-hand-side variables in our OLS regression. With
linearly independent right-hand-side variables we can always perfectly predict
stock returns, no matter what variables we choose.
The Bayesian information criterion (BIC) tells us that we should include as a right-hand-side variable if it explains at least
of the residual variation,
But, where does this penalty come from? And, why is following this rule the Bayesian thing to do? Bayesian updating involves learning about a parameter value by combining prior beliefs with evidence from realized data. So, what parameter are we learning about when using the Bayesian information criterion? And, what are our priors beliefs? These questions are the topic of today’s post.
2. Estimation
Instead of diving directly into our predictor-selection problem (should we include in our model?), let’s pause for a second and solve our parameter-estimation problem (how should we estimate the coefficient on
?). Suppose the data-generating process for returns is
where ,
, and
is normalized so that
. For simplicity, let’s also assume that
in the analysis below.
If we see returns from this data-generating process,
, then we can estimate
by choosing the parameter value that would maximize the posterior probability of realizing these returns:
This is known as maximum a posteriori (MAP) estimation, and the second equality in the expression above points out how we can either maximize the posterior probability or minimize of this function,
We can think about as the average improbability of the realized returns given
.
So, what is this answer? Because
and
, we know that
where the first line is and the second line is
.
What’s more, because we’re specifically choosing
to minimize
, we also know that
And, solving this first-order condition for tells us exactly how to estimate
:
3. Selection
Now that we’ve seen the solution to our parameter-estimation problem, let’s get back to solving our predictor-selection problem. Should we include in our predictive model of tomorrow’s returns? It turns out that answering this question means learning about the prior variance of
. Is
is equally likely to take on any value,
? Or, should we assume that
regardless of the evidence,
?
To see where the first choice comes from, let’s think about the priors we’re implicitly adopting when we include in our predictive model. Since
, this means looking for a
such that
. Inspecting the solution to our parameter-estimation problem reveals that
So, by including , we’re adopting an agnostic prior that
is equally likely to be any value under the sun.
To see where the second choice comes from, let’s think about the priors we’re implicitly adopting when we exclude from our predictive model. This means looking for a
such that
regardless of the realized data,
. Again, inspecting the formula for
reveals that
So, by excluding , we’re adopting a religious prior that
regardless of any new evidence.
Thus, when we decide whether to include in our predictive model, what we’re really doing is learning about our priors. So, after seeing
returns,
, we can decide whether to include
in our predictive model by choosing the prior variance,
, that maximizes the posterior probability of realizing these returns,
where the second equality in the expression above points out how we can either maximize the posterior probability or minimize of this function—i.e., its average improbability. Either way, if we estimate
, then we should include
; whereas, if we estimate
, then we shouldn’t.
4. Why log(N)/N?
The posterior probability of the realized returns given our choice of priors is given by
In this section, we’re going to see how to evaluate this integral. And, in the process, we’re going to see precisely where that penalty term in the Bayesian information criterion comes from.
Here’s the key insight in plain English. The realized returns are affected by noise shocks. By definition, excluding from our predictive model means that we aren’t learning about
from the realized returns, so there’s no way for these noise shocks to affect either our estimate of
or our posterior-probability calculations. By contrast, if we include
in our predictive model, then we are learning about
from the realized returns, so these noise shocks will distort both our estimate of
and our posterior-probability calculations. The distortion caused by these noise shocks are going to be the source of the
penalty term in the Bayesian information criterion.
Now, here’s the same insight in Mathese. Take a look at the Taylor expansion of around
,
There’s no first-order term because is chosen to minimize
, and there are no higher-order terms because both
and
are normally distributed. From the formula for
we can calculate that
Recall that measures the average improbability of realizing
given that
. So, if
for a given choice of priors, then having any
is infinitely improbable under those priors. And, this is exactly what we find when we exclude
from our predictive model,
. By contrast, if we include
in our predictive model, then
, meaning that we are willing to entertain the idea that
due to distortions caused by the noise shocks.
To see why these distortions warrant a penalty, all we have to do then is evaluate the integral. First, let’s think about the case where we exclude
from our predictive model. We just saw that, if
, then we are unwilling to consider any parameter values besides
. So, the integral equation for our posteriors given that
simplifies to
This means that the average improbability of realizing given the priors
is given by
To calculate our posterior beliefs when we include , let’s use this Taylor expansion around
again,
The first term is the probability of observing the realized returns assuming that . The second term is a penalty that accounts for the fact that
might be different from the estimated
in finite samples. Due to the central-limit theorem, this difference between
and
is going to shrink at a rate of
:
So, the average improbability of realizing given the priors
is given by
where is big-“O” notation denoting terms that shrink faster than
as
.
5. Formatting
Bringing everything together, hopefully it’s now clear why we can decide whether to include in our predictive model by checking whether
The penalty term accounts for the fact that you’re going to be overfitting the data in sample when you include more right-hand-side variables. This criterion was first proposed in Schwarz (1978), who showed that the criterion becomes exact as
. The Bayesian information criterion is often written as an optimization problem as well:
Both ways of writing down the criterion are the same. They just look different due to formatting. There is one interesting idea that pops out of writing down the Bayesian information criterion as a optimization problem. Solving for the suggests that you should completely ignore any predictors with sufficiently small OLS coefficients:
You must be logged in to post a comment.