Imagine you’re an asset-pricing researcher. You’ve just thought up a new variable, , that might predict the cross-section of returns. And you’ve regressed returns on
in a market environment
of your choosing (i.e., using data on some specific time period, country, asset class, set of test assets, etc):
(1)
If differences in predict differences in returns in your chosen market environment
, the estimated slope coefficient will large,
. It would’ve been profitable to trade on the predictor in sample.
Suppose you find . Assets with higher
values today tend to have higher returns tomorrow. You now face a choice about whether to publish this finding. If you do, then other researchers will read your paper and try to replicate it in other market environments you haven’t yet looked at,
. Let
denote the collection of all out-of-sample market environments that your colleagues might examine.
Obviously, you shouldn’t publish if isn’t a good cross-sectional predictor in most of these out-of-sample environments—i.e., you shouldn’t publish if
. But, even if
is a good predictor on average, you still worry about worst-case scenarios. If there’s one market environment
where
, then one of your colleagues will surely discover it and you’ll look utterly foolish when he tells the world. You only want to publish if
robustly predicts returns out-of-sample:
(2)
captures the relative importance of these two considerations to your publication decision. The larger the
, the more you care about saving face by not publishing any really bad predictions.
Importantly, let’s assume that all you care about when doing research is solving this robust out-of-sample prediction problem. You don’t care at all about whether investors actually price assets based on . All that matters is whether
reliably predicts returns out-of-sample. You’re completely drunk on Friedman’s “as if” Kool-Aid. Before deciding whether to publish, you have a choice as to which market environment to examine. What sort of environment should you choose? What should your empirical strategy be?
The key insight in this post is that, even if all you care about is robust out-of-sample performance, causal inference still turns out to be a useful tool for achieving this goal. If investors always use the same model to price assets, then understanding this model will allow you to always make good predictions. Your empirical strategy should be to choose an empirical environment that identifies the causal effect of
on returns.
Investors’ model
I begin by defining investors’ model. Suppose that, in every market environment, investors price each asset so that its returns are governed by the following linear structural model:
(3)
Moreover, assume that the parameters, , are the same in every market environment.
is the cross-sectional predictor that you’re working on, and
is an omitted variable. This is a variable that investors might be using to price assets but researchers have yet to discover. If it’s 1981 and
is firm size, then
might be liquidity.
is a noise term and
captures its affect on returns.
Crucially, either affects returns,
, or
affects returns,
, but not both in investors’ model. If
, then
reliably predicts the cross-section of returns since
is the same in every environment—i.e., in every time period, country, etc. If
, then any predictability associated with
is spurious. Let
(4)
denote the entire range of possible values that and
might take on.
To keep things simple, suppose that the realized values of ,
, and
for each asset in a given market environment are drawn IID normal:
(5)
,
, and
all have mean zero and unit variance. The noise term
is uncorrelated with both
and
in every market environment,
. However,
and
may be correlated across stocks,
. Moreover, this correlation can differ across market environments. In other words,
and
may be highly correlated in one time period/country/asset class/etc but not in another.
Note that Equations (3) and (5) imply asset returns are zero on average, , in every market environment. I’m making this assumption to keep the math simple. If it really bothers you, just think about
as an asset’s residual return that unexplained by other trading signals. Let’s also assume that
in every market environment for the same reasons. This assumption implies that
.
Two explanations
When you regressed the cross-section of returns on in your chosen market environment
, you found that
. Given the structure of investors’ model, we know that either
predicts the cross-section of returns or
predicts the cross-section of returns but not both:
(6)
It might be that you estimated in market environment
because
in every environment. Or it might be that you estimated
in market environment
because
happened to be correlated with an omitted variable in that environment,
. These are the two possible explanations.
Since you’re focused on robust out-of-sample performance in other market environments , the reason why
in-sample is very important. If
merely because
, then
will only be a good cross-sectional predictor in other market environments
where
and
are similarly correlated,
. Under this explanation, it’s possible to imagine market environments where
is an abysmal predictor. Just look for environments where
.
Causal inference
What needs to be true about market environment if you want to be able to distinguish between these two explanations? The answer boils down to an identifying assumption about the range of values that
might take on:
(7)
A market environment, , consists of a set of structural parameters,
; a range of possible values for the correlation between
and
,
; and, a particular choice for this value,
.
If market environments and
have the same structural parameters,
, then the cross-sectional slope coefficient will be the same in both environments,
. Yet, you will interpret the slope coefficient differently in each environment if
. By analogy, medical researchers will draw different conclusions about a drug’s efficacy from an RCT than from an observational study even if the joint distribution of patient outcomes and observable characteristics is the same in both datasets. If
, then
identifies
as the correct explanation. There’s no way to have
in such an environment. By contrast, if
, then
could be explained either by
or by
.
Note that it isn’t possible to choose a market environment where consists of an arbitrarily small neighborhood around zero. The omitted variable
can explain no more than
of the variation in returns across assets. That would occur if
since we are assuming
. Hence, if
and
due to a spurious correlation, then this correlation must be bounded away from zero:
(8)
This digital zero/non-zero distinction is why it’s possible to map out causal effects using path diagrams. A path between two variables must be contemplated whenever they could have a non-zero correlation.
Out-of-sample environments
When you regressed the cross-section of returns on in market environment
, you found
. We can now give a precise definition for the set of all out-of-sample market environments that other researchers might try to replicate this finding in. Let
(9)
denote the range of possible values for and
that are consistent with your initial estimate
given
. If
is guaranteed to be uncorrelated with the omitted variable,
, then
and we say that market environment
identifies changes in
as having a causal affect on the cross-section of returns.
Given the set of all and
values that are consistent with your initial result
, the range of potential out-of-sample market environments is defined as follows:
(10)
This collection of market environments consists of any environment which could be generated by some and
that’s consistent with your initial result combined with any possible value of
.
Research strategy
If you chose a market environment that identified the causal effect of
on the cross-section of returns for your initial test, then your estimate of
would imply that
in every out-of-sample market environment. The left-hand side of Equation (2) would reduce to:
(11)
The finding would be robust out-of-sample, and you should publish it.
By contrast, if you chose a market environment that did not identify the causal effect of
on the cross-section of returns, then your estimate of
would be harder to interpret. It could be that
or it could be that
. If the latter is true, then we would say that
reflects a spurious correlation. And to make this spurious correlation look as bad as possible out-of-sample, other researchers should look for a market environment
where
:
(12)
Absent identification, you have to entertain this possibility. So you may refuse to publish strong results with good average-case out-of-sample performance for fear of being embarrassed by worst-case predictions.
Thus, as outlined at the beginning, even if all you care about as a researcher is publishing results that have robust out-of-sample performance, causal inference still turns out to be relevant. It’s a very a useful tool for achieving this goal. If investors are always using the same model to price assets, then understanding this model will allow you to always make good predictions. So you should consider adopting a research strategy whereby you insist on testing each new predictor in an identified market environment
.
No free lunch
I recognize that identifying causal effects is hard. Running RCTs is hard. Finding valid instrumental variables is hard. It’s hard to find a market environment that identifies the causal effect of a change in on returns—i.e., to find a market environment
where it’s reasonable to assume that
.
So you might be thinking: “Can’t I just get around the problem by checking lots of different market environments before publishing? If in lots of different market environments
, then shouldn’t I be more confident in
‘s out-of-sample performance? After all, in real life, no researcher would (or could!) publish a result about cross-sectional predictability based on one regression.”
It’s absolutely true that you do learn something about ‘s out-of-sample performance when you verify that
in many different market environments
. Unfortunately, the something that you learn only applies to
‘s average-case performance,
. For example, if
in more than half of all possible out-of-sample environments, then there’s no way for
since we know that
in every remaining market environment as we saw in Equation (12).
Yet, until you check every imaginable out-of-sample environment, you can say nothing new about the worst-case outcome. No matter how many environments you check in , you can never be certain that
in one of the remaining environments in
that you haven’t checked. Thus, if you care about never publishing a result that makes an embarrassingly bad out-of-sample prediction in some situation,
, then simply doing lots of in-sample checks isn’t a viable research strategy on its own. It certainly doesn’t hurt. But it DOES NOT tell you anything about out-of-sample robustness in the setup thusfar.
The thing that makes causal inference difficult is that it requires making a strong assumption about the joint distribution of and an unobserved/unknown/omitted variable
in a particular market environment
. You have to assume that
. Such identifying assumptions can be hard to stomach. However, the assumption that you would need to make for in-sample robustness to guarantee out-of-sample robustness is even less palatable. TANSTAAFL. Instead of requiring that
in some specific market environment, you would need to assume that
in all out-of-sample environments
. Such an assumption is tantamount to simply assuming the result you’re after—namely, out-of-sample robustness.