Causal inference as a tool for publishing robust results

Imagine you’re an asset-pricing researcher. You’ve just thought up a new variable, $X$ , that might predict the cross-section of returns. And you’ve regressed returns on $X$ in a market environment $e$ of your choosing (i.e., using data on some specific time period, country, asset class, set of test assets, etc):

(1) $\begin{equation*} R(i) = \alpha_e + \beta_e \cdot X(i) + \epsilon_e(i) \qquad \text{for assets } i=1,\ldots,\,I \end{equation*}$

If differences in $X$ predict differences in returns in your chosen market environment $e$ , the estimated slope coefficient will large, $|\beta_e| \gg 0$ . It would’ve been profitable to trade on the predictor in sample.

Suppose you find $\beta_e \gg 0$ . Assets with higher $X$ values today tend to have higher returns tomorrow. You now face a choice about whether to publish this finding. If you do, then other researchers will read your paper and try to replicate it in other market environments you haven’t yet looked at, $e' \neq e$ . Let $\mathsf{OoS}_e$ denote the collection of all out-of-sample market environments that your colleagues might examine.

Obviously, you shouldn’t publish if $X$ isn’t a good cross-sectional predictor in most of these out-of-sample environments—i.e., you shouldn’t publish if $\text{average}_{e' \in \mathsf{OoS}_e} \, \beta_{e'} = \frac{1}{|\mathsf{OoS}_e|} \cdot \sum_{e' \in \mathsf{OoS}_e} \, \beta_{e'} < 0$ . But, even if $X$ is a good predictor on average, you still worry about worst-case scenarios. If there’s one market environment $e' \in \mathsf{OoS}_e$ where $\beta_{e'} \ll 0$ , then one of your colleagues will surely discover it and you’ll look utterly foolish when he tells the world. You only want to publish if $X$ robustly predicts returns out-of-sample:

(2) $\begin{equation*} (1-\lambda) \cdot \underset{e' \in \mathsf{OoS}_e}{\text{average}} \, \beta_{e'} + \lambda \cdot \underset{e' \in \mathsf{OoS}_e}{\text{minimum}\phantom{j}} \!\beta_{e'} \geq 0 \end{equation*}$

$\lambda \in (0, \, 1]$ captures the relative importance of these two considerations to your publication decision. The larger the $\lambda$ , the more you care about saving face by not publishing any really bad predictions.

Importantly, let’s assume that all you care about when doing research is solving this robust out-of-sample prediction problem. You don’t care at all about whether investors actually price assets based on $X$ . All that matters is whether $X$ reliably predicts returns out-of-sample. You’re completely drunk on Friedman’s “as if” Kool-Aid. Before deciding whether to publish, you have a choice as to which market environment to examine. What sort of environment should you choose? What should your empirical strategy be?

The key insight in this post is that, even if all you care about is robust out-of-sample performance, causal inference still turns out to be a useful tool for achieving this goal. If investors always use the same model to price assets, then understanding this model will allow you to always make good predictions. Your empirical strategy should be to choose an empirical environment $e$ that identifies the causal effect of $X$ on returns.

Investors’ model

I begin by defining investors’ model. Suppose that, in every market environment, investors price each asset so that its returns are governed by the following linear structural model:

(3) $\begin{equation*} R \leftarrow \theta_{\star} \cdot X + \vartheta_{\star} \cdot U + \sigma_{\star} \cdot N \end{equation*}$

Moreover, assume that the parameters, $(\theta_{\star}, \, \vartheta_{\star}, \, \sigma_{\star})$ , are the same in every market environment. $X$ is the cross-sectional predictor that you’re working on, and $U$ is an omitted variable. This is a variable that investors might be using to price assets but researchers have yet to discover. If it’s 1981 and $X$ is firm size, then $U$ might be liquidity. $N$ is a noise term and $\sigma_{\star} > 0$ captures its affect on returns.

Crucially, either $X$ affects returns, $\theta_{\star} \neq 0$ , or $U$ affects returns, $\vartheta_{\star} \neq 0$ , but not both in investors’ model. If $\theta_{\star} \neq 0$ , then $X$ reliably predicts the cross-section of returns since $\theta_{\star}$ is the same in every environment—i.e., in every time period, country, etc. If $\vartheta_{\star} \neq 0$ , then any predictability associated with $X$ is spurious. Let

(4) $\begin{equation*} \mathsf{\Theta} = \{ \, (\theta, \, \vartheta) \in [-1, \, 1]^2 \, | \, 1_{\{\theta \neq 0\}} + 1_{\{ \vartheta \neq 0 \}} = 1 \, \} \end{equation*}$

denote the entire range of possible values that $\theta_{\star}$ and $\vartheta_{\star}$ might take on.

To keep things simple, suppose that the realized values of $X$ , $U$ , and $N$ for each asset in a given market environment are drawn IID normal:

(5) $\begin{equation*} \begin{pmatrix} X \\ U \\ N \end{pmatrix} \overset{\text{IID}}{\sim} \mathrm{Normal} \left( \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & \rho_e & 0 \\ \rho_e & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \right) \end{equation*}$

$X$ , $U$ , and $N$ all have mean zero and unit variance. The noise term $N$ is uncorrelated with both $X$ and $U$ in every market environment, $\Corr_e[X, \, N] = \Corr_e[U, \, N] = 0$ . However, $X$ and $U$ may be correlated across stocks, $\Corr_e[X, \, U] = \rho_e \neq 0$ . Moreover, this correlation can differ across market environments. In other words, $X$ and $U$ may be highly correlated in one time period/country/asset class/etc but not in another.

Note that Equations (3) and (5) imply asset returns are zero on average, $\Exp_e[R] = 0$ , in every market environment. I’m making this assumption to keep the math simple. If it really bothers you, just think about $R$ as an asset’s residual return that unexplained by other trading signals. Let’s also assume that $\Var_e[R] = 1$ in every market environment for the same reasons. This assumption implies that $\sigma_{\star} = \sqrt{1 - (\theta_{\star}^2 + \vartheta_{\star}^2)}$ .

Two explanations

When you regressed the cross-section of returns on $X$ in your chosen market environment $e$ , you found that $\beta_e \gg 0$ . Given the structure of investors’ model, we know that either $X$ predicts the cross-section of returns or $U$ predicts the cross-section of returns but not both:

(6) $\begin{equation*} \beta_e = \begin{cases} \theta_{\star} &\text{if } \theta_{\star} \neq 0 \\ \vartheta_{\star} \cdot \rho_e &\text{if } \theta_{\star} = 0 \end{cases} \end{equation*}$

It might be that you estimated $\beta_e \gg 0$ in market environment $e$ because $\theta_{\star} \gg 0$ in every environment. Or it might be that you estimated $\beta_e \gg 0$ in market environment $e$ because $X$ happened to be correlated with an omitted variable in that environment, $\vartheta_{\star} \cdot \rho_e \gg 0$ . These are the two possible explanations.

Since you’re focused on robust out-of-sample performance in other market environments $e' \neq e$ , the reason why $\beta_e \gg 0$ in-sample is very important. If $\beta_e \gg e$ merely because $\vartheta_{\star} \cdot \rho_e \gg 0$ , then $X$ will only be a good cross-sectional predictor in other market environments $e' \in \mathsf{OoS}_e$ where $X$ and $U$ are similarly correlated, $\mathbb{S}\mathrm{ign}[\rho_e] = \mathbb{S}\mathrm{ign}[\rho_{e'}]$ . Under this explanation, it’s possible to imagine market environments where $X$ is an abysmal predictor. Just look for environments where $\mathbb{S}\mathrm{ign}[\rho_e] \neq \mathbb{S}\mathrm{ign}[\rho_{e'}]$ .

Causal inference

What needs to be true about market environment $e$ if you want to be able to distinguish between these two explanations? The answer boils down to an identifying assumption about the range of values that $\Corr_e[X, \, U] = \rho_e$ might take on:

(7) $\begin{equation*} \mathsf{P}_e = \{ \, \rho \in [-1, \, 1] \, | \, \text{it could be that $\Corr_e[X, \, U] = \rho$ in market environment $e$} \, \} \end{equation*}$

A market environment, $e = \{ \theta, \, \vartheta; \mathsf{P}, \, \rho \}$ , consists of a set of structural parameters, $(\theta, \, \vartheta)$ ; a range of possible values for the correlation between $X$ and $U$ , $\mathsf{P}$ ; and, a particular choice for this value, $\rho \in \mathsf{P}$ .

If market environments $e$ and $e'$ have the same structural parameters, $(\theta,\,\vartheta,\,\rho_e) = (\theta,\,\vartheta,\,\rho_{e'})$ , then the cross-sectional slope coefficient will be the same in both environments, $\beta_e = \beta_{e'}$ . Yet, you will interpret the slope coefficient differently in each environment if $\mathsf{P}_e \neq \mathsf{P}_{e'}$ . By analogy, medical researchers will draw different conclusions about a drug’s efficacy from an RCT than from an observational study even if the joint distribution of patient outcomes and observable characteristics is the same in both datasets. If $\{ 0 \} = \mathsf{P}_e$ , then $\beta_e \gg 0$ identifies $\theta_{\star} \gg 0$ as the correct explanation. There’s no way to have $\beta_e = \vartheta_{\star} \cdot 0 \gg 0$ in such an environment. By contrast, if $\{ 0\} \subset \mathsf{P}_{e'}$ , then $\beta_{e'} \gg 0$ could be explained either by $\theta_{\star} \gg 0$ or by $\vartheta_{\star} \cdot \rho_{e'} \gg 0$ .

Note that it isn’t possible to choose a market environment where $\mathsf{P}_{e'}$ consists of an arbitrarily small neighborhood around zero. The omitted variable $U$ can explain no more than $100\%$ of the variation in returns across assets. That would occur if $|\vartheta_{\star}| = 1$ since we are assuming $\Var_{e'}[R] = 1$ . Hence, if $\theta_{\star} = 0$ and $\beta_{e'} = \vartheta_{\star} \cdot \rho_{e'} \gg 0$ due to a spurious correlation, then this correlation must be bounded away from zero:

(8) $\begin{equation*} \beta_{e'} = \vartheta_{\star} \cdot \rho_{e'} \leq 1 \cdot \rho_{e'} \qquad \Rightarrow \qquad (-\beta_{e'}, \, 0) \cup (0, \, \beta_{e'}) \not\subset \mathsf{P}_{e'} \end{equation*}$

This digital zero/non-zero distinction is why it’s possible to map out causal effects using path diagrams. A path between two variables must be contemplated whenever they could have a non-zero correlation.

Out-of-sample environments

When you regressed the cross-section of returns on $X$ in market environment $e$ , you found $\beta_e \gg 0$ . We can now give a precise definition for the set of all out-of-sample market environments that other researchers might try to replicate this finding in. Let

(9) $\begin{equation*} \mathsf{\Theta}_e = \{ \, (\theta, \, \vartheta) \in \mathsf{\Theta} \, | \, \text{$\beta_e = \theta + \vartheta \cdot \rho$ for some $\rho \in \mathsf{P}_e$} \, \} \end{equation*}$

denote the range of possible values for $\theta_{\star}$ and $\vartheta_{\star}$ that are consistent with your initial estimate $\beta_e \gg 0$ given $\mathsf{P}_e$ . If $X$ is guaranteed to be uncorrelated with the omitted variable, $\{ 0 \} = \mathsf{P}_e$ , then $\{ (\beta_e, \, 0) \} = \mathsf{\Theta}_e$ and we say that market environment $e$ identifies changes in $X$ as having a causal affect on the cross-section of returns.

Given the set of all $\theta_{\star}$ and $\vartheta_{\star}$ values that are consistent with your initial result $\beta_e \gg 0$ , the range of potential out-of-sample market environments is defined as follows:

(10) $\begin{equation*} \mathsf{OoS}_e = \{ \, e' = (\theta, \, \vartheta; \mathsf{P}, \, \rho ) \, | \, \theta, \, \vartheta \in \mathsf{\Theta}_e \text{ and } \rho \in [-1, \, 1] = \mathsf{P} \, \} \end{equation*}$

This collection of market environments consists of any environment which could be generated by some $\theta_{\star}$ and $\vartheta_{\star}$ that’s consistent with your initial result combined with any possible value of $\rho \in [-1, \, 1]$ .

Research strategy

If you chose a market environment $e$ that identified the causal effect of $X$ on the cross-section of returns for your initial test, then your estimate of $\beta_e \gg 0$ would imply that $\beta_{e'} = \theta_{\star} \gg 0$ in every out-of-sample market environment. The left-hand side of Equation (2) would reduce to:

(11) $\begin{equation*} \begin{split} (1-\lambda) \cdot \underset{e' \in \mathsf{OoS}_e}{\text{average}} \, \beta_{e'} + \lambda \cdot \underset{e' \in \mathsf{OoS}_e}{\text{minimum}\phantom{j}} \!\beta_{e'} &= (1 - \lambda) \cdot \beta_e + \lambda \cdot \beta_e \\ &= \beta_e \end{split} \end{equation*}$

The finding would be robust out-of-sample, and you should publish it.

By contrast, if you chose a market environment $e$ that did not identify the causal effect of $X$ on the cross-section of returns, then your estimate of $\beta_e \gg 0$ would be harder to interpret. It could be that $\beta_e = \theta_{\star} \gg 0$ or it could be that $\beta_e = \vartheta_e \cdot \rho_e \gg 0$ . If the latter is true, then we would say that $\beta_e$ reflects a spurious correlation. And to make this spurious correlation look as bad as possible out-of-sample, other researchers should look for a market environment $e' \in \mathsf{OoS}_e$ where $\rho_{e'} = - 1$ :

(12) $\begin{equation*} \underset{e' \in \mathsf{OoS}_e}{\text{minimum}\phantom{j}} \!\beta_{e'} = -1 \end{equation*}$

Absent identification, you have to entertain this possibility. So you may refuse to publish strong results with good average-case out-of-sample performance for fear of being embarrassed by worst-case predictions.

Thus, as outlined at the beginning, even if all you care about as a researcher is publishing results that have robust out-of-sample performance, causal inference still turns out to be relevant. It’s a very a useful tool for achieving this goal. If investors are always using the same model to price assets, then understanding this model will allow you to always make good predictions. So you should consider adopting a research strategy whereby you insist on testing each new predictor $X$ in an identified market environment $e$ .

No free lunch

I recognize that identifying causal effects is hard. Running RCTs is hard. Finding valid instrumental variables is hard. It’s hard to find a market environment that identifies the causal effect of a change in $X$ on returns—i.e., to find a market environment $e$ where it’s reasonable to assume that $\{ 0 \} = \mathsf{P}_e$ .

So you might be thinking: “Can’t I just get around the problem by checking lots of different market environments before publishing? If $\beta_{e'} = \beta_{e} \gg 0$ in lots of different market environments $e' \in \mathsf{OoS}_e$ , then shouldn’t I be more confident in $X$ ‘s out-of-sample performance? After all, in real life, no researcher would (or could!) publish a result about cross-sectional predictability based on one regression.”

It’s absolutely true that you do learn something about $X$ ‘s out-of-sample performance when you verify that $\beta_{e'} = \beta_e \gg 0$ in many different market environments $e' \in \mathsf{OoS}_e$ . Unfortunately, the something that you learn only applies to $X$ ‘s average-case performance, $\text{average}_{e' \in \mathsf{OoS}_e} \, \beta_{e'}$ . For example, if $\beta_{e'} = 0.99$ in more than half of all possible out-of-sample environments, then there’s no way for $\text{average}_{e' \in \mathsf{OoS}_e} \, \beta_{e'} < 0$ since we know that $\beta_{e'} \geq -1$ in every remaining market environment as we saw in Equation (12).

Yet, until you check every imaginable out-of-sample environment, you can say nothing new about the worst-case outcome. No matter how many environments you check in $\mathsf{OoS}_e$ , you can never be certain that $\beta_{e'} \neq -1$ in one of the remaining environments in $\mathsf{OoS}_e$ that you haven’t checked. Thus, if you care about never publishing a result that makes an embarrassingly bad out-of-sample prediction in some situation, $\lambda > 0$ , then simply doing lots of in-sample checks isn’t a viable research strategy on its own. It certainly doesn’t hurt. But it DOES NOT tell you anything about out-of-sample robustness in the setup thusfar.

The thing that makes causal inference difficult is that it requires making a strong assumption about the joint distribution of $X$ and an unobserved/unknown/omitted variable $U$ in a particular market environment $e$ . You have to assume that $\{ 0 \} = \mathsf{P}$ . Such identifying assumptions can be hard to stomach. However, the assumption that you would need to make for in-sample robustness to guarantee out-of-sample robustness is even less palatable. TANSTAAFL. Instead of requiring that $\rho_e = 0$ in some specific market environment, you would need to assume that $\vartheta_{\star} \cdot \rho_{e'} \gg 0$ in all out-of-sample environments $e' \in \mathsf{OoS}_e$ . Such an assumption is tantamount to simply assuming the result you’re after—namely, out-of-sample robustness.