Alex – Page 5 – Research Notebook

A Tell-Tale Sign of Short-Run Trading

January 26, 2017 by Alex

Motivation

Trading has gotten a lot faster over the last two decades. The term “short-run trader” used to refer to people who traded multiple times a day. Now, it refers to algorithms that trade multiple times a second.

Some people are worried about this new breed of short-run trader making stock prices less informative about firm fundamentals by trading so often on information that’s unrelated to companies’ long-term prospects. But, this is a red herring. By the same logic, market-neutral strategies should be making market indexes like the Russell 3000 less informative about macro fundamentals. And, no one believes this.

Short-run traders aren’t ignoring fundamentals; they’re learning about fundamentals before everyone else by studying order flow. And, this post shows why they also make trading volume look asymmetric and lumpy as a result. The logic is simple. If short-run traders get additional information from order flow, then they’ll use this information to cluster their trading at times when everyone else is moving in the opposite direction.

Benchmark Model

Consider a market with a single company that’s going to pay a dividend, $d_t$ , in future periods $t = 1, \, 2, \, \ldots$ And, suppose that there’s a unit mass of small agents, $i \in (0, \, 1]$ , who have noisy priors about these dividends,

$\begin{equation*} d_t \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}( \mu_t^{(i)}, \, \sfrac{\sigma^2\!}{2} ) \quad \text{for some} \quad \sigma^2 > 0, \end{equation*}$

which are correct on average, $d_t = \int_0^1 \mu_t^{(i)} \cdot \mathrm{d}i$ . This assumption means that, by aggregating agents’ demand, equilibrium prices can contain information about dividends that isn’t known by any individual agent.

This unit mass of agents is split into two different groups: long-term investors and short-run traders. At time $t=0$ , each group of agents trades shares in its own separate fund, $f \in \{L, \, H\}$ , that offers frequency-specific exposure to the company’s dividends at times $t=1,\,2$ . The long-term investors, $i \in (0, \, \sfrac{1}{2}]$ , trade the low-frequency fund which has a payout:

$\begin{equation*} d_L \overset{\scriptscriptstyle \text{def}}{=} d_1 + d_2 \end{equation*}$

And, the short-run traders, $i \in (\sfrac{1}{2}, \, 1]$ , trade the high-frequency fund which has a payout:

$\begin{equation*} d_H \overset{\scriptscriptstyle \text{def}}{=} d_1 - d_2 \end{equation*}$

At time $t=0$ , each agent observes the equilibrium price of a frequency-specific fund, $p_f$ , and then chooses the number of shares to buy, $x_f^{(i)}$ , in order to maximize his expected utility at the end of time $t=2$ :

(1) $\begin{equation*} \max_{x_f^{(i)}} \mathrm{E}^{(i)}\left[ \, - e^{ \, - \alpha \cdot \{d_f - p_f\} \cdot x_f^{(i)} \, } \, \middle| \, p_f \, \right] \quad \text{for some} \quad \alpha > 0 \end{equation*}$

Above, $\mathrm{E}^{(i)}[\cdot|p_f]$ denotes agent $i$ ‘s conditional expectation, and $\alpha$ denotes his risk-aversion parameter.

Let $z_t$ denote the number of shares of the company’s stock that are available for purchase at time $t$ . We say that “markets clear” at time $t=0$ if the dividend payout from each available share at times $t=1,\,2$ has been unambiguously assigned to exactly one trader via their fund holdings:

(2) $\begin{align*} {\textstyle \int_0^{\sfrac{1}{2}}} x_L^{(i)} \cdot \mathrm{d}i + {\textstyle \int_{\sfrac{1}{2}}^1} x_H^{(i)} \cdot \mathrm{d}i &= z_1 \\ \text{and} \quad {\textstyle \int_0^{\sfrac{1}{2}}} x_L^{(i)} \cdot \mathrm{d}i - {\textstyle \int_{\sfrac{1}{2}}^1} x_H^{(i)} \cdot \mathrm{d}i &= z_2 \end{align*}$

Let $z_L \overset{\scriptscriptstyle \text{def}}{=} z_1 + z_2$ and $z_H \overset{\scriptscriptstyle \text{def}}{=} z_1 - z_2$ denote the number of available shares at each frequency.

An equilibrium then consists of a demand rule, $\mathrm{X}(\mathrm{E}^{(i)}[d_f|p_f], \, p_f) = x_f^{(i)}$ , and a price function, $\mathrm{P}(d_f, \, z_f) = p_f$ , such that 1) demand maximizes the expected utility of each agent given the price and 2) markets clear.

Because the equilibrium price of each fund only depends on its promised payout and the number of available shares, if agents knew the number of available shares, then they could reverse engineer a fund’s future payout at times $t=1, \, 2$ by studying its equilibrium price at time $t=0$ . And, an equilibrium in such a market wouldn’t be well-defined. So, to make sure that equilibrium prices at time $t=0$ aren’t fully revealing, let’s assume that the number of available shares in each period is a random variable:

$\begin{equation*} z_t \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}( 0, \, \sfrac{1}{2}) \end{equation*}$

This means thinking about “available” shares as shares that haven’t already been purchased by noise traders.

The key fact about this benchmark model is that agents don’t use order-flow information at time $t=1$ to update their time $t=0$ beliefs. As a result, the model really isn’t about short-run traders in spite of how the variables are named. Crouzet/Dew-Becker/Nathanson shows that with some clever relabeling you could just as easily think about the low- and high-frequency funds as index and market-neutral funds, respectively.

Trading Volume

Although agents are only active at time $t=0$ in the benchmark model, each fund has to trade at times $t=1,\,2$ in order to deliver frequency-specific payouts. Let’s use $x_L$ and $x_H$ to denote aggregate demand:

$\begin{align*} x_L &\overset{\scriptscriptstyle \text{def}}{=} {\textstyle \int_0^{\sfrac{1}{2}}} x_L^{(i)} \cdot \mathrm{d}i \\ \text{and} \quad x_H &\overset{\scriptscriptstyle \text{def}}{=} {\textstyle \int_{\sfrac{1}{2}}^1} \, x_H^{(i)} \cdot \mathrm{d}i \end{align*}$

To deliver $d_L$ to every one of its shareholders, the low-frequency fund has to buy $x_L$ shares of the company’s stock between times $t=0$ and $t=1$ and then liquidate this position between $t=2$ and $t=3$ . And, to deliver $d_H$ to every one of its shareholders, the high-frequency fund has to buy $x_H$ shares between times $t=0$ and $t=1$ , sell $2 \cdot x_H$ shares between $t=1$ and $t=2$ , and then buy back $x_H$ shares between times $t=2$ and $t=3$ .

So, trading volume in the benchmark model is:

$\begin{align*} \mathit{vlm}_{0|1} &\overset{\scriptscriptstyle \text{def}}{=} |x_L| + |x_H| \\ \mathit{vlm}_{1|2} &\overset{\scriptscriptstyle \text{def}}{=} 2 \cdot |x_H| \\ \text{and} \quad \mathit{vlm}_{2|3} &\overset{\scriptscriptstyle \text{def}}{=} |x_L| + |x_H| \end{align*}$

The key thing to notice is that trading volume is symmetric, $\mathit{vlm}_{0|1} = \mathit{vlm}_{2|3}$ , because short-run traders don’t get any new information after time $t=0$ .

Model Solution

So, how many shares of each frequency-specific fund are agents going to demand in the benchmark model? To solve the model and answer this question, let’s first guess that the price function is linear:

$\begin{equation*} \mathrm{P}(d_f, \, z_f) = d_f - \sqrt{\sfrac{\sigma^2\!}{\mathit{SNR}}} \cdot z_f \quad \text{for some} \quad \mathit{SNR} > 0 \end{equation*}$

This guess introduces a new parameter, $\mathit{SNR}$ , which represents the signal-to-noise ratio of fund prices at time $t=0$ . If this parameter is large, then the time $t=0$ prices of the low- and high-frequency funds will reveal a lot of the information about the company’s time $t=1, \, 2$ dividend payouts.

Here’s the upshot of this guess. It implies that each fund’s price is a normally-distributed signal about its future payout, $d_f \sim \mathrm{N}(p_f, \, \sfrac{\sigma^2\!}{\mathit{SNR}})$ . And, with normally-distributed signals, we know how to compute agents’ posterior beliefs about $d_f$ after seeing $p_f$ :

$\begin{equation*} \mathrm{E}^{(i)}[d_f|p_f] = {\textstyle \left\{ \frac{1}{1 + \mathit{SNR}}\right\}} \cdot \mu_f^{(i)} + {\textstyle \left\{ \frac{\mathit{SNR}}{1 + \mathit{SNR}}\right\}} \cdot p_f \end{equation*}$

Above, $\mu_f^{(i)}$ denotes agent $i$ ‘s priors about a particular fund, either $\mu_L^{(i)} \overset{\scriptscriptstyle \text{def}}{=} \mu_1^{(i)} + \mu_2^{(i)}$ or $\mu_H^{(i)} \overset{\scriptscriptstyle \text{def}}{=} \mu_1^{(i)} - \mu_2^{(i)}$ . And, we can then use these posterior beliefs to compute agents’ equilibrium demand rule by solving the first-order condition of Equation (1) with respect to $x_f^{(i)}$ :

(3) $\begin{equation*} \mathrm{X}(\mathrm{E}^{(i)}[d_f|p_f], \, p_f) = \{1 + \mathit{SNR}\} \cdot {\textstyle \big\{ \frac{\mathrm{E}^{(i)}[d_f|p_f] - p_f}{\sigma} \big\}} \cdot {\textstyle \big\{ \frac{1}{\alpha \cdot \sigma} \big\}} \end{equation*}$

Finally, to verify that our original guess about a linear price function was correct, we can plug this equilibrium demand rule into the market-clearing conditions in Equation (2) and solve for $p_f$ :

$\begin{equation*} \mathrm{P}(d_f, \, z_f) = d_f - \alpha \cdot \sigma^2 \cdot z_f \end{equation*}$

The resulting price function is indeed linear, so our solution is internally consistent (…though not unique). And, by matching coefficients, we can solve for the equilibrium signal-to-noise ratio, $\mathit{SNR} = \{ \alpha \cdot \sigma \}^{-2}$ . In other words, fund prices at time $t=0$ reveal more information about a company’s dividend at times $t=1, \, 2$ when agents are less risk averse ( $\alpha$ small) or when they have more precise priors ( $\sigma$ small).

Order-Flow Info

How would this solution have to change if short-run traders could learn from time $t=1$ order flow?

For markets to clear at time $t=1$ , the aggregate demand for the low-frequency fund plus the aggregate demand for the high-frequency fund has to equal the total number of available shares, $x_L + x_H = z_1$ . And, from Equation (3), we know that the aggregate demand for the low-frequency fund is related to the company’s total dividend payout:

$\begin{equation*} x_L = \sqrt{\sfrac{\mathit{SNR}}{\sigma^2}} \cdot \{ d_L - p_L \} \end{equation*}$

So, by looking at the time $t=1$ order flow, short-run traders can get a signal about the company’s time $t=2$ dividend since $d_2 = d_L - d_1$ :

$\begin{equation*} d_2 \sim \mathrm{N}\left( \{p_L - d_1\} - \sqrt{\sfrac{\sigma^2\!}{\mathit{SNR}}} \cdot x_H, \, \sfrac{\sigma^2\!}{\mathit{SNR}} \right), \end{equation*}$

And, this information about the time $t=2$ dividend is helpful since $\mathrm{Cov}[d_2, \, d_H] = - \,\sigma^2$ .

With this additional signal, the short-run traders who previously invested in the high-frequency fund would now rather trade the company’s stock directly at times $t=1, \, 2$ . Let $\tilde{x}_H$ denote their demand at time $t=2$ . When they observe a high price for the low-frequency asset at time $t=0$ , $p_L > 0$ , and a large dividend payout at time $t=1$ , $d_1 > 0$ , they know that $d_2$ is likely small. And, as a result, they’ll short more shares at time $t=2$ , $|\tilde{x}_H| > |x_H|$ . By contrast, when they observe a high price for the low-frequency asset at time $t=0$ , $p_L > 0$ , and a small dividend payout at time $t=1$ , $d_1 < 0$ , they know that $d_2$ is likely large. So, they’ll short fewer shares at time $t=2$ , $|\tilde{x}_H| < |x_{H}|$ .

Either way, trading volume is going to look asymmetric and lumpy as a result, with relatively more of the trading volume clustered at one of the end points. If $p_L > 0$ and $d_1 > 0$ , then relatively more of the trading volume will occur at between time $t=2$ and $t=3$ because $\mathit{vlm}_{0|1}$ is unchanged and:

$\begin{equation*} \mathit{vlm}_{2|3} = |x_L| + |x_H| < |x_L| + |\tilde{x}_H| \overset{\scriptscriptstyle \text{def}}{=} \widetilde{\mathit{vlm}}_{2|3} \end{equation*}$

Whereas, if $p_L > 0$ and $d_1 < 0$ , then relatively more of the trading volume will occur between time $t=0$ and $t=1$ because $\mathit{vlm}_{0|1}$ is unchanged and now $\mathit{vlm}_{2|3} > \widetilde{\mathit{vlm}}_{2|3}$ .

What’s more, to long-term investors who can’t see short-run order flow, short-run traders are going to add execution risk. The price at which the low-frequency fund executes its time $t=2$ orders will now vary. And, this variation will be related to the magnitude (but not the sign) of their time $t=0$ demand.

Finally, note that this analysis shows why it’s easier to model indexers and stock pickers than long-term investors and short-run traders. An equilibrium in either model has to contain a demand rule and a price function (e.g., see the setup in the benchmark model). But, an equilibrium in a model with multi-frequency trade also has to contain a rule for how long-term investors think short-run traders will affect their order execution. And, this rule is the crux of any model with long-term investors and short-run traders.

The Tension Between Learning and Predicting

January 24, 2017 by Alex

1. Motivation

Imagine we’re traders in a market where the cross-section of returns is related to $V \geq 1$ variables:

$\begin{align*} r_s = \alpha^\star + {\textstyle \sum_v} \, \beta_v^{\star} \cdot x_{s,v} + \epsilon_s^{\star}. \end{align*}$

In the equation above, $\alpha^\star$ denotes the mean return, and each $\beta_v^\star$ captures the relationship between returns and the $v$ th right-hand-side variable. Some notation: I’m going to be using a “star” to denote true parameter values and a “hat” to denote in-sample estimates of these values. e.g., $\hat{\beta}_v$ denotes an in-sample estimate of $\beta_v^\star$ . To make things simple, let’s assume that $\sum_s x_{s,v} = 0$ , $\sum_s x_{s,v}^2 = S$ , $\sum_s x_{s,v} \cdot x_{s,v'} = 0$ , and $\epsilon_s^{\star} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2)$ .

Notice that learning about the most likely parameter values,

$\begin{align*} \{ \hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}, \, \ldots, \, \hat{\beta}_V^{\text{ML}} \} &= \arg \max_{\alpha, \, \beta_1, \, \ldots, \, \beta_V} \left\{ \, \mathrm{Pr}(\mathbf{r} | \{\alpha, \, \beta_1, \, \ldots, \, \beta_V\}) \, \right\}, \end{align*}$

is really easy in this setting because $\epsilon_s^{\star} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2)$ . These maximum-likelihood estimates are just the coefficients from a cross-sectional regression,

$\begin{align*} r_s = \hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v} + \hat{\epsilon}_s^{\text{ML}}. \end{align*}$

So, finding $\{ \hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}, \, \ldots, \, \hat{\beta}_V^{\text{ML}} \}$ is homework question from Econometrics 101. Nothing could be simpler.

But, what if we’re interested in predicting returns,

$\begin{align*} \min_{\alpha, \, \beta_1, \, \ldots, \, \beta_V} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, \left( r_s - [\alpha + {\textstyle \sum_v} \, \beta_v \cdot x_{s,v}] \right)^2 \, \right\}, \end{align*}$

rather than learning about the most likely parameter values? It might seem like this is the same exact problem. And, if we’re only considering $1$ right-hand-side variable, then it is the same exact problem. When $V = 1$ the best predictions come from using $\{\hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}\}$ . But, it turns out that when there’s $2$ or more right-hand-side variables (and an unknown mean), this is no longer true. When $V \geq 2$ we can make better predictions with less likely parameters. When $V \geq 2$ there’s a tension between the learning and predicting.

Why? That’s the topic of today’s post.

2. Maximum Likelihood

Finding the most likely (ML) parameter values is equivalent to minimizing the negative log likelihood. So, because we’re assume that $\epsilon_s^\star \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2)$ , this is just

$\begin{align*} - \, \log \mathrm{Pr}(\mathbf{r} | \{\alpha, \, \beta_1, \, \ldots, \, \beta_V\}) &= {\textstyle \frac{1}{2 \cdot (S \cdot \sigma^2)}} \cdot {\textstyle \sum_s} \left(r_s - [\alpha + {\textstyle \sum_v} \, \beta_v \cdot x_{s,v}] \right)^2 + \cdots \end{align*}$

where the “ $+ \cdots$ ” at the end denotes a bunch of terms that don’t include any of the parameters that we’re optimization over. Optimizing each parameter value then gives the first-order conditions below:

$\begin{align*} 0 &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right) \cdot 1 \\ \text{and} \quad 0 &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right) \cdot x_{s,v} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}$

And, solving this system of $(V+1)$ equations and $(V+1)$ unknowns gives the most likely parameter values:

$\begin{align*} \hat{\alpha}^{\text{ML}} &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, r_s \cdot 1 \\ \text{and} \quad \hat{\beta}_v^{\text{ML}} &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, (r_s - \hat{\alpha}^{\text{ML}}) \cdot x_{s,v} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}$

Clearly, the most likely parameter values are just the coefficients from a cross-sectional regression.

Now, for the sake of argument, let’s imagine there’s an oracle who knows the true parameter values, $\{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \}$ . With access to this oracle, we can compute our mean squared error when using the maximum-likelihood estimates to predict returns given any choice of true parameter values:

$\begin{align*} \mathrm{E} \left[ \, \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right)^2 \, \middle| \, \{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \} \, \right] &= \sigma^2 + 1 \cdot (\sfrac{\sigma^2}{S}) + V \cdot (\sfrac{\sigma^2}{S}). \end{align*}$

The first term, $\sigma^2$ , captures the unavoidable error. i.e., even if we knew the true parameter values, we still wouldn’t be able to predict $\epsilon_s \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2)$ . And, the second and third terms, $1 \cdot (\sfrac{\sigma^2}{S})$ and $V \cdot (\sfrac{\sigma^2}{S})$ , capture the error that comes from using the most likely parameter values rather than the true parameter values.

3. A Better Predictor

With this benchmark in mind, let’s take a look at a variant of the James-Stein (JS) estimator:

$\begin{align*} \hat{\beta}_v^{\text{JS}} &\overset{\scriptscriptstyle \text{def}}{=} (1 - \lambda) \cdot \hat{\beta}_v^{\text{ML}} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}$

In the definition above, $\lambda \in [0, \, 1]$ denotes a bias factor that shrinks the maximum-likelihood estimates towards zero whenever $\lambda > 0$ . So, with access to an oracle, we can compute our mean squared error when using the James-Stein estimates to predict returns just like before:

$\begin{align*} \mathrm{E} \left[ \, \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{JS}} \cdot x_{s,v}] \right)^2 \, \middle| \, \{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \} \, \right] &= \sigma^2 + 1 \cdot (\sfrac{\sigma^2}{S}) \\ &\quad + V \cdot \left\{ \, (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2 \cdot ({\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2) \, \right\}. \end{align*}$

Now the third term is more complicated. If we use a more biased estimator, $\lambda \to 1$ and $|\hat{\beta}_v^{\text{JS}}| \to 0$ , then using the most likely parameter values rather than the true parameter values to predict returns is going to cause less damage. But, bias is going to generate really bad predictions whenever the true parameter value is large $|\beta_v^\star| \gg 0$ . The $(1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S})$ and $\lambda^2 \cdot (\frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2)$ terms capture these two opposing forces.

Comparing the maximum-likelihood and James-Stein prediction errors reveals that we should prefer the James-Stein estimator to the maximum-likelihood estimator if there’s a $\lambda \in (0, \, 1]$ such that:

$\begin{align*} (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2 \cdot ({\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2) < (\sfrac{\sigma^2}{S}). \end{align*}$

But, here’s the thing: if we have access to an oracle, then there’s always going to be some $\lambda > 0$ that satisfies this inequality. This is easier to see if we rearrange things a bit:

$\begin{align*} {\textstyle \frac{ \frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2 }{ (\sfrac{\sigma^2}{S}) } } < {\textstyle \frac{ 2 - \lambda }{ \lambda } }. \end{align*}$

So, there’s always a sufficiently small $\lambda$ such that this inequality holds. Thus, for any $\{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \}$ , there’s some $\lambda$ such that the James-Stein estimates gives better predictions than the most likely parameter values.

4. Abandoning the Oracle

Perhaps this isn’t a useful comparison? In the real world, we can’t see the true parameter values when deciding which estimator to use. So, we can’t know ahead of time whether or not we’ve picked a small enough $\lambda$ . It turns out that having to estimate $\lambda$ changes things, but only when there’s just $V=1$ right-hand-side variable. When there are $V \geq 2$ variables, James-Stein with an estimated $\lambda$ still gives better predictions.

To see where this distinction comes from, let’s first solve for the optimal choice of $\lambda$ when we still have access to the oracle. This will tell us what we have to estimate when we abandon the oracle. The optimal choice of $\lambda$ solves:

$\begin{align*} \lambda^\star &= \arg \min_{\lambda \in [0, \, 1]} \left\{ \, (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2 \cdot \left( {\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2 \right) \, \right\} \end{align*}$

So, if we differentiate, then we can solve the first-order condition for $\lambda^\star$ :

$\begin{align*} \lambda^\star &= {\textstyle \frac{ (\sfrac{\sigma^2}{S}) }{ \frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2 - (\sfrac{\sigma^2}{S}) } }. \end{align*}$

If the maximum-likelihood estimates are really noisy relative to the size of the true parameter values (i.e., $\sfrac{\sigma^2}{S}$ is close to $\frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2$ ), then using the most likely parameter values rather than the true parameter values is going to increase our prediction error a lot. So, we should use more biased parameter estimates.

But, notice what this formula is telling us. To estimate the right amount of bias, all we have to do is estimate the variance of the true parameter values. We don’t have to estimate every single one. So, we can estimate the variance of the true parameter values as follows,

$\begin{align*} {\textstyle \frac{1}{V-1}} \cdot {\textstyle \sum_v} \, |\hat{\beta}_v^{\text{ML}}|^2 &= (\sfrac{\sigma^2}{S}) + {\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2, \end{align*}$

where the factor of $(V - 1)$ on the left-hand side is a degrees-of-freedom correction. To see why we need this correction, recall that

$\begin{align*} \mathrm{E} \left[ \, \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right)^2 \, \right] &= \sigma^2 + 1 \cdot (\sfrac{\sigma^2}{S}) + V \cdot (\sfrac{\sigma^2}{S}), \end{align*}$

so not all $V$ maximum-likelihood estimates can move independently. There is an adding-up constraint.

Thus, if we have to estimate the right amount of bias to use, then we should choose:

$\begin{align*} \hat{\lambda} &= {\textstyle \frac{ (\sfrac{\sigma^2}{S}) }{ \frac{1}{V-1} \cdot \sum_v \, |\hat{\beta}_v^{\text{ML}}|^2 } }. \end{align*}$

Notice that, when we have to estimate the right amount of bias, $\hat{\lambda} > 0$ only if $V \geq 2$ . If $V = 1$ , then the denominator is infinite and $\hat{\lambda} = 0$ . After all, if there’s only $1$ right-hand-side variable, then the equation to estimate the variance of the true parameter values has the same first-order condition as the equation to estimate $\hat{\beta}_1^{\text{ML}}$ . With this estimated amount of bias, our prediction error becomes

$\begin{align*} \mathrm{E} \left[ \, \left( r_s - [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{JS}} \cdot x_{s,v}] \right)^2 \, \right] &= \sigma^2 + 1 \cdot (\sfrac{\sigma^2}{S}) + (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}), \end{align*}$

which is always less than the maximum-likelihood prediction error whenever $V \geq 2$ .

5. What This Means

My last post looked at one reason why it’s harder to predict the cross-section of returns when $V \geq 2$ : Bayesian variable selection doesn’t scale. If we’re not sure which subset of variables actually predict returns, then finding the subset of variables that’s the most likely to predict returns means solving a non-convex optimization problem. It turns out that solving this optimization problem means doing an exhaustive search over the powerset containing all $2^V$ possible subsets of variables. And, this just isn’t feasible when $V \gg 2$ .

But, this scaling problem isn’t the only reason why it’s harder to predict the cross-section of returns when $V \geq 2$ . And, this post points out another one of these reasons: even if you could solve this non-convex optimization problem and find the most likely parameter values, these parameter values wouldn’t give the best predictions. When $V \geq 2$ , there’s a fundamental tension between making good predictions and learning about the most likely parameter values in the data-generating process for returns. So, when $V \geq 2$ traders are going to solve the prediction problem and live with the resulting biased beliefs about the underlying parameter values. What’s more, the $\hat{\lambda}$ with the best out-of-sample fit in the data is going to quantify how much the desire to make good predictions distorts traders’ beliefs.

Why Bayesian Variable Selection Doesn’t Scale

January 19, 2017 by Alex

1. Motivation

Traders are constantly looking for variables that predict returns. If $x$ is the only candidate variable traders are considering, then it’s easy to use the Bayesian information criterion to check whether $x$ predicts returns. Previously, I showed that using the univariate version of the Bayesian information criterion means solving

(●) $\begin{align*} \hat{\beta} &= \arg \min_{\beta} \big\{ \, \underset{\text{Prediction Error}}{{\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2} + \underset{\text{Penalty}}{ \lambda \cdot 1_{\{ \beta \neq 0 \}} } \, \big\} \qquad \text{with} \qquad \lambda = {\textstyle \frac{1}{S}} \cdot \log(S) \end{align*}$

after standardizing things so that $\hat{\mu}_x, \, \hat{\mu}_r = 0$ and $\hat{\sigma}_x^2 = 1$ . If the solution is some $\hat{\beta} \neq 0$ , then $x$ predicts returns. Notation: Parameters with hats are in-sample estimates. e.g., if $x_s \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1)$ , then $\frac{1}{S} \cdot \sum_s x_s = \hat{\mu}_x \sim \mathrm{N}(0, \, \sfrac{1}{S})$ .

But, what if there’s more than $1$ variable? There’s an obvious multivariate extension of (●):

(⣿) $\begin{align*} \{\hat{\beta}_1, \, \ldots, \, \hat{\beta}_V \} &= \arg \min_{\beta_1, \, \ldots, \, \beta_V} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_v} \, \beta_v \cdot x_{s,v} )^2 + \lambda \cdot {\textstyle \sum_v} 1_{\{ \beta_v \neq 0 \}} \, \right\}. \end{align*}$

So, you might guess that it’d be easy to check which subset of $V \geq 1$ variables predicts returns by evaluating (⣿). But, it’s not. To evaluate the multivariate version of the Bayesian information criterion, traders would have to check $2^V$ different parameter values. That’s a combinatorial nightmare when $V \gg 1$ . Thus, traders can’t take a strictly Bayesian approach to variable selection when there are lots of variables to choose from.

Why is evaluating (⣿) so hard? That’s the topic of today’s post.

2. Non-Convex Problem

Let’s start by looking at what makes (●) so easy. The key insight is that you face the same-sized penalty no matter what $\hat{\beta} \neq 0$ you choose when solving the univariate version of the Bayesian information criterion:

$\begin{align*} \lambda \cdot 1_{\{ 0.01 \neq 0 \}} = \lambda \cdot 1_{\{ 100 \neq 0 \}} = \lambda \qquad \text{or, put differently} \qquad {\textstyle \frac{\mathrm{d}\lambda}{\mathrm{d}\beta}} = 0. \end{align*}$

So, if you’re going set $\hat{\beta} \neq 0$ , then you might as well choose the value that minimizes your prediction error:

$\begin{align*} \arg \min_{\beta \neq 0} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 + \lambda \cdot 1_{\{ \beta \neq 0 \}} \, \right\} &= \arg \min_{\beta} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 + \lambda \, \right\} \\ &= \arg \min_{\beta} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 \, \right\} \\ &= \hat{\beta}^{\text{OLS}}. \end{align*}$

Thus, to evaluate (●), all you have to do is check $2$ parameter values, $\beta = 0$ and $\beta = \hat{\beta}^{\text{OLS}}$ , and see which one gives a smaller result. Practically speaking, this means running an OLS regression, $r_s = \hat{\beta}^{\text{OLS}} \cdot x_s + \hat{\epsilon}_s$ , and checking whether or not the penalized residual variance, $\hat{\sigma}_\epsilon^2 + \lambda$ , is smaller than the raw return variance, $\hat{\sigma}_r^2$ .

Most explanations for why (⣿) is hard to evaluate focus on the fact that (⣿) is a non-convex optimization problem (e.g., see here and here). But, the univariate version of the Bayesian information criterion is also a non-convex optimization problem. Just look at the region around $\beta = 0$ in the figure to the left, which shows the objective function from (●). So, non-convexity can only part of the explanation for why (⣿) is hard to evaluate. Increasing the number of variables must be add a missing ingredient.

3. The Missing Ingredient

DATA + CODE

If there are many variables to consider, then these variables can be correlated. Correlation. This is the missing ingredient that makes evaluating (⣿) hard. Let’s look at a numerical example to see why.

Suppose there are only $S = 9$ stocks and $V = 3$ variables. The diagram above summarizes the correlation structure between these variables and returns. The red bar is the total variation in returns. The blue bars represent the portion of this variation that’s related to each of the $3$ variables. If you can draw a vertical line through a pair of bars (i.e., the bars overlap), then the associated variables are correlated. So, because the first $2$ blue bars don’t overlap, $x_1$ and $x_2$ are perfectly uncorrelated in-sample:

$\begin{align*} \widehat{\mathrm{Cor}}[x_1, \, x_2] &= 0. \end{align*}$

Whereas, because the first $2$ blue bars both overlap the third, $x_1$ and $x_2$ are both correlated with $x_3$ :

$\begin{align*} \widehat{\mathrm{Cor}}[x_1, \, x_3] = \widehat{\mathrm{Cor}}[x_2, \, x_3] &= 0.41. \end{align*}$

Finally, longer overlaps denote larger correlations. So, $x_3$ is the single best predictor of returns since the third blue bar has the longest overlap with the top red bar:

$\begin{align*} \widehat{\mathrm{Cor}}[r, \, x_1] = \widehat{\mathrm{Cor}}[r, \, x_2] &= 0.62 \\ \widehat{\mathrm{Cor}}[r, \, x_3] &= 0.67. \end{align*}$

And, this creates a problem. If you had to pick only $1$ variable to predict returns, then you’d pick $x_3$ :

$\begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_3^{\text{OLS}} \cdot x_{s,3} )^2 + \lambda = 0.80 < 0.86 &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_1^{\text{OLS}} \cdot x_{s,1} )^2 + \lambda \\ &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_2^{\text{OLS}} \cdot x_{s,2} )^2 + \lambda. \end{align*}$

But, $\{x_1, \, x_2 \}$ is actually the subset of variables that minimizes (⣿) in this example:

$\begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{1,2\}}} \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda = 0.72. \end{align*}$

In other words, the variable that best predicts returns on its own isn’t part of the subset of variables that collectively best predict returns. In other examples it might be, but there’s no quick way to figure out which kind of example we’re in because (⣿) is a non-convex optimization problem. Until you actually plug $\{x_1, \, x_2 \}$ into (⣿), there’s absolutely no reason to suspect that either variable belongs to subset that minimizes (⣿). Think about it. $x_1$ and $x_2$ are both worse choices than $x_3$ on their own. And, if you start with $x_3$ and add either variable, things get even worse:

$\begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_3^{\text{OLS}} \cdot x_{s,3} )^2 + \lambda = 0.80 < 0.90 &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{3,1\}}} \, \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda \\ &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{3,2\}}} \, \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda. \end{align*}$

With many correlated variables, you can never tell how close you are to the subset of variables that best predicts returns. To evaluate (⣿), you’ve got to check all $2^V$ possible combinations. There are no shortcuts.

4. Birthday-Paradox Math

If correlation makes it hard to evaluate (⣿), then shouldn’t we be able to fix the problem by only considering independent variables? Yes… but only in a fairytale world where there are an infinite number of stocks, $S \to \infty$ . The problem is unavoidable in the real world where there are almost as many candidate variables as there are stocks because independent variables are still going to be correlated in finite samples.

Suppose there are $V \geq 2$ independent variables that might predict returns. Although these variables are independent, they won’t be perfectly uncorrelated in finite samples. So, let’s characterize the maximum in-sample correlation between any pair of variables. After standardizing each variable so that $\hat{\mu}_{x,v} = 0$ and $\hat{\sigma}_{x,v}^2 = 1$ , the in-sample correlation between $x_v$ and $x_{v'}$ when $S \gg 1$ is roughly:

$\begin{align*} \hat{\rho}_{v,v'} \sim \mathrm{N}(0, \, \sfrac{1}{S}). \end{align*}$

Although $\lim_{S \to \infty} \sfrac{1}{S} \cdot (\hat{\rho}_{v,v'} - 0)^2 = 0$ , our estimates won’t be exactly zero in finite samples.

CODE

Since the normal distribution is symmetric, the probability that $x_v$ and $x_{v'}$ have an in-sample correlation more extreme than $c$ is:

$\begin{align*} \mathrm{Pr}[ |\hat{\rho}_{v,v'}| > c ] &= 2 \cdot \mathrm{Pr}[ \hat{\rho}_{v,v'} > c ] = 2 \cdot \mathrm{\Phi}(c \cdot \!\sqrt{S}). \end{align*}$

So, since there are ${V \choose 2} \leq \frac{1}{2} \cdot V^2$ pairs of variables, we know that the probability that no pair has a correlation more extreme than $c$ is:

$\begin{align*} \mathrm{Pr}[ \max |\hat{\rho}_{v,v'}| \leq c ] &\leq \big( \, 1 - 2 \cdot \mathrm{\Phi}(c \cdot \!{\textstyle \sqrt{S}}) \, \big)^{\frac{1}{2} \cdot V^2}. \end{align*}$

Here’s the punchline. Because $V^2$ shows up as an exponent in the equation above, the probability that all pairwise in-sample correlations happen to be really small is going to shrink exponentially fast as traders consider more and more variables. This means that finite-sample effects are always going to make evaluating (⣿) computationally intractable in real-world settings with many variables, even when the variables are truly uncorrelated as $S \to \infty$ . e.g., the in-sample correlation of $\hat{\rho}_{1,3} = \hat{\rho}_{2,3} = 0.41$ from the previous example might seem like an unreasonably high number for independent random variables, something that only happens when $S=9$ . But, the figure above shows that even when there are $S = 50$ stocks, there’s still a $50{\scriptstyle \%}$ chance of observing an in-sample correlation of at least $0.41$ when considering $V = 20$ candidate variables.

5. What This Means

Market efficiency has been an “organizing principle for 30 years of empirical work” in academic finance. The principle is based on on a negative feedback loop: predictable returns suggest profitable trading strategies, but implementing these strategies eliminates the initial predictability. So, if there are enough really smart traders, then they’re going to find the subset of variables that predicts returns and eliminate this predictability. That’s market efficiency.

Even for really smart traders, finding the subset of variables that predicts returns is hard. And, this problem gets harder when there are more candidate variables to choose from. But, while researchers have thought about this problem in the past, they’ve primarily focused on the dangers of p-hacking (e.g., see here and here). If traders regress returns on $V = 20$ different variables, then they should expect that $1$ of these regressions is going to produce a statistically significant coefficient with a $p\text{-value} \leq 0.05$ even if none of the variables actually predicts returns. So, researchers have focused on correcting p-values.

But, this misses the point. Searching for the subset of variables that predicts returns isn’t hard because you have to adjust your p-values. It’s hard because it requires a brute-force search through the powerset of all possible subsets of predictors. It’s hard because any optimization problem with a hard-thresholding rule like (⣿) can be re-written as an integer programming problem,

$\begin{align*} \min_{{\boldsymbol \delta} \in \{0, \, 1\}^V} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - {\textstyle \sum_v} [\hat{\beta}_v^{\text{OLS}} \cdot x_{s,v}] \cdot \delta_v)^2 \quad \text{subject to} \quad k \geq {\textstyle \sum_v} \delta_v \, \right\}, \end{align*}$

which means that it’s NP-hard (e.g., see here, here, or here). It’s hard because, in the diagram above finding the subset of $30$ variables that predicts returns is equivalent to finding the cheapest way to cover the red bar with blue bars at a cost of $\lambda$ per blue bar, which means solving the knapsack problem.

p style=”text-indent: 15px;”>So, the fact that Bayesian variable selection doesn’t scale is a really central problem. It means that, even if there are lots and lots of really smart traders, they may not be able to find the best subset of variables for predicting returns. You’re probably on a secure wifi network right now. This network is considered “secure” because cracking its $128$ -bit passcode would involve a brute-force search over $2^{128}$ parameter values, which would take 1 billion billion years. So, if there are over V = 313 predictors documented in top academic journals, why shouldn’t we consider the subset of variables that actually predicts returns “secure”, too? After all, finding it would involve a brute-force search over $2^{313}$ parameter values. We might be able to approximate it. But, the exact collection of variables may as well be in a vault somewhere.

The Bayesian Information Criterion

January 3, 2017 by Alex

1. Motivation

Imagine that we’re trying to predict the cross-section of expected returns, and we’ve got a sneaking suspicion that $x$ might be a good predictor. So, we regress today’s returns on $x$ to see if our hunch is right,

$\begin{align*} r_{n,t} = \hat{\mu}_{\text{OLS}} + \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1} + \hat{\epsilon}_{n,t}. \end{align*}$

The logic is straightforward. If $x$ explains enough of the variation in today’s returns, then $x$ must be a good predictor and we should include it in our model of tomorrow’s returns, $\mathrm{E}_t(r_{n,t+1}) = \hat{\mu}_{\text{OLS}} + \hat{\beta}_{\text{OLS}} \cdot x_{n,t}$ .

But, how much variation is “enough variation”? After all, even if $x$ doesn’t actually predict tomorrow’s returns, we’re still going to fit today’s returns better if we use an additional right-hand-side variable,

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 \leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}})^2. \end{align*}$

The effect is mechanical. If we want to explain all of the variation in today’s returns, then all we have to do is include $N$ right-hand-side variables in our OLS regression. With $N$ linearly independent right-hand-side variables we can always perfectly predict $N$ stock returns, no matter what variables we choose.

The Bayesian information criterion (BIC) tells us that we should include $x$ as a right-hand-side variable if it explains at least $\sfrac{\log(N)}{N}$ of the residual variation,

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} &\leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}})^2. \end{align*}$

But, where does this $\sfrac{\log(N)}{N}$ penalty come from? And, why is following this rule the Bayesian thing to do? Bayesian updating involves learning about a parameter value by combining prior beliefs with evidence from realized data. So, what parameter are we learning about when using the Bayesian information criterion? And, what are our priors beliefs? These questions are the topic of today’s post.

2. Estimation

Instead of diving directly into our predictor-selection problem (should we include $x$ in our model?), let’s pause for a second and solve our parameter-estimation problem (how should we estimate the coefficient on $x$ ?). Suppose the data-generating process for returns is

$\begin{align*} r_{n,t} = \beta_\star \cdot x_{n,t-1} + \epsilon_{n,t} \end{align*}$

where $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ , $\epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1)$ , and $x$ is normalized so that $\frac{1}{N} \cdot \sum_{n=1}^N x_n^2 = \widehat{\mathrm{Var}}[x_n] = 1$ . For simplicity, let’s also assume that $\mu_\star = 0$ in the analysis below.

If we see $N$ returns from this data-generating process, $\mathbf{r}_t = \{ \, r_{1,t}, \, r_{2,t}, \, \ldots, \, r_{N,t} \, \}$ , then we can estimate $\beta_\star$ by choosing the parameter value that would maximize the posterior probability of realizing these returns:

$\begin{align*} \hat{\beta}_{\text{MAP}} &\overset{\scriptscriptstyle \text{def}}{=} \arg \max_{\beta} \{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) \times \mathrm{Pr}(\beta) \, \} \\ &= \arg \min_{\beta} \{ \, - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}(\beta) \, \}. \end{align*}$

This is known as maximum a posteriori (MAP) estimation, and the second equality in the expression above points out how we can either maximize the posterior probability or minimize $-(\sfrac{1}{N}) \cdot \log(\cdot)$ of this function,

$\begin{align*} \mathrm{f}(\beta) &\overset{\scriptscriptstyle \text{def}}{=} - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}(\beta). \end{align*}$

We can think about $\mathrm{f}(\beta)$ as the average improbability of the realized returns given $\beta_\star = \beta$ .

So, what is this answer? Because $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ and $\epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1)$ , we know that

$\begin{align*} \mathrm{f}(\beta) &= {\textstyle \frac{1}{N}} \cdot \left\{ \, {\textstyle \sum_{n=1}^N} {\textstyle \frac{1}{2}} \cdot (r_{n,t} - \beta \cdot x_{n,t-1})^2 + N \cdot {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) \, \right\} \\ &\qquad \quad + \, {\textstyle \frac{1}{N}} \cdot \left\{ \, {\textstyle \frac{1}{2 \cdot \sigma^2}} \cdot (\beta - 0)^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) + {\textstyle \frac{1}{2}} \cdot \log(\sigma) \, \right\} \end{align*}$

where the first line is $-(\sfrac{1}{N}) \cdot \log \mathrm{Pr}(\mathbf{r}_t|\mathbf{x}_{t-1}, \, \beta)$ and the second line is $-(\sfrac{1}{N}) \cdot \log \mathrm{Pr}(\beta)$ . What’s more, because we’re specifically choosing $\hat{\beta}_{\text{MAP}}$ to minimize $\mathrm{f}(\beta)$ , we also know that

$\begin{align*} \mathrm{f}'(\hat{\beta}_{\text{MAP}}) &= 0 = - \, {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{MAP}} \cdot x_{n,t-1}) \cdot x_{n,t-1} + {\textstyle \frac{1}{N}} \cdot {\textstyle \frac{1}{\sigma^2}} \cdot \hat{\beta}_{\text{MAP}}. \end{align*}$

And, solving this first-order condition for $\hat{\beta}_{\text{MAP}}$ tells us exactly how to estimate $\beta_\star$ :

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \frac{ N \cdot \left\{ \frac{1}{N} \cdot \sum_{n=1}^N r_{n,t} \cdot x_{n,t-1} \right\} }{ \frac{1}{\sigma^2} + N \cdot \left\{ \frac{1}{N} \cdot \sum_{n=1}^N x_{n,t-1}^2 \right\} } = \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] }. \end{align*}$

3. Selection

Now that we’ve seen the solution to our parameter-estimation problem, let’s get back to solving our predictor-selection problem. Should we include $x$ in our predictive model of tomorrow’s returns? It turns out that answering this question means learning about the prior variance of $\beta_\star$ . Is $\beta_\star$ is equally likely to take on any value, $\sigma^2 \to \infty$ ? Or, should we assume that $\beta_\star = 0$ regardless of the evidence, $\sigma^2 \to 0$ ?

To see where the first choice comes from, let’s think about the priors we’re implicitly adopting when we include $x$ in our predictive model. Since $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ , this means looking for a $\sigma^2$ such that $\hat{\beta}_{\text{MAP}} = \hat{\beta}_{\text{OLS}}$ . Inspecting the solution to our parameter-estimation problem reveals that

$\begin{align*} \lim_{\sigma^2 \to \infty} \hat{\beta}_{\text{MAP}} &= \lim_{\sigma^2 \to \infty} \left( \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] } \right) = \frac{ \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \widehat{\mathrm{Var}}[x_{n,t-1}] } = \hat{\beta}_{\text{OLS}}. \end{align*}$

So, by including $x$ , we’re adopting an agnostic prior that $\beta_\star$ is equally likely to be any value under the sun.

To see where the second choice comes from, let’s think about the priors we’re implicitly adopting when we exclude $x$ from our predictive model. This means looking for a $\sigma^2$ such that $\hat{\beta}_{\text{MAP}} = 0$ regardless of the realized data, $\mathbf{x}_{t-1}$ . Again, inspecting the formula for $\hat{\beta}_{\text{MAP}}$ reveals that

$\begin{align*} \lim_{\sigma^2 \to 0} \hat{\beta}_{\text{MAP}} &= \lim_{\sigma^2 \to 0} \left( \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] } \right) =0. \end{align*}$

So, by excluding $x$ , we’re adopting a religious prior that $\beta_\star = 0$ regardless of any new evidence.

Thus, when we decide whether to include $x$ in our predictive model, what we’re really doing is learning about our priors. So, after seeing $N$ returns, $\mathbf{r}_t$ , we can decide whether to include $x$ in our predictive model by choosing the prior variance, $\sigma^2$ , that maximizes the posterior probability of realizing these returns,

$\begin{align*} \hat{\sigma}_{\text{MAP}}^2 &\overset{\scriptscriptstyle \text{def}}{=} \arg \max_{\sigma^2 \in \{\infty, \, 0\}} \left\{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \times \mathrm{Pr}( \sigma^2 ) \, \right\} \\ &= \arg \min_{\sigma^2 \in \{ \infty, \, 0\}} \left\{ \, - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) + \log \mathrm{Pr}( \sigma^2 ) \, \right\}, \end{align*}$

where the second equality in the expression above points out how we can either maximize the posterior probability or minimize $-(\sfrac{1}{N}) \cdot \log(\cdot)$ of this function—i.e., its average improbability. Either way, if we estimate $\hat{\sigma}_{\text{MAP}}^2 \to \infty$ , then we should include $x$ ; whereas, if we estimate $\hat{\sigma}_{\text{MAP}}^2 \to 0$ , then we shouldn’t.

4. Why log(N)/N?

The posterior probability of the realized returns given our choice of priors is given by

$\begin{align*} \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \cdot \mathrm{Pr}( \sigma^2 ) &= {\textstyle \int_{-\infty}^\infty} \mathrm{Pr}(\mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta) \cdot \mathrm{Pr}(\beta|\sigma^2) \cdot \mathrm{d}\beta \\ &= {\textstyle \int_{-\infty}^\infty} \, e^{-N \cdot \mathrm{f}(\beta)} \cdot \mathrm{d}\beta. \end{align*}$

In this section, we’re going to see how to evaluate this integral. And, in the process, we’re going to see precisely where that $\sfrac{\log(N)}{N}$ penalty term in the Bayesian information criterion comes from.

Here’s the key insight in plain English. The realized returns are affected by noise shocks. By definition, excluding $x$ from our predictive model means that we aren’t learning about $\beta_\star$ from the realized returns, so there’s no way for these noise shocks to affect either our estimate of $\hat{\beta}_{\text{MAP}}$ or our posterior-probability calculations. By contrast, if we include $x$ in our predictive model, then we are learning about $\beta_\star$ from the realized returns, so these noise shocks will distort both our estimate of $\hat{\beta}_{\text{MAP}}$ and our posterior-probability calculations. The distortion caused by these noise shocks are going to be the source of the $\sfrac{\log(N)}{N}$ penalty term in the Bayesian information criterion.

Now, here’s the same insight in Mathese. Take a look at the Taylor expansion of $\mathrm{f}(\beta)$ around $\hat{\beta}_{\text{MAP}}$ ,

$\begin{align*} \mathrm{f}(\beta) &= \mathrm{f}(\hat{\beta}_{\text{MAP}}) + {\textstyle \frac{1}{2}} \cdot \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \cdot (\beta - \hat{\beta}_{\text{MAP}})^2. \end{align*}$

There’s no first-order term because $\hat{\beta}_{\text{MAP}}$ is chosen to minimize $\mathrm{f}(\beta)$ , and there are no higher-order terms because both $\beta_\star$ and $\epsilon_{n,t}$ are normally distributed. From the formula for $\mathrm{f}(\beta)$ we can calculate that

$\begin{align*} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) &= {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} \cdot x_{n,t-1}^2 + {\textstyle \frac{1}{N}} \cdot {\textstyle \frac{1}{\sigma^2}}. \end{align*}$

Recall that $\mathrm{f}(\beta)$ measures the average improbability of realizing $\mathbf{r}_t$ given that $\beta_\star = \beta$ . So, if $\mathrm{f}''(\hat{\beta}_{\text{MAP}}) \to \infty$ for a given choice of priors, then having any $\beta_\star \neq \hat{\beta}_{\text{MAP}}$ is infinitely improbable under those priors. And, this is exactly what we find when we exclude $x$ from our predictive model, $\lim_{\sigma \to 0} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) = \infty$ . By contrast, if we include $x$ in our predictive model, then $\lim_{\sigma \to \infty} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) = 1$ , meaning that we are willing to entertain the idea that $\beta_\star \neq \hat{\beta}_{\text{MAP}}$ due to distortions caused by the noise shocks.

To see why these distortions warrant a $\sfrac{\log(N)}{N}$ penalty, all we have to do then is evaluate the integral. First, let’s think about the case where we exclude $x$ from our predictive model. We just saw that, if $\sigma \to 0$ , then we are unwilling to consider any parameter values besides $\hat{\beta}_{\text{MAP}} = 0$ . So, the integral equation for our posteriors given that $\sigma^2 \to 0$ simplifies to

$\begin{align*} \lim_{\sigma^2 \to 0} \left\{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \, \right\} &= \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta_\star = 0 ) \\ &= {\textstyle \big(\frac{1}{\sqrt{2 \cdot \pi}}\big)^N} \cdot e^{ - \, \sum_{n=1}^N \frac{1}{2} \cdot (r_{n,t} - 0)^2 }. \end{align*}$

This means that the average improbability of realizing $\mathbf{r}_t$ given the priors $\sigma^2 \to 0$ is given by

$\begin{align*} \lim_{\sigma \to 0} \left\{ \, - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) \, \right\} &= {\textstyle \frac{1}{2 \cdot N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - 0)^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi). \end{align*}$

To calculate our posterior beliefs when we include $x$ , let’s use this Taylor expansion around $\hat{\beta}_{\text{MAP}}$ again,

$\begin{align*} \lim_{\sigma \to \infty} \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) &= \lim_{\sigma \to \infty} \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{-\, N \cdot \mathrm{f}(\beta)} \cdot \mathrm{d}\beta \, \right\} \\ &= \lim_{\sigma \to \infty} \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{-\, N \cdot \left\{ \mathrm{f}(\hat{\beta}_{\text{MAP}}) + \frac{1}{2} \cdot \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \cdot (\beta - \hat{\beta}_{\text{MAP}})^2 \right\}} \cdot \mathrm{d}\beta \, \right\} \\ &= \left\{ \, e^{-\, N \cdot \mathrm{f}(\hat{\beta}_{\text{OLS}})} \, \right\} \times \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{N}{2} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \, \right\}. \end{align*}$

The first term is the probability of observing the realized returns assuming that $\beta_\star = \lim_{\sigma^2 \to \infty} \hat{\beta}_{\text{MAP}} = \hat{\beta}_{\text{OLS}}$ . The second term is a penalty that accounts for the fact that $\beta_\star$ might be different from the estimated $\hat{\beta}_{\text{OLS}}$ in finite samples. Due to the central-limit theorem, this difference between $\beta_\star$ and $\hat{\beta}_{\text{OLS}}$ is going to shrink at a rate of $\sqrt{(\sfrac{1}{N})}$ :

$\begin{align*} {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{N}{2} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta &= {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{1}{2 \cdot (\sfrac{1}{N})} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \\ &= \sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})} \cdot \int_{-\infty}^\infty \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}}} \cdot e^{ - \, \frac{1}{2 \cdot (\sfrac{1}{N})} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \\ &= \sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}. \end{align*}$

So, the average improbability of realizing $\mathbf{r}_t$ given the priors $\sigma^2 \to \infty$ is given by

$\begin{align*} \lim_{\sigma \to \infty} \left\{ \, - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) \, \right\} &= {\textstyle \frac{1}{2 \cdot N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) + {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{\log(N)}{N}} + \mathrm{O}(\sfrac{1}{N}) \end{align*}$

where $\mathrm{O}(\sfrac{1}{N})$ is big-“O” notation denoting terms that shrink faster than $\sfrac{1}{N}$ as $N \to \infty$ .

5. Formatting

Bringing everything together, hopefully it’s now clear why we can decide whether to include $x$ in our predictive model by checking whether

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} + \mathrm{O}(\sfrac{1}{N}) \leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - 0)^2. \end{align*}$

The $\sfrac{\log(N)}{N}$ penalty term accounts for the fact that you’re going to be overfitting the data in sample when you include more right-hand-side variables. This criterion was first proposed in Schwarz (1978), who showed that the criterion becomes exact as $N \to \infty$ . The Bayesian information criterion is often written as an optimization problem as well:

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \arg \min_{\beta} \left\{ \, {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \beta \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} \cdot \mathrm{1}_{\{ \beta \neq 0 \}} \, \right\}. \end{align*}$

Both ways of writing down the criterion are the same. They just look different due to formatting. There is one interesting idea that pops out of writing down the Bayesian information criterion as a optimization problem. Solving for the $\hat{\beta}_{\text{MAP}}$ suggests that you should completely ignore any predictors with sufficiently small OLS coefficients:

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \begin{cases} \hat{\beta}_{\text{OLS}} &\text{if } |\hat{\beta}_{\text{OLS}}| \geq \sqrt{{\textstyle \frac{1}{2}} \cdot {\textstyle \frac{\log(N)}{N}}}, \text{ and} \\ 0 &\text{otherwise.} \end{cases} \end{align*}$

A Model of Rebalancing Cascades

October 23, 2016 by Alex

1. Motivating Examples

Trading strategies can interact with one another to amplify small initial shocks to fundamentals:

Quant Crisis, Aug 2007: “During the week of August 6, 2007, a number of [quantitative hedge funds] experienced unprecedented losses… Initial losses [were] due to the forced liquidation of one or more large equity market-neutral portfolios… and the subsequent price impact… caused other similarly constructed portfolios to experience losses. These losses, in turn, caused other funds to deleverage their portfolios [and] led to further losses [and] more deleveraging and so on.”
Flash Crash, May 2010: “The Dow Jones industrial average plunged more than 600 points in a matter of minutes that day and then recovered in a blink… [it] began with the sale by Waddell & Reed of $75,000$ E-Mini S&P 500 futures contracts… late in the trading day… [with] many of the contracts bought by… computerized traders who… [then] traded contracts back and forth [like a] ‘hot potato’.”
Drop in Oil Prices, May 2011: “Never before had crude oil plummeted so deeply during the course of a day… prices were off by nearly $\mathdollar 13$ a barrel… [and] market players were unable to identify any single bank or fund orchestrating a massive sale to liquidate positions… [rather] computerized trading just kicked in when key price levels were reached.”
End-of-Day Volume, Oct 2011: “In the last $18$ minutes of trading, the S&P 500-stock index jumped more than $10$ points with no news to account for the rally. If you were left scratching your head, you were not alone… [and, the] culprit behind the late-day market swings: exchange-traded funds or ETFs.”
Sterling after Brexit, Oct 2016: “If a country’s exchange rate represents international investors’ confidence in its government’s policies, the markets have given Britain the thumbs-down… The most likely explanation for the plunge lies in the action of algorithmic trades… These sales can be contagious, with one program’s trades setting off the sell signals of other algorithms.”

I refer to these sorts of events as rebalancing cascades. Stock $1$ ‘s fundamentals change, so a trading strategy sells stock $1$ and replaces it with stock $2$ . This purchase of stock $2$ forces another trading strategy to buy stock $2$ and sell stock $3$ . And, this sale forces…

The examples above seem to suggest that you don’t need a very big initial change in stock $1$ ‘s fundamentals to trigger a cascade. For instance, when oil prices suddenly dropped in May 2011, traders were “unable to identify any single bank or fund orchestrating a massive sale”. They were left “scratching [their] heads”, to use the language of the next example. So, in this post, I write down a random-networks model à la Watts (2002) to understand when we should expect small changes in fundamentals to trigger these sorts of long rebalancing cascades.

2. Market Structure

Consider a market with $S$ stocks, $s = 1, \, 2, \, \ldots, \, S$ , where $S$ is a really big number. If a change in the fundamentals of stock $s'$ will force some trading strategy to rebalance and buy stock $s$ instead, then let’s say that stock $s'$ and stock $s$ are neighbors. Suppose that two randomly selected stocks are neighbors with probability $\sfrac{\lambda}{S}$ . This uniform-random-matching assumption means that the number of neighbors that each stock has, $N_s$ , is Poisson distributed with mean $\lambda \overset{\scriptscriptstyle \mathrm{def}}{=} \mathrm{E}[N_s]$ ,

(1) $\begin{equation*} N_s \sim \mathrm{Pois}(\lambda). \end{equation*}$

The fact that each stock is only neighbors with a fraction of the market captures the idea that different trading strategies rebalance in different ways. For technical reasons, let’s assume that $\lambda = \mathrm{O}[\log(S)]$ . The figure below gives examples of this sort of random network when $S=100$ .

Now, suppose that $\Delta_s \in \{ 0, \, 1\}$ is an indicator variable for whether or not stock $s$ ‘s fundamentals have changed. If a bunch of strategies start trading stock $s$ because of changes in its neighboring stocks’ fundamentals, then this additional trading activity can affect stock $s$ ‘s fundamentals. For example, if lots of funds buy a stock and push it into the S&P 500, then the stock will have a higher market beta. Let’s model neighboring stocks’ effect on stock $s$ ‘s fundamentals as follows,

(2) $\begin{equation*} \Delta_s = \begin{cases} 1 &\text{if } (\sfrac{1}{N_{s}}) \cdot {\textstyle \sum_{s' \in \mathcal{N}_s}} \Delta_s' \geq \phi \\ 0 &\text{else} \end{cases}, \end{equation*}$

where $\phi \in (0, \, 1]$ captures the vulnerability of a stock’s fundamentals to rebalancing. If there are lots of different strategies trading stock $s$ in lots of different ways, $N_s > N^\star$ , then no single rebalancing decision will be important enough to change stock $s$ ‘s fundamentals. But, if stock $s$ has at most $N^\star \overset{\scriptscriptstyle \mathrm{def}}{=} \lfloor \sfrac{1}{\phi} \rfloor$ neighbors, then a change in the fundamentals of a single neighbor will generate enough rebalancing to cause stock $s$ ‘s fundamentals to change too. Let’s say that stock $s$ has $V_s \overset{\scriptscriptstyle \mathrm{def}}{=} \sum_{s' \in \mathcal{N}_s} 1_{\{ \, N_{s'} \leq N^\star \}}$ such vulnerable neighbors.

Here’s the exercise I have in mind. Imagine that we select a stock at random, $s$ , and exogenously change its fundamentals, $\Delta_s = 1$ . If $s$ happens to have a vulnerable neighbor, $s'$ , then the rebalancing caused by our initial shock will change the fundamentals of a second stock, $\Delta_{s'} = 1$ . And, if $s'$ happens to have an additional vulnerable neighbor of its own, $s''$ , then the second wave of rebalancing caused by our initial shock to stock $s$ will change the fundamentals of a third stock as well, $\Delta_{s''} = 1$ . If stock $s''$ doesn’t have any additional vulnerable neighbors, then we will have triggered a rebalancing cascade of length $3$ with our initial shock to a single stock’s fundamentals. I want to characterize the distribution of cascade lengths for a randomly selected initial stock $s$ ,

(3) $\begin{equation*} C_s = \Delta_s + 1_{\{ \mathcal{V}_s \neq \emptyset \}} \cdot \left\{ \, {\textstyle \sum_{s' \in \mathcal{V}_s}} \left( \, \Delta_{s'} + 1_{\{ \mathcal{V}_{s'} \neq \emptyset \}} \cdot \left\{ \, {\textstyle \sum_{s'' \in \mathcal{V}_{s'}}} \left( \, \Delta_{s''} + \cdots \, \right) \, \right\} \, \right) \, \right\}. \end{equation*}$

as a function of the market’s average connectivity, $\lambda$ , and vulnerability threshold, $\phi$ .

3. Generating Functions

Generating functions make it possible to compute the distribution of cascade lengths. Here’s the basic idea. Take a look at Graham, Knuth, and Patashnik (1994, Ch 7) for more details. Suppose we’re flipping coins and counting the number of heads. The distribution of the number of heads, $h$ , after one flip is given by:

(4) $\begin{equation*} \text{\# of heads}|\text{1 flip} = \begin{cases} 1 &\text{w/ prob $q$, and} \\ 0 &\text{w/ prob $(1-q)$.} \end{cases} \end{equation*}$

If $q = \sfrac{1}{2}$ , then the coin is fair. The generating function for this same distribution is:

(5) $\begin{align*} \mathrm{G}(x|\text{1 flip}) &= {\textstyle \sum_{h=0}^1} \, p_h \cdot x^h. \end{align*}$

Each term in the series is associated with one possible outcome for the total number of heads. $p_0 = (1 - q)$ is the probability of realizing $h=0$ heads. $p_1 = q$ is the probability of realizing $h=1$ heads. We say that $\mathrm{G}(x|\text{1 flip})$ generates the distribution because we can compute all its moments by evaluating the derivatives of the generating function at $x=1$ . For example, the $0$ th-order derivative, $\left. \mathrm{G}(x|\text{1 flip})\right|_{x=1} = 1$ , says that the coin never lands on its edge or winks out of existence. And, if we want to compute the expected number of heads, then we can use the $1$ st-order derivative:

(6) $\begin{align*} \mathrm{E}[h|\text{1 flip}] &= \left. x \cdot \mathrm{G}'(x|\text{1 flip}) \right|_{x=1} \\ &= \left. x \cdot {\textstyle \sum_{h=0}^1} \, p_h \cdot h \cdot x^{h-1} \right|_{x=1} \\ &= \left. {\textstyle \sum_{h=0}^1} \, p_h \cdot h \cdot x^h \right|_{x=1} \\ &= p_0 \cdot 0 \cdot 1^0 + p_1 \cdot 1 \cdot 1^1 \\ &= q. \end{align*}$

The fact that derivatives of the generating function give the moments of the associated distribution will be useful below. Let’s call this Property 1: derivatives are moments.

Here are two additional properties to keep in mind. Property 2: multiple samples. If we raise the generating function for the number of heads in one flip to the $n$ th power, then we get the generating function for the total number of heads in $n$ flips. To illustrate, look at what happens if we square the generating function for the number of heads in one flip:

(7) $\begin{align*} \mathrm{G}(x|\text{1 flip})^2 &= \left\{ \, p_0 \cdot x^0 + p_1 \cdot x^1 \, \right\} \times \left\{ \, p_0 \cdot x^0 + p_1 \cdot x^1 \, \right\} \\ &= p_0^2 \cdot x^0 + 2 \cdot p_0 \cdot p_1 \cdot x^1 + p_1^2 \cdot x^2 \\ &= (1 - q)^2 \cdot x^0 + 2 \cdot (1 - q) \cdot q \cdot x^1 + q^2 \cdot x^2. \end{align*}$

The result is just the generating function for the number of heads in two flips, $\mathrm{G}(x|\text{2 flips})$ .

Property 3: partial information. If we multiply the generating function for the total number of heads in $(n-1)$ flips by $x^1$ , then we have the generating function for the total number of heads in all $n$ flips conditional on having already seen heads on the first flip. To illustrate, notice what happens when we multiply the generating function for the number of heads in one flip by $x^1$ :

(8) $\begin{align*} \mathrm{G}(x|\text{2 flips}, \, \text{heads on 1st flip}) &= x^1 \cdot \mathrm{G}(x|\text{1 flip}) \\ &= p_0 \cdot x^1 + p_1 \cdot x^2 \\ &= (1 - q) \cdot x^1 + q \cdot x^2. \end{align*}$

If we’ve already seen heads on the first flip, then there’s no way to realize $h=0$ heads. The lowest we can go is $h=1$ heads now. So, the first term is now $h=1$ . And, this outcome occurs if we see tails on the second flip, which happens with probability $(1-q)$ .

4. Cascade Lengths

Now, let’s return to our original problem. Let $\mathrm{G}_c(x)$ be the generating function distribution of cascade lengths that we would start with an exogenous shock to stock $s$ ‘s fundamentals:

(9) $\begin{equation*} \mathrm{G}_c(x) \overset{\scriptscriptstyle \mathrm{def}}{=} {\textstyle \sum_{c=1}^S} \, q_c \cdot x^c. \end{equation*}$

The coefficient $q_c$ gives the probability that a shock to stock $s$ ‘s fundamentals would set off a cascade of length $C_s = c$ . If stock $s$ doesn’t have any vulnerable neighbors, then a shock to stock $s$ ‘s fundamentals can only affect stock $s$ , $c=1$ . Whereas, if a shock to stock $s$ ‘s fundamentals would set off a cascade affecting every other stock in the market, then $c=S$ . Next, let $\mathrm{G}_v(x)$ be the generating function for the number of vulnerable neighbors that stock $s$ has:

(10) $\begin{equation*} \mathrm{G}_v(x) \overset{\scriptscriptstyle \mathrm{def}}{=} {\textstyle \sum_{v=0}^{S-1}} \, p_v \cdot x^v. \end{equation*}$

So, the coefficient $p_v$ is the probability that stock $s$ has $v$ vulnerable neighbors.

Notice how these two generating functions are linked. If stock $s$ has $v=1$ vulnerable neighbor, $s'$ , then an initial shock to stock $s$ ‘s fundamentals will set off a cascade of length $C_s = c$ if a shock to its one vulnerable neighbor will set off a cascade of length $C_{s'} = c - 1$ excluding stock $s$ . If stock $s$ has $v=2$ vulnerable neighbors, $s'$ and $s''$ , then an initial shock to stock $s$ will set off a cascade of length $C_s = c$ if shocks to its two vulnerable neighbors will set off cascades of combined length $C_{s'} + C_{s''} = c - 1$ excluding stock $s$ . And, if stock $s$ has $v=3$ vulnerable neighbors, then an initial shock to stock $s$ will set off a cascade of length $C_s = c$ if shocks to its three vulnerable neighbors will set off cascades of combined length $C_{s'} + C_{s''} + C_{s'''} = c - 1$ .

We know from the previous section (property 2: multiple samples) that $\mathrm{G}_c(x)^v$ is the generating function for the probability that $v$ different shocks set of cascades of combined length $c$ . And, we also know from the previous section (property 3: partial information) that we have to multiply through by $x^1$ if we want the generating function for the probability that $v$ different shocks set of cascades of combined length $(c-1)$ . So, the generating function for the distribution of cascade lengths has to satisfy the following internal-consistency condition as the number of stocks gets large, $S \to \infty$ :

(11) $\begin{align*} \mathrm{G}_c(x) &= p_0 \cdot x + p_1 \cdot x \cdot \mathrm{G}_c(x) + p_2 \cdot x \cdot \mathrm{G}_c(x)^2 + p_3 \cdot x \cdot \mathrm{G}_c(x)^3 + \cdots \\ &= x \cdot \left\{ \, p_0 \cdot \mathrm{G}_c(x)^0 + p_1 \cdot \mathrm{G}_c(x)^1 + p_2 \cdot \mathrm{G}_c(x)^2 + p_3 \cdot \mathrm{G}_c(x)^3 + \cdots \, \right\} \\ &= x \cdot \mathrm{G}_v\left(\mathrm{G}_c(x)\right). \end{align*}$

The outer function, $\mathrm{G}_v(\cdot)$ , gives the probability that the initial stock $s$ has $v$ vulnerable neighbors. The inner function, $\mathrm{G}_c(x)$ , gives the probability that shocks to these vulnerable neighbors would set of cascades of combined length $c$ . And, the multiplication by $x$ accounts for the fact that we want to compute the probability that shocks to these vulnerable neighbors would set of cascades of combined length $(c-1)$ not $c$ .

With this equation in hand, we can now compute the expected length of the rebalancing cascade that would follow from an initial shock to randomly selected stock $s$ :

(12) $\begin{align*} \mathrm{E}[C_s] = \left. x \cdot \mathrm{G}_c'(x) \right|_{x=1} &= 1 + \mathrm{G}_v'(1) \cdot \mathrm{G}_c'(1) \\ &= 1 + \mathrm{G}_v'(1) \cdot \mathrm{E}[C_s]. \end{align*}$

Rearranging yields an expression for the expected cascade length:

(13) $\begin{align*} \mathrm{E}[C_s] = \frac{1}{1 - \mathrm{G}_v'(1)}. \end{align*}$

And, in the exact same way that $\mathrm{G}_c'(1) = \mathrm{E}[C_s]$ (property 1: derivatives are moments), the expected number of vulnerable neighbors that each stock has is given by $\mathrm{G}_v'(1) = \mathrm{E}[V_s]$ . When stocks typically have less than $1$ vulnerable neighbor, $\mathrm{E}[V_s] < 1$ , we have an expression for the average rebalancing-cascade length as a function of the market’s connectivity, $\lambda$ , and vulnerability threshold, $\phi$ .

The figure above plots the average length of the rebalancing-cascade that would emerge if we selected an initial stock $s$ at random and shocked its fundamentals. It’s got a really interesting shape. A little math shows exactly why. We should expect short rebalancing cascades whenever stocks don’t have that many vulnerable neighbors. Here’s the expression for the average number of vulnerable neighbors that each stock has:

(14) $\begin{align*} \mathrm{E}[V_s] = \mathrm{G}_v'(1) &= \lambda \times \mathrm{Pr}[N_s \leq N^\star] \\ &= \lambda \times \mathrm{Pr}[N_s < (\lfloor \sfrac{1}{\phi} \rfloor +1)] \\ &=\lambda \times \left\{ \, \sum_{n < (\lfloor \sfrac{1}{\phi} \rfloor+1)} e^{-\lambda} \cdot \frac{\lambda^n}{n!} \, \right\}. \end{align*}$

Notice that stocks can have less than $1$ vulnerable neighbor on average for either of two reasons. First, they could have very few neighbors—that is, $\lambda$ could be less than $1$ . Think about this as a fragmented market where very few people trade. This is the region on the bottom of the figure. Second, even if there are lots of people trading, stocks could have fundamentals that aren’t very vulnerable to the effects of rebalancing—that is, $\phi$ is large. This is the region in the upper right of the figure. But, if the market isn’t too fragmented and stocks’ fundamentals are a little vulnerable to the effects of rebalancing, then long rebalancing cascades can emerge. In fact, they can be infinitely long…

5. Infinite Cascades

…but what does that even mean? It’s actually much more reasonable than it first sounds. In a large market, $S \to \infty$ , an infinitely long rebalancing cascade is just a cascade that affects a non-infinitesimal fraction of all stocks. If we now specify that $\mathrm{G}_c(x)$ is the generating function for the distribution of finite-length rebalancing cascades, then we can define $\theta$ as the fraction of all stocks affected by an infinitely long rebalancing cascade,

(15) $\begin{align*} \mathrm{G}_c(1) \overset{\scriptscriptstyle \mathrm{def}}{=} 1 - \theta. \end{align*}$

Think back to the coin-flipping example where we said that $\left. \mathrm{G}(x|\text{1 flip})\right|_{x=1} = 1$ because the coin never landed on its edge or magically winked out of existence. If the coin didn’t obey the laws of physics and disappeared $20{\scriptstyle \%}$ of the time, then we would have said that $\left. \mathrm{G}(x|\text{1 flip})\right|_{x=1} = 0.80$ . So, if $\mathrm{G}_c(x)$ is the generating function for the distribution of finite-length rebalancing cascades, then realizing an infinitely long cascade is like realizing a magical event that’s not characterized by $\mathrm{G}_c(x)$ . And, this way of framing the problem, $\mathrm{G}_c(1) = 1 - \theta = \mathrm{G}_v(\mathrm{G}_c(1))$ , gives us a way to solve for the fraction of the market that’s typically affected by an infinitely long rebalancing cascade, $\theta = 1 - e^{\lambda \cdot \theta}$ .

Finally, notice how sharp the phase transition is. Tiny changes in the market’s connectivity, $\lambda$ , and vulnerability, $\phi$ , can make all the difference between expecting infinitely long rebalancing cascades and expecting $16$ -stock long rebalancing cascades. Take a look at the figure above. Each panel has $S =1000$ stocks (the dots) and represents a single realization of trading-strategy rebalancing rules (the lines) in markets where each stock has $\lambda = 6.40$ neighbors (left) and $\lambda = 6.41$ neighbors (right) respectively. Both these markets are observationally equivalent. But, as shown in the figure below, when $\phi=0.30$ an initial shock to stock $s$ ‘s fundamentals will yield a huge rebalancing cascade when $\lambda = 6.40$ (left) but not when $\lambda = 6.41$ (right). Small changes that push a market over the transition point where $\mathrm{E}[V_s] = 1$ can have huge effects on the cascade-length distribution. What’s more, right at this transition point where $\mathrm{E}[V_s] = 1$ , the sizes of the rebalancing cascades follow a power-law distribution,

(16) $\begin{align*} \mathrm{Pr}[C_s = c] \sim c^{-\sfrac{3}{2}}, \end{align*}$

as shown in Newman et al. (2002). Slight differences in how the market happens to be wired up today can affect whether or not a stock on the other side of the market will be affected by an initial shock to stock $s$ ‘s fundamentals.

« Previous Page