Uncategorized – Page 6 – Research Notebook

The Bayesian Information Criterion

January 3, 2017 by Alex

1. Motivation

Imagine that we’re trying to predict the cross-section of expected returns, and we’ve got a sneaking suspicion that $x$ might be a good predictor. So, we regress today’s returns on $x$ to see if our hunch is right,

$\begin{align*} r_{n,t} = \hat{\mu}_{\text{OLS}} + \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1} + \hat{\epsilon}_{n,t}. \end{align*}$

The logic is straightforward. If $x$ explains enough of the variation in today’s returns, then $x$ must be a good predictor and we should include it in our model of tomorrow’s returns, $\mathrm{E}_t(r_{n,t+1}) = \hat{\mu}_{\text{OLS}} + \hat{\beta}_{\text{OLS}} \cdot x_{n,t}$ .

But, how much variation is “enough variation”? After all, even if $x$ doesn’t actually predict tomorrow’s returns, we’re still going to fit today’s returns better if we use an additional right-hand-side variable,

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 \leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}})^2. \end{align*}$

The effect is mechanical. If we want to explain all of the variation in today’s returns, then all we have to do is include $N$ right-hand-side variables in our OLS regression. With $N$ linearly independent right-hand-side variables we can always perfectly predict $N$ stock returns, no matter what variables we choose.

The Bayesian information criterion (BIC) tells us that we should include $x$ as a right-hand-side variable if it explains at least $\sfrac{\log(N)}{N}$ of the residual variation,

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} &\leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}})^2. \end{align*}$

But, where does this $\sfrac{\log(N)}{N}$ penalty come from? And, why is following this rule the Bayesian thing to do? Bayesian updating involves learning about a parameter value by combining prior beliefs with evidence from realized data. So, what parameter are we learning about when using the Bayesian information criterion? And, what are our priors beliefs? These questions are the topic of today’s post.

2. Estimation

Instead of diving directly into our predictor-selection problem (should we include $x$ in our model?), let’s pause for a second and solve our parameter-estimation problem (how should we estimate the coefficient on $x$ ?). Suppose the data-generating process for returns is

$\begin{align*} r_{n,t} = \beta_\star \cdot x_{n,t-1} + \epsilon_{n,t} \end{align*}$

where $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ , $\epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1)$ , and $x$ is normalized so that $\frac{1}{N} \cdot \sum_{n=1}^N x_n^2 = \widehat{\mathrm{Var}}[x_n] = 1$ . For simplicity, let’s also assume that $\mu_\star = 0$ in the analysis below.

If we see $N$ returns from this data-generating process, $\mathbf{r}_t = \{ \, r_{1,t}, \, r_{2,t}, \, \ldots, \, r_{N,t} \, \}$ , then we can estimate $\beta_\star$ by choosing the parameter value that would maximize the posterior probability of realizing these returns:

$\begin{align*} \hat{\beta}_{\text{MAP}} &\overset{\scriptscriptstyle \text{def}}{=} \arg \max_{\beta} \{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) \times \mathrm{Pr}(\beta) \, \} \\ &= \arg \min_{\beta} \{ \, - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}(\beta) \, \}. \end{align*}$

This is known as maximum a posteriori (MAP) estimation, and the second equality in the expression above points out how we can either maximize the posterior probability or minimize $-(\sfrac{1}{N}) \cdot \log(\cdot)$ of this function,

$\begin{align*} \mathrm{f}(\beta) &\overset{\scriptscriptstyle \text{def}}{=} - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}(\beta). \end{align*}$

We can think about $\mathrm{f}(\beta)$ as the average improbability of the realized returns given $\beta_\star = \beta$ .

So, what is this answer? Because $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ and $\epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1)$ , we know that

$\begin{align*} \mathrm{f}(\beta) &= {\textstyle \frac{1}{N}} \cdot \left\{ \, {\textstyle \sum_{n=1}^N} {\textstyle \frac{1}{2}} \cdot (r_{n,t} - \beta \cdot x_{n,t-1})^2 + N \cdot {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) \, \right\} \\ &\qquad \quad + \, {\textstyle \frac{1}{N}} \cdot \left\{ \, {\textstyle \frac{1}{2 \cdot \sigma^2}} \cdot (\beta - 0)^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) + {\textstyle \frac{1}{2}} \cdot \log(\sigma) \, \right\} \end{align*}$

where the first line is $-(\sfrac{1}{N}) \cdot \log \mathrm{Pr}(\mathbf{r}_t|\mathbf{x}_{t-1}, \, \beta)$ and the second line is $-(\sfrac{1}{N}) \cdot \log \mathrm{Pr}(\beta)$ . What’s more, because we’re specifically choosing $\hat{\beta}_{\text{MAP}}$ to minimize $\mathrm{f}(\beta)$ , we also know that

$\begin{align*} \mathrm{f}'(\hat{\beta}_{\text{MAP}}) &= 0 = - \, {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{MAP}} \cdot x_{n,t-1}) \cdot x_{n,t-1} + {\textstyle \frac{1}{N}} \cdot {\textstyle \frac{1}{\sigma^2}} \cdot \hat{\beta}_{\text{MAP}}. \end{align*}$

And, solving this first-order condition for $\hat{\beta}_{\text{MAP}}$ tells us exactly how to estimate $\beta_\star$ :

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \frac{ N \cdot \left\{ \frac{1}{N} \cdot \sum_{n=1}^N r_{n,t} \cdot x_{n,t-1} \right\} }{ \frac{1}{\sigma^2} + N \cdot \left\{ \frac{1}{N} \cdot \sum_{n=1}^N x_{n,t-1}^2 \right\} } = \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] }. \end{align*}$

3. Selection

Now that we’ve seen the solution to our parameter-estimation problem, let’s get back to solving our predictor-selection problem. Should we include $x$ in our predictive model of tomorrow’s returns? It turns out that answering this question means learning about the prior variance of $\beta_\star$ . Is $\beta_\star$ is equally likely to take on any value, $\sigma^2 \to \infty$ ? Or, should we assume that $\beta_\star = 0$ regardless of the evidence, $\sigma^2 \to 0$ ?

To see where the first choice comes from, let’s think about the priors we’re implicitly adopting when we include $x$ in our predictive model. Since $\beta_\star \sim \mathrm{N}(0, \, \sigma^2)$ , this means looking for a $\sigma^2$ such that $\hat{\beta}_{\text{MAP}} = \hat{\beta}_{\text{OLS}}$ . Inspecting the solution to our parameter-estimation problem reveals that

$\begin{align*} \lim_{\sigma^2 \to \infty} \hat{\beta}_{\text{MAP}} &= \lim_{\sigma^2 \to \infty} \left( \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] } \right) = \frac{ \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \widehat{\mathrm{Var}}[x_{n,t-1}] } = \hat{\beta}_{\text{OLS}}. \end{align*}$

So, by including $x$ , we’re adopting an agnostic prior that $\beta_\star$ is equally likely to be any value under the sun.

To see where the second choice comes from, let’s think about the priors we’re implicitly adopting when we exclude $x$ from our predictive model. This means looking for a $\sigma^2$ such that $\hat{\beta}_{\text{MAP}} = 0$ regardless of the realized data, $\mathbf{x}_{t-1}$ . Again, inspecting the formula for $\hat{\beta}_{\text{MAP}}$ reveals that

$\begin{align*} \lim_{\sigma^2 \to 0} \hat{\beta}_{\text{MAP}} &= \lim_{\sigma^2 \to 0} \left( \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] } \right) =0. \end{align*}$

So, by excluding $x$ , we’re adopting a religious prior that $\beta_\star = 0$ regardless of any new evidence.

Thus, when we decide whether to include $x$ in our predictive model, what we’re really doing is learning about our priors. So, after seeing $N$ returns, $\mathbf{r}_t$ , we can decide whether to include $x$ in our predictive model by choosing the prior variance, $\sigma^2$ , that maximizes the posterior probability of realizing these returns,

$\begin{align*} \hat{\sigma}_{\text{MAP}}^2 &\overset{\scriptscriptstyle \text{def}}{=} \arg \max_{\sigma^2 \in \{\infty, \, 0\}} \left\{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \times \mathrm{Pr}( \sigma^2 ) \, \right\} \\ &= \arg \min_{\sigma^2 \in \{ \infty, \, 0\}} \left\{ \, - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) + \log \mathrm{Pr}( \sigma^2 ) \, \right\}, \end{align*}$

where the second equality in the expression above points out how we can either maximize the posterior probability or minimize $-(\sfrac{1}{N}) \cdot \log(\cdot)$ of this function—i.e., its average improbability. Either way, if we estimate $\hat{\sigma}_{\text{MAP}}^2 \to \infty$ , then we should include $x$ ; whereas, if we estimate $\hat{\sigma}_{\text{MAP}}^2 \to 0$ , then we shouldn’t.

4. Why log(N)/N?

The posterior probability of the realized returns given our choice of priors is given by

$\begin{align*} \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \cdot \mathrm{Pr}( \sigma^2 ) &= {\textstyle \int_{-\infty}^\infty} \mathrm{Pr}(\mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta) \cdot \mathrm{Pr}(\beta|\sigma^2) \cdot \mathrm{d}\beta \\ &= {\textstyle \int_{-\infty}^\infty} \, e^{-N \cdot \mathrm{f}(\beta)} \cdot \mathrm{d}\beta. \end{align*}$

In this section, we’re going to see how to evaluate this integral. And, in the process, we’re going to see precisely where that $\sfrac{\log(N)}{N}$ penalty term in the Bayesian information criterion comes from.

Here’s the key insight in plain English. The realized returns are affected by noise shocks. By definition, excluding $x$ from our predictive model means that we aren’t learning about $\beta_\star$ from the realized returns, so there’s no way for these noise shocks to affect either our estimate of $\hat{\beta}_{\text{MAP}}$ or our posterior-probability calculations. By contrast, if we include $x$ in our predictive model, then we are learning about $\beta_\star$ from the realized returns, so these noise shocks will distort both our estimate of $\hat{\beta}_{\text{MAP}}$ and our posterior-probability calculations. The distortion caused by these noise shocks are going to be the source of the $\sfrac{\log(N)}{N}$ penalty term in the Bayesian information criterion.

Now, here’s the same insight in Mathese. Take a look at the Taylor expansion of $\mathrm{f}(\beta)$ around $\hat{\beta}_{\text{MAP}}$ ,

$\begin{align*} \mathrm{f}(\beta) &= \mathrm{f}(\hat{\beta}_{\text{MAP}}) + {\textstyle \frac{1}{2}} \cdot \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \cdot (\beta - \hat{\beta}_{\text{MAP}})^2. \end{align*}$

There’s no first-order term because $\hat{\beta}_{\text{MAP}}$ is chosen to minimize $\mathrm{f}(\beta)$ , and there are no higher-order terms because both $\beta_\star$ and $\epsilon_{n,t}$ are normally distributed. From the formula for $\mathrm{f}(\beta)$ we can calculate that

$\begin{align*} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) &= {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} \cdot x_{n,t-1}^2 + {\textstyle \frac{1}{N}} \cdot {\textstyle \frac{1}{\sigma^2}}. \end{align*}$

Recall that $\mathrm{f}(\beta)$ measures the average improbability of realizing $\mathbf{r}_t$ given that $\beta_\star = \beta$ . So, if $\mathrm{f}''(\hat{\beta}_{\text{MAP}}) \to \infty$ for a given choice of priors, then having any $\beta_\star \neq \hat{\beta}_{\text{MAP}}$ is infinitely improbable under those priors. And, this is exactly what we find when we exclude $x$ from our predictive model, $\lim_{\sigma \to 0} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) = \infty$ . By contrast, if we include $x$ in our predictive model, then $\lim_{\sigma \to \infty} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) = 1$ , meaning that we are willing to entertain the idea that $\beta_\star \neq \hat{\beta}_{\text{MAP}}$ due to distortions caused by the noise shocks.

To see why these distortions warrant a $\sfrac{\log(N)}{N}$ penalty, all we have to do then is evaluate the integral. First, let’s think about the case where we exclude $x$ from our predictive model. We just saw that, if $\sigma \to 0$ , then we are unwilling to consider any parameter values besides $\hat{\beta}_{\text{MAP}} = 0$ . So, the integral equation for our posteriors given that $\sigma^2 \to 0$ simplifies to

$\begin{align*} \lim_{\sigma^2 \to 0} \left\{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \, \right\} &= \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta_\star = 0 ) \\ &= {\textstyle \big(\frac{1}{\sqrt{2 \cdot \pi}}\big)^N} \cdot e^{ - \, \sum_{n=1}^N \frac{1}{2} \cdot (r_{n,t} - 0)^2 }. \end{align*}$

This means that the average improbability of realizing $\mathbf{r}_t$ given the priors $\sigma^2 \to 0$ is given by

$\begin{align*} \lim_{\sigma \to 0} \left\{ \, - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) \, \right\} &= {\textstyle \frac{1}{2 \cdot N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - 0)^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi). \end{align*}$

To calculate our posterior beliefs when we include $x$ , let’s use this Taylor expansion around $\hat{\beta}_{\text{MAP}}$ again,

$\begin{align*} \lim_{\sigma \to \infty} \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) &= \lim_{\sigma \to \infty} \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{-\, N \cdot \mathrm{f}(\beta)} \cdot \mathrm{d}\beta \, \right\} \\ &= \lim_{\sigma \to \infty} \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{-\, N \cdot \left\{ \mathrm{f}(\hat{\beta}_{\text{MAP}}) + \frac{1}{2} \cdot \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \cdot (\beta - \hat{\beta}_{\text{MAP}})^2 \right\}} \cdot \mathrm{d}\beta \, \right\} \\ &= \left\{ \, e^{-\, N \cdot \mathrm{f}(\hat{\beta}_{\text{OLS}})} \, \right\} \times \left\{ \, {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{N}{2} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \, \right\}. \end{align*}$

The first term is the probability of observing the realized returns assuming that $\beta_\star = \lim_{\sigma^2 \to \infty} \hat{\beta}_{\text{MAP}} = \hat{\beta}_{\text{OLS}}$ . The second term is a penalty that accounts for the fact that $\beta_\star$ might be different from the estimated $\hat{\beta}_{\text{OLS}}$ in finite samples. Due to the central-limit theorem, this difference between $\beta_\star$ and $\hat{\beta}_{\text{OLS}}$ is going to shrink at a rate of $\sqrt{(\sfrac{1}{N})}$ :

$\begin{align*} {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{N}{2} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta &= {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{1}{2 \cdot (\sfrac{1}{N})} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \\ &= \sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})} \cdot \int_{-\infty}^\infty \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}}} \cdot e^{ - \, \frac{1}{2 \cdot (\sfrac{1}{N})} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \\ &= \sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}. \end{align*}$

So, the average improbability of realizing $\mathbf{r}_t$ given the priors $\sigma^2 \to \infty$ is given by

$\begin{align*} \lim_{\sigma \to \infty} \left\{ \, - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) \, \right\} &= {\textstyle \frac{1}{2 \cdot N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) + {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{\log(N)}{N}} + \mathrm{O}(\sfrac{1}{N}) \end{align*}$

where $\mathrm{O}(\sfrac{1}{N})$ is big-“O” notation denoting terms that shrink faster than $\sfrac{1}{N}$ as $N \to \infty$ .

5. Formatting

Bringing everything together, hopefully it’s now clear why we can decide whether to include $x$ in our predictive model by checking whether

$\begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} + \mathrm{O}(\sfrac{1}{N}) \leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - 0)^2. \end{align*}$

The $\sfrac{\log(N)}{N}$ penalty term accounts for the fact that you’re going to be overfitting the data in sample when you include more right-hand-side variables. This criterion was first proposed in Schwarz (1978), who showed that the criterion becomes exact as $N \to \infty$ . The Bayesian information criterion is often written as an optimization problem as well:

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \arg \min_{\beta} \left\{ \, {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \beta \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} \cdot \mathrm{1}_{\{ \beta \neq 0 \}} \, \right\}. \end{align*}$

Both ways of writing down the criterion are the same. They just look different due to formatting. There is one interesting idea that pops out of writing down the Bayesian information criterion as a optimization problem. Solving for the $\hat{\beta}_{\text{MAP}}$ suggests that you should completely ignore any predictors with sufficiently small OLS coefficients:

$\begin{align*} \hat{\beta}_{\text{MAP}} &= \begin{cases} \hat{\beta}_{\text{OLS}} &\text{if } |\hat{\beta}_{\text{OLS}}| \geq \sqrt{{\textstyle \frac{1}{2}} \cdot {\textstyle \frac{\log(N)}{N}}}, \text{ and} \\ 0 &\text{otherwise.} \end{cases} \end{align*}$

A Model of Rebalancing Cascades

October 23, 2016 by Alex

1. Motivating Examples

Trading strategies can interact with one another to amplify small initial shocks to fundamentals:

Quant Crisis, Aug 2007: “During the week of August 6, 2007, a number of [quantitative hedge funds] experienced unprecedented losses… Initial losses [were] due to the forced liquidation of one or more large equity market-neutral portfolios… and the subsequent price impact… caused other similarly constructed portfolios to experience losses. These losses, in turn, caused other funds to deleverage their portfolios [and] led to further losses [and] more deleveraging and so on.”
Flash Crash, May 2010: “The Dow Jones industrial average plunged more than 600 points in a matter of minutes that day and then recovered in a blink… [it] began with the sale by Waddell & Reed of $75,000$ E-Mini S&P 500 futures contracts… late in the trading day… [with] many of the contracts bought by… computerized traders who… [then] traded contracts back and forth [like a] ‘hot potato’.”
Drop in Oil Prices, May 2011: “Never before had crude oil plummeted so deeply during the course of a day… prices were off by nearly $\mathdollar 13$ a barrel… [and] market players were unable to identify any single bank or fund orchestrating a massive sale to liquidate positions… [rather] computerized trading just kicked in when key price levels were reached.”
End-of-Day Volume, Oct 2011: “In the last $18$ minutes of trading, the S&P 500-stock index jumped more than $10$ points with no news to account for the rally. If you were left scratching your head, you were not alone… [and, the] culprit behind the late-day market swings: exchange-traded funds or ETFs.”
Sterling after Brexit, Oct 2016: “If a country’s exchange rate represents international investors’ confidence in its government’s policies, the markets have given Britain the thumbs-down… The most likely explanation for the plunge lies in the action of algorithmic trades… These sales can be contagious, with one program’s trades setting off the sell signals of other algorithms.”

I refer to these sorts of events as rebalancing cascades. Stock $1$ ‘s fundamentals change, so a trading strategy sells stock $1$ and replaces it with stock $2$ . This purchase of stock $2$ forces another trading strategy to buy stock $2$ and sell stock $3$ . And, this sale forces…

The examples above seem to suggest that you don’t need a very big initial change in stock $1$ ‘s fundamentals to trigger a cascade. For instance, when oil prices suddenly dropped in May 2011, traders were “unable to identify any single bank or fund orchestrating a massive sale”. They were left “scratching [their] heads”, to use the language of the next example. So, in this post, I write down a random-networks model à la Watts (2002) to understand when we should expect small changes in fundamentals to trigger these sorts of long rebalancing cascades.

2. Market Structure

Consider a market with $S$ stocks, $s = 1, \, 2, \, \ldots, \, S$ , where $S$ is a really big number. If a change in the fundamentals of stock $s'$ will force some trading strategy to rebalance and buy stock $s$ instead, then let’s say that stock $s'$ and stock $s$ are neighbors. Suppose that two randomly selected stocks are neighbors with probability $\sfrac{\lambda}{S}$ . This uniform-random-matching assumption means that the number of neighbors that each stock has, $N_s$ , is Poisson distributed with mean $\lambda \overset{\scriptscriptstyle \mathrm{def}}{=} \mathrm{E}[N_s]$ ,

(1) $\begin{equation*} N_s \sim \mathrm{Pois}(\lambda). \end{equation*}$

The fact that each stock is only neighbors with a fraction of the market captures the idea that different trading strategies rebalance in different ways. For technical reasons, let’s assume that $\lambda = \mathrm{O}[\log(S)]$ . The figure below gives examples of this sort of random network when $S=100$ .

Now, suppose that $\Delta_s \in \{ 0, \, 1\}$ is an indicator variable for whether or not stock $s$ ‘s fundamentals have changed. If a bunch of strategies start trading stock $s$ because of changes in its neighboring stocks’ fundamentals, then this additional trading activity can affect stock $s$ ‘s fundamentals. For example, if lots of funds buy a stock and push it into the S&P 500, then the stock will have a higher market beta. Let’s model neighboring stocks’ effect on stock $s$ ‘s fundamentals as follows,

(2) $\begin{equation*} \Delta_s = \begin{cases} 1 &\text{if } (\sfrac{1}{N_{s}}) \cdot {\textstyle \sum_{s' \in \mathcal{N}_s}} \Delta_s' \geq \phi \\ 0 &\text{else} \end{cases}, \end{equation*}$

where $\phi \in (0, \, 1]$ captures the vulnerability of a stock’s fundamentals to rebalancing. If there are lots of different strategies trading stock $s$ in lots of different ways, $N_s > N^\star$ , then no single rebalancing decision will be important enough to change stock $s$ ‘s fundamentals. But, if stock $s$ has at most $N^\star \overset{\scriptscriptstyle \mathrm{def}}{=} \lfloor \sfrac{1}{\phi} \rfloor$ neighbors, then a change in the fundamentals of a single neighbor will generate enough rebalancing to cause stock $s$ ‘s fundamentals to change too. Let’s say that stock $s$ has $V_s \overset{\scriptscriptstyle \mathrm{def}}{=} \sum_{s' \in \mathcal{N}_s} 1_{\{ \, N_{s'} \leq N^\star \}}$ such vulnerable neighbors.

Here’s the exercise I have in mind. Imagine that we select a stock at random, $s$ , and exogenously change its fundamentals, $\Delta_s = 1$ . If $s$ happens to have a vulnerable neighbor, $s'$ , then the rebalancing caused by our initial shock will change the fundamentals of a second stock, $\Delta_{s'} = 1$ . And, if $s'$ happens to have an additional vulnerable neighbor of its own, $s''$ , then the second wave of rebalancing caused by our initial shock to stock $s$ will change the fundamentals of a third stock as well, $\Delta_{s''} = 1$ . If stock $s''$ doesn’t have any additional vulnerable neighbors, then we will have triggered a rebalancing cascade of length $3$ with our initial shock to a single stock’s fundamentals. I want to characterize the distribution of cascade lengths for a randomly selected initial stock $s$ ,

(3) $\begin{equation*} C_s = \Delta_s + 1_{\{ \mathcal{V}_s \neq \emptyset \}} \cdot \left\{ \, {\textstyle \sum_{s' \in \mathcal{V}_s}} \left( \, \Delta_{s'} + 1_{\{ \mathcal{V}_{s'} \neq \emptyset \}} \cdot \left\{ \, {\textstyle \sum_{s'' \in \mathcal{V}_{s'}}} \left( \, \Delta_{s''} + \cdots \, \right) \, \right\} \, \right) \, \right\}. \end{equation*}$

as a function of the market’s average connectivity, $\lambda$ , and vulnerability threshold, $\phi$ .

3. Generating Functions

Generating functions make it possible to compute the distribution of cascade lengths. Here’s the basic idea. Take a look at Graham, Knuth, and Patashnik (1994, Ch 7) for more details. Suppose we’re flipping coins and counting the number of heads. The distribution of the number of heads, $h$ , after one flip is given by:

(4) $\begin{equation*} \text{\# of heads}|\text{1 flip} = \begin{cases} 1 &\text{w/ prob $q$, and} \\ 0 &\text{w/ prob $(1-q)$.} \end{cases} \end{equation*}$

If $q = \sfrac{1}{2}$ , then the coin is fair. The generating function for this same distribution is:

(5) $\begin{align*} \mathrm{G}(x|\text{1 flip}) &= {\textstyle \sum_{h=0}^1} \, p_h \cdot x^h. \end{align*}$

Each term in the series is associated with one possible outcome for the total number of heads. $p_0 = (1 - q)$ is the probability of realizing $h=0$ heads. $p_1 = q$ is the probability of realizing $h=1$ heads. We say that $\mathrm{G}(x|\text{1 flip})$ generates the distribution because we can compute all its moments by evaluating the derivatives of the generating function at $x=1$ . For example, the $0$ th-order derivative, $\left. \mathrm{G}(x|\text{1 flip})\right|_{x=1} = 1$ , says that the coin never lands on its edge or winks out of existence. And, if we want to compute the expected number of heads, then we can use the $1$ st-order derivative:

(6) $\begin{align*} \mathrm{E}[h|\text{1 flip}] &= \left. x \cdot \mathrm{G}'(x|\text{1 flip}) \right|_{x=1} \\ &= \left. x \cdot {\textstyle \sum_{h=0}^1} \, p_h \cdot h \cdot x^{h-1} \right|_{x=1} \\ &= \left. {\textstyle \sum_{h=0}^1} \, p_h \cdot h \cdot x^h \right|_{x=1} \\ &= p_0 \cdot 0 \cdot 1^0 + p_1 \cdot 1 \cdot 1^1 \\ &= q. \end{align*}$

The fact that derivatives of the generating function give the moments of the associated distribution will be useful below. Let’s call this Property 1: derivatives are moments.

Here are two additional properties to keep in mind. Property 2: multiple samples. If we raise the generating function for the number of heads in one flip to the $n$ th power, then we get the generating function for the total number of heads in $n$ flips. To illustrate, look at what happens if we square the generating function for the number of heads in one flip:

(7) $\begin{align*} \mathrm{G}(x|\text{1 flip})^2 &= \left\{ \, p_0 \cdot x^0 + p_1 \cdot x^1 \, \right\} \times \left\{ \, p_0 \cdot x^0 + p_1 \cdot x^1 \, \right\} \\ &= p_0^2 \cdot x^0 + 2 \cdot p_0 \cdot p_1 \cdot x^1 + p_1^2 \cdot x^2 \\ &= (1 - q)^2 \cdot x^0 + 2 \cdot (1 - q) \cdot q \cdot x^1 + q^2 \cdot x^2. \end{align*}$

The result is just the generating function for the number of heads in two flips, $\mathrm{G}(x|\text{2 flips})$ .

Property 3: partial information. If we multiply the generating function for the total number of heads in $(n-1)$ flips by $x^1$ , then we have the generating function for the total number of heads in all $n$ flips conditional on having already seen heads on the first flip. To illustrate, notice what happens when we multiply the generating function for the number of heads in one flip by $x^1$ :

(8) $\begin{align*} \mathrm{G}(x|\text{2 flips}, \, \text{heads on 1st flip}) &= x^1 \cdot \mathrm{G}(x|\text{1 flip}) \\ &= p_0 \cdot x^1 + p_1 \cdot x^2 \\ &= (1 - q) \cdot x^1 + q \cdot x^2. \end{align*}$

If we’ve already seen heads on the first flip, then there’s no way to realize $h=0$ heads. The lowest we can go is $h=1$ heads now. So, the first term is now $h=1$ . And, this outcome occurs if we see tails on the second flip, which happens with probability $(1-q)$ .

4. Cascade Lengths

Now, let’s return to our original problem. Let $\mathrm{G}_c(x)$ be the generating function distribution of cascade lengths that we would start with an exogenous shock to stock $s$ ‘s fundamentals:

(9) $\begin{equation*} \mathrm{G}_c(x) \overset{\scriptscriptstyle \mathrm{def}}{=} {\textstyle \sum_{c=1}^S} \, q_c \cdot x^c. \end{equation*}$

The coefficient $q_c$ gives the probability that a shock to stock $s$ ‘s fundamentals would set off a cascade of length $C_s = c$ . If stock $s$ doesn’t have any vulnerable neighbors, then a shock to stock $s$ ‘s fundamentals can only affect stock $s$ , $c=1$ . Whereas, if a shock to stock $s$ ‘s fundamentals would set off a cascade affecting every other stock in the market, then $c=S$ . Next, let $\mathrm{G}_v(x)$ be the generating function for the number of vulnerable neighbors that stock $s$ has:

(10) $\begin{equation*} \mathrm{G}_v(x) \overset{\scriptscriptstyle \mathrm{def}}{=} {\textstyle \sum_{v=0}^{S-1}} \, p_v \cdot x^v. \end{equation*}$

So, the coefficient $p_v$ is the probability that stock $s$ has $v$ vulnerable neighbors.

Notice how these two generating functions are linked. If stock $s$ has $v=1$ vulnerable neighbor, $s'$ , then an initial shock to stock $s$ ‘s fundamentals will set off a cascade of length $C_s = c$ if a shock to its one vulnerable neighbor will set off a cascade of length $C_{s'} = c - 1$ excluding stock $s$ . If stock $s$ has $v=2$ vulnerable neighbors, $s'$ and $s''$ , then an initial shock to stock $s$ will set off a cascade of length $C_s = c$ if shocks to its two vulnerable neighbors will set off cascades of combined length $C_{s'} + C_{s''} = c - 1$ excluding stock $s$ . And, if stock $s$ has $v=3$ vulnerable neighbors, then an initial shock to stock $s$ will set off a cascade of length $C_s = c$ if shocks to its three vulnerable neighbors will set off cascades of combined length $C_{s'} + C_{s''} + C_{s'''} = c - 1$ .

We know from the previous section (property 2: multiple samples) that $\mathrm{G}_c(x)^v$ is the generating function for the probability that $v$ different shocks set of cascades of combined length $c$ . And, we also know from the previous section (property 3: partial information) that we have to multiply through by $x^1$ if we want the generating function for the probability that $v$ different shocks set of cascades of combined length $(c-1)$ . So, the generating function for the distribution of cascade lengths has to satisfy the following internal-consistency condition as the number of stocks gets large, $S \to \infty$ :

(11) $\begin{align*} \mathrm{G}_c(x) &= p_0 \cdot x + p_1 \cdot x \cdot \mathrm{G}_c(x) + p_2 \cdot x \cdot \mathrm{G}_c(x)^2 + p_3 \cdot x \cdot \mathrm{G}_c(x)^3 + \cdots \\ &= x \cdot \left\{ \, p_0 \cdot \mathrm{G}_c(x)^0 + p_1 \cdot \mathrm{G}_c(x)^1 + p_2 \cdot \mathrm{G}_c(x)^2 + p_3 \cdot \mathrm{G}_c(x)^3 + \cdots \, \right\} \\ &= x \cdot \mathrm{G}_v\left(\mathrm{G}_c(x)\right). \end{align*}$

The outer function, $\mathrm{G}_v(\cdot)$ , gives the probability that the initial stock $s$ has $v$ vulnerable neighbors. The inner function, $\mathrm{G}_c(x)$ , gives the probability that shocks to these vulnerable neighbors would set of cascades of combined length $c$ . And, the multiplication by $x$ accounts for the fact that we want to compute the probability that shocks to these vulnerable neighbors would set of cascades of combined length $(c-1)$ not $c$ .

With this equation in hand, we can now compute the expected length of the rebalancing cascade that would follow from an initial shock to randomly selected stock $s$ :

(12) $\begin{align*} \mathrm{E}[C_s] = \left. x \cdot \mathrm{G}_c'(x) \right|_{x=1} &= 1 + \mathrm{G}_v'(1) \cdot \mathrm{G}_c'(1) \\ &= 1 + \mathrm{G}_v'(1) \cdot \mathrm{E}[C_s]. \end{align*}$

Rearranging yields an expression for the expected cascade length:

(13) $\begin{align*} \mathrm{E}[C_s] = \frac{1}{1 - \mathrm{G}_v'(1)}. \end{align*}$

And, in the exact same way that $\mathrm{G}_c'(1) = \mathrm{E}[C_s]$ (property 1: derivatives are moments), the expected number of vulnerable neighbors that each stock has is given by $\mathrm{G}_v'(1) = \mathrm{E}[V_s]$ . When stocks typically have less than $1$ vulnerable neighbor, $\mathrm{E}[V_s] < 1$ , we have an expression for the average rebalancing-cascade length as a function of the market’s connectivity, $\lambda$ , and vulnerability threshold, $\phi$ .

The figure above plots the average length of the rebalancing-cascade that would emerge if we selected an initial stock $s$ at random and shocked its fundamentals. It’s got a really interesting shape. A little math shows exactly why. We should expect short rebalancing cascades whenever stocks don’t have that many vulnerable neighbors. Here’s the expression for the average number of vulnerable neighbors that each stock has:

(14) $\begin{align*} \mathrm{E}[V_s] = \mathrm{G}_v'(1) &= \lambda \times \mathrm{Pr}[N_s \leq N^\star] \\ &= \lambda \times \mathrm{Pr}[N_s < (\lfloor \sfrac{1}{\phi} \rfloor +1)] \\ &=\lambda \times \left\{ \, \sum_{n < (\lfloor \sfrac{1}{\phi} \rfloor+1)} e^{-\lambda} \cdot \frac{\lambda^n}{n!} \, \right\}. \end{align*}$

Notice that stocks can have less than $1$ vulnerable neighbor on average for either of two reasons. First, they could have very few neighbors—that is, $\lambda$ could be less than $1$ . Think about this as a fragmented market where very few people trade. This is the region on the bottom of the figure. Second, even if there are lots of people trading, stocks could have fundamentals that aren’t very vulnerable to the effects of rebalancing—that is, $\phi$ is large. This is the region in the upper right of the figure. But, if the market isn’t too fragmented and stocks’ fundamentals are a little vulnerable to the effects of rebalancing, then long rebalancing cascades can emerge. In fact, they can be infinitely long…

5. Infinite Cascades

…but what does that even mean? It’s actually much more reasonable than it first sounds. In a large market, $S \to \infty$ , an infinitely long rebalancing cascade is just a cascade that affects a non-infinitesimal fraction of all stocks. If we now specify that $\mathrm{G}_c(x)$ is the generating function for the distribution of finite-length rebalancing cascades, then we can define $\theta$ as the fraction of all stocks affected by an infinitely long rebalancing cascade,

(15) $\begin{align*} \mathrm{G}_c(1) \overset{\scriptscriptstyle \mathrm{def}}{=} 1 - \theta. \end{align*}$

Think back to the coin-flipping example where we said that $\left. \mathrm{G}(x|\text{1 flip})\right|_{x=1} = 1$ because the coin never landed on its edge or magically winked out of existence. If the coin didn’t obey the laws of physics and disappeared $20{\scriptstyle \%}$ of the time, then we would have said that $\left. \mathrm{G}(x|\text{1 flip})\right|_{x=1} = 0.80$ . So, if $\mathrm{G}_c(x)$ is the generating function for the distribution of finite-length rebalancing cascades, then realizing an infinitely long cascade is like realizing a magical event that’s not characterized by $\mathrm{G}_c(x)$ . And, this way of framing the problem, $\mathrm{G}_c(1) = 1 - \theta = \mathrm{G}_v(\mathrm{G}_c(1))$ , gives us a way to solve for the fraction of the market that’s typically affected by an infinitely long rebalancing cascade, $\theta = 1 - e^{\lambda \cdot \theta}$ .

Finally, notice how sharp the phase transition is. Tiny changes in the market’s connectivity, $\lambda$ , and vulnerability, $\phi$ , can make all the difference between expecting infinitely long rebalancing cascades and expecting $16$ -stock long rebalancing cascades. Take a look at the figure above. Each panel has $S =1000$ stocks (the dots) and represents a single realization of trading-strategy rebalancing rules (the lines) in markets where each stock has $\lambda = 6.40$ neighbors (left) and $\lambda = 6.41$ neighbors (right) respectively. Both these markets are observationally equivalent. But, as shown in the figure below, when $\phi=0.30$ an initial shock to stock $s$ ‘s fundamentals will yield a huge rebalancing cascade when $\lambda = 6.40$ (left) but not when $\lambda = 6.41$ (right). Small changes that push a market over the transition point where $\mathrm{E}[V_s] = 1$ can have huge effects on the cascade-length distribution. What’s more, right at this transition point where $\mathrm{E}[V_s] = 1$ , the sizes of the rebalancing cascades follow a power-law distribution,

(16) $\begin{align*} \mathrm{Pr}[C_s = c] \sim c^{-\sfrac{3}{2}}, \end{align*}$

as shown in Newman et al. (2002). Slight differences in how the market happens to be wired up today can affect whether or not a stock on the other side of the market will be affected by an initial shock to stock $s$ ‘s fundamentals.

Intuition Behind the Bayesian LASSO

September 24, 2016 by Alex

1. Motivating Question

Imagine you’ve just seen Apple’s most recent return, $r$ , which is Apple’s long-run expected return, $\mu^\star$ , plus some random noise, $\epsilon \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \, 1)$ :

(1) $\begin{align*} r &= \mu^\star + \epsilon. \end{align*}$

You want to use this realized return, $r$ , to estimate Apple’s long-run expected return, $\mu^\star$ . The LASSO is a popular way to solve this problem. The LASSO estimates Apple’s long-run expected return, $\mu^\star$ , by choosing a $\hat{\mu}$ that’s as close as possible to the realized $r$ while taking into account an absolute-value penalty,

(2) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, {\textstyle \frac{1}{2}} \cdot (r - \mu)^2 + \lambda \cdot |\mu| \, \right\}, \end{align*}$

where $\lambda \geq 0$ is the strength of this penalty. If you use the LASSO, then you’ll estimate:

(3) $\begin{align*} \hat{\mu}(r) = \begin{cases} \mathrm{Sign}(r) \cdot (|r| - \lambda) &\text{if } |r| > \lambda, \text{ and} \\ 0 &\text{if } |r| \leq \lambda. \end{cases} \end{align*}$

Suppose that you chose $\lambda = 1.0{\scriptstyle \%}$ . If Apple’s most recent stock return was $r = 0.3{\scriptstyle \%}$ , then the LASSO will pick $\hat{\mu} = 0{\scriptstyle \%}$ . And, if Apple’s most recent stock return was $r = -0.7{\scriptstyle \%}$ , then the LASSO will still pick $\hat{\mu} = 0{\scriptstyle \%}$ . But, if Apple’s most recent stock return was $r = 1.2{\scriptstyle \%}$ , then the LASSO will give an estimate of $\hat{\mu} = 0.2{\scriptstyle \%}$ .

The LASSO seems like it’s throwing away lots of information. In the example above, you didn’t adjust your estimate of Apple’s long-run expected return at all when you saw returns of $0.3{\scriptstyle \%}$ and $-0.7{\scriptstyle \%}$ . So, it’s surprising that, if Apple’s long-run expected return, $\mu^\star$ , was drawn from a Laplace distribution,

(4) $\begin{align*} \mathrm{Pr}( \mu^\star = \mu ) = {\textstyle \frac{\lambda}{2}} \cdot e^{- \lambda \cdot |\mu|}, \end{align*}$

then using the LASSO to estimate $\mu^\star$ would be the Bayesian thing to do (Park and Casella, 2008). If $\mu^\star \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Laplace}(\lambda = 1.0{\scriptstyle \%})$ , then it’s correct to just ignore any return smaller than $1.0{\scriptstyle \%}$ when estimating $\mu^\star$ .

Why is this? If you cross your eyes and squint, you can sort of see why the Laplace distribution might be linked to the LASSO. Both use the Greek-letter $\lambda$ and involve $|\mu|$ . But, lot’s of distributions use the absolute-value operator (e.g., the Wishart distribution). And, there are lots of Greek letters. That’s how letters work. I could just as easily have called the scale parameter in the Laplace distribution $\alpha$ , $\beta$ , or $\gamma$ instead of $\lambda$ . So, what’s special about the Laplace distribution? What is it about the Laplace distribution that makes using the LASSO correct? How can it ever be Bayesian to throw information away?

2. Simpler Problem

To answer these questions, let’s start by looking at a simpler inference problem. Suppose that Apple’s long-run expected return is drawn from a Normal distribution, $\mu^\star \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \, \sigma_\mu^2)$ :

(5) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu) &= {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_\mu^2} \cdot (\mu - 0)^2}. \end{align*}$

If $\mu^\star$ is drawn from a Normal distribution, then you definitely don’t want to use the LASSO.

Bayes’ rule tells you that:

(6) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu|r) &\propto \mathrm{Pr}(r|\mu) \times \mathrm{Pr}(\mu) \\ &= \left\{ \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2} \cdot (r - \mu)^2} \, \right\} \times \left\{ \, {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_\mu^2} \cdot (\mu - 0)^2} \, \right\}. \end{align*}$

$\mathrm{Pr}(\mu^\star = \mu|r)$ is the posterior likelihood that Apple’s long-run expected return is $\mu$ given that you’ve just seen a realized return of $r$ . $\mathrm{Pr}(r|\mu)$ is the probability that Apple realizes a return of $r$ if its long-run expected return is $\mu$ . And, $\mathrm{Pr}(\mu)$ is the probability that Apple’s long-run expected return is $\mu^\star = \mu$ in the first place.

You want to choose the $\hat{\mu}$ that maximizes this posterior likelihood $\mathrm{Pr}(\mu^\star = \mu|r)$ , or equivalently, that minimizes the negative of the log of this posterior likelihood:

(7) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\sigma_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

When Apple’s long-run expected return is drawn from a Normal distribution, you want to choose a $\hat{\mu}$ that’s as close as possible to $r$ while taking into account a quadratic penalty not an absolute-value penalty. When $\mu^\star$ is drawn from a Normal distribution, you’re never going to ignore small realized returns.

On one hand, you could pick a $\hat{\mu}$ that’s really close to Apple’s recent return to make $(r - \hat{\mu})^2$ small. On the other hand, you could pick a $\hat{\mu}$ close to $0$ to make $(\sfrac{1}{\sigma_\mu^2}) \cdot (\hat{\mu} - 0)^2$ small. Your priors determine what you do:

(8) $\begin{align*} \hat{\mu}(r) = \left( {\textstyle \frac{\sigma_\mu^2}{1.0{\scriptstyle \%}^2 + \sigma_\mu^2}} \right) \cdot r. \end{align*}$

If you don’t have very strong priors about Apple’s long-run expected return ( $\sigma_\mu \gg 1.0{\scriptstyle \%}$ ), then you’re going to pick $\hat{\mu} \approx r$ since $\sfrac{\sigma_\mu^2}{(1.0{\scriptstyle \%}^2 + \sigma_\mu^2)} \approx 1$ . By contrast, if you have very strong priors ( $\sigma_\mu \ll 1.0{\scriptstyle \%}$ ), then you’re going to pick $\hat{\mu} \approx 0{\scriptstyle \%}$ since $\sfrac{\sigma_\mu^2}{(1.0{\scriptstyle \%}^2 + \sigma_\mu^2)} \approx 0$ . To illustrate, suppose that you’re really sure that Apple’s long-run expected return is close to $0{\scriptstyle \%}$ with $\sigma_{\mu} = 0.1{\scriptstyle \%}$ . Then, if you see Apple realize a return of $r = 6.0{\scriptstyle \%}$ , you’re going to think that this realization was probably due to a positive random shock, $\epsilon = 5.94{\scriptstyle \%}$ , and only pick $\hat{\mu} = 0.06{\scriptstyle \%}$ .

3. Mixture Model

Now, let’s tweak the setup slightly. Suppose that, instead of being constant, the standard deviation of Apple’s long-run expected return can be either high or low,

(9) $\begin{align*} \overline{\sigma}_{\mu} \gg \sigma_{\epsilon} = 1.0{\scriptstyle \%} \gg \underline{\sigma}_{\mu}, \end{align*}$

with the high value much larger than $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ and the low value much smaller than $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ . Each case equally likely: $\mathrm{Pr}(\sigma_\mu = \overline{\sigma}_{\mu} ) = \mathrm{Pr}( \sigma_\mu = \underline{\sigma}_{\mu} ) = \sfrac{1}{2}$ . It turns out that you’re going to behave a lot like someone using the LASSO when you estimate Apple’s long-run expected return in this mixture model.

Regardless of the model, if you want to estimate Apple’s long-run expected return, then you have to use Bayes’ rule. And, just like before, Bayes’ rule tells you that:

(10) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu|r) \propto \mathrm{Pr}(r|\mu) \times \mathrm{Pr}(\mu). \end{align*}$

But, now there’s an extra layer to the problem. The standard deviation of Apple’s long-run expected return can either be high or low,

(11) $\begin{align*} \mathrm{Pr}(\mu) = {\textstyle \frac{1}{2}} \cdot \mathrm{Pr}(\mu|\sigma_\mu = \overline{\sigma}_\mu) + {\textstyle \frac{1}{2}} \cdot \mathrm{Pr}(\mu|\sigma_\mu = \underline{\sigma}_\mu). \end{align*}$

You don’t know which it is. But, if you knew that $\sigma_{\mu} = \overline{\sigma}_{\mu} = 10{\scriptstyle \%} \gg 1.0{\scriptstyle \%}$ , then you’d pick $\hat{\mu} = (\sfrac{100}{101}) \cdot r$ . Whereas, if you knew that $\sigma_{\mu} = \overline{\sigma}_{\mu} = 0.10{\scriptstyle \%} \ll 1.0{\scriptstyle \%}$ , then you’d pick $\hat{\mu} = (\sfrac{1}{101}) \cdot r$ . Your estimate when $\sigma_\mu = \overline{\sigma}_{\mu}$ is going to really different from your estimate when $\sigma_\mu = \underline{\sigma}_{\mu}$ .

Let’s flesh out what this means. You want to estimate Apple’s long-run expected return, $\mu^\star$ , by choosing the $\hat{\mu}$ that maximizes the posterior likelihood $\mathrm{Pr}(\mu^\star = \mu|r)$ ,

(12) $\begin{align*} \hat{\mu}(r) = \arg \max_{\mu \in \mathrm{R}} \left\{ \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2} \cdot (r - \mu)^2} \, \right\} \times \left\{ \, {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{1}{\overline{\sigma}_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \overline{\sigma}_\mu^2} \cdot (\mu - 0)^2} + {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{1}{\underline{\sigma}_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \underline{\sigma}_\mu^2} \cdot (\mu - 0)^2} \, \right\}. \end{align*}$

It’s hard to solve for $\hat{\mu}(r)$ analytically when $\overline{\sigma}_{\mu}$ and $\underline{\sigma}_{\mu}$ can take on arbitrary values, but the assumption that $\overline{\sigma}_{\mu} \gg 1.0{\scriptstyle \%} \gg \underline{\sigma}_{\mu}$ simplifies things nicely. And, the resulting analysis reveals why you’re going to do something LASSO-esque when learning about Apple’s long-run expected return in this mixture model.

There are $2$ cases. First, consider the case where Apple realizes a really big return, $|r| \gg 1.0{\scriptstyle \%}$ . This really big return would be really unlikely if $\sigma_\mu = \underline{\sigma}_\mu$ because $\underline{\sigma}_\mu \ll 1.0{\scriptstyle \%}$ is really small. So, you can safely assume that $\sigma_\mu = \overline{\sigma}_{\mu}$ and just solve the optimization problem from Section 2:

(13) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\overline{\sigma}_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

But, as we saw in Section 2 that, if your priors are really weak ( $\overline{\sigma}_\mu \gg 1.0{\scriptstyle \%}$ ), then you should ignore them since $\sfrac{\overline{\sigma}_\mu^2}{(1.0{\scriptstyle \%}^2 + \overline{\sigma}_\mu^2)} \approx 1$ . So, you’re going to set $\hat{\mu}(r) \approx r$ whenever $|r| \gg 1.0{\scriptstyle \%}$ , just like someone using the LASSO.

Now, consider the other case where Apple realizes a really small return, $|r| \ll 1.0{\scriptstyle \%}$ . Again, this really small return would be really unlikely if $\sigma_\mu = \overline{\sigma}_\mu$ because $\overline{\sigma}_\mu \gg 1.0{\scriptstyle \%}$ is really big. So, you can assume that $\sigma_\mu = \underline{\sigma}_{\mu}$ and just solve the optimization problem:

(14) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\underline{\sigma}_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

But, now the opposite logic holds. If your priors are really strong ( $\underline{\sigma}_\mu \ll 1.0{\scriptstyle \%}$ ), then you should ignore $r$ since $\sfrac{\underline{\sigma}_\mu^2}{(1.0{\scriptstyle \%}^2 + \underline{\sigma}_\mu^2)} \approx 0$ . So, you’re going to set $\hat{\mu}(r) \approx 0$ whenever $|r| \ll 1.0{\scriptstyle \%}$ . This is the LASSO’s dead zone!

The figure below shows that, as the high and low standard deviations get more extreme, you’re going to behave more and more like someone using the LASSO when learning about Apple’s long-run expected return in this mixture model. But, the insight is more general than that. You’re going to behave like someone using the LASSO any time a small realized return, $r$ , tells you that you should be using stronger priors about Apple’s long-run expected return, $\mu^\star$ .

4. Laplace Distribution

If Apple’s long-run expected return is drawn from a Laplace distribution, then you face an estimation problem just like the one in the mixture model above. Andrews and Mallows (1974) shows that a Laplace distribution can be re-written as the weighted average of Normal distributions with different standard deviations,

(15) $\begin{align*} {\textstyle \frac{\lambda}{2}} \cdot e^{- \lambda \cdot |\mu|} = \int_0^\infty \, \left\{ \, {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_{\mu}^2} \cdot (\mu - 0)^2} \, \right\} \times \left\{ \, {\textstyle \frac{\lambda^2}{2}} \cdot e^{- \frac{\lambda^2}{2} \cdot \sigma_{\mu}^2} \, \right\} \times \mathrm{d}\sigma_\mu, \end{align*}$

where the weights follow an Exponential distribution. The Exponential distribution has a really fat tail. If the standard deviation of Apple’s long-run expected return is distributed $\sigma_{\mu} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Exponential}(\lambda^2)$ , then these standard deviations could be either really large or really small. We just saw that this is exactly what needs to happen for a LASSO-like estimation strategy to be optimal. There are lots of distributions for $\sigma_{\mu}$ that have this property—we just saw another one above. But, if you use $\sigma_{\mu} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Exponential}(\lambda^2)$ , then the probabilities of realizing large and small values of $\sigma_\mu$ line up in such a way that it’s precisely optimal to use the LASSO.

In the original paper, there are a ton of extra hyper-parameters. For example, $\sigma_{\epsilon}$ is a random variable. This clearly isn’t necessary. You just need the standard deviation of Apple’s long-run expected return to fluctuate wildly around $\sigma_{\epsilon}$ . You can get a situation where the LASSO is really close to being optimal with just $\overline{\sigma}_{\mu} \gg \sigma_{\epsilon} \gg \underline{\sigma}_{\mu}$ .

Also, in the original paper, there’s a lengthy discussion about properly “conditioning on $\sigma_{\epsilon}$ .” The authors include this bizarre example of how the posterior distribution of $\hat{\mu}(r)$ might not be unimodal if you don’t condition on $\sigma_{\epsilon}$ that, for me anyways, always seems to come out of left field. And, textbooks typically brush this point under the rug, calling it a technical conditions. But, the analysis above shows that it’s not just a technical condition. It’s actually really important!

To see why, consider estimating Apple’s long-run expected return in a mixture model with

(16) $\begin{align*} \overline{\sigma}_{\mu} = 10{\scriptstyle \%} \gg \sigma_{\epsilon} = \underline{\sigma}_{\mu} = 0.10{\scriptstyle \%}. \end{align*}$

The only difference from before is that $\sigma_{\epsilon} = 0.10{\scriptstyle \%}$ instead of $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ . If $\sigma_{\epsilon}$ isn’t sufficiently large relative to $\underline{\sigma}_{\mu}$ , then you’re never going to ignore the Apple’s realized return when $|r|$ is small. With these new numbers, $\hat{\mu}(0.50{\scriptstyle \%}) = \sfrac{0.10{\scriptstyle \%}^2}{(0.10{\scriptstyle \%}^2 + 0.10{\scriptstyle \%}^2)} \cdot 0.50{\scriptstyle \%} = 0.25{\scriptstyle \%}$ rather than $0.005{\scriptstyle \%}$ . When choosing a distribution for $\sigma_\mu$ , you’ve got to make sure that the high standard-deviation outcomes are big enough and the low standard-deviation outcomes are small enough relative to $\sigma_{\epsilon}$ . Otherwise, a LASSO-like estimation strategy can’t be optimal.

FYI: Here’s the code to create the figures.

Inferring Trader Horizons from Trading Volume

July 13, 2016 by Alex

1. Motivating Example

This post shows that, if traders face convex transaction costs (i.e., it costs them more per share to buy $2$ shares of stock than to buy $1$ share of stock), then it is possible to infer traders’ investment horizons from trading-volume data. To see why, imagine you are a trader and you’ve just learned some positive news about Apple’s next earnings announcement, which is due out overnight. To take advantage of this revelation, you will need to buy shares of Apple stock at some point today. In order to minimize your transaction costs, you will want to space out your demand for Apple shares as much as possible. So, all else equal, the average demand for Apple’s shares will be slightly higher today than it was yesterday because of your earnings revelation. This same logic applies to information at other horizons. Thus, if a larger fraction of the variation in Apple’s trading volume comes from day-to-day differences, then more of Apple’s traders must be operating at the daily horizon. Whereas, if a larger fraction of the variation in Apple’s trading volume comes from week-to-week differences, then more of Apple’s traders must be operating at the weekly horizon.

In the past, when researchers have studied traders’ investment horizons, they have used data on portfolio positions rather than trading volume. Some have measured trading activity at a couple of investment horizons for a small number of stocks. e.g., Brogaard et al. (2014) use NASDAQ data on a randomly selected sample of $60$ stocks that assigns each trader a typical investment horizon. Others have measured horizon-specific trading activity for a large number of stocks but only at a single horizon. e.g., Cella et al. (2013) sample the portfolio positions of institutional traders at the quarterly frequency using 13F filings. But, collecting data on traders’ portfolio positions is hard. While this approach works, it tends to restrict the analysis to only a handful of stocks (e.g., $60$ randomly selected NASDAQ stocks) or to a single horizon (e.g., the quarterly horizon). Because we can use trading-volume data to infer traders’ investment horizons, we no longer face these data-collection restrictions since trading-volume data is publicly available. Broad cross-sectional studies of traders’ investment horizons are now possible.

2. Traders’ Problem

Let’s begin by outlining the data-generating process and describing the problem faced by traders with an $H$ -period horizon. Suppose that returns are generated by a simple $1$ -factor model,

(1) $\begin{align*} r_{t+1} &= \phantom{-} \beta \cdot f_t + \epsilon_{t+1}, \end{align*}$

where $\epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\, \sigma_{\epsilon}^2)$ and the level of the factor evolves according to an $\mathrm{AR}(1)$ model,

(2) $\begin{align*} \Delta f_{t+1} &= - \gamma \cdot f_t + \xi_{t+1}, \end{align*}$

with $\xi_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\, \sigma_{\xi}^2)$ . Note that this is the exact same factor structure used in Garleanu and Pedersen (2013). The $H$ -period returns in this model are are:

(3) $\begin{align*} r_{t+H} &= \beta \cdot (1 - \gamma)^{H-1} \cdot f_t + \beta \cdot {\textstyle \sum_{h=1}^{H-1}} (1 - \gamma)^{(H-1) - h} \cdot \xi_{t+h} + \epsilon_{t+H}. \end{align*}$

And, conditional on knowing the current level of the factor, $f_t$ , it’s possible to compute the conditional mean and variance of these $H$ -period-ahead returns, $r_{t+H} \mid f_t \sim \mathrm{N}(\mu_{H,t}, \, \sigma_H^2)$ :

(4) $\begin{align*} \mu_{H,t} &= \beta \cdot (1 - \gamma)^{H-1} \cdot f_t \\ \text{and} \quad \sigma_H^2 &= \sigma_{\epsilon}^2 + 1_{\{ H \geq 2 \}} \times \left\{ \, \beta^2 \cdot {\textstyle \sum_{h=1}^{H-1}} (1 - \gamma)^{2 \cdot \{(H-1) - h\}} \cdot \sigma_{\xi}^2 \, \right\}. \end{align*}$

Traders at time $t$ observe the current level of the factor, $f_t$ , and choose how many shares to buy over the course of the next $H$ periods in order to maximize their mean-variance utility. Let $\Delta_H[x_t] = \sum_{h=0}^{H-1} \Delta_1[x_{t-h}]$ denote the change in a trader’s portfolio position over the period from $t$ to $(t+H)$ , and let $\gamma$ denote traders’ risk-aversion parameter. Then, we can write the baseline utility function of a trader with no transaction costs as of time $t$ as:

(5) $\begin{align*} v_t(\Delta_H[x_t]) &= \mu_{H,t} \cdot \Delta_H[x_t] - {\textstyle \frac{\gamma}{2}} \cdot \sigma_H^2 \cdot \Delta_H[x_t]^2. \end{align*}$

The $t$ subscript comes from the fact that traders can observe the level of the factor at time $t$ , $f_t$ , prior to making their investment decision. If we look at the trader’s decision at a different time, $t$ , then he will have a different utility and choose a different portfolio because the level of the factor, $f_t$ , will be different.

3. Convex Transaction Costs

The key assumption is that, on top of this baseline utility function, traders also face convex transaction costs. i.e., when a trader with an $H$ -period horizon changes his position over $H$ the course of periods, he pays a transaction cost,

(6) $\begin{align*} \mathit{tc}(\Delta_H[x_t]) &= \min_{\{\Delta_1[x_{t-h}] \}_{h=0}^{H-1}} \, \left\{ \, {\textstyle \frac{\kappa}{2} \cdot \sum_{h=0}^{H-1}} \Delta_1[x_{t-h}]^2 \, \middle| \, \Delta_H[x_t] \, \right\}, \end{align*}$

where $\kappa > 0$ is a positive constant that captures the severity of these transaction costs. Thus, traders with an $H$ -period investment horizon maximize the following objective function:

(7) $\begin{align*} \max_{\Delta_H[x_t]} \, \left\{ \, v_t(\Delta_H[x_t]) - \mathit{tc}(\Delta_H[x_t]) \, \right\}. \end{align*}$

They choose the $H$ -period change in portfolio positions that maximizes their mean-variance utility and they implement this change in a way that minimizes their transactions costs.

The convex transaction costs imply that traders smooth out their trading across periods over their $H$ -period horizon. It’s easiest to see why this is the case when $H = 2$ , since $\Delta_1[x_{t+1}] = \Delta_2[x_t] - \Delta_1[x_t]$ . Suppose that traders know optimal final position $\Delta_2[x_t]$ . Then, traders’ optimization problem from Equation (7) becomes:

(8) $\begin{align*} \max_{\Delta_1[x_t]} \left\{ \, \mu_{2,t} \cdot \Delta_2[x_t] - {\textstyle \frac{\gamma}{2}} \cdot \sigma_H^2 \cdot \Delta_2[x_t] - \sfrac{\kappa}{2} \cdot \left\{ \, \Delta_1[x_t]^2 + (\Delta_2 [x_t] - \Delta_1 [x_t])^2 \, \right\} \, \right\}. \end{align*}$

Taking the first-order condition with respect to $\Delta_1[x_t]$ ,

(9) $\begin{align*} 0 &= - \kappa \cdot \left[ \, \Delta_1 [x_t] - (\Delta_2[x_t] - \Delta_1[x_t]) \, \right], \end{align*}$

then implies that $\Delta_1[x_t] = \sfrac{\Delta_2[x_t]}{2}$ . This simple exercise verifies that, when there are convex transaction costs, traders will want to split their orders up evenly across their investment horizon. In general, if traders have an $H$ -period horizon, then traders will choose $\Delta_1[x_{t+h}] = \sfrac{\Delta_H[x_t]}{H}$ . Traders with an $H$ -period investment horizon will have trading volume that is characterized by smooth $H$ -period long intervals, like the ones described in the figure below.

4. Fluctuations in Volume

If traders at horizon $H$ are characterized by trading that’s smoothed over $H$ periods, then we should be able to use inference tools like the wavelet-variance estimator to infer traders’ investment horizons. In a nutshell, this estimator computes the fraction of variation in a time series that comes from comparing successive $H$ -period long intervals. See Percival and Walden (2000) for more information.

I run a pair of numerical experiments to show that this intuition is correct (code). First, using a data-generating process with $\beta = 0.90$ , $\gamma = 0.75$ , and $\sigma_\epsilon = \sigma_\xi = 1$ , I simulate a long return time series. From Equation (7), it’s possible to compute the optimal portfolio position of trader with horizon $H$ ,

(10) $\begin{align*} x_{H,t} &= \frac{\mu_{H,t}}{\gamma \cdot \sigma_H^2 + \sfrac{\kappa}{H}}. \end{align*}$

For $5$ different horizons, $H^\star \in \{ \, 1, \, 4, \, 16, \, 64, \, 256 \, \}$ , I then simulate the trading-volume time series that would occur if all traders operated at horizon $H^\star$ . The figure below shows the fraction of the trading-volume variance occurring at each horizon according to the wavelet-variance estimator. Just as you’d expect, there is a spike the fraction of trading-volume variance at the true horizon, $H^\star$ … whatever that $H^\star$ happens to be. i.e., there is a spike in the dashed green line, which corresponds to the trading-volume data where traders have a horizon of $H^\star = 16$ , precisely at $\log_2(H) = 4$ .

In addition to this single-horizon experiment, I also run a numerical experiment with traders operating at $2$ different horizons. Specifically, using the same baseline parameters, I simulate a trading-volume time series where half of the volume comes from traders operating at the $H^\star = 4$ -period horizon and half of the volume comes from traders operating at the $H^\star = 64$ -period horizon. The figure below shows the fraction of the trading-volume variance occurring at each horizon according to the wavelet-variance estimator. Again, just as you’d expect, there’s is a spike the fraction of trading-volume variance at both the $H^\star = 4$ – and $H^\star = 64$ -period horizons.

Investor Holdings, Naïve Beliefs, and Artificial Supply Constraints

June 24, 2016 by Alex

1. Motivation

In the standard model of house-price dynamics, there are two kinds of cities: supply constrained and supply unconstrained. In supply-constrained cities (e.g., New York, Boston, or San Francisco), it’s difficult and costly to build new houses because of geographic and regulatory hurdles. In supply-unconstrained cities (e.g., Las Vegas, Phoenix, or San Bernadino), these hurdles are much lower, and it’s much easier to build new houses. To get a sense of just how unconstrained places like Las Vegas are, take a look at this time-lapse video of Las Vegas from space. The number of houses balloons as housing demand in Las Vegas grows.

Now, suppose more people suddenly want to live in a particular city. If the city is supply constrained like New York, then people will have to outbid existing residents to move into that city, which will drive up the prices on existing houses in that city. More people want the same number of homes, so prices have to go up. By contrast, if the city is supply unconstrained like Las Vegas, then people who want to move into that city can just build new houses. In supply-unconstrained cities, the supply of housing adjusts to accommodate the additional demand. Why outbid an existing resident when you can just build the same house right next door? Thus, in the standard model, supply elasticity is a key determinant of house-price growth (Saiz, 2010).

But, during the mid 2000s things got weird. Supply-unconstrained cities like Las Vegas realized huge spikes in transaction volumes and house prices (Chinco and Mayer, 2016). What changed? Why did Las Vegas suddenly look like a supply-constrained city? Is there some economic mechanism that might make supply-unconstrained cities behave like supply-constrained cities when there is a lot of trading activity? This post outlines how the combination of trading volume by house flippers (i.e., people who buy and then quickly resell houses without living in them) and naïve beliefs can generate artificial supply constraints in housing markets with lots of trading volume.

2. Everyday Example

To see how this mechanism works, it’s helpful to start with an example from everyday life. I love bagels. Imagine you’re at a bagel shop and you want to buy an everything bagel. The shop has lots of different kinds of bagels displayed in bins that contain $20$ bagels each. So, at the start of each morning, there’s a bin of $20$ plain bagels, a bin of $20$ poppyseed bagels, a bin of $20$ everything bagels, and so on… There’s a line, and it takes each person several seconds to order. Each time someone orders a bagel, the clerk takes it from one of the bins. Whenever one of the bins runs out, a second clerk takes it to the back of the shop and refills it with $20$ more bagels, a task that takes $1$ minute to complete.

Without any sort of naïvety, supply and demand work exactly like you’d expect in this setup. If there is only $1$ everything bagel left, then you might be willing to pay more than the price listed on the menu for that last bagel. But, if there were lots of bagels left, then you would never do this sort of thing. You’d just wait your turn in line and pay the listed price on the menu when you got to the counter. If a bin happened to run out when you were at the front of the line, then you’d recognize that it’s going to be full in a minute and just wait until the second clerk got back.

Without any sort of naïvety, sales volume doesn’t have any affect on this equilibrium. To be sure, if there are lots of people in line and bagels are selling really quickly, then you’re more likely to find the everything-bagels bin empty when you get to the front of the line. It always takes the second clerk $1$ minute to fill an empty bin. So, if there is a big line and more people pass by the front of the line per minute, then bins are more likely to run dry and more people arrive at the register when the second clerk is back in the kitchen. But, if bagel buyers are fully rational, then they’ll realize that each bin will be replenished in a minute and just wait till the fresh bagels come out before buying.

Introducing naïve beliefs changes things. If you don’t recognize that empty bins will be replenished in a minute, then you might be willing to outbid the guy in front of you for the $20$ th everything bagel in a bin—or, at the very least try to talk him into a different order. And, when you get to the register during a busy time of morning, it’s going to look like the whole bagel shop is running low on supply since each individual bin is more likely to be in the process of being filled. If you could linger in the bagel shop for a while, then this naïvety wouldn’t matter. Any empty bins would get replenished while you were standing around making your decision. But, when there is a line out the door and you have to make a quick decision, your naïve beliefs make it look like there is an artificially low number of bagels available.

3. Simple Model

Investors play the role of the clerk that takes $1$ minute to replace a bin of bagels. They take houses off the market for a short period of time. If trading volume is low or home buyers are fully rational, then they shouldn’t affect equilibrium house prices too much. But, if trading volume is very high and home buyers don’t realize that investor homes will come back on the market in $6$ months to a year, then home buyers might get the impression that the supply of houses is getting low. I now outline a simple model to make these ideas more concrete.

How many different houses can a home buyer see on the market if he looked for $h$ months? Let $\textit{houses}_t$ denote the total number of houses, $\textit{owner}_t$ denote the number of owner-occupied houses, $\textit{investor}_t$ denote the number of investor-owned houses, and $\textit{forSale}_t$ denote the number of houses that are currently for sale in month $t$ :

(1) $\begin{align*} \textit{houses}_t = \textit{owner}_t + \textit{investor}_t + \textit{forSale}_t. \end{align*}$

In any given month, home buyers can only visit houses that are currently for sale. Owner-occupied and investor-owned houses are off the market. A house might be owner occupied one month, for sale the next month, and owned by an investor several months later. I write the probability of transitioning from one state to another in matrix form:

(2) $\begin{align*} \begin{pmatrix} \textit{owner}_{t+1} \\ \textit{investor}_{t+1} \\ \textit{forSale}_{t+1} \end{pmatrix} = \begin{bmatrix} \gamma_{o \to o} & 0 & \gamma_{\textit{fs} \to o} \\ 0 & \gamma_{i \to i} & \gamma_{\textit{fs} \to i} \\ \gamma_{o \to \textit{fs}} & \gamma_{i \to \textit{fs}} & \gamma_{\textit{fs} \to \textit{fs}} \end{bmatrix} \begin{pmatrix} \textit{owner}_t \\ \textit{investor}_t \\ \textit{forSale}_t \end{pmatrix}. \end{align*}$

Each entry in this matrix represents the probability that a house transitions from one state to another. For example, $\gamma_{o \to \textit{fs}}$ represents the probability a house goes from being owner occupied one month to for sale the next. And, $\gamma_{i \to i}$ represents the probability that a house is investor owned in month $(t+1)$ given that it was investor owned in month $t$ . The columns of this matrix sum to $1$ and $\gamma_{o \to i} = \gamma_{i \to o} = 0$ since a house has to be for sale before it can pass from one owner to the next. The diagram below gives an alternative way of representing these transition probabilities that doesn’t use matrix notation.

The number of houses that are always owner occupied after $h$ months of looking is $\gamma_{o \to o}^h \cdot \textit{owner}_t$ . So, when there aren’t any investors, the number of houses that a home buyer can choose from after looking for $h$ months is $\widetilde{\textit{supply}}_h = 1 - \gamma_{o \to o}^h \cdot \textit{owner}_t$ . The number of houses that are always investor owned after $h$ months is $\gamma_{i \to i}^h \cdot \textit{investor}_t$ . So, when there are investors, the number of houses that a home buyer can choose from after looking for $h$ months is given by:

(3) $\begin{align*} \textit{supply}_h = 1 - \gamma_{o \to o}^h \cdot \textit{owner}_t - \gamma_{i \to i}^h \cdot \textit{investor}_t. \end{align*}$

Thus, the supply constraint imposed by investors on the number of houses that a home buyer can view in $h$ months is given by:

(4) $\begin{align*} \textit{constraint}_h = \frac{ \gamma_{i \to i}^h \cdot \textit{investor}_t }{ 1 - \gamma_{o \to o}^h \cdot \textit{owner}_t }. \end{align*}$

This term is just the change in the observed housing supply after $h$ months due to the presence of investors, $(1 - \textit{constraint}_h) \times \widetilde{\textit{supply}}_h = \textit{supply}_h$ . If $\textit{constraint}_6 = 0.05$ , then investors decrease the number of houses for sale over the course of $6$ months by $5{\scriptstyle \%}$ . If there would have been $100$ houses for sale over the course of $6$ months without investors, there are only $95$ houses for sale with investors.

4. Plugging in Numbers

This model is nice because it’s easy to plug in numbers to see how investor holdings can affect the perceived housing supply for naïve home buyers. We can go back and forth between holding-period lengths and transition probabilities by using the negative binomial distribution. Investors typically hold onto their houses for $6$ months, implying that $\gamma_{i \to i} = \sfrac{6}{7}$ . Owners typically hold onto their houses for $10$ years, implying that $\gamma_{o \to o} = \sfrac{120}{121}$ . Owners and investors are equally likely to buy houses, $\gamma_{\textit{fs} \to o} = \gamma_{\textit{fs} \to i}$ . Suppose that the typical house stays on the market for $1$ year, implying that $\gamma_{\textit{fs} \to \textit{fs}} = \sfrac{12}{13}$ .

The figure above shows the fraction decrease in the housing supply perceived by naïve home buyers as a function of their search duration when $10{\scriptstyle \%}$ of the housing stock is for sale in any given month (code). e.g., if a home buyer would have seen $100$ houses in $3$ months in the absence of investors, then he sees only $91$ houses in $3$ months when $2{\scriptstyle \%}$ of houses are initially owned by investors. The dashed green line says that $2{\scriptstyle \%}$ investor holdings can lead to a $9{\scriptstyle \%}$ drop in the housing supply as perceived by naïve home buyers. As the number of months that a naïve home buyer searches drops, the impact of investor holdings rises sharply. When search durations are really short, like they were in Las Vegas during the mid 2000s, tiny amounts of investor ownership can have enormous impacts on the perceived housing supply.

« Previous Page