Alex – Page 6 – Research Notebook

Intuition Behind the Bayesian LASSO

September 24, 2016 by Alex

1. Motivating Question

Imagine you’ve just seen Apple’s most recent return, $r$ , which is Apple’s long-run expected return, $\mu^\star$ , plus some random noise, $\epsilon \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \, 1)$ :

(1) $\begin{align*} r &= \mu^\star + \epsilon. \end{align*}$

You want to use this realized return, $r$ , to estimate Apple’s long-run expected return, $\mu^\star$ . The LASSO is a popular way to solve this problem. The LASSO estimates Apple’s long-run expected return, $\mu^\star$ , by choosing a $\hat{\mu}$ that’s as close as possible to the realized $r$ while taking into account an absolute-value penalty,

(2) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, {\textstyle \frac{1}{2}} \cdot (r - \mu)^2 + \lambda \cdot |\mu| \, \right\}, \end{align*}$

where $\lambda \geq 0$ is the strength of this penalty. If you use the LASSO, then you’ll estimate:

(3) $\begin{align*} \hat{\mu}(r) = \begin{cases} \mathrm{Sign}(r) \cdot (|r| - \lambda) &\text{if } |r| > \lambda, \text{ and} \\ 0 &\text{if } |r| \leq \lambda. \end{cases} \end{align*}$

Suppose that you chose $\lambda = 1.0{\scriptstyle \%}$ . If Apple’s most recent stock return was $r = 0.3{\scriptstyle \%}$ , then the LASSO will pick $\hat{\mu} = 0{\scriptstyle \%}$ . And, if Apple’s most recent stock return was $r = -0.7{\scriptstyle \%}$ , then the LASSO will still pick $\hat{\mu} = 0{\scriptstyle \%}$ . But, if Apple’s most recent stock return was $r = 1.2{\scriptstyle \%}$ , then the LASSO will give an estimate of $\hat{\mu} = 0.2{\scriptstyle \%}$ .

The LASSO seems like it’s throwing away lots of information. In the example above, you didn’t adjust your estimate of Apple’s long-run expected return at all when you saw returns of $0.3{\scriptstyle \%}$ and $-0.7{\scriptstyle \%}$ . So, it’s surprising that, if Apple’s long-run expected return, $\mu^\star$ , was drawn from a Laplace distribution,

(4) $\begin{align*} \mathrm{Pr}( \mu^\star = \mu ) = {\textstyle \frac{\lambda}{2}} \cdot e^{- \lambda \cdot |\mu|}, \end{align*}$

then using the LASSO to estimate $\mu^\star$ would be the Bayesian thing to do (Park and Casella, 2008). If $\mu^\star \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Laplace}(\lambda = 1.0{\scriptstyle \%})$ , then it’s correct to just ignore any return smaller than $1.0{\scriptstyle \%}$ when estimating $\mu^\star$ .

Why is this? If you cross your eyes and squint, you can sort of see why the Laplace distribution might be linked to the LASSO. Both use the Greek-letter $\lambda$ and involve $|\mu|$ . But, lot’s of distributions use the absolute-value operator (e.g., the Wishart distribution). And, there are lots of Greek letters. That’s how letters work. I could just as easily have called the scale parameter in the Laplace distribution $\alpha$ , $\beta$ , or $\gamma$ instead of $\lambda$ . So, what’s special about the Laplace distribution? What is it about the Laplace distribution that makes using the LASSO correct? How can it ever be Bayesian to throw information away?

2. Simpler Problem

To answer these questions, let’s start by looking at a simpler inference problem. Suppose that Apple’s long-run expected return is drawn from a Normal distribution, $\mu^\star \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \, \sigma_\mu^2)$ :

(5) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu) &= {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_\mu^2} \cdot (\mu - 0)^2}. \end{align*}$

If $\mu^\star$ is drawn from a Normal distribution, then you definitely don’t want to use the LASSO.

Bayes’ rule tells you that:

(6) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu|r) &\propto \mathrm{Pr}(r|\mu) \times \mathrm{Pr}(\mu) \\ &= \left\{ \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2} \cdot (r - \mu)^2} \, \right\} \times \left\{ \, {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_\mu^2} \cdot (\mu - 0)^2} \, \right\}. \end{align*}$

$\mathrm{Pr}(\mu^\star = \mu|r)$ is the posterior likelihood that Apple’s long-run expected return is $\mu$ given that you’ve just seen a realized return of $r$ . $\mathrm{Pr}(r|\mu)$ is the probability that Apple realizes a return of $r$ if its long-run expected return is $\mu$ . And, $\mathrm{Pr}(\mu)$ is the probability that Apple’s long-run expected return is $\mu^\star = \mu$ in the first place.

You want to choose the $\hat{\mu}$ that maximizes this posterior likelihood $\mathrm{Pr}(\mu^\star = \mu|r)$ , or equivalently, that minimizes the negative of the log of this posterior likelihood:

(7) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\sigma_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

When Apple’s long-run expected return is drawn from a Normal distribution, you want to choose a $\hat{\mu}$ that’s as close as possible to $r$ while taking into account a quadratic penalty not an absolute-value penalty. When $\mu^\star$ is drawn from a Normal distribution, you’re never going to ignore small realized returns.

On one hand, you could pick a $\hat{\mu}$ that’s really close to Apple’s recent return to make $(r - \hat{\mu})^2$ small. On the other hand, you could pick a $\hat{\mu}$ close to $0$ to make $(\sfrac{1}{\sigma_\mu^2}) \cdot (\hat{\mu} - 0)^2$ small. Your priors determine what you do:

(8) $\begin{align*} \hat{\mu}(r) = \left( {\textstyle \frac{\sigma_\mu^2}{1.0{\scriptstyle \%}^2 + \sigma_\mu^2}} \right) \cdot r. \end{align*}$

If you don’t have very strong priors about Apple’s long-run expected return ( $\sigma_\mu \gg 1.0{\scriptstyle \%}$ ), then you’re going to pick $\hat{\mu} \approx r$ since $\sfrac{\sigma_\mu^2}{(1.0{\scriptstyle \%}^2 + \sigma_\mu^2)} \approx 1$ . By contrast, if you have very strong priors ( $\sigma_\mu \ll 1.0{\scriptstyle \%}$ ), then you’re going to pick $\hat{\mu} \approx 0{\scriptstyle \%}$ since $\sfrac{\sigma_\mu^2}{(1.0{\scriptstyle \%}^2 + \sigma_\mu^2)} \approx 0$ . To illustrate, suppose that you’re really sure that Apple’s long-run expected return is close to $0{\scriptstyle \%}$ with $\sigma_{\mu} = 0.1{\scriptstyle \%}$ . Then, if you see Apple realize a return of $r = 6.0{\scriptstyle \%}$ , you’re going to think that this realization was probably due to a positive random shock, $\epsilon = 5.94{\scriptstyle \%}$ , and only pick $\hat{\mu} = 0.06{\scriptstyle \%}$ .

3. Mixture Model

Now, let’s tweak the setup slightly. Suppose that, instead of being constant, the standard deviation of Apple’s long-run expected return can be either high or low,

(9) $\begin{align*} \overline{\sigma}_{\mu} \gg \sigma_{\epsilon} = 1.0{\scriptstyle \%} \gg \underline{\sigma}_{\mu}, \end{align*}$

with the high value much larger than $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ and the low value much smaller than $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ . Each case equally likely: $\mathrm{Pr}(\sigma_\mu = \overline{\sigma}_{\mu} ) = \mathrm{Pr}( \sigma_\mu = \underline{\sigma}_{\mu} ) = \sfrac{1}{2}$ . It turns out that you’re going to behave a lot like someone using the LASSO when you estimate Apple’s long-run expected return in this mixture model.

Regardless of the model, if you want to estimate Apple’s long-run expected return, then you have to use Bayes’ rule. And, just like before, Bayes’ rule tells you that:

(10) $\begin{align*} \mathrm{Pr}(\mu^\star = \mu|r) \propto \mathrm{Pr}(r|\mu) \times \mathrm{Pr}(\mu). \end{align*}$

But, now there’s an extra layer to the problem. The standard deviation of Apple’s long-run expected return can either be high or low,

(11) $\begin{align*} \mathrm{Pr}(\mu) = {\textstyle \frac{1}{2}} \cdot \mathrm{Pr}(\mu|\sigma_\mu = \overline{\sigma}_\mu) + {\textstyle \frac{1}{2}} \cdot \mathrm{Pr}(\mu|\sigma_\mu = \underline{\sigma}_\mu). \end{align*}$

You don’t know which it is. But, if you knew that $\sigma_{\mu} = \overline{\sigma}_{\mu} = 10{\scriptstyle \%} \gg 1.0{\scriptstyle \%}$ , then you’d pick $\hat{\mu} = (\sfrac{100}{101}) \cdot r$ . Whereas, if you knew that $\sigma_{\mu} = \overline{\sigma}_{\mu} = 0.10{\scriptstyle \%} \ll 1.0{\scriptstyle \%}$ , then you’d pick $\hat{\mu} = (\sfrac{1}{101}) \cdot r$ . Your estimate when $\sigma_\mu = \overline{\sigma}_{\mu}$ is going to really different from your estimate when $\sigma_\mu = \underline{\sigma}_{\mu}$ .

Let’s flesh out what this means. You want to estimate Apple’s long-run expected return, $\mu^\star$ , by choosing the $\hat{\mu}$ that maximizes the posterior likelihood $\mathrm{Pr}(\mu^\star = \mu|r)$ ,

(12) $\begin{align*} \hat{\mu}(r) = \arg \max_{\mu \in \mathrm{R}} \left\{ \, {\textstyle \frac{1}{\sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2} \cdot (r - \mu)^2} \, \right\} \times \left\{ \, {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{1}{\overline{\sigma}_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \overline{\sigma}_\mu^2} \cdot (\mu - 0)^2} + {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{1}{\underline{\sigma}_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \underline{\sigma}_\mu^2} \cdot (\mu - 0)^2} \, \right\}. \end{align*}$

It’s hard to solve for $\hat{\mu}(r)$ analytically when $\overline{\sigma}_{\mu}$ and $\underline{\sigma}_{\mu}$ can take on arbitrary values, but the assumption that $\overline{\sigma}_{\mu} \gg 1.0{\scriptstyle \%} \gg \underline{\sigma}_{\mu}$ simplifies things nicely. And, the resulting analysis reveals why you’re going to do something LASSO-esque when learning about Apple’s long-run expected return in this mixture model.

There are $2$ cases. First, consider the case where Apple realizes a really big return, $|r| \gg 1.0{\scriptstyle \%}$ . This really big return would be really unlikely if $\sigma_\mu = \underline{\sigma}_\mu$ because $\underline{\sigma}_\mu \ll 1.0{\scriptstyle \%}$ is really small. So, you can safely assume that $\sigma_\mu = \overline{\sigma}_{\mu}$ and just solve the optimization problem from Section 2:

(13) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\overline{\sigma}_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

But, as we saw in Section 2 that, if your priors are really weak ( $\overline{\sigma}_\mu \gg 1.0{\scriptstyle \%}$ ), then you should ignore them since $\sfrac{\overline{\sigma}_\mu^2}{(1.0{\scriptstyle \%}^2 + \overline{\sigma}_\mu^2)} \approx 1$ . So, you’re going to set $\hat{\mu}(r) \approx r$ whenever $|r| \gg 1.0{\scriptstyle \%}$ , just like someone using the LASSO.

Now, consider the other case where Apple realizes a really small return, $|r| \ll 1.0{\scriptstyle \%}$ . Again, this really small return would be really unlikely if $\sigma_\mu = \overline{\sigma}_\mu$ because $\overline{\sigma}_\mu \gg 1.0{\scriptstyle \%}$ is really big. So, you can assume that $\sigma_\mu = \underline{\sigma}_{\mu}$ and just solve the optimization problem:

(14) $\begin{align*} \hat{\mu}(r) = \arg \min_{\mu \in \mathrm{R}} \left\{ \, (r - \mu)^2 + (\sfrac{1}{\underline{\sigma}_\mu^2}) \cdot (\mu - 0)^2 \, \right\}. \end{align*}$

But, now the opposite logic holds. If your priors are really strong ( $\underline{\sigma}_\mu \ll 1.0{\scriptstyle \%}$ ), then you should ignore $r$ since $\sfrac{\underline{\sigma}_\mu^2}{(1.0{\scriptstyle \%}^2 + \underline{\sigma}_\mu^2)} \approx 0$ . So, you’re going to set $\hat{\mu}(r) \approx 0$ whenever $|r| \ll 1.0{\scriptstyle \%}$ . This is the LASSO’s dead zone!

The figure below shows that, as the high and low standard deviations get more extreme, you’re going to behave more and more like someone using the LASSO when learning about Apple’s long-run expected return in this mixture model. But, the insight is more general than that. You’re going to behave like someone using the LASSO any time a small realized return, $r$ , tells you that you should be using stronger priors about Apple’s long-run expected return, $\mu^\star$ .

4. Laplace Distribution

If Apple’s long-run expected return is drawn from a Laplace distribution, then you face an estimation problem just like the one in the mixture model above. Andrews and Mallows (1974) shows that a Laplace distribution can be re-written as the weighted average of Normal distributions with different standard deviations,

(15) $\begin{align*} {\textstyle \frac{\lambda}{2}} \cdot e^{- \lambda \cdot |\mu|} = \int_0^\infty \, \left\{ \, {\textstyle \frac{1}{\sigma_\mu \cdot \sqrt{2 \cdot \pi}}} \cdot e^{- \frac{1}{2 \cdot \sigma_{\mu}^2} \cdot (\mu - 0)^2} \, \right\} \times \left\{ \, {\textstyle \frac{\lambda^2}{2}} \cdot e^{- \frac{\lambda^2}{2} \cdot \sigma_{\mu}^2} \, \right\} \times \mathrm{d}\sigma_\mu, \end{align*}$

where the weights follow an Exponential distribution. The Exponential distribution has a really fat tail. If the standard deviation of Apple’s long-run expected return is distributed $\sigma_{\mu} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Exponential}(\lambda^2)$ , then these standard deviations could be either really large or really small. We just saw that this is exactly what needs to happen for a LASSO-like estimation strategy to be optimal. There are lots of distributions for $\sigma_{\mu}$ that have this property—we just saw another one above. But, if you use $\sigma_{\mu} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Exponential}(\lambda^2)$ , then the probabilities of realizing large and small values of $\sigma_\mu$ line up in such a way that it’s precisely optimal to use the LASSO.

In the original paper, there are a ton of extra hyper-parameters. For example, $\sigma_{\epsilon}$ is a random variable. This clearly isn’t necessary. You just need the standard deviation of Apple’s long-run expected return to fluctuate wildly around $\sigma_{\epsilon}$ . You can get a situation where the LASSO is really close to being optimal with just $\overline{\sigma}_{\mu} \gg \sigma_{\epsilon} \gg \underline{\sigma}_{\mu}$ .

Also, in the original paper, there’s a lengthy discussion about properly “conditioning on $\sigma_{\epsilon}$ .” The authors include this bizarre example of how the posterior distribution of $\hat{\mu}(r)$ might not be unimodal if you don’t condition on $\sigma_{\epsilon}$ that, for me anyways, always seems to come out of left field. And, textbooks typically brush this point under the rug, calling it a technical conditions. But, the analysis above shows that it’s not just a technical condition. It’s actually really important!

To see why, consider estimating Apple’s long-run expected return in a mixture model with

(16) $\begin{align*} \overline{\sigma}_{\mu} = 10{\scriptstyle \%} \gg \sigma_{\epsilon} = \underline{\sigma}_{\mu} = 0.10{\scriptstyle \%}. \end{align*}$

The only difference from before is that $\sigma_{\epsilon} = 0.10{\scriptstyle \%}$ instead of $\sigma_{\epsilon} = 1.0{\scriptstyle \%}$ . If $\sigma_{\epsilon}$ isn’t sufficiently large relative to $\underline{\sigma}_{\mu}$ , then you’re never going to ignore the Apple’s realized return when $|r|$ is small. With these new numbers, $\hat{\mu}(0.50{\scriptstyle \%}) = \sfrac{0.10{\scriptstyle \%}^2}{(0.10{\scriptstyle \%}^2 + 0.10{\scriptstyle \%}^2)} \cdot 0.50{\scriptstyle \%} = 0.25{\scriptstyle \%}$ rather than $0.005{\scriptstyle \%}$ . When choosing a distribution for $\sigma_\mu$ , you’ve got to make sure that the high standard-deviation outcomes are big enough and the low standard-deviation outcomes are small enough relative to $\sigma_{\epsilon}$ . Otherwise, a LASSO-like estimation strategy can’t be optimal.

FYI: Here’s the code to create the figures.

Inferring Trader Horizons from Trading Volume

July 13, 2016 by Alex

1. Motivating Example

This post shows that, if traders face convex transaction costs (i.e., it costs them more per share to buy $2$ shares of stock than to buy $1$ share of stock), then it is possible to infer traders’ investment horizons from trading-volume data. To see why, imagine you are a trader and you’ve just learned some positive news about Apple’s next earnings announcement, which is due out overnight. To take advantage of this revelation, you will need to buy shares of Apple stock at some point today. In order to minimize your transaction costs, you will want to space out your demand for Apple shares as much as possible. So, all else equal, the average demand for Apple’s shares will be slightly higher today than it was yesterday because of your earnings revelation. This same logic applies to information at other horizons. Thus, if a larger fraction of the variation in Apple’s trading volume comes from day-to-day differences, then more of Apple’s traders must be operating at the daily horizon. Whereas, if a larger fraction of the variation in Apple’s trading volume comes from week-to-week differences, then more of Apple’s traders must be operating at the weekly horizon.

In the past, when researchers have studied traders’ investment horizons, they have used data on portfolio positions rather than trading volume. Some have measured trading activity at a couple of investment horizons for a small number of stocks. e.g., Brogaard et al. (2014) use NASDAQ data on a randomly selected sample of $60$ stocks that assigns each trader a typical investment horizon. Others have measured horizon-specific trading activity for a large number of stocks but only at a single horizon. e.g., Cella et al. (2013) sample the portfolio positions of institutional traders at the quarterly frequency using 13F filings. But, collecting data on traders’ portfolio positions is hard. While this approach works, it tends to restrict the analysis to only a handful of stocks (e.g., $60$ randomly selected NASDAQ stocks) or to a single horizon (e.g., the quarterly horizon). Because we can use trading-volume data to infer traders’ investment horizons, we no longer face these data-collection restrictions since trading-volume data is publicly available. Broad cross-sectional studies of traders’ investment horizons are now possible.

2. Traders’ Problem

Let’s begin by outlining the data-generating process and describing the problem faced by traders with an $H$ -period horizon. Suppose that returns are generated by a simple $1$ -factor model,

(1) $\begin{align*} r_{t+1} &= \phantom{-} \beta \cdot f_t + \epsilon_{t+1}, \end{align*}$

where $\epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\, \sigma_{\epsilon}^2)$ and the level of the factor evolves according to an $\mathrm{AR}(1)$ model,

(2) $\begin{align*} \Delta f_{t+1} &= - \gamma \cdot f_t + \xi_{t+1}, \end{align*}$

with $\xi_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\, \sigma_{\xi}^2)$ . Note that this is the exact same factor structure used in Garleanu and Pedersen (2013). The $H$ -period returns in this model are are:

(3) $\begin{align*} r_{t+H} &= \beta \cdot (1 - \gamma)^{H-1} \cdot f_t + \beta \cdot {\textstyle \sum_{h=1}^{H-1}} (1 - \gamma)^{(H-1) - h} \cdot \xi_{t+h} + \epsilon_{t+H}. \end{align*}$

And, conditional on knowing the current level of the factor, $f_t$ , it’s possible to compute the conditional mean and variance of these $H$ -period-ahead returns, $r_{t+H} \mid f_t \sim \mathrm{N}(\mu_{H,t}, \, \sigma_H^2)$ :

(4) $\begin{align*} \mu_{H,t} &= \beta \cdot (1 - \gamma)^{H-1} \cdot f_t \\ \text{and} \quad \sigma_H^2 &= \sigma_{\epsilon}^2 + 1_{\{ H \geq 2 \}} \times \left\{ \, \beta^2 \cdot {\textstyle \sum_{h=1}^{H-1}} (1 - \gamma)^{2 \cdot \{(H-1) - h\}} \cdot \sigma_{\xi}^2 \, \right\}. \end{align*}$

Traders at time $t$ observe the current level of the factor, $f_t$ , and choose how many shares to buy over the course of the next $H$ periods in order to maximize their mean-variance utility. Let $\Delta_H[x_t] = \sum_{h=0}^{H-1} \Delta_1[x_{t-h}]$ denote the change in a trader’s portfolio position over the period from $t$ to $(t+H)$ , and let $\gamma$ denote traders’ risk-aversion parameter. Then, we can write the baseline utility function of a trader with no transaction costs as of time $t$ as:

(5) $\begin{align*} v_t(\Delta_H[x_t]) &= \mu_{H,t} \cdot \Delta_H[x_t] - {\textstyle \frac{\gamma}{2}} \cdot \sigma_H^2 \cdot \Delta_H[x_t]^2. \end{align*}$

The $t$ subscript comes from the fact that traders can observe the level of the factor at time $t$ , $f_t$ , prior to making their investment decision. If we look at the trader’s decision at a different time, $t$ , then he will have a different utility and choose a different portfolio because the level of the factor, $f_t$ , will be different.

3. Convex Transaction Costs

The key assumption is that, on top of this baseline utility function, traders also face convex transaction costs. i.e., when a trader with an $H$ -period horizon changes his position over $H$ the course of periods, he pays a transaction cost,

(6) $\begin{align*} \mathit{tc}(\Delta_H[x_t]) &= \min_{\{\Delta_1[x_{t-h}] \}_{h=0}^{H-1}} \, \left\{ \, {\textstyle \frac{\kappa}{2} \cdot \sum_{h=0}^{H-1}} \Delta_1[x_{t-h}]^2 \, \middle| \, \Delta_H[x_t] \, \right\}, \end{align*}$

where $\kappa > 0$ is a positive constant that captures the severity of these transaction costs. Thus, traders with an $H$ -period investment horizon maximize the following objective function:

(7) $\begin{align*} \max_{\Delta_H[x_t]} \, \left\{ \, v_t(\Delta_H[x_t]) - \mathit{tc}(\Delta_H[x_t]) \, \right\}. \end{align*}$

They choose the $H$ -period change in portfolio positions that maximizes their mean-variance utility and they implement this change in a way that minimizes their transactions costs.

The convex transaction costs imply that traders smooth out their trading across periods over their $H$ -period horizon. It’s easiest to see why this is the case when $H = 2$ , since $\Delta_1[x_{t+1}] = \Delta_2[x_t] - \Delta_1[x_t]$ . Suppose that traders know optimal final position $\Delta_2[x_t]$ . Then, traders’ optimization problem from Equation (7) becomes:

(8) $\begin{align*} \max_{\Delta_1[x_t]} \left\{ \, \mu_{2,t} \cdot \Delta_2[x_t] - {\textstyle \frac{\gamma}{2}} \cdot \sigma_H^2 \cdot \Delta_2[x_t] - \sfrac{\kappa}{2} \cdot \left\{ \, \Delta_1[x_t]^2 + (\Delta_2 [x_t] - \Delta_1 [x_t])^2 \, \right\} \, \right\}. \end{align*}$

Taking the first-order condition with respect to $\Delta_1[x_t]$ ,

(9) $\begin{align*} 0 &= - \kappa \cdot \left[ \, \Delta_1 [x_t] - (\Delta_2[x_t] - \Delta_1[x_t]) \, \right], \end{align*}$

then implies that $\Delta_1[x_t] = \sfrac{\Delta_2[x_t]}{2}$ . This simple exercise verifies that, when there are convex transaction costs, traders will want to split their orders up evenly across their investment horizon. In general, if traders have an $H$ -period horizon, then traders will choose $\Delta_1[x_{t+h}] = \sfrac{\Delta_H[x_t]}{H}$ . Traders with an $H$ -period investment horizon will have trading volume that is characterized by smooth $H$ -period long intervals, like the ones described in the figure below.

4. Fluctuations in Volume

If traders at horizon $H$ are characterized by trading that’s smoothed over $H$ periods, then we should be able to use inference tools like the wavelet-variance estimator to infer traders’ investment horizons. In a nutshell, this estimator computes the fraction of variation in a time series that comes from comparing successive $H$ -period long intervals. See Percival and Walden (2000) for more information.

I run a pair of numerical experiments to show that this intuition is correct (code). First, using a data-generating process with $\beta = 0.90$ , $\gamma = 0.75$ , and $\sigma_\epsilon = \sigma_\xi = 1$ , I simulate a long return time series. From Equation (7), it’s possible to compute the optimal portfolio position of trader with horizon $H$ ,

(10) $\begin{align*} x_{H,t} &= \frac{\mu_{H,t}}{\gamma \cdot \sigma_H^2 + \sfrac{\kappa}{H}}. \end{align*}$

For $5$ different horizons, $H^\star \in \{ \, 1, \, 4, \, 16, \, 64, \, 256 \, \}$ , I then simulate the trading-volume time series that would occur if all traders operated at horizon $H^\star$ . The figure below shows the fraction of the trading-volume variance occurring at each horizon according to the wavelet-variance estimator. Just as you’d expect, there is a spike the fraction of trading-volume variance at the true horizon, $H^\star$ … whatever that $H^\star$ happens to be. i.e., there is a spike in the dashed green line, which corresponds to the trading-volume data where traders have a horizon of $H^\star = 16$ , precisely at $\log_2(H) = 4$ .

In addition to this single-horizon experiment, I also run a numerical experiment with traders operating at $2$ different horizons. Specifically, using the same baseline parameters, I simulate a trading-volume time series where half of the volume comes from traders operating at the $H^\star = 4$ -period horizon and half of the volume comes from traders operating at the $H^\star = 64$ -period horizon. The figure below shows the fraction of the trading-volume variance occurring at each horizon according to the wavelet-variance estimator. Again, just as you’d expect, there’s is a spike the fraction of trading-volume variance at both the $H^\star = 4$ – and $H^\star = 64$ -period horizons.

Investor Holdings, Naïve Beliefs, and Artificial Supply Constraints

June 24, 2016 by Alex

1. Motivation

In the standard model of house-price dynamics, there are two kinds of cities: supply constrained and supply unconstrained. In supply-constrained cities (e.g., New York, Boston, or San Francisco), it’s difficult and costly to build new houses because of geographic and regulatory hurdles. In supply-unconstrained cities (e.g., Las Vegas, Phoenix, or San Bernadino), these hurdles are much lower, and it’s much easier to build new houses. To get a sense of just how unconstrained places like Las Vegas are, take a look at this time-lapse video of Las Vegas from space. The number of houses balloons as housing demand in Las Vegas grows.

Now, suppose more people suddenly want to live in a particular city. If the city is supply constrained like New York, then people will have to outbid existing residents to move into that city, which will drive up the prices on existing houses in that city. More people want the same number of homes, so prices have to go up. By contrast, if the city is supply unconstrained like Las Vegas, then people who want to move into that city can just build new houses. In supply-unconstrained cities, the supply of housing adjusts to accommodate the additional demand. Why outbid an existing resident when you can just build the same house right next door? Thus, in the standard model, supply elasticity is a key determinant of house-price growth (Saiz, 2010).

But, during the mid 2000s things got weird. Supply-unconstrained cities like Las Vegas realized huge spikes in transaction volumes and house prices (Chinco and Mayer, 2016). What changed? Why did Las Vegas suddenly look like a supply-constrained city? Is there some economic mechanism that might make supply-unconstrained cities behave like supply-constrained cities when there is a lot of trading activity? This post outlines how the combination of trading volume by house flippers (i.e., people who buy and then quickly resell houses without living in them) and naïve beliefs can generate artificial supply constraints in housing markets with lots of trading volume.

2. Everyday Example

To see how this mechanism works, it’s helpful to start with an example from everyday life. I love bagels. Imagine you’re at a bagel shop and you want to buy an everything bagel. The shop has lots of different kinds of bagels displayed in bins that contain $20$ bagels each. So, at the start of each morning, there’s a bin of $20$ plain bagels, a bin of $20$ poppyseed bagels, a bin of $20$ everything bagels, and so on… There’s a line, and it takes each person several seconds to order. Each time someone orders a bagel, the clerk takes it from one of the bins. Whenever one of the bins runs out, a second clerk takes it to the back of the shop and refills it with $20$ more bagels, a task that takes $1$ minute to complete.

Without any sort of naïvety, supply and demand work exactly like you’d expect in this setup. If there is only $1$ everything bagel left, then you might be willing to pay more than the price listed on the menu for that last bagel. But, if there were lots of bagels left, then you would never do this sort of thing. You’d just wait your turn in line and pay the listed price on the menu when you got to the counter. If a bin happened to run out when you were at the front of the line, then you’d recognize that it’s going to be full in a minute and just wait until the second clerk got back.

Without any sort of naïvety, sales volume doesn’t have any affect on this equilibrium. To be sure, if there are lots of people in line and bagels are selling really quickly, then you’re more likely to find the everything-bagels bin empty when you get to the front of the line. It always takes the second clerk $1$ minute to fill an empty bin. So, if there is a big line and more people pass by the front of the line per minute, then bins are more likely to run dry and more people arrive at the register when the second clerk is back in the kitchen. But, if bagel buyers are fully rational, then they’ll realize that each bin will be replenished in a minute and just wait till the fresh bagels come out before buying.

Introducing naïve beliefs changes things. If you don’t recognize that empty bins will be replenished in a minute, then you might be willing to outbid the guy in front of you for the $20$ th everything bagel in a bin—or, at the very least try to talk him into a different order. And, when you get to the register during a busy time of morning, it’s going to look like the whole bagel shop is running low on supply since each individual bin is more likely to be in the process of being filled. If you could linger in the bagel shop for a while, then this naïvety wouldn’t matter. Any empty bins would get replenished while you were standing around making your decision. But, when there is a line out the door and you have to make a quick decision, your naïve beliefs make it look like there is an artificially low number of bagels available.

3. Simple Model

Investors play the role of the clerk that takes $1$ minute to replace a bin of bagels. They take houses off the market for a short period of time. If trading volume is low or home buyers are fully rational, then they shouldn’t affect equilibrium house prices too much. But, if trading volume is very high and home buyers don’t realize that investor homes will come back on the market in $6$ months to a year, then home buyers might get the impression that the supply of houses is getting low. I now outline a simple model to make these ideas more concrete.

How many different houses can a home buyer see on the market if he looked for $h$ months? Let $\textit{houses}_t$ denote the total number of houses, $\textit{owner}_t$ denote the number of owner-occupied houses, $\textit{investor}_t$ denote the number of investor-owned houses, and $\textit{forSale}_t$ denote the number of houses that are currently for sale in month $t$ :

(1) $\begin{align*} \textit{houses}_t = \textit{owner}_t + \textit{investor}_t + \textit{forSale}_t. \end{align*}$

In any given month, home buyers can only visit houses that are currently for sale. Owner-occupied and investor-owned houses are off the market. A house might be owner occupied one month, for sale the next month, and owned by an investor several months later. I write the probability of transitioning from one state to another in matrix form:

(2) $\begin{align*} \begin{pmatrix} \textit{owner}_{t+1} \\ \textit{investor}_{t+1} \\ \textit{forSale}_{t+1} \end{pmatrix} = \begin{bmatrix} \gamma_{o \to o} & 0 & \gamma_{\textit{fs} \to o} \\ 0 & \gamma_{i \to i} & \gamma_{\textit{fs} \to i} \\ \gamma_{o \to \textit{fs}} & \gamma_{i \to \textit{fs}} & \gamma_{\textit{fs} \to \textit{fs}} \end{bmatrix} \begin{pmatrix} \textit{owner}_t \\ \textit{investor}_t \\ \textit{forSale}_t \end{pmatrix}. \end{align*}$

Each entry in this matrix represents the probability that a house transitions from one state to another. For example, $\gamma_{o \to \textit{fs}}$ represents the probability a house goes from being owner occupied one month to for sale the next. And, $\gamma_{i \to i}$ represents the probability that a house is investor owned in month $(t+1)$ given that it was investor owned in month $t$ . The columns of this matrix sum to $1$ and $\gamma_{o \to i} = \gamma_{i \to o} = 0$ since a house has to be for sale before it can pass from one owner to the next. The diagram below gives an alternative way of representing these transition probabilities that doesn’t use matrix notation.

The number of houses that are always owner occupied after $h$ months of looking is $\gamma_{o \to o}^h \cdot \textit{owner}_t$ . So, when there aren’t any investors, the number of houses that a home buyer can choose from after looking for $h$ months is $\widetilde{\textit{supply}}_h = 1 - \gamma_{o \to o}^h \cdot \textit{owner}_t$ . The number of houses that are always investor owned after $h$ months is $\gamma_{i \to i}^h \cdot \textit{investor}_t$ . So, when there are investors, the number of houses that a home buyer can choose from after looking for $h$ months is given by:

(3) $\begin{align*} \textit{supply}_h = 1 - \gamma_{o \to o}^h \cdot \textit{owner}_t - \gamma_{i \to i}^h \cdot \textit{investor}_t. \end{align*}$

Thus, the supply constraint imposed by investors on the number of houses that a home buyer can view in $h$ months is given by:

(4) $\begin{align*} \textit{constraint}_h = \frac{ \gamma_{i \to i}^h \cdot \textit{investor}_t }{ 1 - \gamma_{o \to o}^h \cdot \textit{owner}_t }. \end{align*}$

This term is just the change in the observed housing supply after $h$ months due to the presence of investors, $(1 - \textit{constraint}_h) \times \widetilde{\textit{supply}}_h = \textit{supply}_h$ . If $\textit{constraint}_6 = 0.05$ , then investors decrease the number of houses for sale over the course of $6$ months by $5{\scriptstyle \%}$ . If there would have been $100$ houses for sale over the course of $6$ months without investors, there are only $95$ houses for sale with investors.

4. Plugging in Numbers

This model is nice because it’s easy to plug in numbers to see how investor holdings can affect the perceived housing supply for naïve home buyers. We can go back and forth between holding-period lengths and transition probabilities by using the negative binomial distribution. Investors typically hold onto their houses for $6$ months, implying that $\gamma_{i \to i} = \sfrac{6}{7}$ . Owners typically hold onto their houses for $10$ years, implying that $\gamma_{o \to o} = \sfrac{120}{121}$ . Owners and investors are equally likely to buy houses, $\gamma_{\textit{fs} \to o} = \gamma_{\textit{fs} \to i}$ . Suppose that the typical house stays on the market for $1$ year, implying that $\gamma_{\textit{fs} \to \textit{fs}} = \sfrac{12}{13}$ .

The figure above shows the fraction decrease in the housing supply perceived by naïve home buyers as a function of their search duration when $10{\scriptstyle \%}$ of the housing stock is for sale in any given month (code). e.g., if a home buyer would have seen $100$ houses in $3$ months in the absence of investors, then he sees only $91$ houses in $3$ months when $2{\scriptstyle \%}$ of houses are initially owned by investors. The dashed green line says that $2{\scriptstyle \%}$ investor holdings can lead to a $9{\scriptstyle \%}$ drop in the housing supply as perceived by naïve home buyers. As the number of months that a naïve home buyer searches drops, the impact of investor holdings rises sharply. When search durations are really short, like they were in Las Vegas during the mid 2000s, tiny amounts of investor ownership can have enormous impacts on the perceived housing supply.

Asset-Pricing Implications of Dimensional Analysis

May 14, 2016 by Alex

1. Motivation

I have been trying to use dimensional analysis to understand asset-pricing problems. In many hard physical problems, it is possible to gain some insight about the functional form of the solution by examining the dimensions of the relevant input variables. In the canonical example of this brand of analysis, G.I. Taylor was able to tell the yield of the Trinity Test nuclear explosion from a few photographs via dimensional analysis (see Barenblatt 2003 and earlier post). So, maybe it is possible to better understand, say, the price impact of informed trading by studying the dimensions of this problem?

However, none of the asset-pricing problems I have looked at via dimensional analysis have yielded pretty solutions. It could be that the fundamental asset-pricing equations aren’t dimensionally consistent. Such equations do exist and they can be very helpful. For instance, Dolbear’s Law says that you can tell the outdoor temperature on a summer evening by counting the frequency of cricket chirps,

(1) $\begin{align*} \text{temperature in degrees celsius} = 37^{\circ} + \{ \, \text{cricket chirps every 15 seconds} \, \}. \end{align*}$

But, I don’t think this is what’s going on. Instead, my sense is that, because asset-pricing models are built by researchers trying to convey economic intuition rather than dictated by the physical constraints of a particular real-world problem, there aren’t any interesting unexplored symmetries hiding in the asset-pricing models for dimensional analysis to uncover. A good economist never includes superfluous variables when constructing a model, but there is often unexpected redundancy in our initial formulations of hard physical problems that we find in nature. This post explains my (perhaps wrong) intuition in more detail.

2. Period of Pendulum

Let’s start by looking at a physical problem where dimensional analysis actually helps. Consider the problem of modeling the period of a pendulum with length $\ell$ and mass $m$ . Suppose that in order to get the pendulum swinging, I initially pull it a distance of $a$ centimeters off to the side. In this setup, we can write the period of the pendulum as some function of these variables,

(2) $\begin{align*} p &= \mathrm{f}(\ell,m,a,g), \end{align*}$

together with the acceleration due to gravity, $g$ .

The key insight in dimensional analysis is that the pendulum shouldn’t behave differently if we measure its length in inches rather than centimeters. The marks on our ruler don’t matter. The period of the pendulum has dimensions of time, $\mathrm{dim}[p] = T$ . The length of the pendulum and the amplitude of its swing have dimensions of length, $\mathrm{dim}[\ell] = \mathrm{dim}[a] = L$ . The mass of the pendulum has dimensions of (wait for it…) mass, $\mathrm{dim}[m] = M$ . And, the force of gravity has dimensions, $\mathrm{dim}[g] = L \cdot T^{-2}$ . Suppose that we define new units of mass, length, and time so that $1$ new unit of mass is equal to $\mu$ old units of mass, $1 \overset{M}{\longrightarrow} \mu$ , $1$ new unit of length is equal to $\lambda$ old units of length, $1 \overset{L}{\longrightarrow} \lambda$ , and $1$ new unit of time is equal to $\tau$ old units of time, $1 \overset{T}{\longrightarrow} \tau$ . If our choice of units doesn’t affect the pendulum’s behavior, then we should be able to rewrite our old formula in these new units,

(3) $\begin{align*} {\textstyle \frac{p}{\tau}} &= \mathrm{f}\left( \, {\textstyle \frac{\ell}{\lambda}}, \, {\textstyle \frac{m}{\mu}}, \, {\textstyle \frac{a}{\lambda}}, \, {\textstyle \frac{g}{\sfrac{\lambda}{\tau^2}}} \, \right). \end{align*}$

Now comes the trick. Notice that these new units can be anything we want. So, let’s get clever and pick $\mu = m$ , $\lambda = \ell$ , and $\tau = \sqrt{\sfrac{\ell}{g}}$ . With these values the formula for the period of the pendulum becomes

(4) $\begin{align*} {\textstyle \frac{p}{\sqrt{\sfrac{\ell}{g}}}} &= \mathrm{f}(1,1,\sfrac{a}{\ell},1) = \mathrm{f}^\star(\sfrac{a}{\ell}), \end{align*}$

where $\mathrm{f}^\star(\cdot)$ is a new function of a dimensionless ratio, $\sfrac{a}{\ell}$ . Thus, we know that the period of a pendulum is

(5) $\begin{align*} p &= \sqrt{\sfrac{\ell}{g}} \times \mathrm{f}^{\star}(\sfrac{a}{\ell}). \end{align*}$

Without knowing anything except for the units that each variable is being measured in, we can see that 1) the period is unrelated to the mass and 2) the period of the pendulum is inversely proportional to the square-root of the force of gravity, $\sqrt{g}$ . Functional forms without physics! We now know how to compute the period of the same pendulum on Mars.

3. Price Impact

What happens if we try to use these same trick to understand price impact in the stock market? That is, how much does the price of a stock move if traders demand an extra $100$ shares on a particular day? Let’s use the standard terminology from information-based asset-pricing models and define price impact as a function of $3$ variables,

(6) $\begin{align*} \lambda &= \mathrm{f}(\sigma_v, \sigma_z, \gamma), \end{align*}$

where $\mathrm{dim}[\lambda] = D \cdot S^{-1}$ with $D$ denoting dollars and $S$ denoting shares. Suppose that informed traders know the fundamental value of the stock, $v$ , but uninformed traders don’t. Let $\sigma_v$ denote the volatility of the stock’s value from the perspective of uninformed traders, $\mathrm{dim}[\sigma_v] = D \cdot S^{-1}$ , and let $\sigma_z$ denote the volatility of asset-supply noise, $\mathrm{dim}[\sigma_z] = S$ . This is the noise term that keeps the asset’s price from being perfectly revealing. Finally, let $\gamma$ denote the risk aversion of the informed traders, $\mathrm{dim}[\gamma] = D^{-1}$ .

By the logic of dimensional analysis, it shouldn’t matter whether we measure a stock’s value in dollars or euros and it shouldn’t matter whether we measure changes in demand in shares or tens of shares. So, suppose that $1$ new unit of value is equal to $\delta$ old units of value, $1 \overset{D}{\longrightarrow} \delta$ and that $1$ new unit of quantity is equal to $\psi$ old units of quantity, $1 \overset{S}{\longrightarrow} \psi$ . If our choice of units doesn’t affect market behavior, then we should be able to rewrite our old formula for price impact in these new units,

(7) $\begin{align*} \frac{\lambda}{\sfrac{\delta}{\psi}} &= \mathrm{f}\left(\frac{\sigma_v}{\sfrac{\delta}{\psi}}, \frac{\sigma_z}{\psi}, \frac{\gamma}{\sfrac{1}{\delta}}\right), \end{align*}$

just like before.

Now comes the trouble. If we get clever and choose our units to create a function of a single dimensionless variables, $\psi = \sigma_z$ and $\delta = \sfrac{1}{\gamma}$ , we find that:

(8) $\begin{align*} \lambda &= {\textstyle \frac{1}{\gamma \cdot \sigma_z}} \cdot \mathrm{f}^\star (\gamma \cdot \sigma_z \cdot \sigma_v). \end{align*}$

In the pendulum problem above, the single dimensionless quantity only involved some of the relevant variables; however, in the price-impact problem the dimensionless quantity involves all $3$ of the relevant variables. There is no progress. Before applying dimensional analysis we had an unknown function of $3$ variables. After applying dimensional analysis we still have an unknown function of $3$ variables. Dimensional analysis doesn’t provide any new insight about the functional form of the link between the quantity of interest (i.e., the price impact, $\lambda$ ) and any of the input parameters (i.e., the values $\sigma_v$ , $\sigma_z$ , or $\gamma$ ). I always seem to find this sort of non-result when applying dimensional analysis to asset-pricing problems.

4. Main Intuition

I think this particular non-result in the price-impact problem is suggestive of why dimensional analysis doesn’t help that much when trying to understand asset-pricing models more generally. What makes the canonical information-based asset-pricing papers great is that they pack a lot of economic intuition into a relatively simple model. There isn’t a lot of superfluous structure hanging around. When you look at the original formulation of the pendulum problem, there was a bunch of redundancy involved. The mass of the pendulum turned out to be irrelevant, and two of the variables, length and amplitude, turned out to have the exact same units. There is no such redundancy in the price-impact problem. As defined, the parameters $\sigma_v$ , $\sigma_z$ , and $\gamma$ are all needed to define the dimensionless quantity. The elegance of models like Kyle (1985) makes them unsuited to dimensional analysis.

To illustrate, consider changing the original price-impact problem slightly. Suppose that we, as econometricians, could directly observe the inverse of the dollar-demand volatility from noise traders, $\sfrac{1}{\sigma_y}$ , which has units of dollars $\mathrm{dim}[\sfrac{1}{\sigma_y}] = D^{-1}$ , instead of the demand volatility, which has units of shares $\mathrm{dim}[\sigma_z] = S$ . This is a less elegant model because it is needlessly complex. Demand volatility in shares now depends on both the equilibrium price and the shares demanded by noise traders. But, let’s go with it. In this new setup, the price impact is still an unknown function of $3$ variables,

(9) $\begin{align*} \lambda &= \mathrm{f}(\sigma_v, \sfrac{1}{\sigma_y}, \gamma), \end{align*}$

but now, because there is redundancy, we can make progress via dimensional analysis.

Again, suppose that $1$ new unit of value is equal to $\delta$ old units of value, $1 \overset{D}{\longrightarrow} \delta$ and that $1$ new unit of quantity is equal to $\psi$ old units of quantity, $1 \overset{S}{\longrightarrow} \psi$ . If our choice of units doesn’t affect market behavior, then we should be able to rewrite our old formula for price impact in these new units:

(10) $\begin{align*} \frac{\lambda}{\sfrac{\delta}{\psi}} &= \mathrm{f}\left(\frac{\sigma_v}{\sfrac{\delta}{\psi}}, \frac{\sfrac{1}{\sigma_y}}{\sfrac{1}{\delta}}, \frac{\gamma}{\sfrac{1}{\delta}}\right). \end{align*}$

If we choose our units to create a function of a single dimensionless variables, $\delta = \sigma_y$ and $\psi = \sfrac{\sigma_y}{\sigma_v}$ , we find that:

(11) $\begin{align*} \lambda &= \sigma_v \cdot \mathrm{f}^\star(\sigma_y \cdot \gamma). \end{align*}$

Before we had an unknown function with $3$ variables and now we have an unknown function of only $2$ variables. Progress! If we found situations where the product of informed traders’ risk aversion and dollar-demand noise was constant, $\sigma_y \cdot \gamma = \text{Const.}$ , then we could actually test this relationship with a regression,

(12) $\begin{align*} \log(\lambda) &= \alpha + \beta \times \log(\sigma_v) + \epsilon, \end{align*}$

and check whether or not $\beta = 1$ . But, the only reason that we could make progress in this alternative setting was that the model was written down clumsily in the first place.

ETF-Rebalancing Cascades

April 6, 2016 by Alex

1. Motivation

This post looks at the consequences of ETF rebalancing. These funds follow pre-announced rules that involve discrete thresholds. The well-known SPDR tracks the S&P 500, but there are over 1400 different ETFs tracking a wide variety of different underlying indexes. When any of these underlying indexes change, the corresponding ETFs have to change their holdings. These thresholding rules mean that, in an extreme example, if Verisk Analytics gets $\mathdollar 1$ larger and moves from being the $501$ st largest stock to being the $500$ th largest stock (actually happened), then the ETFs tracking the S&P 500 are going to suddenly have to build large positions in Verisk over a relatively short period of time. See here for more examples.

When there are many different ETFs tracking many different thresholds, these rebalancing rules can interact with one another and amplify small initial demand shocks. For example, when Verisk increases in size and gets bought by ETFs tracking the S&P 500, it will be slightly more correlated with the market (Barberis, Shleifer, and Wurgler, 2005). As a result, ETFs like SPHB that track large-cap high-beta stocks might have to buy Verisk as well, which in turn can have further consequences down the line. This is what journalists have in mind when they worry about ETFs “turning the market close into a buying or selling frenzy.” To model this phenomenon, I use an approach based on branching processes à la Nirei (2006).

The idea that traders might herd (Devenow and Welch, 1996) or amplify shocks (Veldkamp, 2006) is not new. But, these rebalancing rules can also transmit shocks to completely unrelated corners of the market. This idea is new. Rebalancing involves buying one stock and selling another. It sounds obvious, but it matters that we’re looking at “rebalancing” rules and not just “purchasing” rules. For instance, when Verisk got added to the S&P 500, it replaced Joy Global, a mining-tools manufacturing company. So, all of the ETFs tracking the S&P 500 had to sell their positions in Joy Global, pushing Joy’s price down and making it more likely that ETFs that need to hold a position in mining companies choose one of Joy’s competitors like ABB Limited. And, this additional demand for ABB can cause subsequent rebalancing, which is the result of an initial change in the value of Verisk, a totally unrelated stock.

No trader would ever be able to guess that the reason another Swiss company like Novartis is realizing selling pressure is that Verisk was Joy Global in the S&P 500, which caused ETFs tracking the S&P 500 to sell Joy, which caused other industrial ETFs to replace Joy with ABB Limited, which caused still other ETFs tracking the largest stocks in each European country to replace their positions in Novartis with a position in ABB. So, even though each ETF’s rebalancing rules are completely predictable, their aggregate behavior generates noise. By analogy, the population of France at any given instant is a definite fact. It is not random. No one is timing births by coin flips, dice rolls, radioactive decay, etc… But, as John Maynard Keynes pointed out, whether this number is even or odd at each instant is effectively random. Were $1,324$ or $1,325$ people born in Paris during the time it took you to finish counting the population of Nice?

2. Market Structure

Suppose there’s a single stock that can be held by $F$ different ETFs, and each fund’s demand for the stock, $x_f$ , is the sum of $3$ different components:

(1) $\begin{align*} x_f &= \alpha_f + \mathrm{B}(\mathbf{x}) - \lambda \cdot z_f \qquad \text{where} \qquad \mathbf{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_F \end{bmatrix}^{\top}. \end{align*}$

When writing down models of the ETF market, people usually only think about the first component, $\alpha_f$ . This is just the intrinsic demand that an ETF would have for the stock if it were the only fund in the market, like the SPDR was in the early 1990s. If there are no large-cap high-beta ETFs, then having the SPDR purchase additional shares of Verisk wouldn’t have any knock-on effects in the example above. Think about shocks to each fund’s $\alpha_f$ as shocks to whether or not the stock is included in the fund’s benchmark index.

The second component, $\mathrm{B}(\mathbf{x})$ , captures how an ETF’s demand is affected by the decisions of other ETFs. This is the effect of S&P 500-tracking ETFs on the holdings of ETFs that track a large-cap high-beta index. Because buying by one ETF leads to additional buying by other ETFs, $\mathrm{B}(\mathbf{x})$ is increasing in ETF demand:

(2) $\begin{align*} {\textstyle \frac{\partial}{\partial x_f}}[\mathrm{B}(\mathbf{x})] &> 0. \end{align*}$

Because each ETF’s demand only has a subtle effect on other ETFs’ demand, we have that:

(3) $\begin{align*} {\textstyle \frac{\partial}{\partial x_f}}[\mathrm{B}(\mathbf{x})] &= \mathrm{O}(\sfrac{1}{F}), \end{align*}$

where $\mathrm{O}(\cdot)$ is Landau notation. So, when there are more ETFs that might hold a stock, each ETF’s decisions have a smaller effect on the decisions of its peers. In the theoretical analysis below, I’m going to study the function $\mathrm{B}(\mathbf{x}) = \frac{\beta}{F} \cdot \sum_{f=1}^F x_f$ , which satisfies this property.

Finally, the third component, $-\lambda \cdot z_f$ , reflects the fact that ETFs use thresholding rules. Even if Verisk has a market capitalization that is $\mathdollar 1$ smaller than the $500$ th largest stock, Verisk won’t be held by the SPDR, which tracks the S&P 500. The moment Verisk becomes the $500$ th largest stock, the SPDR is going to have to buy a large block of shares. This threshold-adjustment component is defined using modular arithmetic,

(4) $\begin{align*} z_f &= \lambda^{-1} \cdot \mathrm{mod}( y_f , \, \lambda), \end{align*}$

where $y_f = \alpha_f + \mathrm{B}(\mathbf{x})$ and $\mathrm{mod}( y_f , \, \lambda )$ is the remainder that’s left over after dividing $y_f$ by $\lambda$ . So, if $\lambda = 3$ shares and an ETF would have bought $y_f = 5$ shares, then $z_f = \sfrac{2}{3}$ and its resulting demand is $x_f = 3$ shares. A fund is always demanding some multiple of $\lambda$ shares. Think about $\lambda$ as the size of a typical ETF’s adjustment once a stock is added to its benchmark.

3. Equilibrium Concept

What does it mean for this market to be in equilibrium, and how does the market transition between equilibria? Via Tarski’s fixed-point theorem, we know that for any choice of ${\boldsymbol \alpha}$ , if

both $\alpha_f$ and $\lambda$ are bounded, and
there are scalars $\underline{x} = \mathrm{B}(\underline{x},\,\underline{x},\,\ldots,\,\underline{x}) + \underline{\alpha} - \overline{\lambda}$ and $\overline{x} = \mathrm{B}(\overline{x},\,\overline{x},\,\ldots,\,\overline{x}) + \overline{\alpha} - \underline{\lambda}$ ,

then there exists a solution, $\mathbf{x}^\star$ , to Equation (1) for all $F$ funds. This solution is the equilibrium associated with ${\boldsymbol \alpha}$ . Here, $\overline{\alpha}$ and $\underline{\alpha}$ denote the upper and lower bounds on $\alpha_f$ —i.e., $\alpha_f \in [\underline{\alpha}, \, \overline{\alpha}]$ ; and, $\overline{\lambda}$ and $\underline{\lambda}$ denote the upper and lower bounds on $\lambda$ . To make things concrete, note that, if $\alpha_f \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Unif}(0,\,\lambda)$ , then $\mathbf{x}^\star = 0$ .

At the start of each trading day, the market is in an equilibrium at $\mathbf{x}^\star$ associated with the intrinsic demand ${\boldsymbol \alpha}$ . We can normalize this value to $0$ . Then, over the course of the day, the stock’s characteristics change. It gets added to some ETFs’ benchmarks. I model this as a shock to each ETF’s intrinsic demand,

(5) $\begin{align*} \hat{\alpha}_f = \alpha_f + \sfrac{\epsilon_f}{F} \end{align*}$

where $\epsilon_f$ is positive with mean $\mu_{\epsilon} > 0$ and distributed i.i.d. across funds. This shock is divided through by $F$ so that its impact is the same magnitude as the feedback effect from one ETF’s demand to another. Given this shock, how will ETFs’ demand evolve over the course of the final $10$ to $15$ minutes of trading?

To answer this question, we need to define how ETFs update their demand. I follow the approach used in Cooper (1994) where ETFs adjust their positions by applying the best response function iteratively. In the first round, ETFs adjust their holdings by $\lambda$ if they realize a sufficiently large shock to their intrinsic demand:

(6) $\begin{align*} x_{f,1} &= \begin{cases} x_{f,0} + \lambda &\text{if } \sfrac{\epsilon_f}{F} \geq \lambda \cdot (1 - z_{f,0}) \\ x_{f,0} &\text{else} \end{cases} \\ \text{and} \quad z_{f,1} &= z_{f,0} + \lambda^{-1} \cdot ( \, \sfrac{\epsilon_f}{F} - \{ x_{f,1} - x_{f,0} \} \, ). \end{align*}$

Then, in each subsequent period, ETFs adjust their holdings by $\lambda$ if the demand pressure from other ETFs is sufficiently large:

(7) $\begin{align*} x_{f,t+1} &= \begin{cases} x_{f,t} + \lambda &\text{if } \{ \mathrm{B}(\mathbf{x}_t) - \mathrm{B}(\mathbf{x}_{t-1})\} \geq \lambda \cdot (1- z_{f,t}) \\ x_{f,t} &\text{else} \end{cases} \\ \text{and} \quad z_{f,t+1} &= z_{f,t} + \lambda^{-1} \cdot \left( \, \{ \mathrm{B}(\mathbf{x}_t) - \mathrm{B}(\mathbf{x}_{t-1})\} - \{ x_{f,t+1} - x_{f,t} \} \, \right). \end{align*}$

This process continues until no more changes are required at $T = \min\{ \, t \mid \mathbf{x}_{t+1} = \mathbf{x}_t \, \}$ . Initially, the shock is exogenous, $\sfrac{\epsilon_f}{F}$ ; then, in all later rounds the shock comes from other ETFs, $\mathrm{B}(\mathbf{x}_t) - \mathrm{B}(\mathbf{x}_{t-1})$ . Importantly, ETFs in this model don’t proactively adjust their positions to account for future changes in demand that they see coming down the line in future iterations. After all, ETFs are constrained to mimic their benchmarks.

4. Cascade Length

It is now possible to show that small initial shocks to each ETF’s demand can lead to long rebalancing cascades. Let $\ell_t$ denote the number of ETFs that buy $\lambda$ additional shares of the target stock in period $t$ :

(8) $\begin{align*} \ell_t &= \lambda^{-1} \cdot {\textstyle \sum_{f=1}^F} (x_{f,t} - x_{f,t-1}). \end{align*}$

Similarly, let $L$ denote the total number of ETF position changes from time $t=1$ to time $t=T$ :

(9) $\begin{align*} L &= {\textstyle \sum_{t=1}^T} \ell_t = \lambda^{-1} \cdot {\textstyle \sum_{f=1}^F} (x_{f,T} - x_{f,0}). \end{align*}$

So, when $L \gg 1$ , then a small initial shock will cause a large number of ETFs to change their positions later on. This is what I mean by the length of a rebalancing cascade.

In order to characterize the distribution of $L$ analytically, I need to make a couple of functional form assumptions. First, I assume that the interaction of ETF-demand schedules is governed by the rule:

(10) $\begin{align*} \mathrm{B}(\mathbf{x}) &= {\textstyle \frac{\beta}{F}} \cdot {\textstyle \sum_{f=1}^F} x_f, \end{align*}$

with parameter $0 \leq \beta \leq \mathrm{min}(\lambda^{-1},\,1)$ . As $\beta$ gets larger and larger, each ETF’s demand has a larger and larger effect on other ETFs’ demand schedules. Finally, suppose that the distribution of ETF intrinsic demands,

(11) $\begin{align*} \alpha_f \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Unif}(0,\,\lambda), \end{align*}$

is uniformly distributed on the interval from $0$ to $\lambda$ .

Using results from Harris (1963) it’s possible to show that, as the number of funds gets large, $F \to \infty$ , the sequence of adjustments $\{ \ell_t \}$ converges to a Poisson-distributed Galton-Watson process. The probability-density function for the total cascade length is given by:

(12) $\begin{align*} \mathrm{pdf}(L) &= (\beta \cdot L + \sfrac{\mu_{\epsilon}}{\lambda})^{L-1} \cdot \frac{\sfrac{\mu_{\epsilon}}{\lambda} \cdot e^{-(\beta \cdot L - \sfrac{\mu_{\epsilon}}{\lambda})}}{L!}. \end{align*}$

The figure above plots this function when $\beta = 0.10$ and $\mu_{\epsilon} = 0.10$ .

A Galton-Watson process, $\{\ell_t\}$ , is stochastic process that starts out with formula $\ell_0=1$ and then evolves according to the rule

(13) $\begin{align*} \ell_{t+1} &= {\textstyle \sum_{\ell=1}^{\ell_t}} \xi_{\ell}^{(t)}, \end{align*}$

where $\{ \, \xi_{\ell}^{(t)} \mid t , \, \ell \in \mathbb{N} \, \}$ is a set of i.i.d. Poisson-distributed random variables. To illustrate by example, the figure below displays a single realization from a Galton-Watson process. In this example, there are $L=7$ total ETFs that rebalance their holdings in response to an initial demand shock. In this example, the rebalancing cascade lasts $T = 3$ rounds. There are $\ell_2 = 3$ ETFs that rebalance in the second round, and the rebalancing demand from the third of these ETFs causes two more ETFs to adjust their positions, $\xi_{3}^{(2)} = 2$ .

Given the probability-density function described above, it’s easy to show that the mean and variance of the number of ETFs that rebalance is:

(14) $\begin{align*} \mathrm{E}[L] &= {\textstyle \frac{\mu_{\epsilon}}{\lambda} \cdot \frac{1}{1-\beta}} \quad \text{and} \quad \mathrm{Var}[L] = {\textstyle \frac{\mu_{\epsilon}}{\lambda} \cdot \frac{1}{(1-\beta)^3}}. \end{align*}$

So, even when each individual ETF realizes a shock to its intrinsic demand over the course of the trading day that is quite small on average, $\mu_{\epsilon} \approx 0$ , the stock can still realize large demand shocks when each ETF’s demand has a large spillover effect, $\beta \approx 1$ , and when they don’t have to make very granular position changes, $\lambda \approx 0$ . This is the amplification result I mentioned above. The figures below verifies these calculations using simulations with $F = 10^3$ ETFS. In particular, the left panel shows that the initial shock of only $\mu_{\epsilon} = 0.10$ leads to more than one rebalancing cascade on average when $\lambda = 0.10$ .

« Previous Page