Research Notebook

ETF-Rebalancing Cascades

April 6, 2016 by Alex

1. Motivation

This post looks at the consequences of ETF rebalancing. These funds follow pre-announced rules that involve discrete thresholds. The well-known SPDR tracks the S&P 500, but there are over 1400 different ETFs tracking a wide variety of different underlying indexes. When any of these underlying indexes change, the corresponding ETFs have to change their holdings. These thresholding rules mean that, in an extreme example, if Verisk Analytics gets \mathdollar 1 larger and moves from being the 501st largest stock to being the 500th largest stock (actually happened), then the ETFs tracking the S&P 500 are going to suddenly have to build large positions in Verisk over a relatively short period of time. See here for more examples.

When there are many different ETFs tracking many different thresholds, these rebalancing rules can interact with one another and amplify small initial demand shocks. For example, when Verisk increases in size and gets bought by ETFs tracking the S&P 500, it will be slightly more correlated with the market (Barberis, Shleifer, and Wurgler, 2005). As a result, ETFs like SPHB that track large-cap high-beta stocks might have to buy Verisk as well, which in turn can have further consequences down the line. This is what journalists have in mind when they worry about ETFs “turning the market close into a buying or selling frenzy.” To model this phenomenon, I use an approach based on branching processes à la Nirei (2006).

The idea that traders might herd (Devenow and Welch, 1996) or amplify shocks (Veldkamp, 2006) is not new. But, these rebalancing rules can also transmit shocks to completely unrelated corners of the market. This idea is new. Rebalancing involves buying one stock and selling another. It sounds obvious, but it matters that we’re looking at “rebalancing” rules and not just “purchasing” rules. For instance, when Verisk got added to the S&P 500, it replaced Joy Global, a mining-tools manufacturing company. So, all of the ETFs tracking the S&P 500 had to sell their positions in Joy Global, pushing Joy’s price down and making it more likely that ETFs that need to hold a position in mining companies choose one of Joy’s competitors like ABB Limited. And, this additional demand for ABB can cause subsequent rebalancing, which is the result of an initial change in the value of Verisk, a totally unrelated stock.

No trader would ever be able to guess that the reason another Swiss company like Novartis is realizing selling pressure is that Verisk was Joy Global in the S&P 500, which caused ETFs tracking the S&P 500 to sell Joy, which caused other industrial ETFs to replace Joy with ABB Limited, which caused still other ETFs tracking the largest stocks in each European country to replace their positions in Novartis with a position in ABB. So, even though each ETF’s rebalancing rules are completely predictable, their aggregate behavior generates noise. By analogy, the population of France at any given instant is a definite fact. It is not random. No one is timing births by coin flips, dice rolls, radioactive decay, etc… But, as John Maynard Keynes pointed out, whether this number is even or odd at each instant is effectively random. Were 1,324 or 1,325 people born in Paris during the time it took you to finish counting the population of Nice?

2. Market Structure

Suppose there’s a single stock that can be held by F different ETFs, and each fund’s demand for the stock, x_f, is the sum of 3 different components:

(1)   \begin{align*} x_f &= \alpha_f + \mathrm{B}(\mathbf{x}) - \lambda \cdot z_f  \qquad \text{where} \qquad  \mathbf{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_F \end{bmatrix}^{\top}. \end{align*}

When writing down models of the ETF market, people usually only think about the first component, \alpha_f. This is just the intrinsic demand that an ETF would have for the stock if it were the only fund in the market, like the SPDR was in the early 1990s. If there are no large-cap high-beta ETFs, then having the SPDR purchase additional shares of Verisk wouldn’t have any knock-on effects in the example above. Think about shocks to each fund’s \alpha_f as shocks to whether or not the stock is included in the fund’s benchmark index.

The second component, \mathrm{B}(\mathbf{x}), captures how an ETF’s demand is affected by the decisions of other ETFs. This is the effect of S&P 500-tracking ETFs on the holdings of ETFs that track a large-cap high-beta index. Because buying by one ETF leads to additional buying by other ETFs, \mathrm{B}(\mathbf{x}) is increasing in ETF demand:

(2)   \begin{align*} {\textstyle \frac{\partial}{\partial x_f}}[\mathrm{B}(\mathbf{x})] &> 0.  \end{align*}

Because each ETF’s demand only has a subtle effect on other ETFs’ demand, we have that:

(3)   \begin{align*} {\textstyle \frac{\partial}{\partial x_f}}[\mathrm{B}(\mathbf{x})] &= \mathrm{O}(\sfrac{1}{F}), \end{align*}

where \mathrm{O}(\cdot) is Landau notation. So, when there are more ETFs that might hold a stock, each ETF’s decisions have a smaller effect on the decisions of its peers. In the theoretical analysis below, I’m going to study the function \mathrm{B}(\mathbf{x}) = \frac{\beta}{F} \cdot \sum_{f=1}^F x_f, which satisfies this property.

Finally, the third component, -\lambda \cdot z_f, reflects the fact that ETFs use thresholding rules. Even if Verisk has a market capitalization that is \mathdollar 1 smaller than the 500th largest stock, Verisk won’t be held by the SPDR, which tracks the S&P 500. The moment Verisk becomes the 500th largest stock, the SPDR is going to have to buy a large block of shares. This threshold-adjustment component is defined using modular arithmetic,

(4)   \begin{align*} z_f &= \lambda^{-1} \cdot \mathrm{mod}( y_f , \, \lambda), \end{align*}

where y_f = \alpha_f + \mathrm{B}(\mathbf{x}) and \mathrm{mod}( y_f , \, \lambda ) is the remainder that’s left over after dividing y_f by \lambda. So, if \lambda = 3 shares and an ETF would have bought y_f = 5 shares, then z_f = \sfrac{2}{3} and its resulting demand is x_f = 3 shares. A fund is always demanding some multiple of \lambda shares. Think about \lambda as the size of a typical ETF’s adjustment once a stock is added to its benchmark.

3. Equilibrium Concept

What does it mean for this market to be in equilibrium, and how does the market transition between equilibria? Via Tarski’s fixed-point theorem, we know that for any choice of {\boldsymbol \alpha}, if

  1. both \alpha_f and \lambda are bounded, and
  2. there are scalars \underline{x} = \mathrm{B}(\underline{x},\,\underline{x},\,\ldots,\,\underline{x}) + \underline{\alpha} - \overline{\lambda} and \overline{x} = \mathrm{B}(\overline{x},\,\overline{x},\,\ldots,\,\overline{x}) + \overline{\alpha} - \underline{\lambda},

then there exists a solution, \mathbf{x}^\star, to Equation (1) for all F funds. This solution is the equilibrium associated with {\boldsymbol \alpha}. Here, \overline{\alpha} and \underline{\alpha} denote the upper and lower bounds on \alpha_f—i.e., \alpha_f \in [\underline{\alpha}, \, \overline{\alpha}]; and, \overline{\lambda} and \underline{\lambda} denote the upper and lower bounds on \lambda. To make things concrete, note that, if \alpha_f \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Unif}(0,\,\lambda), then \mathbf{x}^\star = 0.

At the start of each trading day, the market is in an equilibrium at \mathbf{x}^\star associated with the intrinsic demand {\boldsymbol \alpha}. We can normalize this value to 0. Then, over the course of the day, the stock’s characteristics change. It gets added to some ETFs’ benchmarks. I model this as a shock to each ETF’s intrinsic demand,

(5)   \begin{align*} \hat{\alpha}_f = \alpha_f + \sfrac{\epsilon_f}{F} \end{align*}

where \epsilon_f is positive with mean \mu_{\epsilon} > 0 and distributed i.i.d. across funds. This shock is divided through by F so that its impact is the same magnitude as the feedback effect from one ETF’s demand to another. Given this shock, how will ETFs’ demand evolve over the course of the final 10 to 15 minutes of trading?

To answer this question, we need to define how ETFs update their demand. I follow the approach used in Cooper (1994) where ETFs adjust their positions by applying the best response function iteratively. In the first round, ETFs adjust their holdings by \lambda if they realize a sufficiently large shock to their intrinsic demand:

(6)   \begin{align*} x_{f,1} &=  \begin{cases} x_{f,0} + \lambda &\text{if } \sfrac{\epsilon_f}{F} \geq \lambda \cdot (1 - z_{f,0}) \\ x_{f,0} &\text{else} \end{cases} \\ \text{and} \quad z_{f,1} &= z_{f,0} + \lambda^{-1} \cdot ( \, \sfrac{\epsilon_f}{F} - \{ x_{f,1} - x_{f,0} \} \, ). \end{align*}

Then, in each subsequent period, ETFs adjust their holdings by \lambda if the demand pressure from other ETFs is sufficiently large:

(7)   \begin{align*} x_{f,t+1} &=  \begin{cases} x_{f,t} + \lambda &\text{if } \{ \mathrm{B}(\mathbf{x}_t) - \mathrm{B}(\mathbf{x}_{t-1})\} \geq \lambda \cdot (1- z_{f,t}) \\ x_{f,t} &\text{else} \end{cases} \\ \text{and} \quad z_{f,t+1} &= z_{f,t} + \lambda^{-1} \cdot \left( \, \{ \mathrm{B}(\mathbf{x}_t) - \mathrm{B}(\mathbf{x}_{t-1})\} - \{ x_{f,t+1} - x_{f,t} \}  \, \right). \end{align*}

This process continues until no more changes are required at T = \min\{ \, t \mid \mathbf{x}_{t+1} = \mathbf{x}_t \, \}. Initially, the shock is exogenous, \sfrac{\epsilon_f}{F}; then, in all later rounds the shock comes from other ETFs, \mathrm{B}(\mathbf{x}_t) - \mathrm{B}(\mathbf{x}_{t-1}). Importantly, ETFs in this model don’t proactively adjust their positions to account for future changes in demand that they see coming down the line in future iterations. After all, ETFs are constrained to mimic their benchmarks.

4. Cascade Length

It is now possible to show that small initial shocks to each ETF’s demand can lead to long rebalancing cascades. Let \ell_t denote the number of ETFs that buy \lambda additional shares of the target stock in period t:

(8)   \begin{align*} \ell_t &= \lambda^{-1} \cdot {\textstyle \sum_{f=1}^F} (x_{f,t} - x_{f,t-1}). \end{align*}

Similarly, let L denote the total number of ETF position changes from time t=1 to time t=T:

(9)   \begin{align*} L &= {\textstyle \sum_{t=1}^T} \ell_t = \lambda^{-1} \cdot {\textstyle \sum_{f=1}^F} (x_{f,T} - x_{f,0}). \end{align*}

So, when L \gg 1, then a small initial shock will cause a large number of ETFs to change their positions later on. This is what I mean by the length of a rebalancing cascade.

In order to characterize the distribution of L analytically, I need to make a couple of functional form assumptions. First, I assume that the interaction of ETF-demand schedules is governed by the rule:

(10)   \begin{align*} \mathrm{B}(\mathbf{x}) &= {\textstyle \frac{\beta}{F}} \cdot {\textstyle \sum_{f=1}^F} x_f, \end{align*}

with parameter 0 \leq \beta \leq \mathrm{min}(\lambda^{-1},\,1). As \beta gets larger and larger, each ETF’s demand has a larger and larger effect on other ETFs’ demand schedules. Finally, suppose that the distribution of ETF intrinsic demands,

(11)   \begin{align*} \alpha_f \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Unif}(0,\,\lambda), \end{align*}

is uniformly distributed on the interval from 0 to \lambda.

plot--etf-rebalancing-simulation--cascade-length-distribution--05apr2016

Using results from Harris (1963) it’s possible to show that, as the number of funds gets large, F \to \infty, the sequence of adjustments \{ \ell_t \} converges to a Poisson-distributed Galton-Watson process. The probability-density function for the total cascade length is given by:

(12)   \begin{align*} \mathrm{pdf}(L) &= (\beta \cdot L + \sfrac{\mu_{\epsilon}}{\lambda})^{L-1} \cdot \frac{\sfrac{\mu_{\epsilon}}{\lambda} \cdot e^{-(\beta \cdot L - \sfrac{\mu_{\epsilon}}{\lambda})}}{L!}. \end{align*}

The figure above plots this function when \beta = 0.10 and \mu_{\epsilon} = 0.10.

A Galton-Watson process, \{\ell_t\}, is stochastic process that starts out with formula \ell_0=1 and then evolves according to the rule

(13)   \begin{align*} \ell_{t+1} &= {\textstyle \sum_{\ell=1}^{\ell_t}} \xi_{\ell}^{(t)}, \end{align*}

where \{ \, \xi_{\ell}^{(t)} \mid t , \, \ell  \in \mathbb{N} \, \} is a set of i.i.d. Poisson-distributed random variables. To illustrate by example, the figure below displays a single realization from a Galton-Watson process. In this example, there are L=7 total ETFs that rebalance their holdings in response to an initial demand shock. In this example, the rebalancing cascade lasts T = 3 rounds. There are \ell_2 = 3 ETFs that rebalance in the second round, and the rebalancing demand from the third of these ETFs causes two more ETFs to adjust their positions, \xi_{3}^{(2)} = 2.

diag--galton-watson-process

Given the probability-density function described above, it’s easy to show that the mean and variance of the number of ETFs that rebalance is:

(14)   \begin{align*} \mathrm{E}[L] &= {\textstyle \frac{\mu_{\epsilon}}{\lambda} \cdot \frac{1}{1-\beta}} \quad \text{and} \quad \mathrm{Var}[L] = {\textstyle \frac{\mu_{\epsilon}}{\lambda} \cdot \frac{1}{(1-\beta)^3}}. \end{align*}

So, even when each individual ETF realizes a shock to its intrinsic demand over the course of the trading day that is quite small on average, \mu_{\epsilon} \approx 0, the stock can still realize large demand shocks when each ETF’s demand has a large spillover effect, \beta \approx 1, and when they don’t have to make very granular position changes, \lambda \approx 0. This is the amplification result I mentioned above. The figures below verifies these calculations using simulations with F = 10^3 ETFS. In particular, the left panel shows that the initial shock of only \mu_{\epsilon} = 0.10 leads to more than one rebalancing cascade on average when \lambda = 0.10.

plot--etf-rebalancing-simulation--predicted-vs-actual-moments--05apr2016

Filed Under: Uncategorized

The Characteristic Scale of House-Price Variation

January 24, 2016 by Alex

1. Introduction

There are many reasons why two houses might have different prices. To start with, one house might just be larger or have a better layout than the other. Let’s call these sorts of house-to-house differences “fine-grained”. But, prices can also vary for reasons that have nothing to do with the houses themselves. Even if two houses are physically identical, one house might sit in a more attractive neighborhood or belong to a better school district. Let’s call such differences over larger scales “coarse-grained”.

Different scales dominate in different places. In some counties, most of the price variation comes from fine-grained, house-to-house differences. Think about Los Angeles, CA where there is a lot of heterogeneity in the age and quality of the housing stock, even for houses that are right next door to one another. But, there are also coarse-grained counties where most of the house-price variation occurs over much larger scales. Think about Orange County, CA where the typical house is part of a subdivision. While there are lots of differences between subdivisions in Orange County, all the houses within each subdivision are typically built in a similar style by a single company at the exact same time.

los-angeles-vs-orange-county

This post shows how to estimate the characteristic scale of house-price variation in a county. That is, it shows how to tell if a county is fine-grained like Los Angeles, coarse-grained like Orange County, or somewhere in between. To do this, I introduce a new scale-specific variance estimator based on the Allan variance. This estimator decomposes the cross-sectional house-price variation in a given county into scale-specific components such as the amount of variation that arises from comparing randomly selected houses or the amount of variation that arises from comparing randomly selected neighborhoods.

Why not just compare the variance of the individual house prices to the variance of the neighborhood-level averages? The answer is simple: those calculations aren’t independent. Fine-grained counties with lots of house-to-house variation will mechanically have more variance in their average neighborhood-level prices. Just imagine the extreme case where all of the variation comes from house-to-house differences and each house’s price is independently drawn from the same normal distribution, p_h \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathsf{N}(\bar{p}, \, \sigma_H^2). In a world where the nth neighborhood has L houses, the variance of the neighborhood-level average price, \mathsf{E}(p_h|h \in n) = \hat{\mu}_n, would be increasing in the amount of house-to-house variation, \mathsf{Var}(\hat{\mu}_n)=\sfrac{\sigma_H^2}{L}. I use this new scale-specific estimator because I don’t want to confuse these sorts of emergent neighborhood-level fluctuations with the honest-to-goodness neighborhood-level differences.

2. Data-Generating Process

Consider a county with H houses and N neighborhoods where there are L \geq 2 houses in each neighborhood so that (L \cdot N) = H. Suppose that house prices are the sum of a neighborhood-level value and a house-level value,

(1)   \begin{align*} p_h &= {\textstyle \sum_n} \mu_n \cdot 1_{\{ h \in n \}} + \theta_h, \end{align*}

with \mu_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathsf{N}(0, \, \sigma_N^2) and \theta_h \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathsf{N}(0, \, \sigma_H^2). Think about neighborhood-level values as the quality of the local school district or the attractiveness of the nearby restaurant scene. This value is coarse-grained. You can have a mansion or a hovel in a nice school district. The house-level value, by contrast, relates to the characteristics of each particular house. This value is fine-grained.

Given this data-generating process, we know that a fraction

(2)   \begin{align*} \lambda = {\textstyle \frac{\sigma_N^2}{\sigma_N^2 + \sigma_H^2}} \end{align*}

of the variation in house prices comes from neighborhood-level differences. This is the object of interest in the current post. If \lambda is close to 1, then the county is dominated by coarse-grained variation in house prices and looks like Orange County, CA. If \lambda is close to 0, then the county is dominated by fine-grained variation in house prices and looks more like Los Angeles, CA. In the analysis below, I’m going to show how to estimate \lambda in simulated data.

3. Naïve Estimate of λ

One way to estimate \lambda would be to look at a variance ratio. You might first compute the average price in each neighborhood,

(3)   \begin{align*} \hat{\mu}_n = \mathsf{E}(p_h|h \in n) = {\textstyle \frac{1}{L} \cdot \sum_{h \in n}} p_h, \end{align*}

and then look at the ratio of the variance of the neighborhood-level average prices to the total house-price variance:

(4)   \begin{align*} \lambda^{\scriptscriptstyle\text{Na\"{i}ve}}  = {\textstyle  \frac{ \mathsf{Var}(\hat{\mu}_n) }{ \mathsf{Var}(p_h) } }. \end{align*}

The thought process behind this calculation is really simple. If different neighborhoods have very different prices, then there should be a lot of variation in the average neighborhood-level price. In fact, it turns out to be too simple. While it is true that this naïve calculation will generate a higher \lambda in counties with lots of coarse-grained neighborhood-level variation, it will also be high in counties with lots of fine-grained house-to-house variation.

To see why, let’s look at the variance of the average price in each county:

(5)   \begin{align*} \mathsf{Var}(\hat{\mu}_n) &= {\textstyle \frac{1}{N} \cdot \sum_n} (\hat{\mu}_n - \hat{\mu} )^2 \\ &=  \underbrace{{\textstyle \frac{1}{N} \cdot \sum_n} (\mu_n - 0)^2}_{\sigma_N^2} + \underbrace{{\textstyle \frac{1}{N} \cdot \sum_n} (\hat{\mu}_n - \mu_n)^2}_{\sfrac{\sigma_H^2}{L}}  - \underbrace{{\textstyle \frac{1}{N} \cdot \sum_n}(\hat{\mu} - 0)^2}_{\sfrac{\sigma_N^2}{N} + \sfrac{\sigma_H^2}{H}}. \end{align*}

This variance consists of 3 parts: the true neighborhood-level variance, \sigma_N^2; differences in neighborhood-level prices from fine-grained variation, \sfrac{\sigma_H^2}{L}; and, a correction for the unknown sample mean, -\left( \sfrac{\sigma_N^2}{N} + \sfrac{\sigma_H^2}{H}\right). If we solve for the true amount of neighborhood-level variation,

(6)   \begin{align*} \sigma_N^2 &=  {\textstyle \frac{N}{N - 1}} \cdot \left( \, \mathsf{Var}(\hat{\mu}_n) - {\textstyle \frac{N - 1}{H}} \cdot \sigma_H^2 \, \right), \end{align*}

we see that it’s going to be smaller than the variation in neighborhood-level average prices. Counties with lots of fine-grained, house-to-house differences look like they have too much neighborhood-level house-price variation. What’s more, in the simulations plotted below [code] where the county has H = 256 houses, you can see that the nature of this bias is going to vary in a non-trivial way as the number of neighborhoods and the amount of coarse-grained, neighborhood-level variation changes.

plot--naive-lambda-estimate--simulated-data--22jan2016

4. Corrected Estimate of λ

In order to fix the problem, we need a way of simultaneously estimating neighborhood-level and house-level price variation and then using the second estimate to correct the bias in the first. It turns out that you can do this by running a simple cross-sectional regression,

(7)   \begin{align*} p_h &=  \hat{\alpha}  +  {\textstyle \sum_n} \left\{ \, {\textstyle \sum_{j = 1}^{N/2}} \hat{\beta}_j \cdot x_j(n) \, \right\} \cdot 1_{\{ h \in n \}} +  {\textstyle \sum_{k=1}^{H/2}} \hat{\gamma}_k \cdot y_k(h)  +  \epsilon_h, \end{align*}

where \{ x_j(n) \}_{j=1}^{N/2} and \{ y_k(h) \}_{k=1}^{H/2} denote a collection of cleverly-chosen right-hand-side variables that I define below. The key insight is that, if you define these variables correctly, then you can read off both the neighborhood-level and house-level variation from the coefficients.

Here’s how. First, to create the variables that define the neighborhood-level variation, \{ x_j(n) \}_{j=1}^{N/2}, randomly pair-off each neighborhood within the county so that there are \sfrac{N}{2} neighborhood pairs. Then create the variables:

(8)   \begin{align*} \mathbf{x}_1 &=  {\textstyle \sqrt{\frac{1}{2} \cdot \left(\frac{H - 1}{2 \cdot (\sfrac{H}{N})}\right)}} \times  \left[ \begin{array}{ccc:ccc:ccc:ccc:c}  1 & \cdots & \phantom{-}1 & -1 & \cdots & -1 &  0 & \cdots & \phantom{-}0 &  \phantom{-}0 & \cdots & \phantom{-}0 &  \cdots \end{array} \right]^{\top} \\ \mathbf{x}_2 &=  {\textstyle \sqrt{\frac{1}{2} \cdot \left(\frac{H - 1}{2 \cdot (\sfrac{H}{N})}\right)}} \times  \left[ \begin{array}{ccc:ccc:ccc:ccc:c}  0 & \cdots & \phantom{-}0 &  \phantom{-}0 & \cdots & \phantom{-}0 &  1 & \cdots & \phantom{-}1 & -1 & \cdots & -1 &  \cdots \end{array} \right]^{\top} \\ &\vdots \end{align*}

So, the first of these variables, \mathbf{x}_1, compares the average price in the first neighborhood to the average price in the second neighborhood, meaning that the variable is mean zero. The scaling by \sqrt{\frac{1}{2} \cdot \left(\frac{H - 1}{2 \cdot (\sfrac{H}{N})}\right)} then ensures that \sfrac{1}{2} = \mathsf{Var}(\mathbf{x}_j) for all j = 1, 2, \ldots, \sfrac{N}{2}.

Next, to create the variables that define the house-level variation, \{ y_k(h) \}_{k=1}^{H/2}, randomly pair-off each house within a neighborhood so that there are \sfrac{L}{2} house pairs within each neighborhood and \sfrac{H}{2} pairs in total. Then, use these house pairs to create the variables:

(9)   \begin{align*} \mathbf{y}_1 &=  {\textstyle \sqrt{\frac{1}{2} \cdot \left( \frac{H-1}{2 \cdot 1} \right)}} \times  \left[ \begin{array}{ccccccc:c}  1 & -1 &  0 & \phantom{-}0 &  \cdots &  0 & \phantom{-}0 & \cdots \end{array} \right]^{\top} \\ \mathbf{y}_2 &=  {\textstyle \sqrt{\frac{1}{2} \cdot \left(\frac{H-1}{2 \cdot 1}\right)}} \times  \left[ \begin{array}{ccccccc:c}  0 & \phantom{-}0 &  1 & -1 &  \cdots &  0 & \phantom{-}0 & \cdots \end{array} \right]^{\top} \\ &\vdots \end{align*}

So, the first of these variables, \mathbf{y}_1, compares the price of the first house to the price of the second house, meaning that the variable is mean zero. The scaling by \sqrt{\frac{1}{2} \cdot \left(\frac{H - 1}{2 \cdot 1}\right)} ensures that \sfrac{1}{2} = \mathsf{Var}(\mathbf{y}_k) for all k = 1, 2, \ldots, \sfrac{H}{2}

Because these right-hand-side variables are orthogonal to one another,

(10)   \begin{align*} 0  =  \mathsf{Cov}(\mathbf{x}_j, \mathbf{x}_{j'})  =  \mathsf{Cov}(\mathbf{y}_k, \mathbf{y}_{k'}) =  \mathsf{Cov}(\mathbf{x}_j, \mathbf{y}_{k}), \end{align*}

we can then use their coefficients to estimate

(11)   \begin{align*} \tilde{\sigma}_H^2 &= {\textstyle \sum_{k=1}^{H/2}}\hat{\gamma}_k^2 \\ \text{and} \quad \tilde{\sigma}_N^2 &= {\textstyle \sum_{j=1}^{N/2}} \hat{\beta}_j^2 - {\textstyle \frac{\tilde{\sigma}_H^2}{\sfrac{H}{N}}}, \end{align*}

and thus generated an unbiased estimate of \lambda:

(12)   \begin{align*} \lambda^{\scriptscriptstyle \text{2-Sample}} &= {\textstyle \frac{\tilde{\sigma}_N^2}{\tilde{\sigma}_N^2 + \tilde{\sigma}_H^2}}. \end{align*}

The simulations below [code] use the exact sample parameter values as above but calculate \lambda using this correction. All of the earlier bias disappears.

plot--corrected-lambda-estimate--simulated-data--24jan2016

Filed Under: Uncategorized

Using the LASSO to Forecast Returns

December 5, 2015 by Alex

1. Motivating Example

A Popular Goal. Financial economists have been looking for variables that predict stock returns for as long as there have been financial economists. For some recent examples, think about Jegadeesh and Titman (1993), which shows that a stock’s current returns are predicted by the stock’s returns over the previous 12 months, Hou (2007), which shows that the current returns of smallest stocks in an industry are predicted by the lagged returns of the largest stocks in the industry, and Cohen and Frazzini (2008), which shows that a stock’s current returns are predicted by the lagged returns of its major customers.

Two-Step Process. When you think about it, finding these sorts of variables actually consists of two separate problems, identification and estimation. First, you have to use your intuition to identify a new predictor, x_t, and then you have to use statistics to estimate this new predictor’s quality,

(1)   \begin{align*}   r_{n,t} &= \hat{\theta}_0 + \hat{\theta}_1 \cdot x_{t-1} + \epsilon_{n,t}, \end{align*}

where \hat{\theta}_0 and \hat{\theta}_1 are estimated coefficients, r_{n,t} is the return on the nth stock, and \epsilon_{n,t} is the regression residual. If knowing x_{t-1} reveals a lot of information about what a stock’s future returns will be, then |\hat{\theta}_1| and the associated R^2 will be large.

Can’t Always Use Intuition. But, modern financial markets are big, fast, and dense. Predictability doesn’t always occur at scales that are easy for people to intuit, making the standard approach to tackling the first problem problematic. For instance, the lagged returns of the Federal Signal Corporation were a significant predictor for more than 70{\scriptstyle \%} of all NYSE-listed telecom stocks during a 34-minute stretch on October 5th, 2010. Can you really fish this particular variable out from the sea of spurious predictors using intuition alone? And, how exactly are you supposed to do this in under 34 minutes?

Using Statistics Instead. In a recent working paper (link), Mao Ye, Adam Clark-Joseph, and I show how to replace this intuition step with statistics and use the least absolute shrinkage and selection operator (LASSO) to identify rare, short-lived, “sparse” signals in the cross-section of returns. This post uses simulations to show how the LASSO can be used to forecast returns.

2. Using the LASSO

LASSO Definition. The LASSO is a penalized-regression technique that was was introduced in Tibshirani (1996). It simultaneously identifies and estimates the most important coefficients using a far shorter sample period by betting on sparsity—that is, by assuming only a handful of variables actually matter at any point in time. Formally, using the LASSO means solving the problem below,

(2)   \begin{align*}   \hat{\boldsymbol \vartheta}   &=   \underset{{\boldsymbol \vartheta} \in \mathbf{R}^Q}{\mathrm{arg}\,\mathrm{min}}   \,   \left\{     \,     \frac{1}{2 \cdot T}     \cdot      \sum_{t=1}^T \left(r_t - \vartheta_0 - {\textstyle \sum_{q=1}^Q} \vartheta_q \cdot x_{q,t-1}\right)^2     +      \lambda \cdot \sum_{q=1}^Q \left|\vartheta_q\right|     \,   \right\}, \end{align*}

where r_t is a stock’s return at time t, \hat{\boldsymbol \vartheta} is a (Q \times 1)-dimensional vector of estimated coefficients, x_{q,t-1} is the value of qth predictor at time (t-1), T is the number of time periods in the sample, and \lambda is a penalty parameter. Equation (2) looks complicated at first, but it’s not. It’s a simple extension of an OLS regression. In fact, if you ignore the right-most term—the penalty function, \lambda \cdot \sum_q \left|\vartheta_q\right|—then this optimization problem would simply be an OLS regression.

Penalty Function. But, it’s this penalty function that’s the secret to the LASSO’s success, allowing the estimator to give preferential treatment to the largest coefficients and completely ignore the smaller ones. To better understand how the LASSO does this, consider the solution to Equation (2) when the right-hand-side variables are uncorrelated and have unit variance:

(3)   \begin{align*}   \hat{\vartheta}_q &= \mathrm{sgn}[\hat{\theta}_q] \cdot (|\hat{\theta}_q| - \lambda)_+. \end{align*}

Here, \hat{\theta}_q represents what the standard OLS coefficient would have been if we had an infinite amount of data, \mathrm{sgn}[x] = \sfrac{x}{|x|}, and (x)_+ = \max\{0,\,x\}. On one hand, this solution means that, if OLS would have estimated a large coefficient, |\hat{\theta}_q| \gg \lambda, then the LASSO is going to deliver a similar estimate, \hat{\vartheta}_q \approx \hat{\theta}_q. On the other hand, the solution implies that, if OLS would have estimated a sufficiently small coefficient, |\hat{\theta}_q| < \lambda, then the LASSO is going to pick \hat{\vartheta}_q = 0. Because the LASSO can set all but a handful of coefficients to zero, it can be used to identify the most important predictors even when the sample length is much shorter than the number of possible predictors, T \ll Q. Morally speaking, if only K \ll Q of the predictors are non-zero, then you should only need a few more than K observations to select and then estimate the size of these few important coefficients.

3. Simulation Analysis

I run 1,000 simulations to show how to use the LASSO to forecast future returns. You can find all of the relevant code here.

Data Simulation. Each simulation involves generating returns for Q = 100 stocks for T = 1,150 periods. Each period, the returns of all Q=100 stocks are governed by the returns of a subset of K=5 stocks, \mathcal{K}_t, together with an idiosyncratic shock,

(4)   \begin{align*} r_{q,t} &= 0.15 \cdot \sum_{q' \in \mathcal{K}_t} r_{q',t-1} + 0.001 \cdot \epsilon_{q,t}, \end{align*}

where \epsilon_{q,t} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1). This cast of K = 5 sparse signals changes over time, leading to the time subscript on \mathcal{K}_t. Specifically, I assume that there is a 1{\scriptstyle \%} chance that each signal changes every period, so each signal lasts lasts \sfrac{(1 - 0.01)}{0.01} = 99 periods on average.

Fitting Models to the Data. For each period from t=151 to t=1,150, I estimate the LASSO on the first stock, q=1, as defined in Equation (2) using the previous T=50 periods of data where the Q possible predictors are the Q=100 stocks. This means using T=50 time periods to estimate a model with Q=100 potential right-hand-side variables. As useful benchmarks, I also estimate the autoregressive model from Equation (1) and an oracle regression. In this specification, I estimate an OLS regression with the K=5 true predictors as the right-hand-side variables. Obviously, in the real-world you don’t know what the true predictors are, but this specification gives an estimate of the best fit you could achieve. After fitting each model to the previous 50 periods of data, I then make an out-of-sample forecast in the 51st period.

Forecasting Regressions. I then check how closely these forecasts line up with the realized returns of the first asset by analyzing the adjusted R^2 statistics from a bunch of forecasting regressions. For example, I take the LASSO’s return forecast in periods t=151 to t=1,150 and estimate the regression below,

(5)   \begin{align*}   r_{1,t+1} &= \alpha + \beta \times \left( \frac{f_{1,t}^{\scriptscriptstyle \mathrm{LASSO}} - \mu^{\scriptscriptstyle \mathrm{LASSO}}}{\sigma^{\scriptscriptstyle \mathrm{LASSO}}} \right) + \varepsilon_{1,t+1}, \end{align*}

where \alpha and \beta are estimated coefficients, r_{1,t+1} denotes the first stock’s realized return in period (t+1), f_{1,t}^{\scriptscriptstyle \mathrm{LASSO}} denotes the LASSO’s forecast of the first stock’s return in minute (t+1), \mu^{\scriptscriptstyle \mathrm{LASSO}} and \sigma^{\scriptscriptstyle \mathrm{LASSO}} represent the mean and standard deviation of this out-of-sample forecast from period t=151 to t=1,150, and \varepsilon_{1,t+1} is the regression residual. The figure below shows that the average adjusted-R^2 statistic from 1,000 simulations is 4.40{\scriptstyle \%} for the LASSO; whereas, this statistic is only 1.29{\scriptstyle \%} when making your return forecasts using an autoregressive model,

(6)   \begin{align*}   r_{1,t+1} &= \alpha + \beta \times \left( \frac{f_{1,t}^{\scriptscriptstyle \mathrm{OLS}} - \mu^{\scriptscriptstyle \mathrm{OLS}}}{\sigma^{\scriptscriptstyle \mathrm{OLS}}} \right) + \varepsilon_{1,t+1}. \end{align*}

plot--r2-distribution--simulated-data--sparse-shocks

4. Tuning Parameter

Penalty Parameter Choice. Fitting the LASSO to the data involves selecting a penalty parameter, \lambda. I do this by selecting the penalty parameter that has the highest out-of-sample forecasting R^2 during the first 100 periods of the data. This is why the forecasting regressions above only use data starting at t=151 instead of t=51. The figure below shows the distribution of penalty parameter choices across the 1,000 simulations. The discrete 0.0005 jumps come from the discrete grid of possible \lambdas that I considered when running the code.

plot--lambda-distribution--simulated-data--sparse-shocks

Number of Predictors. Finally, if you look at the panel labeled “Oracle” in the adjusted R^2 figure, you’ll notice that the LASSO’s out-of-sample forecasting power is about a third of the true model’s forecasting power, \sfrac{4.40}{12.84} = 0.34. This is because the LASSO doesn’t do a perfect job of picking out the K=5 sparse signals. The right panel of the figure below shows that the LASSO usually only picks out the most important of these K=5 signals. What’s more, the left panel shows that the LASSO also locks onto lots of spurious signals. This result suggests that you might be able to improve the LASSO’s forecasting power by choosing a higher penalty parameter, \lambda.

plot--predictor-distribution--simulated-data--sparse-shocks

5. When Does It Fail?

Placebo Tests. I conclude this post by looking at two alternative simulations where the LASSO shouldn’t add any forecasting power. In the first alternative setting, there are no shocks. That is, the returns for the Q=100 stocks are simulated using the model below,

(7)   \begin{align*} r_{q,t} &= 0.00 \cdot \sum_{q' \in \mathcal{K}_t} r_{q',t-1} + \sigma \cdot \epsilon_{q,t}. \end{align*}

In the second setting, there are too many shocks: K =75. The figures below show that, in both these settings, the LASSO doesn’t add any forecasting power. Thus, running these simulations offers a pair of nice placebo tests showing that the LASSO really is picking up sparse signals in the cross-section of returns.

plot--r2-distribution--simulated-data--no-shocks

plot--r2-distribution--simulated-data--dense-shocks

Filed Under: Uncategorized

Notes on Kyle (1989)

November 13, 2015 by Alex

1. Motivating Example

In several earlier posts (e.g., here and here) I’ve talked about the two well-known information-based asset-pricing models, Grossman and Stiglitz (1980) and Kyle (1985). But, there are lots of situations that don’t really fit with either of these two models. For one thing, uninformed speculators often recognize that they’re going to have a price impact, so it’s at odds with Grossman and Stiglitz (1980). For another thing, uninformed speculators typically use limit orders, so it’s at odds with Kyle (1985).

This post outlines the Kyle (1989) model which studies speculators that place limit orders and recognize their own price impact.

2. Market Structure

Assets. There is a single trading period and a single risky asset with a price of p. This risky asset’s liquidation value is v \sim \mathrm{N}(0, \, \tau_v^{-1}). For example, you might think about the asset as a stock that’ll have a value of v after some important news announcement tomorrow. It’s just that, right now, you don’t know which direction the news will go.

Traders. There are 3 kinds of traders: noise traders, informed speculators, and uninformed speculators. Noise traders demand -z \sim \mathrm{N}(0, \, \tau_z^{-1}) shares of the risky asset. There are N informed speculators and M uninformed speculators. Both informed and uninformed traders have an initial endowment of \mathdollar 0 (this is just a normalization) and exponential utility with risk-aversion parameter \rho > 0,

(1)   \begin{align*} - \, \exp\left\{ \, - \, \rho \cdot (v - p) \cdot x \, \right\}, \end{align*}

where x denotes the number of shares demanded by a speculator.

Information. Prior to trading, each informed speculator gets a private signal s_n and has a demand schedule \mathrm{X}_{I,n}(p,\,s_n). That is, he has in mind a function which tells him how many shares to demand at each possible price, p, given his private signal, s_n. Assume that the informed speculators’ signals can be written as

(2)   \begin{align*} s_n = v + \epsilon_n \end{align*}

where \epsilon_n \sim \mathrm{N}(0, \, \tau_{\epsilon}^{-1}). Each uninformed speculator has a demand schedule \mathrm{X}_{U,m}(p).

3. Equilibrium Concept

Definition. An equilibrium is a set of demand schedules, X_{I,n}(p,\,s_n) for the n=1,\,2,\,\ldots,\,N informed speculators and X_{U,m}(p) for the m=1,\,2,\,\ldots,\,M uninformed speculators, and a price function P(v,\,z) such that (a) markets clear,

(3)   \begin{align*} z &= {\textstyle \sum_{n=1}^N} X_{I,n}(p,\,s_n) + {\textstyle \sum_{m=1}^M} \cdot X_{U,m}(p), \end{align*}

and (b) both informed and uninformed speculators optimize,

(4)   \begin{align*} X_{I,n}(p,\,s_n) &\in \arg \max_x \left\{ \, \mathrm{E}[ \, - \, \exp\left\{ \, \rho \cdot (v - p) \cdot x \right\} \, | \, p, \, s_n \, ] \, \right\} \text{ for all } n = 1, \, 2, \, \ldots, \, N \\ \text{and} \qquad X_{U,m}(p) &\in \arg \max_x \left\{ \, \mathrm{E}[ \, - \, \exp\left\{ \, \rho \cdot (v - p) \cdot x \right\} \, | \, p \, ] \, \right\} \text{ for all } m = 1, \, 2, \, \ldots, \, M. \end{align*}

Both informed and uninformed speculators understand the relationship between prices and the random variables v and z. Prices will not be fully revealing due to the presence of noise-trader demand, z.

Refinements. I make a pair of additional restrictions on the set of equilibria. Namely, I look only for linear, symmetric equilibria where the informed speculators’ demand schedules can be written as

(5)   \begin{align*} \mathrm{X}_I(p,\, s_n) &= \alpha_I - \beta_I \cdot p + \gamma_I \cdot s_n, \end{align*}

the uninformed-speculators’ demand schedules can be written as

(6)   \begin{align*} \mathrm{X}_U(p) &= \alpha_U - \beta_U \cdot p, \end{align*}

and the price can be written as

(7)   \begin{align*} \mathrm{P}(v,\,z) &= \theta_0 + \theta_v \cdot v - \theta_z \cdot z. \end{align*}

4. Information Updating

Price Impact. If we substitute the linear demand schedules for the informed and uninformed speculators into the market-clearing condition, then we get a formula for the price,

(8)   \begin{align*} p &= \lambda \times \left( \, \left\{ \, N \cdot \alpha_I + M \cdot \alpha_U \, \right\} + \gamma_I \cdot {\textstyle \sum_{n=1}^N} s_n - z \, \right) \quad \text{where} \quad \lambda = {\textstyle \frac{1}{N \cdot \beta_I + M \cdot \beta_U}}. \end{align*}

Thus, if noise traders supply one additional share, then the price drops by \mathdollar \lambda. Next, I define the same object for informed and uninformed speculators. That is, taking the demand schedules of the other speculators as given, how much will the price change if the nth informed speculator or mth uninformed speculator increases his demand by 1 share? This question defines the residual supply curves,

(9)   \begin{align*} p &= \hat{p}_{I,n} + \lambda_I \times X_I(p,\,s_n) \\ \text{and} \qquad p &= \hat{p}_{U} + \lambda_U \times X_U(p). \end{align*}

Imperfect competition is present because each trader recognizes that if he submits a different schedule, the resulting equilibrium price may change.

Forecast Precision. Informed speculator’s forecast precision is given via Bayesian updating as:

(10)   \begin{align*} \tau_I = \left( \, \mathrm{Var}[v|p,\, s_n] \, \right)^{-1} = \tau_v + \tau_{\epsilon} + \varphi_I \times (N-1) \cdot \tau_{\epsilon} \quad \text{where} \quad \varphi_I = {\textstyle \frac{(N - 1) \cdot \gamma_I^2 \cdot \tau_z}{(N - 1) \cdot \gamma_I^2 \cdot \tau_z + \tau_{\epsilon}}}. \end{align*}

\varphi_I represents the fraction of the precision from the other (N-1) informed speculators revealed to the nth informed speculator by the price. The corresponding forecast precision for uninformed speculator is:

(11)   \begin{align*} \tau_U = \left( \, \mathrm{Var}[v|p] \, \right)^{-1} = \tau_v + \varphi_U \times N \cdot \tau_{\epsilon} \quad \text{where} \quad \varphi_U = {\textstyle \frac{N \cdot \gamma_I^2 \cdot \tau_z}{N \cdot \gamma_I^2 \cdot \tau_z + \tau_{\epsilon}}}. \end{align*}

\varphi_U represents the fraction of the precision of the N informed speculators revealed to the mth uninformed speculator by the price. Clearly, prices become perfectly revealing as \varphi_I, \, \varphi_U \to 1.

Posterior Beliefs. The nth informed speculator’s posterior beliefs about the risky asset’s liquidation value is a weighted average of the public price signal and his private signal,

(12)   \begin{align*} \mathrm{E}[v|p, \, s_n]  &=  \left( {\textstyle \frac{\varphi_I }{\tau_I} \cdot \frac{\tau_{\epsilon}}{\gamma_I}} \right) \times \left( \, \lambda^{-1} \cdot p - \{N \cdot \alpha_I + M \cdot \alpha_U\} \, \right) +  \left( {\textstyle \frac{(1 - \varphi_I) \cdot \gamma_I}{\tau_I} \cdot \frac{\tau_{\epsilon}}{\gamma_I}} \right) \times s_n  \end{align*}

Uninformed speculators form their posterior beliefs solely on the public price signal,

(13)   \begin{align*} \mathrm{E}[v|p]  &=  \left( {\textstyle \frac{\varphi_U}{\tau_U} \cdot \frac{\tau_{\epsilon}}{\gamma_I}} \right) \times \left( \, \lambda^{-1} \cdot p - \{N \cdot \alpha_I + M \cdot \alpha_U\} \, \right). \end{align*}

5. Optimal Demand

Informed Demand. The nth informed speculator observes his private signal, s_n, and the price implied by the demand of the other traders, \hat{p}_{I,n}. He then solves the optimization problem below,

(14)   \begin{align*} \max_x \left\{ \, (\mathrm{E}[v|\hat{p}_{I,n}, \, s_n] - \hat{p}_{I,n}) \cdot x - (\lambda_I + {\textstyle \frac{\rho}{2}} \cdot \mathrm{Var}[v|\hat{p}_{I,n}, \, s_n]) \cdot x^2 \, \right\} \quad \Rightarrow \quad x = {\textstyle \frac{\mathrm{E}[v|\hat{p}_{I,n}, \, s_n] - \hat{p}_{I,n}}{2 \cdot \lambda_I + \rho \cdot \mathrm{Var}[v|\hat{p}_{I,n}, \, s_n]}}. \end{align*}

But, the residual demand curve, \hat{p}_{I,n}, is related to the actual price, \hat{p}_{I,n} = p - \lambda_I \cdot X_I(p,\,s_n). So, after a little bit of rearranging, we can write the informed speculators’ optimal demand schedules as

(15)   \begin{align*} X_I(p,\,s_n) &= {\textstyle \frac{\mathrm{E}[v|p, \, s_n] - p}{\lambda_I + \sfrac{\rho}{\tau_I}}}. \end{align*}

Uninformed Demand. The mth uninformed speculator observes only the price implied by the demand of the other traders, \hat{p}_U. He then solves the optimization problem below,

(16)   \begin{align*} \max_x \left\{ \, (\mathrm{E}[v|\hat{p}_U] - \hat{p}_U) \cdot x - (\lambda_U + {\textstyle \frac{\rho}{2}} \cdot \mathrm{Var}[v|\hat{p}_U]) \cdot x^2 \, \right\} \quad \Rightarrow \quad x = {\textstyle \frac{\mathrm{E}[v|\hat{p}_U] - \hat{p}_U}{2 \cdot \lambda_U + \rho \cdot \mathrm{Var}[v|\hat{p}_U]}}. \end{align*}

Using the exact same tricks, we can write the uninformed speculators’ optimal demand schedule as

(17)   \begin{align*} X_U(p) &= {\textstyle \frac{\mathrm{E}[v|p] - p}{\lambda_U + \sfrac{\rho}{\tau_U}}}. \end{align*}

6. Endogenous Parameters

Next, it’s useful to define a couple of additional parameters.

Information Incidence. First, define the information incidence as

(18)   \begin{align*} \zeta &= \left( {\textstyle \frac{\tau_\epsilon}{\tau_I}} \right)^{-1} \times (\lambda \cdot \gamma_I). \end{align*}

This new parameter represents the increase in the equilibrium price when the nth informed speculator’s valuation of the risky asset goes up by \mathdollar 1 as a result of a higher signal realization, s_n. For trader n‘s valuation to rise by \mathdollar 1, his private signal must rise by a factor of (\frac{\tau_\epsilon}{\tau_I})^{-1}, and prices move by a factor of (\lambda \cdot \gamma_I) for every \mathdollar 1 increase in the nth informed speculator’s private signal, s_n. In equilibrium, it turns out that \zeta \leq \sfrac{1}{2}.

Marginal Market Share. Next, define two parameters capturing the marginal market share of the informed and uninformed speculators,

(19)   \begin{align*} \xi_I = \beta_I \cdot \lambda \quad \text{and} \quad \xi_U = \beta_U \cdot \lambda. \end{align*}

Here’s how you interpret \xi_I: if noise traders demand 1 additional share, then the quantity traded by each informed speculator increases by \xi_I shares. Likewise, \xi_U captures the amount of additional trading that each uninformed speculator does in response to a 1 share increase in noise-trader demand.

7. Model Solution

Kyle (1989) shows that there exists a unique symmetric linear equilibrium if N \geq 2, M \geq 1, \tau_z^{-1} > 0, and \tau_{\epsilon} > 0. This equilibrium is characterized by a system of 4 equations and 4 unknowns, \{ \, \gamma_I, \, \zeta, \, \xi_I, \, \xi_U \, \}, subject to the constraints that \gamma_I > 0, 0 < \zeta \leq \sfrac{1}{2}, \sfrac{\varphi_U}{N} < \beta_I \cdot \lambda < \sfrac{1}{N}, and 0 < \beta_U \cdot \lambda < \sfrac{(1 - \varphi_U)}{M}.

Equation 1. The first equation is the easiest. If noise traders demand an additional share, then someone has to sell it to them. Informed speculators tend to adjust their demand by \xi_I shares, and uninformed speculators tend to adjust their demand by \xi_U shares. Thus, because there are N informed speculators and M uninformed speculators, we have the market-clearing condition below:

(20)   \begin{align*} 1 &= N \cdot \xi_I + M \cdot \xi_U. \end{align*}

Equation 2. Next, we turn to the second equation, which characterizes the informed speculator’s demand response to price changes, \beta_I, via the endogenous parameter \xi_I,

(21)   \begin{align*} (1 - \zeta) &= (1 - \varphi_I) \times (1 - \xi_I). \end{align*}

This equation links how unresponsive prices are to a \mathdollar 1 increase in an informed speculator’s private signal, (1 - \zeta), to the product of how uninformative prices are about other informed speculators’ signals, (1 - \varphi_I), and how little each informed speculator has to trade in response to a 1 share increase in noise-trader demand, (1 - \xi_I). After all, if informed speculators don’t have to trade that often—i.e., (1 - \xi_I) \approx 1—and prices don’t really reveal much of their private signal to other informed speculators when they do—i.e., (1 - \varphi_I) \approx 1, then prices shouldn’t be moving that much in response to private shocks—i.e., (1 - \zeta) \approx 1.

Equation 3. The third equation is much more directly an equilibrium characterization of \gamma_I,

(22)   \begin{align*} \gamma_I  &= \tau_{\epsilon} \times \left( {\textstyle \frac{1}{\rho}} \right) \times (1 - \varphi_I) \times \left( {\textstyle \frac{1 - 2 \cdot \zeta}{1 - \zeta}} \right). \end{align*}

Informed speculators are going to trade more aggressively in response to a \mathdollar 1 increase in their private signal when their private signal is more precise (i.e., \tau_{\epsilon} is big), when they are closer to risk neutral (i.e., \rho is small), when prices don’t reveal much about their private signal to other informed speculators (i.e., (1 - \varphi_I) \approx 1 because \varphi_I \approx 0), or when prices don’t move much when informed speculators trade on their private information (i.e., \frac{1 - 2 \cdot \zeta}{1 - \zeta} \approx 1 because \zeta \approx 0). Notice that this last effect is second order when \zeta is small.

Equation 4. Finally, let’s have a look at equation, which characterizes the uninformed speculator’s demand response to price changes, \beta_U, via the endogenous parameter \xi_U,

(23)   \begin{align*} \zeta \cdot \tau_U - \varphi_U \cdot \tau_I  &=  \xi_U  \times  \underset{> 0}{\left( {\textstyle \frac{\zeta \cdot \tau_U}{1 - \xi_U}} +  {\textstyle \frac{\rho \cdot \gamma_I \cdot \tau_I}{\tau_{\epsilon}}} \right)}. \end{align*}

I don’t have any clean way to analyze the right-hand side of this equation, but it is possible to show that the right-hand side will only be 0 if \xi_U = 0—that is, if there are lots of small uninformed speculators. What’s more, we know from Equation (13) that prices will only be an unbiased estimate of the uninformed speculators’ beliefs if:

(24)   \begin{align*} 1 &= \left( {\textstyle \frac{\varphi_U \cdot \tau_{\epsilon}}{\gamma_I \cdot \tau_U}} \right) \times \lambda^{-1}. \end{align*}

If we rearrange the left-hand side of the equation a bit,

(25)   \begin{align*} 0 &= \zeta \cdot \tau_U - \varphi_U \cdot \tau_I \\ \varphi_U \cdot \tau_I &= \zeta \cdot \tau_U  \\ \varphi_U \cdot \tau_I &= \left( {\textstyle \frac{\tau_\epsilon}{\tau_I}} \right)^{-1} \cdot (\lambda \cdot \gamma_I) \cdot \tau_U  \\ \left( {\textstyle \frac{\varphi_U \cdot \tau_{\epsilon}}{\gamma_I \cdot \tau_U}} \right) \times \lambda^{-1} &= 1, \end{align*}

we see that prices can only be unbiased if there are lots of small uninformed speculators, just like in Grossman and Stiglitz (1980). Otherwise, prices overreact—that is, \theta = \left( {\textstyle \frac{\varphi_U \cdot \tau_{\epsilon}}{\gamma_I \cdot \tau_U}} \right) \times \lambda^{-1} < 1.

8. Numerical Analysis

To make sure that I’ve understood how to solve the model correctly, I solved for the equilibrium parameters numerically when \rho=1, N=2, M=1, \tau_{\epsilon}=2, and \tau_v=1 as the precision of noise-trader demand volatility ranges from \tau_z = 0 to \tau_z = 2. You can find the code here. I’ve plotted some of the results below.

plot--kyle-1989-model-solution--endogenous-param--12nov2015

plot--kyle-1989-model-solution--price-impact-coef--12nov2015

plot--kyle-1989-model-solution--information-efficiency-parameters--12nov2015

Filed Under: Uncategorized

Screening Using False-Discovery Rates

November 5, 2015 by Alex

1. Motivating Example

Jegadeesh and Titman (1993) show that, if you rank stocks according to their returns over the previous 12 months, then the past winners will outperform the past losers by 1.5{\scriptstyle \%} per month over the next 3 months. But, the authors don’t just test this particular strategy. They also test strategies that rank stocks over the previous 3, 6, and 9 months and strategies which hold stocks for the next 6, 9, and 12 months, too. Clearly, if they test enough hypotheses, then some of these tests are going to appear statistically significant by pure chance. To address this concern in the original paper, the authors use the Bonferroni method.

This post shows how to use an alternative method—namely, controlling the false-discovery rate—to identify statistically significant results when testing multiple hypotheses.

2. Bonferroni Method

First, here’s the logic behind the Bonferroni method. Suppose you want to run N \gg 1 different hypothesis tests, \{ \, h_1, \, h_2, \, \ldots, \, h_N \, \}. Let h_n = 0 if the nth null hypothesis is true and h_n = 1 if the nth null hypothesis is false (i.e., should be rejected). Let p_n denote the p-value associated with some test statistic for the nth hypothesis. If there were just one test, then we should simply reject the null whenever p_1 \leq 0.05. But, if there are many tests, then this no longer works. If you look at a lot of hypotheses, then 5{\scriptstyle \%} of the p-values should be less than 0.05 even if h_n = 0 for all of them. The Bonferroni method suggests correcting this problem by lowering the p-value associated with the 5{\scriptstyle \%} significance level and only rejecting the null hypothesis when

(1)   \begin{align*} p_n \leq {\textstyle \frac{1}{N}} \cdot 0.05. \end{align*}

i.e., if there are N = 10 hypothesis tests, then only reject the null at the 5{\scriptstyle \%} significance level when the p-value is less than 0.005 rather than 0.05.

This is a nice start, but it turns out that the Bonferroni method is way too strict. Imagine drawing samples of 10 observations from N different normal distributions. All samples have the same standard deviation, \sigma = 1, but not all of the sample have the same mean. 20{\scriptstyle \%} have a mean of \mu = 1, and the rest have a mean of \mu = 0. The figure below shows that, if we use the Bonferroni method to identify which of the N = 100 samples have a non-zero mean, then we’re only going to choose 2 samples. But, by construction, we know that 20 samples had a non-zero mean! We should be rejecting the null 10-times more often!

plot--bonferroni-too-strict

3. False-Discovery Rate

Now, let’s talk about false-discovery rates. Define R(0.05) = \sum_{n=1}^N 1_{\{p_n < 0.05\}} as the total number hypotheses that you reject at the 5{\scriptstyle \%} significance level. Similarly, define R_f(0.05) = \sum_{n=1}^N 1_{\{p_n < 0.05\}} \times 1_{\{ h_n = 0 \}} as the number of hypotheses that you reject at the 5{\scriptstyle \%} significance level where the null was actually true—i.e., these are false rejections. The false-discovery rate is then

(2)   \begin{align*} \mathit{FDR} = \mathrm{E}\left[ \, \sfrac{R_f}{R} \, \right]. \end{align*}

Let’s return to the numerical example above to clarify this definition. Suppose we had a test that identified all 20 cases where the sample mean was \mu = 1, R = 20. If we wanted a false-discovery rate \mathit{FDR} \leq 0.10, then this test could produce at most 2 false rejections, R_f \leq 2. If the test identified only half of the 20 cases where the sample mean was \mu =1, R = 10, then this test could produce at most 1 one false rejection, R_f \leq 1.

Benjamini and Hochberg (1995) first introduced the idea that you could use the false-discovery rate to adjust statistical-significance tests when exploring multiple hypotheses. Here’s their recipe. First, run all of your tests and order the resulting p-values,

(3)   \begin{align*} p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(n)} \leq \cdots \leq p_{(N)}. \end{align*}

Then, for a given false-discovery rate, x, define n_x as

(4)   \begin{align*} n_x = \max_{n \in N} \, \left\{ \, n \, : \, p_{(n)} \leq {\textstyle \frac{n}{N} \cdot x} \, \right\}. \end{align*}

Benjamini and Hochberg (1995) showed that, if you reject any null hypothesis where

(5)   \begin{align*} p_n \leq p_{(n_x)}, \end{align*}

then \mathit{FDR} \leq x, guaranteed. If we apply the false-discovery-rate procedure to the same numerical example from above using the x = 0.05 threshold, then we see that as the number of hypotheses gets large, N \to \infty, the fraction of rejected null hypotheses hovers around 0.06. Improvement! It’s no longer shrinking to 0. Notice that neither method allows us to pick out the full 20{\scriptstyle \%} of null hypotheses that should be rejected in the simulation.

plot--false-discovery-rate

4. Why’s This So?

It’s pretty clear how the Bonferroni method works. If there are lots of hypotheses and you are worried about rejecting the null by pure chance, then just make it harder to reject the null. i.e., lower the threshold for significant p-values. It’s much less clear, though, how the false-discovery-rate screening process works. All you get is a recipe and a guarantee. If you do the list of things prescribed by Benjamini and Hochberg (1995), then you’ll falsely reject no more than x{\scriptstyle \%} of your null hypotheses. Let’s explore the proof of this result from Storey, Taylor and Siegmund (2003) to better understand where this guarantee comes from.

What does it mean to have a useful test? Well, if h_n = 0 (null is true), then p_n is drawn randomly from a uniform distribution, \mathrm{U}(0,\,1). However, if you have a useful test statistic, then if h_n = 1 (null is false), then p_n is drawn from some other distribution, \mathrm{A}(0,\,1) that is more concentrated around 0. The distribution of p-values is then given by

(6)   \begin{align*} \mathrm{G} = \pi \cdot \mathrm{U} + (1 - \pi) \cdot \mathrm{A}. \end{align*}

If we reject all p-value less than x, then with a little bit of algebra we can see that

(7)   \begin{align*} \mathit{FDR}(x) =  \mathrm{E}\left[ \,  \frac{ {\textstyle \sum_{n=1}^N} 1_{\{ p_n \leq x\}} \times 1_{\{ h_n = 0 \}} }{ {\textstyle \sum_{n=1}^N} 1_{\{ p_n \leq x \}} }  \, \right] &= \frac{ \mathrm{E}\left[ \,  {\textstyle \sum_{n=1}^N} 1_{\{ p_n \leq x\}} \times 1_{\{ h_n = 0 \}} \, \right] }{ \mathrm{E}\left[ \,  {\textstyle \sum_{n=1}^N} 1_{\{ p_n \leq x \}} \, \right] }  + \mathrm{O}\left( {\textstyle \frac{1}{\sqrt{N}}} \right) \\ &= \frac{ N \cdot \mathrm{Pr}( p_n \leq x | h_n = 0 ) \cdot \mathrm{Pr}(h_n = 0) }{ N \cdot \mathrm{G}(x) }  + \mathrm{O}\left( {\textstyle \frac{1}{\sqrt{N}}} \right) \end{align*}

where the last step uses the fact that R(x) = {\textstyle \sum_{n=1}^N} 1_{\{ p_n \leq x \}} = N \cdot \mathrm{G}(x) and \mathrm{O}( \sfrac{1}{\sqrt{N}}) denotes “big-O” notation. Since p-values are drawn from a uniform distribution when the null hypothesis is true, we know that x = \mathrm{Pr}( p_n \leq x | h_n = 0 ). Thus, we can simplify even further:

(8)   \begin{align*} \mathit{FDR}(x) &= \frac{ x \cdot \mathrm{Pr}(h_n = 0) }{ \mathrm{G}(x) } + \mathrm{O}\left( {\textstyle \frac{1}{\sqrt{N}}} \right). \end{align*}

Now comes the trick. x can be anything we want between 0 and 1. So, let’s choose x as one of the ordered p-values, p_{(n)}. If we do this, then \mathrm{G}(p_{(n)}) = \sfrac{n}{N} and

(9)   \begin{align*} \mathit{FDR}(p_{(n)}) \approx \frac{p_{(n)} \cdot \mathrm{Pr}(h_n = 0) \cdot N}{n} \leq \frac{p_{(n)} \cdot N}{n}. \end{align*}

If we set the right-hand side equal to the false-discovery-rate tolerance, 0.05, and solve for p_{(n)}, then we get the threshold value for p_{(n)} in Benjamini and Hochberg (1995),

(10)   \begin{align*} p_{(n)} = {\textstyle \frac{n}{N}} \times 0.05. \end{align*}

If we only reject hypotheses where p_n \leq {\textstyle \frac{n}{N}} \times 0.05, then our false-discovery rate is capped at 5{\scriptstyle \%}.

Filed Under: Uncategorized

« Previous Page
Next Page »

Pages

  • Publications
  • Working Papers
  • Curriculum Vitae
  • Notebook
  • Courses

Copyright © 2026 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in

 

Loading Comments...
 

You must be logged in to post a comment.