Uncategorized – Page 11 – Research Notebook

Randomized Market Trials

July 16, 2014 by Alex

1. Motivation

How much can traders learn from past price signals? It depends on what kind of assets sell. Suppose that returns are (in part) a function of $K = \Vert {\boldsymbol \alpha} \Vert_{\ell_0}$ different feature-specific shocks:

(1) $\begin{align*} r_n &= \sum_{q=1}^Q \alpha_q \cdot x_{n,q} + \epsilon_n \qquad \text{with} \qquad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}$

If ${\boldsymbol \alpha}$ is identifiable, then different values of ${\boldsymbol \alpha}$ have to produce different values of $r_n$ . This is only the case if assets are sufficiently different from one another. e.g., consider the analogy to randomized control trials. In an RCT, randomizing which subjects get thrown in the treatment and control groups makes it exceptionally unlikely that, say, all the people in the treatment group will by chance happen to all have some other common trait that actually explains their outcomes. Similarly, randomizing which assets get sold makes makes it exceptionally unlikely that $2$ different choices of ${\boldsymbol \alpha}$ and ${\boldsymbol \alpha}'$ can explain the observed returns.

This post sketches a quick model relating this problem to housing prices. To illustrate, imagine $N = 4$ houses have sold at a discount in a neighborhood that looks like this:

The shock might reflect a structural change in the vacation home market whereby there is less disposable income to buy high end units—i.e., a permanent shift. Alternatively, the shock might have been due to a couple of out-of-town second house buyers needing to sell quickly—i.e., a transient effect. The houses in the picture above are all vacation homes of a similar quality with owners living in LA. Since there is so little variation across units, both these explanations are observationally equivalent. Thus, the asset composition affects how informative prices are in an important way. The main empirical prediction is that in places with less variation in housing amenities, there should be more price momentum since it’s harder to distinguish between noise and amenity-specific value shocks.

2. Toy Model

Suppose you’ve seen $N$ sales in the area. Most of the prices looked just about right, but some of the houses sold for a bit more than you would have expected and some sold for a bit less than you would have expected. You’re trying to decide whether or not to buy the $(N+1)$ th house if the transaction costs are $\mathdollar c$ today:

(2) $\begin{align*} U &= \max_{\{\text{Buy},\text{Don't}\}} \left\{ \, \mathrm{E}\left[ r_{N+1} \right] - \frac{\gamma}{2} \cdot \mathrm{Var}\left[ r_{N+1} \right] - c, \, 0 \, \right\} \end{align*}$

You will buy the house if your risk adjusted expectation of its future returns exceeds the transaction costs, $\mathrm{E}[r_{N+1}] - \sfrac{\gamma}{2} \cdot \mathrm{Var}[r_{N+1}] \geq c$ .

This problem hinges on your ability to estimate ${\boldsymbol \alpha}$ . What’s the best you could ever hope to do? Well, suppose you knew which $K$ features mattered ahead of time and the elements of $\mathbf{X}$ were given by $x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{K})$ . In this setting, your average estimation error per relevant feature is given by:

(3) $\begin{align*} \Omega^\star = \mathrm{E}\left[ \, \frac{1}{K} \cdot \sum_{q=1}^Q \left( \widehat{\alpha}_q - \alpha_q \right)^2 \, \right] &= \frac{K \cdot \sigma_{\epsilon}^2}{N} \end{align*}$

i.e., it’s as if you ran an OLS regression of the $N$ price changes on the $K$ relevant columns of $\mathbf{X}$ . You will buy the house if:

(4) $\begin{align*} \mathbf{x}_{N+1}^{\top} \widehat{\boldsymbol \alpha} - \frac{\gamma}{2} \cdot \left( \frac{K + N}{N} \right) \cdot \sigma_{\epsilon}^2 &\geq c \end{align*}$

In the real world, however, you generally don’t know which $K$ features are important ahead of time and each house’s amenities are not taken as an iid draw. Instead, you must solve $\ell_1$ -type inference problem:

(5) $\begin{align*} \widehat{\boldsymbol \alpha} &= \arg \min_{\boldsymbol \alpha} \sum_{n=1}^N \left( r_n - \mathbf{x}_n^{\top} {\boldsymbol \alpha} \right)^2 \qquad \text{s.t.} \qquad \left\Vert {\boldsymbol \alpha} \right\Vert_{\ell_1} \leq \lambda \cdot \sigma_{\epsilon} \end{align*}$

with a correlated measurement matrix, $\mathbf{X}$ , using something like LASSO. In this setting, you face feature selection risk. i.e., you might focus on the wrong causal explanation for the past price movements. If $\Omega^{\perp}$ denotes your estimation error when each of the elements $x_{n,q}$ are drawn independently and $\Omega$ denotes your estimation error in the general case when $\rho(x_{n,q},x_{n',q}) \neq 0$ , then:

(6) $\begin{align*} \Omega^{\star} \leq \Omega^{\perp} \leq \Omega \end{align*}$

Since your estimate of $\widehat{\boldsymbol \alpha}$ is unbiased, feature selection risk will simply increase $\mathrm{Var}[r_{N+1}]$ making it less likely that you will buy the house in this stylized model:

(7) $\begin{align*} \mathbf{x}_{N+1}^{\top} \widehat{\boldsymbol \alpha} - \frac{\gamma}{2} \cdot \left( K \cdot \Omega + \sigma_{\epsilon}^2 \right) &\geq c \end{align*}$

More generally, it will make prices slower to respond to shocks and allow for momentum.

3. Matrix Coherence

Feature selection risk is worst when assets all have really correlated features. Let $\mathbf{X}$ denote the $(N \times Q)$ -dimensional measurement matrix containing all the features of the $N$ houses that have already sold in the market:

(8) $\begin{align*} \mathbf{X} &= \begin{bmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,Q} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,Q} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N,1} & x_{N,2} & \cdots & x_{N,Q} \\ \end{bmatrix} \end{align*}$

Each row represents all of the features of the $n$ th house, and each column represents the level to which the $N$ assets display a single feature. Let $\widetilde{\mathbf{x}}_q$ denote a unit-normed column from this measurement matrix:

(9) $\begin{align*} \widetilde{\mathbf{x}}_q &= \frac{\mathbf{x}_q}{\sqrt{\sum_{n=1}^N x_{n,q}^2}} \end{align*}$

I use a measure of the coherence of $\mathbf{X}$ to quantify the extent to which all of the assets in a market have similar features.

(10) $\begin{align*} \mu(\mathbf{X}) &= \max_{q \neq q'} \left\vert \left\langle \widetilde{\mathbf{x}}_q, \widetilde{\mathbf{x}}_{q'} \right\rangle \right\vert \end{align*}$

e.g., the coherence of a matrix with $x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N})$ is roughly $\sqrt{2 \cdot \log(Q)/N}$ corresponding to the red line in the figure below. As the correlation between elements in the same column increases, the coherence increases since different terms in the above cross-product are less likely to cancel out.

4. Selection Risk

There is a tight link between the severity of the selection risk and how correlated asset features are. Specifically, Ben-Haim, Eldar, and Elad (2010) show that if

(11) $\begin{align*} \alpha_{\min} \cdot \left( 1 - \{2 \cdot K - 1\} \cdot \mu(\mathbf{X}) \right) &\geq 2 \cdot \sigma_{\epsilon} \cdot \sqrt{2 \cdot (1 + \xi) \cdot \log(Q)} \end{align*}$

for some $\xi > 0$ , then:

(12) $\begin{align*} \sum_{q=1}^Q \left( \widehat{\alpha}_q - \alpha_q \right)^2 &\leq \frac{2 \cdot (1 + \xi)}{(1 - (K-1)\cdot \mu(\mathbf{X}))^2} \times K \cdot \sigma_{\epsilon}^2 \cdot \log(Q) = \Omega \end{align*}$

with probability at least:

(13) $\begin{align*} 1 - Q^{-\xi} \cdot \left( \, \pi \cdot (1 + \xi) \cdot \log(Q) \, \right)^{-\sfrac{1}{2}} \end{align*}$

where $\alpha_{\min} = |\arg \min_{q \in \mathcal{K}} \alpha_q|$ . Let’s plug in some numbers. If $\alpha_{\min} = 0.10$ and $\sigma_{\epsilon} = 0.05$ , then the result means that $\Vert \widehat{\boldsymbol \alpha} - {\boldsymbol \alpha} \Vert_{\ell_2}^2$ is less than $0.185 \times K \cdot \log(Q)$ with probability $\sfrac{3}{4}$ .

There are a couple of things worth pointing out here. First, the recovery bounds only hold when $\mathbf{X}$ is sufficiently incoherent:

(14) $\begin{align*} \mu(\mathbf{X}) < \frac{1}{2 \cdot K - 1} \end{align*}$

i.e., when the assets are too similar, we can’t learn anything concrete about which amenity-specific shocks are driving the returns. Second, the free parameter $\xi > 0$ links the probability of seeing an error rate outside the bounds, $p$ , to the number of amenities that houses have:

(15) $\begin{align*} \xi &\approx \frac{\log(\sfrac{1}{p}) - \frac{1}{2} \cdot \log\left[ \pi \cdot \log Q \right]}{\sfrac{1}{2} + \log(Q)} \end{align*}$

If you want to lower this probability, you need to either use a larger constant or decrease the number of amenities. For $\xi$ large enough we can effectively regard the error bounds as the variance. Importantly, this quantity is increasing in the coherence of the measurement matrix. i.e., when assets are more similar, I am less sure that I am drawing the correct conclusion from past returns.

5. Empirical Predictions

The main empirical prediction is that in places with less variation in housing amenities, there should be more price momentum since it’s harder to distinguish between noise and amenity-specific value shocks. e.g., imagine studying the price paths of $2$ neighborhoods, $A$ and $B$ , which have houses of the exact same value, $\mathdollar v$ . In neighborhood $A$ , each of the houses has a very different collection of amenities whose values sum to $\mathdollar v$ ; whereas, in neighborhood $B$ , each of the houses has the exact same amenities whose values sum to $\mathdollar v$ . e.g., you can think about neighborhood $A$ as pre-war and neighborhood $B$ as tract housing. The theory says that the price of houses in the neighborhood $B$ should respond slower to amenity-specific value shocks because houses have more correlated amenities—i.e., $\Omega$ is larger. As a result, home prices in neighborhood $B$ should also display more momentum… though this is not in the toy model above.

Notes: Ang, Hodrick, Xing, and Zhang (2006)

May 15, 2014 by Alex

1. Introduction

In this post I work through the main results in Ang, Hodrick, Xing, and Zhang (2006) which shows not only that i) stocks with more exposure to changes in aggregate volatility have higher average excess returns, but also that ii) stocks with more idiosyncractic volatility relative to the Fama and French (1993) $3$ factor model have lower excess returns. The first result is consistent with existing asset pricing theories; whereas, the second result is at odds with almost any mainstream asset pricing theory you might write down. Idiosyncratic risk should not be priced. This paper together with Campbell, Lettau, Malkiel, and Xu (2001) (see my earlier post) set off an investigation into the role of idiosyncratic risk in determining asset prices. One possibility is that idiosyncratic risk is just a proxy for exposure to aggregate risk. i.e., perhaps it’s the firms with the highest exposure to aggregate return volatility that also have the highest idiosyncratic volatility. Interestingly, Ang, Hodrick, Xing, and Zhang (2006) show that this is not the case via a double sort on both aggregate and idiosyncratic volatility exposure giving evidence that these are $2$ separate risk factors. The code I use to replicate the results in Ang, Hodrick, Xing, and Zhang (2006) and create the figures can be found here.

2. Theoretical Motivation

The discount factor view of asset pricing says that:

(1) $\begin{align*} 0 = \mathrm{E}[m \cdot r_n] \quad \text{for all } n=1,2,\ldots,N \end{align*}$

where $\mathrm{E}(\cdot)$ denotes the expectation operator, $m$ denotes the stochastic discount factor, and $r_n$ denotes asset $n$ ‘s excess return. Equation (1) reads: “In the absence of margin requirements and transactions costs, it costs you $\mathdollar 0$ today to borrow at the riskless rate, buy a stock, and hold the position for $1$ period.” Asset pricing theories explain why average excess returns, $\mathrm{E}[r_n]$ , vary across assets even though they all have the same price today by construction (see my earlier post).

Suppose each asset’s excess returns are a function of a risk factor $x$ , $\mathrm{R}_n(x)$ , and noise, $z_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_z^2)$ :

(2) $\begin{align*} r_n &= \mathrm{R}_n(x) + z_n \\ &= \mathrm{R}_n(\mu_x) + \mathrm{R}_n'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{R}_n''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + z_n \\ &\approx \alpha_n + \beta_n \cdot (x - \mu_x) + \frac{\gamma_n}{2} \cdot (x - \mu_x)^2 + z_n \end{align*}$

where I assume for simplicity that the only risk factor is the value-weighted excess return on the market so that $\mu_x \approx 6{\scriptstyle \%/\mathrm{yr}}$ and $\sigma_x \approx 16{\scriptstyle \%/\mathrm{yr}}$ . I use a Taylor expansion to linearize the function $\mathrm{R}_n(x)$ around the point $x = \mu_x$ and assume $\mathrm{O}(x - \mu_x)^3$ terms are negligible so $\mathrm{E}[r_n] = \alpha_n + \sfrac{\gamma_n}{2} \cdot \sigma_x^2$ and $\mathrm{Var}[r_n] = \beta_n^2 \cdot \sigma_x^2 + \sigma_z^2$ . This means that if the excess return on the market is $\sfrac{16{\scriptstyle \%}}{\sqrt{252}} \approx 1{\scriptstyle \%/\mathrm{day}}$ larger than expected, then asset $n$ ‘s expected excess returns will be $\beta_n{\scriptstyle \%}$ larger.

Any asset pricing theory says that each asset’s expected excess return should be proportional to how much the asset comoves with the risk factor, $x$ :

(3) $\begin{align*} \mathrm{E}[r_n] = \alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2 = \underbrace{\text{Constant} \times \beta_n}_{\text{Predicted}} \end{align*}$

where the constant of proportionality, $\text{Constant} = c \cdot (\sfrac{\mathrm{Var}[m]}{\mathrm{E}[m]})$ , depends on the exact asset pricing model. Equation (3) says that if you ran a regression of each stock’s excess returns on the aggregate risk factor:

(4) $\begin{align*} r_{n,t} = \widehat{\alpha}_n + \widehat{\beta}_n \cdot x_t + \mathit{Error}_{n,t} \end{align*}$

then the estimated intercept for each stock should be:

(5) $\begin{align*} \widehat{\alpha}_n = \frac{\gamma_n}{2} \cdot \sigma_x^2 - \beta_n \cdot \mu_x \end{align*}$

Thus, each stock’s average excess returns may well be related to its exposure to aggregate volatility since $\sigma_x$ shows up in the expression for $\widehat{\alpha}_n$ ; however, idiosyncratic volatility, $\sigma_z$ , better not be priced since it shows up nowhere above.

3. Aggregate Volatility

Ang, Hodrick, Xing, and Zhang (2006) show that stocks with more exposure to aggregate volatility have lower average excess returns. i.e., that the coefficient $\gamma_n < 0$ . The authors actually look at each stock’s exposure to changes in aggregate volatility. To see how this changes the math, consider rewriting the intercept above as:

(6) $\begin{align*} \widehat{\alpha}_n = \mathrm{A}_n(\Delta \sigma_x) &= \alpha_n + \frac{\gamma_n}{2} \cdot \left(\langle \sigma_x \rangle + \Delta \sigma_x \right)^2 \end{align*}$

Using this formulation, we can look at how perturbing $\mathrm{A}_n(\Delta \sigma_x)$ around its mean with some small $\Delta \sigma_x$ will impact the estimated intercept:

(7) $\begin{align*} \mathrm{A}_n(\Delta \sigma_x) &= \mathrm{A}_n(0) + \mathrm{A}_n'(0) \cdot \Delta \sigma_x + \cdots \\ &\approx \left[ \alpha_n + \frac{\gamma_n}{2} \cdot \langle\sigma_x\rangle^2 \right] + \gamma_n \cdot \langle\sigma_x\rangle \cdot \Delta \sigma_x \end{align*}$

Since $\langle \sigma_x \rangle > 0$ by definition, $(\sfrac{\gamma_n}{2}) \cdot \langle \sigma_x \rangle^2$ and $\gamma_n \cdot \langle \sigma_x \rangle$ will have the same sign. Thus, testing for whether exposure to changes in aggregate volatility is priced is tantamount to testing for whether exposure to aggregate volatility is priced.

The authors proceed in $5$ steps. First, they calculate the changes in aggregate volatility time series using changes in the daily options implied volatility:

(8) $\begin{align*} \Delta \sigma_{x,d+1} = \mathit{VXO}_{d+1} - \mathit{VXO}_d \qquad \text{with} \qquad \mathrm{E}[\Delta \sigma_{x,d+1}] = 0.01{\scriptstyle \%}, \, \mathrm{StD}[\Delta \sigma_{x,d+1}] = 2.65{\scriptstyle \%} \end{align*}$

If the VXO is $4.33{\scriptstyle \%}$ , then options markets expect the S&P 100 to move up or down $4.33{\scriptstyle \%}$ over the next $30$ calendar days. The authors use the VXO contract price rather than the VIX contract price because it has a longer time series dating back to 1986. The only difference between the $2$ contracts is that the VXO quotes the options implied volatility on the S&P 100; whereas, the VIX quotes the options implied volatility on the S&P 500. Daily changes in the $2$ contract prices have a correlation of $0.81$ over the sample period from January 1986 to December 2012 as shown in the figure below.

Second, the authors compute each stock’s exposure to changes in aggregate volatility by running a regression for each stock $n \in \{1,2,\ldots,N\}$ using the daily data in month $(m-1)$ :

(9) $\begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\beta}_{n} \cdot x_d + \widehat{\gamma}_{n} \cdot \Delta \sigma_{x,d} + \mathit{Error}_{n,d} \end{align*}$

Estimated coefficients are related to underlying deep parameters by:

(10) $\begin{align*} \widehat{\alpha}_n &= \frac{\gamma_n}{2} \cdot \langle \sigma_x \rangle^2 - \beta_n \cdot \mu_x \\ \widehat{\beta}_n &= \beta_n \\ \widehat{\gamma}_n &= \gamma_n \cdot \langle \sigma_x \rangle \end{align*}$

The daily market excess return, $x_d$ , is the excess return on the CRSP value-weighted market index. I include AMEX, NYSE, and NASDAQ stocks with $\geq 17$ daily observations in month $(m-1)$ in my universe of $N$ stocks.

Third, the authors sort the $N$ stocks satisfying the data constraints in month $(m-1)$ into $5$ value-weighted portfolios based on their estimated $\widehat{\gamma}_{n}$ . Note that because the factor $\langle \sigma_x \rangle$ is common to all stocks in month $(m-1)$ , this sort effectively organizes stocks by their true exposure to aggregate volatility, $\gamma_n$ . For each portfolio $j \in \{\text{L},2,3,4,\text{H}\}$ with $j = \text{L}$ denoting the stocks with the lowest aggregate volatility exposure and $j = \text{H}$ denoting the stocks with the highest aggregate volatility exposure, the authors then calculate the daily portfolio returns in month $m$ . The figure above shows the cumulative returns to each of these $5$ portfolios. It reads that if you invested $\mathdollar 1$ in the low aggregate volatility exposure portfolio in January 1986, then you would have over $\mathdollar 200$ more dollars in December 2012 than if you had invested that same $\mathdollar 1$ in the high aggregate volatility exposure portfolio. What’s more, each portfolio’s exposure to the excess return on the market is not explaining its performance. The figure below reports the estimated intercepts for each $j \in \{\text{L},2,3,4,\text{H}\}$ from the regression:

(11) $\begin{align*} r_{j,m} = \widehat{\alpha}_j + \widehat{\beta}_j \cdot x_m + \mathit{Error}_{j,m} \end{align*}$

and indicates that abnormal returns are decreasing in the portfolio’s exposure to aggregate volatility.

Fourth, in order to test whether the spread in portfolio abnormal returns is actually explained by contemporaneous exposure to aggregate volatility, the authors then create an aggregate volatility factor mimicking portfolio. They estimate the regression below using the daily excess returns on each of the $5$ aggregate volatility exposure portfolios in each month $m$ :

(12) $\begin{align*} \Delta \sigma_{x,d} = \widehat{\kappa} + \sum_{j=\text{L}}^{\text{H}} \widehat{\lambda}_{j} \cdot r_{j,d} + \mathit{Error}_d \end{align*}$

and store the parameter estimates for $\begin{bmatrix} \widehat{\lambda}_1 & \widehat{\lambda}_2 & \widehat{\lambda}_3 & \widehat{\lambda}_4 & \widehat{\lambda}_5 \end{bmatrix}^{\top}$ . They then define the factor mimicking portfolio return at daily horizon in month $m$ as:

(13) $\begin{align*} f_d = \sum_{j=\text{L}}^{\text{H}} \widehat{\lambda}_{j} \cdot r_{j,d} \end{align*}$

The figure below plots the factor mimicking portfolio returns against the underlying changes in aggregate volatility at the monthly level. The $2$ data series line up relatively closely; however, the factor mimicking portfolio is much too volatile during crises such as Black Monday in 1987.

Fifth and finally, the authors check whether or not each of the $5$ aggregate volatility portfolio’s returns are positively correlated with contemporaneous movements in the aggregate volatility factor mimicking portfolio at the monthly horizon. To do this, they cumulate up daily excess returns on the factor mimicking portfolio and the aggregate volatility exposure sorted portfolios to get monthly returns:

(14) $\begin{align*} f_m &= \sum_{d=1}^{22} f_d \\ r_{j,m} &= \sum_{d=1}^{22} r_{j,d} \quad \text{for all } j \in \{\text{L},2,3,4,\text{H}\} \end{align*}$

Then, they run the regression below at a monthly horizon over full sample:

(15) $\begin{align*} r_{j,m} = \widehat{\zeta}_j + \widehat{\eta}_j \cdot x_m + \widehat{\theta}_j \cdot f_m + \mathit{Error}_{j,m} \end{align*}$

I report the estimated $\widehat{\theta}_j$ coefficients in the figure below. Consistent with the idea that exposure to aggregate volatility is driving the disparate excess returns of the $5$ test portfolios, I find that each portfolio loads positively on monthly movements in the factor mimicking portfolio.

4. Idiosyncratic Volatility

Ang, Hodrick, Xing, and Zhang (2006) also show that stocks with more idiosyncratic volatility have lower average excess returns. This should not be true under the standard theory outlined in Section 2 above. To measure idiosyncratic volatility, the authors run the regression below at the daily level in month $(m-1)$ for each stock $n = 1,2,\ldots,N$ :

(16) $\begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \cdot \mathbf{x}_d + \mathit{Error}_{n,d} \end{align*}$

where the risk factors are the excess return on the value weighted market portfolio, the excess return on a size portfolio, and the excess return on a value portfolio as dictated by Fama and French (1993):

(17) $\begin{align*} \mathbf{x}_d^{\top} = \begin{bmatrix} r_{\mathrm{Mkt},d} & r_{\mathrm{SmB},d} & r_{\mathrm{HmL},d} \end{bmatrix} \end{align*}$

For each stock listed on the AMEX, NYSE, or NASDAQ stock exchange with $\geq 17$ daily observations in month $(m-1)$ , the authors then calculate the measure of idiosyncratic volatility below:

(18) $\begin{align*} \sigma_{z,n} &= \mathrm{StD}[\mathit{Error}_{n,d}] \end{align*}$

The authors sort the $N$ stocks satisfying the data constraints in month $(m-1)$ into $5$ value-weighted portfolios based on their estimated $\sigma_{z,n}$ values. The figure above reports the cumulative returns to these $5$ test portfolios. The figure reads that if you invested $\mathdollar 1$ in the low idiosyncratic volatility portfolio in January 1963, then you would have over $\mathdollar 100$ more in December 2012 than if you had invested in the high idiosyncratic volatility portfolio. The figure below reports the estimated abnormal returns, $\widehat{\alpha}_j$ , for each of the idiosyncratic volatility portfolios over the full sample and confirms that the poor performance of the high idiosyncratic volatility portfolio cannot be explained by exposure to common risk factors.

5. Are They Related?

I conclude by discussing the obvious follow-up question: “Are these $2$ phenomena related?” After all, it could be the case that the firms with the highest exposure to aggregate return volatility also have the highest idiosyncratic volatility and vice versa. Ang, Hodrick, Xing, and Zhang (2006) show that this is not the case via a double sort. i.e., they show that within each aggregate volatility exposure portfolio, the stocks with the lowest idiosyncratic volatility outperform the stocks with the highest idiosyncratic volatility. Similarly, they show that within each idiosyncratic volatility portfolio, the stocks with the lowest aggregate volatility exposure outperform the stocks with the highest aggregate volatility exposure. Thus, the motivation driving investors to pay a premium for stocks with high aggregate volatility exposure is different from the motivation driving investors to pay a premium for stocks with high idiosyncratic volatility.

Indeed, you can pretty much guess this fact from the cumulative return plots in Sections $3$ and $4$ where the red lines denoting the low exposure portfolios behave in completely different ways. e.g., the low aggregate volatility exposure portfolio returns behave more or less like the high aggregate volatility exposure portfolio returns but with a higher mean. By contrast, the low idiosyncratic volatility portfolio returns are a much different time series with dramatically less volatility. Interestingly, if the authors sort on total volatility in month $(m-1)$ rather than idiosyncratic volatility, then results are identical; however, the results to not carry through if you sort on $R^2$ in month $(m-1)$ . e.g., suppose you ran the same regression at the daily level in month $(m-1)$ for each stock $n = 1,2,\ldots,N$ :

(19) $\begin{align*} r_{n,d} = \widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \cdot \mathbf{x}_d + \mathit{Error}_{n,d} \end{align*}$

(20) $\begin{align*} \mathbf{x}_d^{\top} = \begin{bmatrix} r_{\mathrm{Mkt},d} & r_{\mathrm{SmB},d} & r_{\mathrm{HmL},d} \end{bmatrix} \end{align*}$

Then, for each stock you computed the $R^2$ statistic measuring the fraction of the total variation in each stock’s excess returns that is explained by movements in the risk factors:

(21) $\begin{align*} R^2 &= 1 - \frac{\sum_{d=1}^{22}(r_{n,d} - \{\widehat{\alpha}_n + \widehat{\boldsymbol \beta}_n^{\top} \mathbf{x}_d\})^2}{\sum_{d=1}^{22}(r_{n,d} - \langle r_{n,d} \rangle)^2} \end{align*}$

If you group stocks into $5$ portfolios based on their $R^2$ over the previous month, the figure above shows that there is no monotonic spread in the abnormal returns. Thus, the idiosyncratic volatility results seem to be more about volatility and less about the explanatory power of the Fama and French (1993) factors.

Using the Cross-Section of Returns

May 12, 2014 by Alex

1. Introduction

The empirical content of the discount factor view of asset pricing can all be derived from the equation below:

(1) $\begin{align*} 0 = \mathrm{E}[m \cdot r_n] \quad \text{for all } n=1,2,\ldots,N \end{align*}$

where $m$ denotes the prevailing stochastic discount factor and $r_n$ denotes an asset’s excess return. Equation (1) reads: “In the absence of margin requirements and transactions costs, it costs you $\mathdollar 0$ today to borrow at the riskless rate, buy a stock, and hold the position for $1$ period.” The question is then why average excess returns, $\mathrm{E}[r_n]$ , vary across the $N$ assets even though they all have the same price today by construction.

The answer hinges on the behavior of the stochastic discount factor, $m$ , in Equation (1). What is this thing? Everyone knows that it is better to have $\mathdollar 1$ today than $\mathdollar 1$ tomorrow, and the present value of an asset that pays out $\mathdollar 1$ tomorrow is the called the discount rate. Sometimes important stuff will happen in the next $24$ hours that changes how awesome it is to have an additional $\mathdollar 1$ tomorrow. As a result, the realized discount rate is a random variable each period (i.e., follows a stochastic process). e.g., if agents have utility, $\mathrm{U}_0 = \mathrm{E}_0 \sum_{t \geq 0} e^{\rho \cdot t} \cdot c_t^{1-\theta}$ , then the stochastic discount factor is $m = e^{-\rho - \theta \cdot \Delta \log c}$ and the stuff (i.e., risk factor) is changes in log consumption.

An asset pricing model is a machine which takes as inputs a) each agent’s preferences, b) each agent’s information, and c) a list of the relevant risk factors affecting how agents discount the future and produces a stochastic discount factor as its output. In this post, I show how to test an asset pricing model using the cross-section of asset returns. i.e., by linking how average excess returns vary across assets to each asset’s exposure to the risk factors governing the behavior of the stochastic discount factor.

2. Theoretical Predictions

The key to massaging Equation (1) into a form that can be taken to the data is noticing that for any $2$ random variables $u$ and $v$ , the following identity holds:

(2) $\begin{equation*} \mathrm{E}[u\cdot v] = \mathrm{Cov}[u,v] - \mathrm{E}[u] \cdot \mathrm{E}[v] \end{equation*}$

Thus, if I let $u$ denote the stochastic discount factor and $v$ denotes any of the $N$ excess returns, I can link the expected excess return to holding an asset to its covariance with the stochastic discount factor:

(3) $\begin{align*} \mathrm{E}[r_n] &= \frac{\mathrm{Cov}[m, r_n]}{\mathrm{Var}[m]} \cdot \left( - \frac{\mathrm{Var}[m]}{\mathrm{E}[m]} \right) \end{align*}$

The first term is dimensionless and represents the amount of exposure asset $n$ has to the risk factor $x$ . The second term has dimension $\sfrac{1}{\Delta t}$ , is common across all assets, and represents the price of exposure to the risk factor $x$ since it has the same units as the expected return $\mathrm{E}[r_n]$ . Asset pricing theories say that each asset’s expected return should be proportional to the market-wide prices of risk where the constant on proportionality is the asset’s “exposure” to that risk factor.

What does “exposure” mean here? To answer this question I need to put a bit more structure on the stochastic discount factor, $m$ , and the excess return, $r_n$ . I remain agnostic about which asset pricing model actually governs returns and which risk factors that affect discount rates, but to avoid writing out lots of messy matrices I do assume that there is only a single factor, $x$ , with $\mathrm{E}[x] = \mu_x$ and $\mathrm{Var}[x] = \sigma_x^2$ . I then write the stochastic discount factor as the sum of a function of $x$ , $\mathrm{M}(x)$ , and some noise, $y \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_y^2)$ :

(4) $\begin{align*} m &= \mathrm{M}(x) + y \\ &= \mathrm{M}(\mu_x) + \mathrm{M}'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{M}''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + y \\ &\approx \phi + \chi \cdot (x - \mu_x) + \frac{\psi}{2} \cdot (x - \mu_x)^2 + y \end{align*}$

where I use a Taylor expansion to linearize the function $\mathrm{M}(x)$ around the point $x = \mu_x$ and assume terms of order $\mathrm{O}(x - \mu_x)^3$ are negligible so that $\mathrm{E}[m] = \phi + \sfrac{\psi}{2} \cdot \sigma_x^2$ and $\mathrm{Var}[m] = \chi^2 \cdot \sigma_x^2 + \sigma_y^2$ . This means that if the risk factor is $\sigma_x$ larger than expected, $(x - \mu_x) = \sigma_x$ , then agents value having an additional $\mathdollar 1$ tomorrow $\chi \cdot \sigma_x$ more than usual. Similarly, suppose each excess return is the sum of an asset-specific function of $x$ , $\mathrm{R}_n(x)$ , and some asset-specific noise, $z_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_z^2)$ :

(5) $\begin{align*} r_n &= \mathrm{R}_n(x) + z_n \\ &= \mathrm{R}_n(\mu_x) + \mathrm{R}_n'(\mu_x) \cdot (x - \mu_x) + \frac{1}{2} \cdot \mathrm{R}_n''(\mu_x) \cdot (x - \mu_x)^2 + \text{``h.o.t.''} + z_n \\ &\approx \alpha_n + \beta_n \cdot (x - \mu_x) + \frac{\gamma_n}{2} \cdot (x - \mu_x)^2 + z_n \end{align*}$

where I use a Taylor expansion to linearize the function $\mathrm{R}_n(x)$ around the point $x = \mu_x$ and assume $\mathrm{O}(x - \mu_x)^3$ terms are negligible so that $\mathrm{E}[r_n] = \alpha_n + \sfrac{\gamma_n}{2} \cdot \sigma_x^2$ and $\mathrm{Var}[r_n] = \beta_n^2 \cdot \sigma_x^2 + \sigma_z^2$ . This means that if the risk factor is $\sigma_x$ larger than expected, $(x - \mu_x) = \sigma_x$ , then asset $n$ ‘s realized excess returns will be $\beta_n \cdot \sigma_x$ larger than average.

Plugging Equations (4) and (5) into Equation (3) then shows exactly what “exposure” to the risk factor means:

(6) $\begin{equation*} \begin{split} \mathrm{E}[r_n] &= \frac{\mathrm{Cov}[m,r_n]}{\mathrm{Var}[m]} \cdot \left( - \, \frac{\mathrm{Var}[m]}{\mathrm{E}[m]} \right) \\ &= \frac{\chi \cdot \beta_n \cdot \sigma_x^2}{\chi^2 \cdot \sigma_x^2 + \sigma_y^2} \cdot \left( - \, \frac{\chi^2 \cdot \sigma_x^2 + \sigma_y^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \\ &= - \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \beta_n \\ &= \text{Constant} \times \beta_n \end{split} \end{equation*}$

Each asset’s exposure to the risk factor $x$ is summarized by the coefficient $\beta_n$ . Assets which have higher realized returns when the risk factor is high (have a large $\beta_n$ ) will have lower average returns (high prices) since these assets are good hedges against the risk factor. i.e., these assets look like insurance. Equation (1)’s empirical content is then that an asset’s average excess returns, $\langle r_n \rangle$ , is proportional to its exposure to the risk factor, $\beta_n$ , where the constant of proportionality is the same for all assets:

(7) $\begin{align*} \mathrm{E}[r_n] = \underbrace{\alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2}_{\text{Realized } \langle r_n \rangle} = \underbrace{- \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \beta_n}_{\text{Predicted}} \end{align*}$

By letting $y,z_n \searrow 0$ we can interpret this relationship as a realization of the first Hansen-Jagannathan bound:

(8) $\begin{align*} \frac{\mathrm{StD}[m_{t+1}]}{\mathrm{E}[m_{t+1}]} = \frac{\chi \cdot \sigma_x}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} = \frac{\alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2}{\beta_n \cdot \sigma_x} = \left| \frac{\mathrm{E}[r_{n,t+1}]}{\mathrm{StD}[r_{n,t+1}]} \right| \end{align*}$

3. Empirical Strategy

To test Equation (7), an econometrician has to estimate $(2 \cdot N + 2)$ unknown parameters:

(9) $\begin{align*} \widehat{\boldsymbol \theta} = \begin{bmatrix} \widehat{\mu}_x & \widehat{\alpha}_1 & \cdots & \widehat{\alpha}_N & \widehat{\beta}_1 & \cdots & \widehat{\beta}_N & \widehat{\lambda} \end{bmatrix}^{\top} \end{align*}$

using $T$ periods of observations. i.e., $2$ parameters for each asset (its average excess returns and its factor exposure) as well as $2$ market-wide parameters (the risk factor mean and the market price of risk). There are $(3 \cdot N + 1)$ equations to estimate these parameters with via GMM so that the system is over-identified whenever there are $N > 1$ assets:

(10) $\begin{align*} \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix} &= \mathrm{E}[\mathrm{G}(\widehat{\boldsymbol \theta};\mathbf{r}_t,x_t)] = \mathrm{E} \begin{bmatrix} x_t - \widehat{\mu}_x \\ r_{1,t} - \left\{ \widehat{\alpha}_1 + \widehat{\beta}_1 \cdot (x_t - \widehat{\mu}_x) \right\} \\ \vdots \\ r_{N,t} - \left\{ \widehat{\alpha}_N + \widehat{\beta}_N \cdot (x_t - \widehat{\mu}_x) \right\} \\ \left( r_{1,t} - \left\{ \widehat{\alpha}_1 + \widehat{\beta}_1 \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ \vdots \\ \left( r_{N,t} - \left\{ \widehat{\alpha}_N + \widehat{\beta}_N \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ r_{1,t} - \widehat{\beta}_1 \cdot \widehat{\lambda} \\ \vdots \\ r_{N,t} - \widehat{\beta}_N \cdot \widehat{\lambda} \end{bmatrix} \end{align*}$

The first equation pins down the mean of the factor $x$ . The following $(2 \cdot N)$ equations identify the $\{\widehat{\alpha}_n,\widehat{\beta}_n\}_{n \in N}$ parameters governing the relationship between the risk factor and each asset’s excess returns. The final $N$ equations pin down the market price of risk, $\widehat{\lambda}$ , for exposure to the risk factor $x$ . A risk is “priced” if $\widehat{\lambda} \neq 0$ .

Note that this empirical strategy doesn’t pin down every single one of the parameters governing the relationship between the stochastic discount factor and each asset’s excess returns. e.g., the parameter estimates $\widehat{\alpha}_n$ and $\widehat{\lambda}$ are composites of several deep parameters:

(11) $\begin{align*} \widehat{\alpha}_n &= \alpha_n + \frac{\gamma_n}{2} \cdot \sigma_x^2 \\ \widehat{\lambda} &= - \, \left( \frac{\chi \cdot \sigma_x^2}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \end{align*}$

The underlying parameters $\alpha_n$ and $\gamma_n$ as well as $\phi$ , $\chi$ , and $\psi$ are not identifiable from this approach since they satisfy conservation laws which leave the estimates for $\widehat{\alpha}_n$ and $\widehat{\lambda}$ unchanged:

(12) $\begin{align*} \frac{\partial \widehat{\alpha}_n}{\partial \alpha_n} \cdot \Delta \alpha_n + \frac{\partial \widehat{\alpha}_n}{\partial \gamma_n} \cdot \Delta \gamma_n = 0 &= \Delta \alpha_n + \frac{\sigma_x^2}{2} \cdot \Delta \gamma_n \\ \frac{\partial \widehat{\lambda}}{\partial \phi} \cdot \Delta \phi + \frac{\partial \widehat{\lambda}}{\partial \chi} \cdot \Delta \chi + \frac{\partial \widehat{\lambda}}{\partial \psi} \cdot \Delta \psi = 0 &= \left( \frac{\chi}{\phi + \frac{\psi}{2} \cdot \sigma_x^2} \right) \cdot \{\Delta \phi + \frac{\sigma_x^2}{2} \cdot \Delta \psi\} - \Delta \chi \end{align*}$

e.g., if you increase $\alpha_n$ by $\epsilon \approx 0^+$ and decrease $\gamma_n$ by $\frac{2}{\sigma_x^2} \cdot \epsilon$ , then the estimate of $\widehat{\alpha}_n$ remains unchanged.

4. Time Scale Considerations

There is a hidden assumption floating around behind the empirical strategy outlined in Section $3$ above. Namely, that each asset’s factor exposure is constant and the market price of risk is constant. In practice, this is surely not the case as is documented in Jagannathan and Wang (1996) and Lewellen and Nagel (2006). OK… so constant factor exposures and prices of risk is an approximation. Fine. How good/bad an approximation is it? e.g., Fama and MacBeth (1973) use rolling $T = 60$ month windows to estimate each asset’s $\widehat{\beta}_n$ . Is this too long a window relative to how much factor exposures vary over time? Alternatively, should we be using a longer window to more accurately pin down these parameters? It turns out that the estimation strategy gives some guidance about the relationship between the optimal estimation window and parameter persistence which I discuss below.

First, I model the evolution of the true parameters. To test an asset pricing model using the cross-section of excess returns, we are interested in knowing whether or not $\widehat{\lambda} = 0$ . Suppose the true market price of risk, $\lambda$ , follows a random walk:

(13) $\begin{align*} \lambda_T = \lambda + \sum_{t=1}^T l_t \end{align*}$

where $l_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_l^2)$ so that the final $\lambda_T$ is a random variable with distribution:

(14) $\begin{align*} \lambda_T \sim \mathrm{N}(\lambda, T \cdot \sigma_l^2) \end{align*}$

Second, I note that the estimation strategy outlined in Section $3$ above gives signal, $\widehat{\lambda}$ , about the average market price of risk with distribution:

(15) $\begin{align*} \widehat{\lambda} \sim \mathrm{N}\left(\lambda, \sfrac{\sigma_s^2}{T}\right) \end{align*}$

where $s_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_s^2)$ denotes estimation error from the GMM procedure. There is an additional complication to consider. Namely, if the true market price of risk is floating around during the estimation period, it will add additional noise to the parameter estimates and increase $\sigma_s^2$ . To keep things simple, suppose that nature sets the market price of risk to $\lambda$ at the beginning of the estimation sample and it remains constant during estimation period. Then, $\lambda_T$ is revealed at the end of time $T$ and prevails afterwards. This will mean that the derivations below will be inequalities due to the underestimate of $\sigma_s^2$ .

What I really care about is the distance between the true $\lambda_T$ at the end of the sample which governs the market going forward and the GMM estimate of $\widehat{\lambda}$ . Thus, I should choose out sample period length, $T$ , to minimize:

(16) $\begin{align*} T = \arg \min_{T \geq 0} \mathrm{E}\left[ (\lambda_T - \widehat{\lambda})^2 \right] = \arg \min_{T \geq 0} \mathrm{E}\left[ (\lambda_T - \lambda)^2 + (\lambda - \widehat{\lambda})^2 \right] \end{align*}$

As a result, to find the optimal $T$ I take the first order condition:

(17) $\begin{align*} 0 = \frac{d}{dT} \left[ T \cdot \sigma_l^2 + \left(\frac{1}{\sigma_{\lambda}^2} + \frac{T}{\sigma_s^2} \right)^{-1} \right] \end{align*}$

where $\sigma_{\lambda}^2$ denotes the variance of my priors about the market price of risk governing the estimation sample $\lambda$ . The solution to this equation defines the window length, $T$ , which optimally trades off the benefit of getting a more precise estimate of $\lambda$ with the cost of decreasing the relevance of this estimate due to the evolution of $\lambda_T$ .

GMM maps $\sigma_s^2$ onto a parameter of the underlying model. To keep things simple, suppose there is only $1$ asset and $4$ unknown parameters:

(18) $\begin{align*} \widehat{\boldsymbol \theta} = \begin{bmatrix} \widehat{\mu}_x & \widehat{\alpha} & \widehat{\beta} & \widehat{\lambda} \end{bmatrix}^{\top} \end{align*}$

so that the system of estimation equations reduces to:

(19) $\begin{align*} \begin{pmatrix} 0 \\ 0 \\ 0 \\ 0 \end{pmatrix} &= \mathrm{E}[\mathrm{G}(\widehat{\boldsymbol \theta};r_t,x_t)] = \mathrm{E} \begin{bmatrix} x_t - \widehat{\mu}_x \\ r_t - \left\{ \widehat{\alpha} + \widehat{\beta} \cdot (x_t - \widehat{\mu}_x) \right\} \\ \left( r_t - \left\{ \widehat{\alpha} + \widehat{\beta} \cdot (x_t - \widehat{\mu}_x) \right\} \right) \cdot (x_t - \widehat{\mu}_x) \\ r_t - \widehat{\beta} \cdot \widehat{\lambda} \end{bmatrix} \end{align*}$

This assumption means that I don’t have to consider how learning about one asset affects my beliefs about another asset. In this world, if $x_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(\mu_x,\sigma_x^2)$ , then GMM reduces to OLS and $\sigma_s^2 = \sfrac{\sigma_z^2}{\beta_n^2}$ since:

(20) $\begin{align*} r_{n,t} = \beta_n \cdot \lambda + \beta_n \cdot (x_t - \mu_x) + z_{n,t} \end{align*}$

Evaluating the first order condition then gives:

(21) $\begin{align*} 0 = \sigma_l^2 - \left(\frac{1}{\sigma_{\lambda}^2} + \frac{T}{\sfrac{\sigma_z^2}{\beta_n^2}} \right)^{-2} \cdot \frac{1}{\sfrac{\sigma_z^2}{\beta_n^2}} \end{align*}$

Solving for $T$ yields:

(22) $\begin{align*} T &\geq \min\left\{ \, 0, \, \frac{\sigma_z}{\beta_n \cdot \sigma_l} - \frac{\sigma_z^2}{\beta_n^2 \cdot \sigma_{\lambda}^2} \, \right\} \end{align*}$

Let’s plug in some values to make sure this formula makes sense. First, notice that if the market price of risk is constant, $\lambda_T = \lambda$ , then $\sigma_l = 0$ and you should pick $T = \infty$ or as large as possible. Second, notice that if you already know the true $\lambda$ , then $\sigma_{\lambda}^2 = 0$ and you should pick $T = 0$ . Finally, notice that if the test asset has no exposure to the risk factor, $\beta_n = 0$ , then the equation is undefined since any window length gives you the same amount of information—i.e., none.

Phase Change in High-Dimensional Inference

April 2, 2014 by Alex

1. Introduction

In my paper Feature Selection Risk (2014), I study a problem where assets have $Q \gg 1$ different attributes and traders try to identify which $K \ll Q$ of these attributes matter via price changes:

(1) $\begin{align*} \Delta p_n &= p_n - \mathrm{E}[p_n] = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{where} \qquad K = \Vert {\boldsymbol \beta} \Vert_{\ell_0} = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}} \notag \end{align*}$

with each asset’s exposure to a given attribute given by $x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,1)$ and the noise is given by $\epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2)$ . In the limit as $K,N,Q \to \infty$ , $\sfrac{K}{Q} \to 0$ , $(N - K) \cdot \beta \to \infty$ , and $\beta = \sfrac{1}{\sqrt{K}}$ there exists both a signal opacity bound, $N_O$ , as well as a signal recovery bound, $N_R$ :

(2) $\begin{align*} N_O \sim K \cdot \log \left( \frac{Q}{N_O} \right) \qquad \text{and} \qquad N_R \sim K \cdot \log \left( \frac{Q}{K} \right) \notag \end{align*}$

with $N_O \leq N_R$ in units of transactions. I explain what I mean by “ $\sim$ ” in Section 4 below. These $2$ thresholds separate the regions where traders are arbitrarily bad at identifying the shocked attributes (i.e., $N < N_O$ ) from the regions where traders can almost surely identify the shocked attributes (i.e., $N > N_R$ ). i.e., if traders have seen fewer than $N_O$ transactions, then they have no idea which shocks took place; whereas, if traders have seen more than $N_R$ transactions, then they can pinpoint exactly which shocks took place.

In this post, I show that the signal opacity and recovery bounds become arbitrarily close in a large market. The analysis in this post primarily builds on work done in Donoho and Tanner (2009) and Wainwright (2009).

2. Motivating Example

This sort of inference problem pops up all the time in financial settings. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When studying a list of recent sales prices, you find yourself a bit surprised. People seem to have changed their preferences for $1$ of $7$ different amenities: $^{(1)}$ a $2$ car garage, $^{(2)}$ a third bedroom, $^{(3)}$ a half-circle driveway, $^{(4)}$ granite countertops, $^{(5)}$ energy efficient appliances, $^{(6)}$ central A/C, or $^{(7)}$ a walk-in closet? The mystery amenity is raising the sale price of some houses by $\beta > 0$ dollars. How many sales do you need to see in order to figure out which of the $7$ amenities realized the shock?

The answer is $3$ . How did I arrive at this number? Suppose you found one house with amenities $\{1,3,5,7\}$ , a second house with amenities $\{2, 3, 6, 7\}$ , and a third house with amenities $\{4, 5, 6,7\}$ . The combination of the price changes for these $3$ houses reveals exactly which amenity has been shocked. i.e., if only the first house’s price was too high, $\Delta p_1 = p_1 - \mathrm{E}[p_1] = \beta$ , then Chicagoans must have changed their preferences for $2$ car garages:

(3) $\begin{equation*} \begin{bmatrix} \Delta p_1 \\ \Delta p_2 \\ \Delta p_3 \end{bmatrix} = \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} \beta \\ 0 \\ \vdots \\ 0 \end{bmatrix} \end{equation*}$

By contrast, if $\Delta p_1 = \Delta p_2 = \Delta p_3 = \beta$ , then people must value walk-in closets more than they did a year ago.

Here’s the key point. The problem changes character at $N_R = 3$ observations. $3$ sales is just enough information to answer $7$ yes or no questions and rule out the possibility of no change: $7 = 2^3 - 1$ . $N = 4$ sales simply narrows your error bars around the exact value of $\beta$ . $N = 2$ sales only allows you to distinguish between subsets of amenities. e.g., seeing just the first and second houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more… not which one.

Yet, the dimensionality in this toy example can be confusing. There is obviously something different about the problem at $N_R = 3$ observations, but there is still some information contained in the first $N = 2$ observations. e.g., even though you can’t tell exactly which attribute realized a shock, you can narrow down the list of possibilities to $2$ attributes out of $7$ . If you just flipped a coin and guessed after seeing $N = 2$ transactions, you would have an error rate of $50{\scriptstyle \%}$ . This is no longer true in higher dimensions. i.e., even in the absence of any noise, seeing any fraction $(1 - \alpha) \cdot N_R$ of the required observations for $\alpha \in (0,1)$ will leave you with an error rate that is within a tiny neighborhood of $100{\scriptstyle \%}$ as the number of attributes gets large.

3. Non-Random Analysis

I start by exploring how the required number of observations, $N_R$ , moves around as I increase the number of attributes in the setting where there is only $K = 1$ shock and the data matrix is non-random. Specifically, I look at the case where $K = 1$ and $Q = 15$ . My goal is to build some intuition about what I should expect in the more complicated setting where the data $\mathbf{X}$ is a random matrix. Here, in this simple setting, the ideal data matrix would be $(4 \times 15)$ -dimensional and look like:

(4) $\begin{equation*} \underset{4 \times 15}{\mathbf{X}} = \left[ \begin{matrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 \end{matrix} \ \ \ \begin{matrix} 0 & 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{matrix} \right] \end{equation*}$

where each column of the data matrix corresponds to a number $q=1,2,\ldots,15$ in binary.

Let $S(N)$ be a function that eats $N$ observed price changes and spits out the set of possible preference changes that might explain the observed price changes. e.g., if traders only see the $1$ st transaction, then they can only place the shock in $1$ of $2$ sets containing $8$ attributes each:

(5) $\begin{align*} S(1) &= \begin{cases} \{ 1, 3, 5, 7, 9, 11, 13, 15 \} &\text{if } \Delta p_1 = \beta \\ \{ \emptyset, 2, 4, 6, 8, 10, 12, 14 \} &\text{if } \Delta p_1 = 0 \end{cases} \end{align*}$

The $2$ nd transaction then allows traders to split each of these $2$ larger sets into $2$ smaller ones and place the shock in a set of $4$ possibilities:

(6) $\begin{align*} S(2) = \begin{cases} \{ \emptyset, 4, 8, 12 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 5, 9, 13 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 \end{bmatrix}^{\top} \\ \{ 2, 6, 10, 14 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta \end{bmatrix}^{\top} \\ \{ 3, 7, 11, 15 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta \end{bmatrix}^{\top} \end{cases} \end{align*}$

With the $3$ rd transaction, traders can tell that the actual shock is either of $2$ possibilities:

(7) $\begin{align*} S(3) = \begin{cases} \{ \emptyset, 8 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & 0 \end{bmatrix}^{\top} \\ \{ 1, 9 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & 0\end{bmatrix}^{\top} \\ \{ 2, 10 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & 0 \end{bmatrix}^{\top} \\ \{ 3, 11 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & 0 \end{bmatrix}^{\top} \\ \{ 4, 12 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & 0 & \beta \end{bmatrix}^{\top} \\ \{ 5, 13 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & 0 & \beta \end{bmatrix}^{\top} \\ \{ 6, 14 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} 0 & \beta & \beta \end{bmatrix}^{\top} \\ \{ 7, 15 \} &\text{if } \Delta \mathbf{p} = \begin{bmatrix} \beta & \beta & \beta \end{bmatrix}^{\top} \end{cases} \end{align*}$

The $N_R = 4$ th observation then closes the case against the offending attribute.

Here’s the key observation. Only the absolute difference between $(N_R - N)$ matters when computing the size of the output of $S(N)$ . If traders have seen $N = (N_R - 1)$ transaction, then they can tell which subset of $2 = 2^1$ attributes has realized a shock. If traders have seen $N = (N_R - 2)$ transactions, then they can tell which subset of $4 = 2^2$ attributes has realized a shock. If traders have seen $N = (N_R - 3)$ observations, then they can tell which subset of $8 = 2^3$ attributes has realized a shock. Thus, after seeing any number of observations $N \leq N_R$ , traders can place the shock in a set of size $2^{N_R - N}$ . i.e., a trader has the same amount of information about which attribute has realized a shock in (i) a situation where $N_R = 100$ and he’s seen $N = 99$ transactions as in (ii) a situation where $N_R = 3$ and he’s seen $N = 2$ transactions.

The probability that traders select the correct attribute after seeing only $N \leq N_R$ observations is given by $2^{-(N_R - N)}$ assuming uniform priors. Natural numbers are hard to work with analytically, so let’s suppose that traders observe some fraction of the required number of observations $N_R$ . i.e., for some $\alpha \in (0,1)$ traders see $N = (1 - \alpha) \cdot N_R$ observations. We can then perform a change of variables $2^{- (N_R - N)} = e^{- \alpha \cdot \log(2) \cdot N_R}$ and answer the question: “How much does getting $1$ additional observation improve traders’ error rate?”

(8) $\begin{align*} \frac{1}{N_R} \cdot \frac{d}{d\alpha}\left[ \, 2^{- (N_R - N)} \, \right] = - \log(2) \cdot e^{- \alpha \cdot \log(2) \cdot N_R} \end{align*}$

I plot this statistic for $N_R$ ranging from $100$ to $800$ below. When $N_R = 100$ , a trader’s predictive power doesn’t start to improve until he sees $95$ transactions (i.e., $95{\scriptstyle \%}$ of $N_R$ ); by contrast, when $N_R= 800$ a trader’s predictive power doesn’t start to improve until he’s seen $N = 792$ transactions (i.e., $99{\scriptstyle \%}$ of $N_R$ ). Here’s the punchline. As I scale up the original toy example from $7$ attributes to $7$ million attributes, traders effectively get $0$ useful information about which attributes realized a shock until they come within a hair’s breadth of the signal recovery bound $N_R$ . The opacity and recovery bounds are right on top of one another.

4. Introducing Randomness

Previously, the matrix of attributes was strategically chosen so that the set of $N$ observations that traders see would be as informative as possible. Now, I want to relax this assumption and allow the data matrix to be random with elements $x_{n,q} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0,1)$ :

(9) $\begin{align*} \Delta p_n = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \qquad \text{for each } n = 1,2,\ldots,N \end{align*}$

where $\epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2)$ denotes idiosyncratic shocks affecting asset $n$ in units of dollars. For a given triplet of integers $(K,N,Q)$ with $0 < K < N < Q$ , I want to know whether solving the linear program:

(10) $\begin{align*} \widehat{\boldsymbol \beta} = \min \left\Vert {\boldsymbol \beta} \right\Vert_{\ell_1} \qquad \text{subject to} \qquad \mathbf{X} {\boldsymbol \beta} = \Delta \mathbf{p} \end{align*}$

recovers the true $\boldsymbol \beta$ when it is $K$ -sparse. i.e., when $\boldsymbol \beta$ has only $K$ non-zero entries $K = \sum_{q=1}^Q 1_{\{ \beta_q \neq 0 \}}$ . Since $N < Q$ the linear system is underdetermined; however, if the level of sparsity is sufficiently high (i.e., $K$ is sufficiently small), then there will be a unique solution with high probability.

First, I study the case where there is no noise (i.e., where $\sigma_\epsilon^2 \searrow 0$ ), and I ask: “What is the minimum number of observations needed to identify the true $\boldsymbol \beta$ with probability $(1 - \eta)$ for $\eta \in (0,1)$ using the linear program in Equation (10)?” I remove the noise to make the inference problem as easy as possible for traders. Thus, the proposition below which characterizes this minimum number of observations gives a lower bound. I refer to this number of observations as the signal opacity bound and write it as $N_O$ . The proposition shows that, whenever traders have seen $N < N_O$ observations, I can make traders’ error rate arbitrarily bad (i.e., $\eta \nearrow 1$ ) by increasing the number of attributes (i.e., $Q \nearrow \infty$ ).

Proposition (Donoho and Tanner, 2009): Suppose $\sfrac{K}{N} = \rho$ , $\sfrac{N}{Q} = \delta$ , and $N \geq N_0$ with $\rho, \delta \in (0,1)$ . The linear program in Equation (10) will recover $\boldsymbol \beta$ a fraction $(1 - \eta)$ if the time whenever:

(11) $\begin{align*} N > 2 \cdot K \cdot \log\left( \frac{Q}{N} \right) \cdot \left( 1 - R(\eta;N,Q) \right)^{-1} \end{align*}$

where $R(\eta;N,Q) = 2 \cdot \sqrt{\sfrac{1}{N} \cdot \log\left( 4 \cdot \sfrac{(Q + 2)^6}{\eta} \right)}$ .

Next, I turn to the case where there is noise (i.e., where $\sigma_\epsilon^2 > 0$ ), and I ask: “How many observations do traders need to see in order to identify the true $\boldsymbol \beta$ with probability $1$ in an asymptotically large market using the linear program in Equation (10)?” Define traders’ error rate after seeing $N$ observations as:

(12) $\begin{align*} \mathrm{Err}[N] &= \frac{1}{{Q \choose K}} \cdot \sum_{\substack{\mathcal{K} \in \mathcal{Q} \\ |\mathcal{K}| = K}} \mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta}) \end{align*}$

$\mathrm{Pr}(\widehat{\boldsymbol \beta} \neq {\boldsymbol \beta})$ denotes the probability that the linear program in Equation (10) chooses the wrong subset of attributes (i.e., makes an error) given the true support $\mathcal{K}$ and averaging over not only the measurement noise, ${\boldsymbol \epsilon}$ , but also the choice of the Gaussian attribute exposure matrix, $\mathbf{X}$ . Traders’ error rate is the weighted average of these probabilities over every shock set of size $K$ . Traders identify the true ${\boldsymbol \beta}$ with probability $1$ in an asymptotically large market if:

(13) $\begin{align*} \lim_{\substack{K,N,Q \to \infty \\ \sfrac{K}{Q} \to 0}} \mathrm{Err}[N] &= 0 \end{align*}$

Thus, the proposition below which characterizes this number of observations gives an upper bound of sorts. I refer to this number of observations as the signal recovery bound and write it as $N_R$ . i.e., the proposition shows that, whenever traders have seen $N > N_R$ observations, they will be able to recovery $\boldsymbol \beta$ almost surely no matter how large I make the market.

Proposition (Wainwright, 2009): Suppose $K,N,Q \to \infty$ , $\sfrac{K}{Q} \to 0$ , $(N - K) \cdot \beta \to \infty$ , and $\beta = \sfrac{1}{\sqrt{K}}$ , then traders can identify the true ${\boldsymbol \beta}$ with probability $1$ in an asymptotically large market if for some constant $a > 0$ :

(14) $\begin{align*} N &> a \cdot K \cdot \log (\sfrac{Q}{K}) \end{align*}$

The only cognitive constraint that traders face is that their selection rule must be computationally tractable. Under minimal assumptions a convex optimization program is computationally tractable in the sense that the computational effort required to solve the problem to a given accuracy grows moderately with the dimensions of the problem. Natarajan (1995) explicitly shows that $\ell_0$ constrained linear programming is NP-hard. This cognitive constraint is really weak in the sense that any selection rule that you might look up in an econometrics or statistics textbook (e.g., forward stepwise regression or LASSO) is going to be computationally tractable. After all, they have to be executed on computers.

5. Discussion

What is really interesting is that the signal opacity bound, $N_O$ , and the signal recovery bound, $N_R$ , basically sit right on top of one another when the market gets large just as you would expect from the analysis in Section 3. The figure above plots each bound on a log-log scale for varying levels of sparsity. It’s clear from the figure that the bounds are quite close. The figure below plots the relative gap between these $2$ bounds:

(15) $\begin{align*} \frac{N_R - N_O}{N_R} \end{align*}$

i.e., it plots how big the gap is relative to the size of the signal recover bound $N_R$ . For each level of sparsity, the gap is shrinking as I add more and more attributes. This is an identical result as in the figure from Section 3: as the size of the market increases, traders learn next to nothing from each successive observation until they get within an inch of the signal recovery bound. The only difference here is that now there are an arbitrary number of shocks and the data matrix is random.

Intra-Industry Lead-Lag Effect

March 19, 2014 by Alex

1. Introduction

Hou (2007) documents a really interesting phenomenon in asset markets. Namely, if the largest securities in an industry as measured by market capitalization perform really well in the current week, then the smallest securities in that industry tend to do well in the subsequent $2$ weeks. However, the reverse relationship does not hold. i.e., if the smallest securities in an industry do well in the current week, this tells you next to nothing about how the largest securities in that industry will do in the subsequent weeks. This effect is has a characteristic time scale of $1$ to $2$ weeks, and varies substantially across industries.

In this post, I replicate the main finding, provide some robustness checks, and then relate the result to the analysis in my paper Feature Selection Risk (2014).

2. Data Description

I use monthly and daily CRSP data from June 1963 to December 2001 to recreate Hou (2007), Table 1. I also replicate the same results using a different industry classification system over the period from January 2000 to December 2013. I look at securities traded on the NYSE, AMEX, and NASDAQ stock exchanges. I restrict the sample to include only securities with share codes $10$ or $11$ . i.e., I exclude things like ADRs, closed-end funds, and REITS. I calculate weekly returns by compounding daily returns between adjacent Wednesdays:

(1) $\begin{align*} \tilde{r}_{n,t} &= (1 + \tilde{r}_{n,\text{W}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{Th}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{F}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{M}_t}) \cdot (1 + \tilde{r}_{n,\text{Tu}_t}) \end{align*}$

I classify firms into industries in $2$ different ways. In order to replicate Hou (2007), Table 1 I use the $12$ industry classification system from Ken French’s website. This classification system is nice in the sense that it uses the SIC codes and can thus be extended back to the 1920s. However, the industry classification system that everyone in the financial industry uses is GICS codes. As a result, I also assign each firm to $1$ of $24$ different GICS industry subgroups.

I assign each firm to either an SIC or GICS industry based on their reported code in the monthly CRSP data as of the previous June. e.g., if I was looking at Apple, Inc in September 2005, then I would assign Apple to its industry as of June 2005; whereas, if I was looking at Apple in May 2005, then I would assign Apple to its industry as of June 2004. I use $N_{i,y}$ to denote the number of securities in industry $i$ in year $y$ . In each of the figure below, I report the average number of firms in each industry on an annual basis over the sample period:

(2) $\begin{align*} \langle N_{i,y} \rangle &= \left\lfloor \frac{1}{Y} \cdot \sum_{y=1}^Y N_{i,y} \right\rfloor \end{align*}$

e.g., when replicating the results in Hou (2007), Table 1 I compute the average number of firms using $Y = 64$ June observations.

Each June I also sort the securities in each industry $i$ by their market cap. After sorting, I then construct an equally weighted portfolio of the largest $30{\scriptstyle \%}$ of stocks in each industry and the smallest $30{\scriptstyle \%}$ of stocks in each industry:

(3) $\begin{align*} \tilde{r}_{i,t}^B &= \frac{1}{N_{i,y_t}^{30\%}} \cdot \sum_{n=1}^{N_{i,y_t}^{30\%}} \tilde{r}_{n,t} \qquad \text{and} \qquad \tilde{r}_{i,t}^S = \frac{1}{N - N_{i,y_t}^{70\%}} \cdot \sum_{n=N_{i,y_t}^{70\%} + 1}^{N_i} \tilde{r}_{n,t} \end{align*}$

In the analysis below, I look at the relationship of the weekly returns of these $2$ portfolios over the subsequent year. Note that these are within industry sorts. e.g., stocks in the “big” portfolio of the consumer durables industry might be in the “small” portfolio of the telecommunications industry.

Here is the code I use to pull the data from WRDS and create the figure in Section 3 below: gist.

3. Hou (2007), Table 1

Table 1 in Hou (2007) reports the cross-autocorrelation of the big and small intra-industry portfolios defined. To estimate these statistics, I first normalize the big and small portfolio weekly returns so that they each have a mean of $0$ and a standard deviation of $1$ :

(4) $\begin{align*} \mu_B &= \mathrm{E}[\tilde{r}_{i,t}^B] \qquad \text{and} \qquad \sigma_B = \mathrm{StDev}[\tilde{r}_{i,t}^B] \\ \mu_S &= \mathrm{E}[\tilde{r}_{i,t}^S] \qquad \text{and} \qquad \sigma_S = \mathrm{StDev}[\tilde{r}_{i,t}^S] \\ r_{i,t}^B &= \frac{\tilde{r}_{i,t}^B - \mu_B}{\sigma_B} \qquad \text{and} \qquad r_{i,t}^S = \frac{\tilde{r}_{i,t}^S - \mu_S}{\sigma_S} \end{align*}$

Then, to estimate the correlation between the returns of the big portfolio in week $t$ and the subsequent returns of the small portfolio in week $(t + l)$ I run the regression:

(5) $\begin{align*} r_{i,t+l}^S &= \beta(l) \cdot r_{i,t}^B + \epsilon_{i,t+l} \qquad \text{for } l = 0,1,2,\ldots,6 \end{align*}$

and estimate $\beta(l) = \mathrm{Cor}[r_{i,t}^B,r_{i,t+l}^S]$ . Similarly, to estimate the correlation between the returns of the small portfolio in week $t$ and the subsequent returns of the big portfolio in week $(t + l)$ I run the regression:

(6) $\begin{align*} r_{i,t+l}^B &= \gamma(l) \cdot r_{i,t}^S + \epsilon_{i,t+l} \qquad \text{for } l = 0,1,2,\ldots,6 \end{align*}$

to estimate $\gamma(l) = \mathrm{Cor}[r_{i,t}^S,r_{i,t+l}^B]$ . The advantage of this approach over estimating a simple correlation matrix is that you can read off the standard errors from the regression results rather than rely on asymptotic results.

The figure above gives the results of these regressions using data from January 1963 to December 2001. The solid blue and red lines give the point estimates for $\beta(l)$ and $\gamma(l)$ respectively at lags of $l=0,1,2,\ldots,6$ weeks. The shaded regions around the solid lines are the $95{\scriptstyle \%}$ confidence intervals around these point estimates. e.g., the panel in the upper left-hand corner reports that when the largest securities in the consumer non-durables industry realize a return that is $1$ standard deviation above mean in week $t$ the smallest securities in the consumer non-durables industry realize a return that is roughly $0.30$ standard deviations above their mean in week $(t+1)$ . By contrast, the smallest consumer non-durables securities have no predictive power over the future returns of their larger cousins.

4. Robustness Checks

The above results are quite interesting, but no one really uses the Ken French industry classification system when trading. The industry standard is GICS. The figure below replicates these same results over the period from January 2000 to December 2013 using the GICS codes. The results are similar, but slightly less pronounced. This replication suggests that shocks to the largest securities in an industry take roughly $2$ weeks to fully propagate out to the smallest securities in the same industry.

An obvious follow-up question is: “Is there something special about the largest firms in an industry? Or, is this cross-autocorrelation a statistical effect?” One way to shed light on this question is to look at the predictive power of the largest $10{\scriptstyle \%}$ of securities in each industry as opposed to the largest $30{\scriptstyle \%}$ :

(7) $\begin{align*} \tilde{r}_{i,t}^B &= \frac{1}{N_{i,t}^{10\%}} \cdot \sum_{n=1}^{N_{i,t}^{10\%}} \tilde{r}_{n,t} \qquad \text{and} \qquad \tilde{r}_{i,t}^S = \frac{1}{N - N_{i,t}^{70\%}} \cdot \sum_{n=N_{i,t}^{70\%} + 1}^{N_i} \tilde{r}_{n,t} \end{align*}$

If there is something fundamental about size, we should expect to see an even more pronounced disparity between the predictive power of the big and small portfolios. However, the figure below shows that looking at the predictive power of the really large firms (if anything) weakens the effect. It’s definitely not more pronounced.

5. Conclusion

What’s going on here? If size isn’t the root explanation, what is? In my paper Feature Selection Risk, I propose that the true culprit is not size, but rather the number of plausible shocks that might explain a firm’s returns. e.g., Apple might have really low stock returns in the current week for all sorts of reasons: bad product release, news about factory conditions, raw materials price shock, etc… Only some of these shocks will be relevant for other firms in the industry. It takes a while to parse Apple’s bad returns and figure out how you should extrapolate to other firms. By contrast, there are many fewer ways for a small firm’s returns to go very badly in the space of a few days, and often the reason is firm-specific.

« Previous Page