Research Notebook

Intra-Industry Lead-Lag Effect

March 19, 2014 by Alex

1. Introduction

Hou (2007) documents a really interesting phenomenon in asset markets. Namely, if the largest securities in an industry as measured by market capitalization perform really well in the current week, then the smallest securities in that industry tend to do well in the subsequent 2 weeks. However, the reverse relationship does not hold. i.e., if the smallest securities in an industry do well in the current week, this tells you next to nothing about how the largest securities in that industry will do in the subsequent weeks. This effect is has a characteristic time scale of 1 to 2 weeks, and varies substantially across industries.

In this post, I replicate the main finding, provide some robustness checks, and then relate the result to the analysis in my paper Feature Selection Risk (2014).

2. Data Description

I use monthly and daily CRSP data from June 1963 to December 2001 to recreate Hou (2007), Table 1. I also replicate the same results using a different industry classification system over the period from January 2000 to December 2013. I look at securities traded on the NYSE, AMEX, and NASDAQ stock exchanges. I restrict the sample to include only securities with share codes 10 or 11. i.e., I exclude things like ADRs, closed-end funds, and REITS. I calculate weekly returns by compounding daily returns between adjacent Wednesdays:

(1)   \begin{align*} \tilde{r}_{n,t} &= (1 + \tilde{r}_{n,\text{W}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{Th}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{F}_{t-1}}) \cdot (1 + \tilde{r}_{n,\text{M}_t}) \cdot (1 + \tilde{r}_{n,\text{Tu}_t}) \end{align*}

I classify firms into industries in 2 different ways. In order to replicate Hou (2007), Table 1 I use the 12 industry classification system from Ken French’s website. This classification system is nice in the sense that it uses the SIC codes and can thus be extended back to the 1920s. However, the industry classification system that everyone in the financial industry uses is GICS codes. As a result, I also assign each firm to 1 of 24 different GICS industry subgroups.

I assign each firm to either an SIC or GICS industry based on their reported code in the monthly CRSP data as of the previous June. e.g., if I was looking at Apple, Inc in September 2005, then I would assign Apple to its industry as of June 2005; whereas, if I was looking at Apple in May 2005, then I would assign Apple to its industry as of June 2004. I use N_{i,y} to denote the number of securities in industry i in year y. In each of the figure below, I report the average number of firms in each industry on an annual basis over the sample period:

(2)   \begin{align*} \langle N_{i,y} \rangle &= \left\lfloor \frac{1}{Y} \cdot \sum_{y=1}^Y N_{i,y} \right\rfloor \end{align*}

e.g., when replicating the results in Hou (2007), Table 1 I compute the average number of firms using Y = 64 June observations.

Each June I also sort the securities in each industry i by their market cap. After sorting, I then construct an equally weighted portfolio of the largest 30{\scriptstyle \%} of stocks in each industry and the smallest 30{\scriptstyle \%} of stocks in each industry:

(3)   \begin{align*} \tilde{r}_{i,t}^B &= \frac{1}{N_{i,y_t}^{30\%}} \cdot \sum_{n=1}^{N_{i,y_t}^{30\%}} \tilde{r}_{n,t} \qquad \text{and} \qquad \tilde{r}_{i,t}^S = \frac{1}{N - N_{i,y_t}^{70\%}} \cdot \sum_{n=N_{i,y_t}^{70\%} + 1}^{N_i} \tilde{r}_{n,t} \end{align*}

In the analysis below, I look at the relationship of the weekly returns of these 2 portfolios over the subsequent year. Note that these are within industry sorts. e.g., stocks in the “big” portfolio of the consumer durables industry might be in the “small” portfolio of the telecommunications industry.

Here is the code I use to pull the data from WRDS and create the figure in Section 3 below: gist.

3. Hou (2007), Table 1

Table 1 in Hou (2007) reports the cross-autocorrelation of the big and small intra-industry portfolios defined. To estimate these statistics, I first normalize the big and small portfolio weekly returns so that they each have a mean of 0 and a standard deviation of 1:

(4)   \begin{align*} \mu_B &= \mathrm{E}[\tilde{r}_{i,t}^B] \qquad \text{and} \qquad \sigma_B = \mathrm{StDev}[\tilde{r}_{i,t}^B] \\ \mu_S &= \mathrm{E}[\tilde{r}_{i,t}^S] \qquad \text{and} \qquad \sigma_S = \mathrm{StDev}[\tilde{r}_{i,t}^S] \\ r_{i,t}^B &= \frac{\tilde{r}_{i,t}^B - \mu_B}{\sigma_B} \qquad \text{and} \qquad r_{i,t}^S = \frac{\tilde{r}_{i,t}^S - \mu_S}{\sigma_S} \end{align*}

Then, to estimate the correlation between the returns of the big portfolio in week t and the subsequent returns of the small portfolio in week (t + l) I run the regression:

(5)   \begin{align*} r_{i,t+l}^S &= \beta(l) \cdot r_{i,t}^B + \epsilon_{i,t+l} \qquad \text{for } l = 0,1,2,\ldots,6 \end{align*}

and estimate \beta(l) = \mathrm{Cor}[r_{i,t}^B,r_{i,t+l}^S]. Similarly, to estimate the correlation between the returns of the small portfolio in week t and the subsequent returns of the big portfolio in week (t + l) I run the regression:

(6)   \begin{align*} r_{i,t+l}^B &= \gamma(l) \cdot r_{i,t}^S + \epsilon_{i,t+l} \qquad \text{for } l = 0,1,2,\ldots,6 \end{align*}

to estimate \gamma(l) = \mathrm{Cor}[r_{i,t}^S,r_{i,t+l}^B]. The advantage of this approach over estimating a simple correlation matrix is that you can read off the standard errors from the regression results rather than rely on asymptotic results.

hou-2007-table-1

The figure above gives the results of these regressions using data from January 1963 to December 2001. The solid blue and red lines give the point estimates for \beta(l) and \gamma(l) respectively at lags of l=0,1,2,\ldots,6 weeks. The shaded regions around the solid lines are the 95{\scriptstyle \%} confidence intervals around these point estimates. e.g., the panel in the upper left-hand corner reports that when the largest securities in the consumer non-durables industry realize a return that is 1 standard deviation above mean in week t the smallest securities in the consumer non-durables industry realize a return that is roughly 0.30 standard deviations above their mean in week (t+1). By contrast, the smallest consumer non-durables securities have no predictive power over the future returns of their larger cousins.

4. Robustness Checks

The above results are quite interesting, but no one really uses the Ken French industry classification system when trading. The industry standard is GICS. The figure below replicates these same results over the period from January 2000 to December 2013 using the GICS codes. The results are similar, but slightly less pronounced. This replication suggests that shocks to the largest securities in an industry take roughly 2 weeks to fully propagate out to the smallest securities in the same industry.

hou-2007-table-1--gics

An obvious follow-up question is: “Is there something special about the largest firms in an industry? Or, is this cross-autocorrelation a statistical effect?” One way to shed light on this question is to look at the predictive power of the largest 10{\scriptstyle \%} of securities in each industry as opposed to the largest 30{\scriptstyle \%}:

(7)   \begin{align*} \tilde{r}_{i,t}^B &= \frac{1}{N_{i,t}^{10\%}} \cdot \sum_{n=1}^{N_{i,t}^{10\%}} \tilde{r}_{n,t} \qquad \text{and} \qquad \tilde{r}_{i,t}^S = \frac{1}{N - N_{i,t}^{70\%}} \cdot \sum_{n=N_{i,t}^{70\%} + 1}^{N_i} \tilde{r}_{n,t} \end{align*}

If there is something fundamental about size, we should expect to see an even more pronounced disparity between the predictive power of the big and small portfolios. However, the figure below shows that looking at the predictive power of the really large firms (if anything) weakens the effect. It’s definitely not more pronounced.

hou-2007-table-1--gics-10pct

5. Conclusion

What’s going on here? If size isn’t the root explanation, what is? In my paper Feature Selection Risk, I propose that the true culprit is not size, but rather the number of plausible shocks that might explain a firm’s returns. e.g., Apple might have really low stock returns in the current week for all sorts of reasons: bad product release, news about factory conditions, raw materials price shock, etc… Only some of these shocks will be relevant for other firms in the industry. It takes a while to parse Apple’s bad returns and figure out how you should extrapolate to other firms. By contrast, there are many fewer ways for a small firm’s returns to go very badly in the space of a few days, and often the reason is firm-specific.

Filed Under: Uncategorized

Investigation Bandwidth

March 3, 2014 by Alex

1. Motivation

Time is dimensionless in modern asset pricing theory. e.g., the canonical Euler equation:

(1)   \begin{align*} P_t &= \widetilde{\mathrm{E}}_t[ \, P_{t+1} + D_{t+1} \, ] \end{align*}

says that the price of an asset at time t (i.e., P_t) is equal to the risk-adjusted expectation at time t (i.e., \widetilde{E}_t[\cdot]) of the price of the asset at time t+1 plus the risk-adjusted expectation of any dividends paid out by the asset at time t+1 (i.e., P_{t+1} + D_{t+1}). Yet, the theory never answers the question: “Plus 1 what?” Should we be thinking about seconds? Hours? Days? Years? Centuries? Millennia?

Why does this matter? An algorithmic trader adjusting his position each second worries about different risks than Warren Buffett who has a median holding period of decades. e.g., Buffett studies cash flows, dividends, and business plans. By contrast, the probability that a firm paying out a quarterly dividend happens to pay its dividend during any randomly chosen 1 second time interval is \sfrac{1}{1814400}. i.e., roughly the odds of picking a year at random since the time that the human and chimpanzee evolutionary lines diverged. Thus, if an algorithmic trader and Warren Buffett both looked at the exact same stock at the exact same time, then they would have to use different risk-adjusted expectations operators:

(2)   \begin{align*} P_t &= \begin{cases}  \widetilde{\mathrm{E}}^{\text{Alg}}_t[ \, P_{t+1{\scriptscriptstyle \mathrm{sec}}} \, ] &\text{from algorithmic trader's p.o.v.} \\ \widetilde{\mathrm{E}}^{\text{WB}}_t[ \, P_{t+1{\scriptscriptstyle \mathrm{qtr}}} + D_{t+1{\scriptscriptstyle \mathrm{qtr}}} \, ] &\text{from Warren Buffett's p.o.v.} \end{cases} \end{align*}

This note gives a simple economic model in which traders endogenously specialize in looking for information at a particular time scale and ignore predictability at vastly different time scales.

2. Simulation

I start with a simple numerical simulation that illustrates why traders at the daily horizon will ignore price patterns at vastly different frequencies. Suppose that the Cisco’s stock returns are composed of a constant growth rate \mu = \sfrac{0.04}{(480 \cdot 252)}, a daily wobble \beta \cdot \sin(2 \cdot \pi \cdot t) with \beta = \sfrac{1}{(480 \cdot 252)}, and a white noise term \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2) with \sigma= \sfrac{0.12}{\sqrt{480 \cdot 252}}:

(3)   \begin{align*} R_t &= \mu + \beta \cdot \sin(2 \cdot \pi \cdot t) + \epsilon_t, \quad \text{for} \quad t = \sfrac{1}{480}, \sfrac{2}{480}, \ldots, \sfrac{10079}{480}, \sfrac{10080}{480} \end{align*}

I consider a world where the clock ticks forward in 1 minute increments so that each tick represents \sfrac{1}{480}th of a trading day. The figure below shows a single sample path of Cisco’s return process over the course of a month.

plot--daily-wobble-plus-noise

What are the properties of this return process? First, the constant growth rate, \mu = \sfrac{0.04}{(480 \cdot 252)}, implies that Cisco has a 4{\scriptstyle \%} per year return on average. Second, the volatility of the noise component, \sigma= \sfrac{0.12}{\sqrt{480 \cdot 252}}, implies that the annualized volatility of Cisco’s returns is 12{\scriptstyle \%/\sqrt{\mathrm{Yr}}}. Finally, since:

(4)   \begin{align*} \frac{1}{2 \cdot \pi} \cdot \int_0^{2 \cdot \pi} [\sin(x)]^2 \cdot dx &= 1 \end{align*}

the choice of \beta = \sfrac{1}{(480 \cdot 252)} means that (in a world with a 0{\scriptstyle \%} riskless rate) a trading strategy which is long Cisco stock in the morning and short Cisco stock in the afternoon will generate a 100{\scriptstyle \%} return over the course of 1 year. i.e., this is a big daily wobble! If you start with a \mathdollar 1 on the morning of January 1st you end up with \mathdollar 2 on the evening of December 31st on average by following this trading strategy. The figure below confirms this math by simulating 100 year long realizations of this trading strategy’s returns.

plot--cum-trading-strategy-returns

3. Trader’s Problem

Suppose you didn’t know the exact frequency of the wobble in Cisco’s returns. The wobble is equally likely to have a frequency of anywhere from \sfrac{1}{252} cycles per day to 480 cycles per day. Using the last month’s worth of data, suppose you estimated the regressions specified below:

(5)   \begin{align*} R_t &= \hat{\mu} + \hat{\beta} \cdot \sin(2 \cdot \pi \cdot f \cdot t) + \hat{\gamma} \cdot \cos(2 \cdot \pi \cdot f \cdot t) + \hat{\epsilon}_t \quad \text{for each} \quad \sfrac{1}{252} < f < 480 \end{align*}

and identified the frequency, f_{\min}, which best fit the data:

(6)   \begin{align*} f_{\min} &= \arg \min_{\sfrac{1}{252} < f < 480} \left\{ \, \hat{\sigma}(f) \, \right\} \end{align*}

The figure below shows the empirical distribution of these best in-sample fit frequencies when the true frequency is a daily wobble. The figure reads: “A month’s worth of Cisco’s minute-by-minute returns best fits a factor with a frequency of \sfrac{1}{1.01{\scriptstyle \mathrm{days}}} about 2{\scriptstyle \%} of the time when the true frequency is 1 cycle a day.”

plot--best-in-sample-fit-freq

Suppose that you notice a wobble with a frequency of \sfrac{1}{1.01{\scriptstyle \mathrm{days}}} fit Cisco’s returns over the last month really well, but you also know that this is a noisy in-sample estimate. The true wobble could have a different frequency. If you can expend some cognitive effort to investigate alternate frequencies, how wide a bandwidth of frequencies should you investigate? Here’s where things get interesting. The figure above essentially says that you should never investigate frequencies outside of f_{\min} \pm 0.5 \cdot \sfrac{1}{21}—i.e., plus or minus half the width of the bell. The probability that a pattern in returns with a frequency outside this range is actually driving the results is nil!

4. Costs and Benefits

Again, suppose you’re a trader whose noticed that there is a daily wobble in Cisco’s returns over the past month. i.e., using the past month’s data, you’ve estimated f_{\min} = \sfrac{1}{1{\scriptstyle \mathrm{day}}}. Just as before, it’s a big wobble. Implemented at the right time scale, f_\star, you know that this strategy of buying early and selling late will generate a R(f_\star) = 100{\scriptstyle \%/\mathrm{yr}} = 8.33{\scriptstyle \%/\mathrm{mon}} return. Nevertheless, you also know that f_{\min} isn’t necessarily the right frequency to invest in just because it had the lowest in-sample error over the last month. You don’t want to go to your MD and pitch a strategy only to have adjust it a month later due to poor performance. Let’s say that is costs you \kappa dollars to investigate a range of \delta frequencies. If you investigate a particular range and f_\star is there, then you will discover f_\star with probability 1.

The question is then: “Which frequency buckets should you investigate?” First, are we losing anything by only searching \delta-sized increments. Well, we can tile the entire frequency range with little tiny \delta increments as follows:

(7)   \begin{align*} 1 - \Delta(x,N) &= \sum_{n=0}^{N-1} \mathrm{Pr}\left[ \, x + n \cdot \delta \leq f_\star < x + (n + 1) \cdot \delta \, \middle| \, f_{\min} \, \right]  \end{align*}

i.e., starting at frequency x we can iteratively add N different increments of size \delta. If we start at a small enough frequency, x, and add enough increments, N, then we can tile as much of the entire domain as we like so that \Delta(x,N) is as small as we like.

Next, what are the benefits of discovering the correct time scale to invest in? If R(f_{\star}) denotes the returns to investing in a trading strategy at the correct time scale over the course of the next month, let:

(8)   \begin{align*} \mathrm{Corr}[R(f_{\star}),R(f_{\min})] &= C(f_{\star},f_{\min}) \end{align*}

denote the correlation between the returns of the strategy at the true frequency and the strategy at the best in-sample fit frequency. We know that C(f_{\star},f_{\star}) = 1 and that:

(9)   \begin{align*} \frac{dC(f_{\star},f_{\min})}{d|\log f_{\star} - \log f_{\min}|} < 0 \qquad \text{with} \qquad \lim_{|\log f_{\star} - \log f_{\min}| \to \infty} R(f_{\min}) = 0 \end{align*}

i.e., as f_{\min} gets farther and farther away from f_{\star}, your realized returns over the next month from a trading strategy implemented at horizon f_{\min} will become less and less correlated with the returns of the strategy implemented at f_{\star} and as a consequence shrink to 0. Thus, the benefit to discovering that the true frequency was not f_{\min} is given by (1 - C(f_\star,f_{\min})) \cdot R(f_{\star}).

Putting the pieces together, it’s clear that you should investigate a particular range of frequencies for a confounding explanation if the expected probability of finding f_{\star} there given the realized f_{\min} times the benefit of discovering the true f_{\star} in that range exceeds the search cost \kappa:

(10)   \begin{align*} \kappa &\leq \underbrace{\mathrm{Pr}\left[ \, x + n \cdot \delta \leq f_\star < x + (n + 1) \cdot \delta \, \middle| \, f_{\min} \, \right]}_{\substack{\text{Probability of finding $f_\star$ in a } \\ \text{particular range given observed $f_{\min}$.}}} \cdot \overbrace{(1 - C(f_\star,f_{\min})) \cdot R(f_{\star})}^{\substack{\text{Benefit of} \\ \text{discovery}}} \end{align*}

i.e., you’ll have a donut shaped search pattern around f_{\min}. You won’t investigate frequencies that are really different from f_{\min} since the probability of finding f_{\star} there will be too low to justify the search costs. By contrast you won’t investigate frequencies that are too similar to f_{\min} since the benefits to discovering this minuscule error don’t justify the costs even though such tiny errors may be quite likely.

5. Wrapping Up

I started with the question: “How can it be that an algorithmic trader and Warren Buffett worry about different patterns in the same price path?” In the analysis above I give one possible answer. If you see a tradable anomaly at a particular time scale (e.g., 1 wobble per day) over the past month, then the probability that this anomaly was caused by a data generating process with a much shorter or much longer frequency is essentially 0. I used only sine wave plus noise processes above, but it seems like this assumption can be easily relaxed via results from, say, Friedlin and Wentzell.

Filed Under: Uncategorized

The Secrets N Prices Keep

December 30, 2013 by Alex

1. Introduction

Prices are signals about shocks to fundamentals. In a world where there are many stocks and lots of different kinds of shocks to fundamentals, traders are often more concerned with identifying exactly which shocks took place than the value of any particular asset. e.g., imagine you are a day trader. While you certainly care about changes in the fundamental value of Apple stock, you care much more about the size and location of the underlying shocks since you can profit from this information elsewhere. On one hand, if all firms based in California were hit with a positive shock, you might want to buy shares of Apple, Banana Republic, Costco, …, and Zero Skateboards stock. On the other hand, if all electronic equipment companies were hit with a positive shock, you might want to buy up Apple, Bose, Cisco Systems, …, and Zenith shares instead.

It turns out that there is a sharp phase change in traders’ ability to draw inferences about attribute-specific shocks from prices. i.e., when there have been fewer than N^\star transactions, you can’t tell exactly which shocks affected Apple’s fundamental value. Even if you knew that Apple had been hit by some shock, with fewer than N^\star observations you couldn’t tell whether it was a California-specific event or an electronic equipment-specific event. By contrast, when there have been more than N^\star transactions, you can figure out exactly which shocks have occurred. The additional (N - N^\star) transactions simply allow you to fine tune your beliefs about exactly how large the shocks were. The surprising result is that N^\star is a) independent of traders’ cognitive abilities and b) easily calculable via tools from the compressed sensing literature. See my earlier post for details.

This signal recovery bound is thus a novel new constraint on the amount of information that real world traders can extract from prices. Moreover, the bound gives a concrete meaning to the term “local knowledge”. e.g., shocks that haven’t yet manifested themselves in N^\star transactions are local in the sense that no one can spot them through prices. Anyone who knows of their existence must have found out via some other channel. To build intuition, this post gives 3 examples of this constraint in action.

2. Out-of-Town House Buyer

First I show where this signal recovery bound comes from. People spend lots of time looking for houses in different cites. e.g., see Trulia or my paper. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When studying at a list of recent sales prices, you find yourself a bit surprised. People must have changed their preferences for 1 of 7 different amenities: ^{(1)}a 2 car garage, ^{(2)}a 3rd bedroom, ^{(3)}a half-circle driveway, ^{(4)}granite countertops, ^{(5)}energy efficient appliances, ^{(6)}central A/C, or ^{(7)}a walk-in closet. Having the mystery amenity raises the sale price by \beta > 0 dollars. You would know how preferences had evolved if you had lived in Chicago the whole time; however, in the absence of this local knowledge, how many sales would you need to see in order to figure out which of the 7 amenities mattered?

The answer is 3. How did I come up with this number? For ease of explanation, let’s normalize expected house prices to \mathrm{E}_{t-1}[p_{n,t}] = 0. Suppose you found one house with amenities \{1,3,5,7\}, a second house with amenities \{2, 3, 6, 7\}, and a third house with amenities \{4, 5, 6,7\}. The combination of prices for these 3 houses would reveal exactly which amenity had been shocked. i.e., if only the first house’s price was higher than expected, p_{1,t} \approx \beta, then Chicagoans must have changed their preferences for having a 2 car garage:

(1)   \begin{equation*} {\small  \begin{bmatrix} p_{1,t} \\ p_{2,t} \\ p_{3,t} \end{bmatrix}  = \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix}  =  \begin{bmatrix}  1 & 0 & 1 & 0 & 1 & 0 & 1  \\  0 & 1 & 1 & 0 & 0 & 1 & 1  \\  0 & 0 & 0 & 1 & 1 & 1 & 1  \end{bmatrix} \begin{bmatrix}  \beta \\ 0 \\ \vdots \\ 0  \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{bmatrix}  } \quad \text{with} \quad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2), \, \beta \gg \sigma \end{equation*}

By contrast, if it was the case that p_{1,t} \approx \beta, p_{2,t} \approx \beta, and p_{3,t} \approx \beta, then you would know that people now value walk-in closets much more than they did a year ago.

Here is the key point. 3 sales is just enough information to answer 7 yes or no questions and rule out the possibility of no change:

(2)   \begin{align*}   7 = 2^3 - 1 \end{align*}

N = 4 sales simply narrows your error bars around the exact value of \beta. N = 2 sales only allows you to distinguish between subsets of amenities. e.g., seeing just the 1st and 2nd houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more. It doesn’t tell you which one. The problem changes character at N = N^\star(7,1) = 3. When you have seen fewer than N^\star = 3 sales, information about how preferences have changed is purely local knowledge. Prices can’t publicize this information. You must live and work in Chicago to learn it.

3. Industry Analyst’s Advantage

Next, I illustrate how this signal recovery bound acts like a cognitive constraint for would be arbitrageurs. Suppose you’re a petroleum industry analyst. Through long, hard, caffeine-fueled nights of research you’ve discovered that oil companies such as Schlumberger, Halliburton, and Baker Hughes who’ve invested in hydraulic fracturing (a.k.a., “fracking”) are due for a big unexpected payout. This is really valuable information affecting only a few of the major oil companies. Many companies haven’t really invested in this technology, and they won’t be affected by the shock. How aggressively should you trade Schlumberger, Halliburton, and Baker Hughes? On one hand, you want to build up a large position in these stocks to take advantage of the future price increases that you know are going to happen. On the other hand, you don’t want to allow news of this shock to spill out to the rest of the market.

In the canonical Grossman and Stiglitz (1980)-type setup, the reason that would be arbitrageurs can’t immediately infer your hard-earned information from prices is the existence of noise traders. They can’t be completely sure whether a sudden price movement is due to a) your informed trading or b) random noise trader demand. Here, I propose a new confound: the existence of many plausible shocks. e.g., suppose you start aggressively buying up shares of Schlumberger, Halliburton, and Baker Hughes stock. As an arbitrageur I see the resulting gradual price increases in these 3 stocks, and ask: “What should my next trade be?” Here’s where things get interesting. When there have been fewer than N^\star transactions in the petroleum industry, I can’t tell whether you are trading on a Houston, TX-specific shock or a fracking-specific shock since all 3 of these companies share both these attributes. I need to see at least N^\star observations in order to recognize the pattern you’re trading on.

petroleumindustry-search-subjects

The figure above gives a sense of the number of different kinds of shocks that affect the petroleum industry. It reads: “If you select a Wall Street Journal article on the petroleum industry over the period from 2011 to 2013 there is a 19{\scriptstyle \%} chance that ‘Oil sands’ is a listed descriptor and a 7{\scriptstyle \%} chance that ‘LNG’ (i.e., liquid natural gas) is a listed descriptor.” Thus, oil stock price changes might be due to Q \gg 1 different shocks:

(3)   \begin{align*} \hat{p}_{n,t} &= p_{n,t} - \mathrm{E}_{t-1}[p_{n,t}] = \sum_{q=1}^Q \beta_{q,t} \cdot x_{n,q} + \epsilon_{n,t} \qquad \text{with} \qquad \epsilon_{n,t} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2) \end{align*}

where x_{n,q} denotes stock n‘s exposure to the qth attribute. e.g., in this example x_{n,q} = 1 if the company invested in fracking (i.e., like Schlumberger, Halliburton, and Baker Hughes) and x_{n,q}=0 if the company didn’t. What’s more, very few of the Q possible attributes matter each month. e.g., the plot below reads: “Only around 10{\scriptstyle \%} of all the descriptors in the Wall Street Journal articles about the petroleum industry over the period from January 2011 to December 2013 are used each month.” Thus, only K of the possible Q attributes appear to realize shocks each period:

(4)   \begin{align*} K &= \Vert {\boldsymbol \beta} \Vert_{\ell_0} = \sum_{q=1}^Q 1_{\{\beta_q \neq 0\}} \qquad \text{with} \qquad K \ll Q \end{align*}

Note that this calculation includes terms like ‘Crude oil prices’ which occur in roughly half the articles, so the actual rate is likely much lower. Crude oil prices is just a synonym for the industry.

petroleumindustry--fraction-of-search-subjects-mentioned-each-month

For simplicity, suppose that 10 attributes out of a possible 100 realized a shock in the previous period, and you discovered 1 of them. How long does your informational monopoly last? Using tools from Wainwright (2009) it’s easy to show that uninformed traders need at least:

(5)   \begin{align*} N^\star(100,10) \approx 10 \cdot \log(100 - 10) = 45  \end{align*}

observations to identify which 10 of the 100 possible payout-relevant attributes in the petroleum industry has realized a shock. If it takes you (…and other industry specialists like you) around 1 hour to materially increase your position, then you have roughly 5.6 = \sfrac{45}{8} days (i.e., around 1 trading week) to build up a position before the rest of the market catches on assuming an 8 hour trading day.

4. Asset Management Expertise

Finally, I show how there can be situations where you might not bother trying to learn from prices because there are too many plausible explanations to check out. In this world everyone specializes in acquiring local knowledge. Suppose you’re a wealthy investor, and I’m a broke asset manager with a trading strategy. I walk into your office, and I try to convince you to finance my strategy that has abnormal returns of r_t per month:

(6)   \begin{align*}   r_t &= \mu + \epsilon_t   \qquad \text{with} \qquad    \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}

where \sigma_{\epsilon}^2 = 1{\scriptstyle \%} per month to make the algebra neat. For simplicity, suppose that there is no debate \mu > 0. In return for running the trading strategy, I ask for fees amounting to a fraction f of the gross returns. Of course, I have to tell you a little bit about how the trading strategy works, so you can deduce that I’m taking on a position that is to some extent a currency carry trade and to some extent a short-volatility strategy. This narrows down the list a bit, but it still leaves a lot of possibilities. In the end, you know that I am using some combination of K = 2 out of Q = 100 possible strategies.

You have 2 options. On one hand, if you accept the terms of this offer and finance my strategy, you realize returns net of fees equal to:

(7)   \begin{align*}   (1 - f) \cdot \mu \cdot T + \sum_{t=1}^T \epsilon_t \end{align*}

This approach would net you an annualized Sharpe ratio of \text{SR}_{\text{mgr}} = \sqrt{12} \cdot (1 - f) \cdot \sfrac{\mu}{\sigma} e.g., if I asked for a fee of f = 20{\scriptstyle \%}, and my strategy yielded a return of 2{\scriptstyle \%} per month, then your annualized Sharpe ratio net of my fees would be \text{SR}_{\text{mgr}} = 0.55.

On the other hand, you could always refuse my offer and try to back out which strategies I was following using the information you gained from our meeting. i.e., you know that my strategy involves using some combination of K=2 factors out of a universe of Q = 100 possibilities:

(8)   \begin{align*}   \mu &= \sum_{q=1}^{100} \beta_q \cdot x_{q,t}   \qquad \text{with} \qquad    \Vert {\boldsymbol \beta} \Vert_{\ell_0} = 3 \end{align*}

In order to deduce which strategies I was using as quickly as possible, you’d have to trade random portfolio combinations of these 100 different factors for:

(9)   \begin{align*}   T^\star(100,2) \approx 2 \cdot \log(100 - 2) = 9.17 \, {\scriptstyle \mathrm{months}} \end{align*}

Your Sharpe ratio during this period would be \text{SR}_{\text{w/o mgr}|\text{pre}} = 0, and afterwards you would earn the same Sharpe ratio as before without having to pay any fees to me:

(10)   \begin{align*}   \text{SR}_{\text{w/o mgr}|\text{post}} &= \sqrt{12} \cdot \left( \frac{0.02}{0.10} \right) = 0.69 \end{align*}

However, if you have to show your investors reports every year, it may not be worth it for you to reverse engineer my trading strategy. Your average Sharpe ratio during this period would be:

(11)   \begin{align*}   \text{SR}_{\text{w/o mgr}} &= \sfrac{9.17}{12} \cdot 0 + \sfrac{(12 - 9.17)}{12} \cdot 0.69 = 0.16 \end{align*}

which is well below the Sharpe ratio on the market portfolio. Thus, you may just want to pay my fees. Even though you could in principle back out which strategies I was using, it would take too long. You’re investors would withdraw due to poor performance before you could capitalize on your newfound knowledge.

5. Discussion

To cement ideas, let’s think about what this result implies for a financial econometrician. We’ve known since the 1970s that there is a strong relationship between oil shocks and the rest of the economy. e.g., see Hamilton (1983), Lamont (1997), and Hamilton (2003). Imagine you’re now an econometrician, and you go back and pinpoint the exact house when each fracking news shock occurred over the last 40 years. Using this information, you then run an event study which finds that petroleum stocks affected by each news shock display a positive cumulative abnormal return over the course of the following week. Would this be evidence of a market inefficiency? Are traders still under-reacting to oil shocks? No. Ex post event studies assume that traders know exactly what is and what isn’t important in real time. Non-petroleum industry specialists who didn’t lose sleep researching hydraulic fracturing have to parse out which shocks are relevant only from prices. This takes time. In the interim, this knowledge is local.

Filed Under: Uncategorized

How Quickly Can We Decipher Price Signals?

December 23, 2013 by Alex

1. Introduction

There are many different attribute-specific shocks that might affect an asset’s fundamental value in any given period. e.g., the prices of all stocks held in model-driven long/short equity funds might suddenly plummet as happened in the Quant Meltdown of August 2007. Alternatively, new city parking regulations might raise the value of homes with a half circle driveway. Innovations in asset prices are signals containing 2 different kinds of information: a) which of these Q different shocks has taken place and b) how big each of them was.

It’s often a challenge for traders to answer question (a) in real time. e.g., Daniel (2009) notes that during the Quant Meltdown “markets appeared calm to non-quantitative investors… you could not tell that anything was happening without quant goggles.” This post asks the question: How many transactions do traders need to see in order to identify shocked attributes? The surprising result is that there is a well-defined and calculable answer to this question that is independent of traders’ cognitive abilities. Local knowledge is an unavoidable consequence of this location recovery bound.

2. Motivating Example

It’s easiest to see where this location recovery bound comes from via a short example. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When looking at a list of recent sales prices, you find yourself surprised. People must have changed their preferences for 1 of 7 different amenities: ^{(1)}a 2 car garage, ^{(2)}a 3rd bedroom, ^{(3)}a half-circle driveway, ^{(4)}granite countertops, ^{(5)}energy efficient appliances, ^{(6)}central A/C, or ^{(7)}a walk-in closet. Having the mystery amenity raises the sale price by \beta > 0 dollars. To be sure, you would know how preferences had evolved if you had lived in Chicago the whole time; however, in the absence of this local knowledge, how many sales would you need to see in order to figure out which of the 7 amenities mattered?

The answer is 3. Where does this number come from? For ease of explanation, let’s normalize the expected house prices to \mathrm{E}_{t-1}[p_{1,t}] = 0. Suppose you found one house with amenities \{1,3,5,7\}, a second house with amenities \{2, 3, 6, 7\}, and a third house with amenities \{4, 5, 6,7\}. The combination of prices for these 3 houses would reveal exactly which amenity had been shocked. i.e., if only the first house’s price was higher than expected, p_{1,t} \approx \beta, then Chicagoans must have changed their preferences for having a 2 car garage:

(1)   \begin{equation*} {\small  \begin{bmatrix} p_{1,t} \\ p_{2,t} \\ p_{3,t} \end{bmatrix}  = \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix}  =  \begin{bmatrix}  1 & 0 & 1 & 0 & 1 & 0 & 1  \\  0 & 1 & 1 & 0 & 0 & 1 & 1  \\  0 & 0 & 0 & 1 & 1 & 1 & 1  \end{bmatrix} \begin{bmatrix}  \beta \\ 0 \\ \vdots \\ 0  \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{bmatrix}  } \quad \text{with} \quad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2), \, \beta \gg \sigma \end{equation*}

By contrast, if it was the case that p_{1,t} \approx \beta, p_{2,t} \approx \beta, and p_{3,t} \approx \beta, then you would know that people now value walk-in closets much more than they did a year ago.

Here is the key point. 3 sales is just enough information to answer 7 yes or no questions and rule out the possibility of no change:

(2)   \begin{align*}   7 = 2^3 - 1 \end{align*}

N = 4 sales simply narrows your error bars around the exact value of \beta. N = 2 sales only allows you to distinguish between subsets of amenities. e.g., seeing just the 1st and 2nd houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more. It doesn’t tell you which one. The problem changes character at N = N^\star(7,1) = 3… i.e., the location recovery bound.

3. Main Results

This section formalizes the intuition from the example above. Think about innovations in the price of asset n as the sum of a meaningful signal, f_n, and some noise, \epsilon_n:

(3)   \begin{align*} p_{n,t} - \mathrm{E}_{t-1}[p_{n,t}] &= f_n + \epsilon_n = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \quad \text{with} \quad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2) \end{align*}

where the signal can be decomposed into Q different attribute-specific shocks. In Equation (3) above, \beta_q \neq 0 denotes a shock of size |\beta_q| to the qth attribute and x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N}) denotes the extent to which asset n displays the qth attribute. Each of the data columns is normalized so that \mathrm{E} \, \sum_{n=1}^N \mathrm{Var}[x_{n,q}] = 1.

In general, when there are more attributes than shocks, K < Q, picking out exactly which K attributes have realized a shock is a combinatorially hard problem as discussed in Natarajan (1995). However, suppose you had an oracle which could bypass this hurdle and tell you exactly which attributes had realized a shock:

(4)   \begin{align*} \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2} &= \inf_{\{\hat{\boldsymbol \beta} : \#[\beta_q \neq 0] \leq K\}} \, \Vert \mathbf{f} - \mathbf{X}\hat{\boldsymbol \beta} \Vert_{\ell_2} \end{align*}

In this world, your mean squared prediction error, \mathrm{MSE} = \frac{1}{N} \cdot \Vert \mathbf{p} - \hat{\mathbf{p}} \Vert_{\ell_2}^2, is given by:

(5)   \begin{align*} \mathrm{MSE}^{\text{Oracle}} & = \min_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{N^{\text{Oracle}}} \cdot \Vert \mathbf{p} - \hat{\mathbf{p}}^{\text{Oracle}} \Vert_{\ell_2}^2 \, \right\} = \min_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{N^{\text{Oracle}}} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \sigma^2 \, \right\} \end{align*}

where N^{\text{Oracle}} = N^{\text{Oracle}}(Q,K) = K denotes the number of observations necessary for your oracle. e.g., if each \beta_q \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Bin}(\kappa), then \mathrm{MSE}^{\text{Oracle}} = \sigma^2 since there is only variation in the location of the shocks and not the size of the shocks.

It turns out that if each asset isn’t too redundant relative to the number of shocked attributes, then you can achieve a mean squared error that is within a log factor of the oracle’s mean squared error using many fewer observations than there are attributes, N \ll Q. e.g., suppose that you used a lasso estimator:

(6)   \begin{align*} \hat{\boldsymbol \beta}^{\text{Lasso}} &= \arg\min_{\hat{\boldsymbol \beta}} \, \left\{ \, \frac{1}{2} \cdot \Vert \mathbf{p} - \mathbf{X} \hat{\boldsymbol \beta} \Vert_{\ell_2}^2 + \lambda_{\ell_1} \cdot \sigma \cdot \Vert \hat{\boldsymbol \beta} \Vert_{\ell_1} \, \right\} \end{align*}

with \lambda_{\ell_1} = 2 \cdot \sqrt{2 \cdot \log Q}. Then, Candes and Davenport (2011) show that:

(7)   \begin{align*} \mathrm{MSE}^{\text{Lasso}} &\leq \gamma \cdot \inf_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{K} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \log Q \cdot \sigma^2 \, \right\} \end{align*}

with probability 1 - 6 \cdot Q^{-2 \cdot \log 2} - Q^{-1} \cdot (2 \cdot \pi \cdot \log Q)^{-\sfrac{1}{2}} where \gamma > 0 is a small numerical constant. However, this paragraph is quite loose. i.e., what exactly does the condition that “each asset isn’t too redundant relative to the number of shocked attributes” mean? Exactly how many observations would you need to see if each asset’s attribute exposure is drawn as x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N})?

Here’s where things get really interesting. Wainwright (2009) shows that there is a sharp bound on the number of observations, N^\star = N^\star(Q,K), that you need to observe in order for \ell_1-type estimators like Lasso to succeed when attribute exposure is drawn iid Gaussian:

(8)   \begin{align*} N^\star(Q,K) &= \mathcal{O}\left[K \cdot \log(Q - K)\right] \end{align*}

with Q \to \infty, K \to \infty, and \sfrac{K}{Q} \to \kappa for some \kappa > 0. When traders observe N < N^\star(Q,K) observations picking out which attributes have realized a shock is an NP-hard problem; whereas, when they observe N \geq N^\star(Q,K) there exist efficient convex optimization algorithms that solve this problem. This result says how the N^\star = 3 location recovery bound from the motivating example generalizes to arbitrary numbers of attributes, Q, and shocks, K.

4. Just Identified

I conclude this post by discussing the non-sparse case. i.e., \boldsymbol \beta usually isn’t sparse in econometric textbooks á la Hayashi, Wooldridge, or Angrist and Pischke. When every one of the Q attributes matters, it’s easy to decide which attributes to pay attention to—i.e., all of them. In this situation the mean squared error for an oracle is the same as the mean squared error for mere mortals:

(9)   \begin{align*} \mathrm{MSE}^{\text{Oracle}} & = \frac{1}{Q} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \sigma^2 = \frac{1}{Q} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Mortal}} \Vert_{\ell_2}^2 + \sigma^2 = \mathrm{MSE}^{\text{Mortal}} \end{align*}

Does the location recovery bound disappears in this setting?

No. This is not the case. Indeed, the attribute selection bound corresponds to the usual N \geq Q requirement for identification. To see why, let’s return to the motivating example in Section 2, and consider the case where any of the 7 attributes could have realized a shock. This leaves us with 128 different shock combinations:

(10)   \begin{align*} 128 &= {7 \choose 0} + {7 \choose 1} + {7 \choose 2} + {7 \choose 3} + {7 \choose 4} + {7 \choose 5} + {7 \choose 6} + {7 \choose 7} \\ &= 1 + 7 + 21 + 35 + 35 + 21 + 7 + 1 \\ &= 2^7 \end{align*}

so that N^\star = 7 gives just enough differences to identify which combination of shocks was realized. More generally, we have that for any number of attributes, Q:

(11)   \begin{align*} 2^Q &= \sum_{k=0}^Q {Q \choose k} \end{align*}

This gives an interesting information theoretic interpretation to the meaning of “just identified” that has nothing to do with linear algebra or the invertibility of a matrix.

Filed Under: Uncategorized

Constraining Effort Not Bandwidth

October 3, 2013 by Alex

1. Introduction

Imagine trying to fill up a 5 gallon bucket using a hand-powered water pump. You might end up with a half-full bucket after 1 hour of work for either of 2 reasons. First, the spigot might be too narrow. i.e., even though you are doing enough work to pull 5 gallons of water out of the ground each hour, only 2.5 gallons can actually flow through the spigot during the allotted time. This is a bandwidth constraint. Alternatively, the pump handle might be too short. i.e., you have to crank the handle twice as many times to pull each gallon of water out of the ground. This is an effort constraint.

water-pump

Existing information-based asset pricing models a la Grossman and Stiglitz (1980) restrict the bandwidth of arbitrageurs’ flow of information. The market just produces too much information per trading period. i.e., people’s minds have narrow spigots. However, traders also face restrictions on how much work they can do each period. Sometimes it’s hard to crank the pump handle often enough to produce the relevant information. e.g., think of a binary signal such as knowing if a cancer drug has completed a phase of testing as in Huberman and Regev (2001). It doesn’t take much bandwidth to convey this news. After all, people immediately recognized its significance when the New York Times wrote about it. Yet, arbitrageurs must have faced a restriction on the number of scientific articles they could read each year since Nature reported this exact same news 5 months earlier and no one batted an eye! These traders left money on the table because they anticipated having to pump the handle too many times in order to uncover a really simple signal.

This post proposes algorithmic flop counts as a way of quantifying how much effort traders have to do in order to uncover profitable information. I illustrate the idea via a short example and some computations.

2. Effort via Flop Counts

One way of quantifying how much effort it takes to discover a piece of information is to count the number of floating-point operations (i.e., flops) that a computer has to do to estimate the number. I take my discussion of flop counts primarily from Boyd and Vandenberghe and define a flop as an addition, subtraction, multiplication, or division of 2 floating-point numbers. e.g., I use flops as a unit of effort so that:

(1)   \begin{align*} \mathrm{Effort}\left[ \ 2.374 \times 17.392 \ \right] &= 2{\scriptstyle \mathrm{flops}} \end{align*}

in the same way that the cost of a Snickers bar might be \mathdollar 2. I then count the total number of flops needed to calculate a number as a proxy for the effort needed to find out the associated piece of information. e.g., if it took N^2 flops to compute the average return of all technology stocks but N^3 flops to arrive at the median return on assets for all value stocks, then I would say that it is easier (i.e., took less effort) to know the mean return. The key thing here is that this measure is independent of the amount of entropy that either of these 2 calculations resolves.

I write flop counts as a polynomial function of the dimensions of the matrices and vectors involved. Moreover, I always simplify the expression by removing all but the highest order terms. e.g., suppose that an algorithm required:

(2)   \begin{align*} \left\{ M^7 + 7 \cdot M^4 \cdot N + M^2 \cdot N + 2 \cdot M \cdot N^6 + 5 \cdot M \cdot N^2 \right\}{\scriptstyle \mathrm{flops}} \end{align*}

In this case, I would write the flop count as:

(3)   \begin{align*} \left\{M^7 + 2 \cdot M \cdot N^6\right\}{\scriptstyle \mathrm{flops}} \end{align*}

since both these terms are of order 7. Finally, if I also know that N \gg M, I might further simplify to 2 \cdot M \cdot N^6 flops. Below, I am going to be thinking about high-dimensional matrices and vectors (i.e., where M and N are big), so these simplifications are sensible.

Let’s look at a couple of examples to fix ideas. First, consider the task of matrix-to-vector multiplication. i.e., suppose that there is a matrix \mathbf{X} \in \mathrm{R}^{M \times N} and we want to calculate:

(4)   \begin{align*} \mathbf{y} &= \mathbf{X}{\boldsymbol \beta} \end{align*}

where we know both \mathbf{X} and {\boldsymbol \beta} and want to figure out \mathbf{y}. This task takes an effort of 2 \cdot M \cdot N flops. There are M elements in the vector \mathbf{y}, and to compute each one of these elements, we have to multiply 2 numbers together N times as:

(5)   \begin{align*} y_m &= \sum_n x_{m,n} \cdot \beta_n \end{align*}

This setup is analogous to \mathbf{X} being a dataset with M observations on N different variables where each variable has a linear affect \beta_n on the outcome variable y.

Next, let’s turn the tables and look at the case when we know the outcome variable \mathbf{y} and want to solve for {\boldsymbol \beta} when \mathbf{X} \in \mathrm{R}^{N \times N}. A standard approach here would be to use the factor-solve method whereby we first factor the data matrix into the product of K components, \mathbf{X} = \mathbf{X}_1 \mathbf{X}_2 \cdots \mathbf{X}_K, and then use these components to iteratively compute {\boldsymbol \beta} = \mathbf{X}^{-1}\mathbf{y} as:

(6)   \begin{align*} {\boldsymbol \beta}_1 &= \mathbf{X}_1^{-1}\mathbf{y} \\ {\boldsymbol \beta}_2 &= \mathbf{X}_2^{-1}{\boldsymbol \beta}_1 = \mathbf{X}_2^{-1}\mathbf{X}_1^{-1}\mathbf{y} \\ &\vdots \\ {\boldsymbol \beta} = {\boldsymbol \beta}_K &= \mathbf{X}_K^{-1}{\boldsymbol \beta}_{K-1} = \mathbf{X}_K^{-1}\mathbf{X}_{K-1}^{-1} \cdots \mathbf{X}_1^{-1}\mathbf{y} \end{align*}

We call the process of computing the factors \{\mathbf{X}_1, \mathbf{X}_2, \ldots, \mathbf{X}_K\} the factorization step and the process of solving the equations {\boldsymbol \beta}_{k} = \mathbf{X}_{k}^{-1}{\boldsymbol \beta}_{k-1} the solve step. The total flop count of a solution strategy is then the sum of the flop counts for each of these steps. In many cases the cost of the factorization step is the leading order term.

e.g., consider the Cholesky Factorization method that is commonly used in statistical software. We know that for every \mathbf{X} \in \mathrm{R}^{N \times N} there exists a factorization:

(7)   \begin{align*} \mathbf{X} = \mathbf{L} \mathbf{L}^{\top} \end{align*}

where \mathbf{L} is lower triangular and non-singular with positive diagonal elements. Cost of computing these Cholesky factors is (\sfrac{1}{3}) \cdot N^3 flops. By contrast, the resulting solve steps of \mathbf{L}{\boldsymbol \beta}_1 = \mathbf{y} and \mathbf{L}^{\top}{\boldsymbol \beta} = {\boldsymbol \beta}_1 each have flop counts of N^2 flops bringing the total flop count to (\sfrac{1}{3}) \cdot N^3 + 2 \cdot N^2 \sim (\sfrac{1}{3}) \cdot N^3 flops. In the general case, the effort involved in solving a linear system of equations \mathbf{y} = \mathbf{X} {\boldsymbol \beta} for {\boldsymbol \beta} when \mathbf{X} \in \mathrm{R}^{N \times N} grows with \mathrm{O}(N^3). Boyd and Vandenberghe argue that “for N more than a thousand or so, generic methods… become less practical,” and financial markets definitely have more than “a thousand or so” trading opportunities to check!

3. Asset Structure

Consider constraining traders’ cognitive effort in an information-based asset pricing model a la Kyle (1985) but with many assets and attribute-specific shocks. Specifically, suppose that there are N stocks that each have H different payout-relevant characteristics. Every characteristic can take on I distinct levels. I call a (characteristic, level) pairing an ‘attribute’ and use the indicator variable a_n(h,i) to denote whether or not a stock has an attribute. Think about attributes as sitting in a (H \times I)-dimensional matrix, \mathbf{A}, as illustrated in Equation (8) below:

(8)   \begin{equation*}   \mathbf{A}^{\top} = \bordermatrix{     ~      & 1                              & 2                         & \cdots & H                            \cr     1      & \text{Agriculture}             & \text{Albuquerque}        & \cdots & \text{Alcoa Inc}             \cr     2      & \text{Apparel}                 & \textbf{\color{red}Boise} & \cdots & \text{ConocoPhillips}        \cr     3      & \textbf{\color{red}Disk Drive} & \text{Chicago}            & \cdots & \text{Dell Inc} \cr     \vdots & \vdots                         & \vdots                    & \ddots & \vdots \cr     I      & \text{Wholesale}               & \text{Vancouver}          & \cdots & \textbf{\color{red}Xerox Corp} \cr} \end{equation*}

I’ve highlighted the attributes for Micron Technology. e.g., we have that a_{\text{Mcrn}}(\text{City},\text{Boise}) = 1 while a_{\text{WDig}}(\text{City},\text{Boise}) = 0 since Micron Technology is based in Boise, ID while Western Digital is based in SoCal.

Further, suppose that each stock’s value is then the sum of a collection of attribute-specific shocks:

(9)   \begin{align*} v_n &= \sum_{h,i} x(h,i) \cdot a_n(h,i) \end{align*}

where the shocks are distributed according to the rule:

(10)   \begin{align*} x(h,i) &= x^+(h,i) + x^-(h,i) \quad \text{with each} \quad x^\pm(h,i) \overset{\scriptscriptstyle \mathrm{iid}}{\sim}  \begin{cases} \pm \sfrac{\delta}{\sqrt{H}} &\text{ w/ prob } \pi \\ \ \: \, 0 &\text{ w/ prob } (1 - \pi) \end{cases} \end{align*}

Each of the x(h,i) indicates whether or not the attribute (h,i) happened to realize a shock. The \sfrac{\delta}{\sqrt{H}} > 0 term represents the amplitude of all shocks in units of dollars per share, and the \pi term represents the probability of either a positive or negative shock to attribute (h,i) each period.

If value investors learn asset-specific information and Kyle (1985)-type market makers price each individual stock using only their own order flow in a dynamic setting, then each individual stock will be priced correctly:

(11)   \begin{align*} \mathrm{E} \left[ \ p_n - v_n \ \middle| \ y_n \ \right] &= 0 \end{align*}

where y_n denotes the aggregate order flow for stock n. Yet, the high-dimensionality of market would mean that there still could be groups of mispriced stocks:

(12)   \begin{align*} \mathrm{E} \left[ \ \langle p_n \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = \sfrac{\delta}{\sqrt{H}} \ \right] &< 0 \; \text{and} \; \mathrm{E} \left[ \ \langle p_n \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = - \sfrac{\delta}{\sqrt{H}} \ \right] > 0 \end{align*}

where \langle p_n \rangle_{h,i} = \sfrac{I}{N} \cdot \sum_n p_n \cdot a_n(h,i) denotes the sample average price for stocks with a particular attribute, (h,i). This is a case of more is different. If an oracle told you that x(h,i) = \sfrac{\delta}{\sqrt{H}} for some attribute (h,i), then you would know that the average price of stocks with attribute (h,i) would be:

(13)   \begin{align*} \langle p_n \rangle_{h,i} &= \lambda_n \cdot \beta_n \cdot \frac{\delta}{\sqrt{H}} + \mathrm{O}(\sfrac{N}{I})^{-1/2} \end{align*}

since \lambda_n \cdot \beta_n < 1 in a dynamic Kyle (1985) model where informed traders have an incentive to trade less aggressively today (i.e., decrease \beta_n and thus \lambda_n) in order to act on their information again tomorrow. In this setting, \langle p_{n,1} \rangle_{h,i} will be less than its fundamental value \langle v_n \rangle_{h,i} = \sfrac{\delta}{\sqrt{H}} even though it will be easy to see that \langle p_{n,1} \rangle_{h,i} \neq 0 as \sfrac{I}{N} \to 0.

4. Arbitrageurs’ Inference Problem

So how much effort does it take to discover the set of shocked attributes, \mathbf{A}^\star \subseteq \mathbf{A}:

(14)   \begin{align*} \mathbf{A}^\star &= \left\{ \ (h,i) \in H \times I \ \middle| \ x(h,i) \neq 0 \ \right\} \quad \text{where} \quad A^\star = \#\left\{ \ (h,i) \ \middle| \ (h,i) \in \mathbf{A}^\star \ \right\} \end{align*}

given their price impact? What’s stopping arbitrageurs from trading away these attribute-specific pricing errors? Well, the problem of finding the attributes in \mathbf{A}^\star boils down to solving:

(15)   \begin{align*} \underset{\mathbf{y}}{\begin{pmatrix} p_{1,1} \\ p_{2,1} \\ \vdots \\ p_{n,1} \\ \vdots \\ p_{N,1} \end{pmatrix}} &= \underset{\mathbf{X}}{\begin{bmatrix} a_1(1,1) & a_1(1,2) & \cdots & a_1(h,i) & \cdots & a_1(H,I) \\  a_2(1,1) & a_2(1,2) & \cdots & a_2(h,i) & \cdots & a_2(H,I) \\  \vdots   & \vdots   & \ddots & \vdots   & \ddots & \vdots   \\  a_n(1,1) & a_n(1,2) & \cdots & a_n(h,i) & \cdots & a_n(H,I) \\  \vdots   & \vdots   & \ddots & \vdots   & \ddots & \vdots   \\  a_N(1,1) & a_N(1,2) & \cdots & a_N(h,i) & \cdots & a_N(H,I) \end{bmatrix}} \underset{\boldsymbol \beta}{\begin{pmatrix} \langle p_{n,1} \rangle_{1,1} \\ \langle p_{n,1} \rangle_{1,2} \\ \vdots \\ \langle p_{n,1} \rangle_{h,i} \\ \vdots \\ \langle p_{n,1} \rangle_{H,I} \end{pmatrix}} \end{align*}

for {\boldsymbol \beta} where \mathbf{X} \in \mathrm{R}^{N \times A}, A \gg N \geq A^\star, and \mathrm{Rank}[\mathbf{X}] = N. i.e., this is a similar problem as the linear solve in Section 3 above, but with 2 additional complications. First, the system is underdetermined in the sense that there are many more payout-relevant attributes than stocks, A \gg N \geq A^\star. Second, arbitrageurs don’t know exactly how many attributes are in \mathbf{A}^\star. They know that on average, \mathrm{E}[A^\star] = 2 \cdot \pi \cdot (1 - \pi) \cdot A; however, A^\star itself is a random variable.

It’s easy enough to extend the solution strategy in Section 3 to the case of an underdetermined system where a solution {\boldsymbol \beta} is a member of the set:

(16)   \begin{align*} \left\{ \ {\boldsymbol \beta} \ \middle| \ \mathbf{X}{\boldsymbol \beta} = \mathbf{y} \ \right\} &= \left\{ \  \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \begin{bmatrix} {\boldsymbol \beta}_1 \\ 0 \end{bmatrix} + \mathbf{F} \mathbf{b} \ \middle| \ \mathbf{b} \in \mathrm{R}^{A - A^\star} \ \right\} \end{align*}

where \mathbf{F} is a matrix whose column vectors are a basis for the null space of \mathbf{X}. Suppose that \mathbf{X}_1 \subseteq \mathbf{X} is (A^\star \times A^\star)-dimensional and non-singular, then:

(17)   \begin{align*} \mathbf{X} {\boldsymbol \beta} &= \begin{bmatrix} \mathbf{X}_1 & \mathbf{X}_2 \end{bmatrix} \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \mathbf{y} \quad \text{and} \quad {\boldsymbol \beta}_1 = \mathbf{X}_1^{-1} \left( \mathbf{y} - \mathbf{X}_2 {\boldsymbol \beta}_2 \right) \end{align*}

Obviously, setting {\boldsymbol \beta}_2 = 0 is one solution. The full set of solutions defining the null space \mathbf{F} is given by:

(18)   \begin{align*} {\boldsymbol \beta} &= \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \begin{bmatrix} \mathbf{X}_1^{-1} \mathbf{y} \\ 0 \end{bmatrix} + \underbrace{\begin{bmatrix} - \mathbf{X}_1^{-1} \mathbf{X}_2 \\ \mathbf{I} \end{bmatrix}}_{\mathbf{F}} \mathbf{b} \end{align*}

Thus, if it takes f flops to factor \mathbf{X} into \begin{bmatrix} \mathbf{X}_1 & \mathbf{X}_2 \end{bmatrix} and s flops to solve each linear system of the form \mathbf{X}_1 {\boldsymbol \beta}_1 = \mathbf{y}, then the total cost of parameterizing all the solutions is:

(19)   \begin{align*} \{ f + s \cdot (A - A^\star + 1)\}{\scriptstyle \mathrm{flops}} \end{align*}

deep-thought-vs-trading-floor

Via the LU factorization method, I know that the factorization step will cost roughly:

(20)   \begin{align*} f &= \left\{ (\sfrac{2}{3}) \cdot (A^\star)^3 + (A^\star)^2 \cdot (A - A^\star) \right\}{\scriptstyle \mathrm{flops}} \end{align*}

Moreover, we know from Section 3 that the cost of the solve step will be on the order s = \mathrm{O}(A^\star)^3. However, there is one detail left to consider still. Namely that arbitrageurs don’t know A^\star. Thus, they have to solve for both {\boldsymbol \beta} and A^\star by starting at A^\star_0 = \mathrm{E}[A^\star] and iterating on the above process until the columns of \mathbf{F} actually represents a basis for the null space of \mathbf{X}_1. Thus, the total effort needed is:

(21)   \begin{align*} \left\{ (\sfrac{1}{3}) \cdot (A^\star)^3 \cdot (A - A^\star) \right\}^{\sfrac{1}{\gamma}} {\scriptstyle \mathrm{flops}} \end{align*}

where \gamma \in (0,1) is the convergence rate and the calculation is dominated by the effort spent searching through the null space to be sure that A^\star is correct. More broadly, this step is just one way of capturing the deeper idea that knowing where to look is hard. e.g., Warren Buffett says that he “can say no in 10 seconds or so to 90{\scriptstyle \%} or more of all the [investment opportunities] that come along.” This is great… until you consider how many investment opportunities Buffett runs across every single day. Saying no in 10 second flat then turns out to be quite a chore! Alternatively, as the figure above highlights, this is why traders use personalized multi-monitor computing setups that make it easy to spot patterns instead of a shared super computer with minimal output.

5. Clock Time Interpretation

Is \mathrm{O}((A^\star)^3 \cdot (A - A^\star))^{\sfrac{1}{\gamma}} flops a big number? Is it a small number? Flop counts were originally used when floating-point operations were the main computing bottleneck. Now things relating to how data are stored such as cache boundaries and reference locality have first order affects on computation time as well. Nevertheless, flop counts can still give a good back of the envelope estimate of the relative amount of time it would take to execute a procedure, and such a calculation would be helpful in trying to interpret the unit of measurement “flops” on a human scale. e.g., on one hand, arbitrageur effort would be a silly constraint to worry about if the time it took to execute real world calculations was infinitesimally small. On the other hand, flops might be a poor unit of measure for arbitrageurs’ effort if the time it took to carry out reasonable calculations was on the order of millennia since arbitrageurs clearly don’t wait this long to act! Actually doing a quick computation can allay these fears.

Suppose that computers can execute roughly 1{\scriptstyle \mathrm{mil}} operations per second. Millions of instructions per second (i.e., \mathrm{MIPS}) is a standard unit of computational speed. I can then calculate the amount of time it would take to execute a given number of flops at a speed of 1{\scriptstyle \mathrm{MIPS}} as:

(22)   \begin{align*} \mathrm{Time} &= \left((A^\star)^3 \cdot (A - A^\star) \right)^{\sfrac{1}{\gamma}} \times \left(\frac{1 {\scriptstyle \mathrm{sec}}}{10^6 {\scriptstyle \mathrm{flops}}}\right) \times \left(\frac{1 {\scriptstyle \mathrm{day}}}{86400 {\scriptstyle \mathrm{sec}}}\right) \end{align*}

Thus, if there are roughly 5000 characteristics that can take on 50 different levels and 1 out of every 1000 attributes realizes a shock each period, then even if arbitrageurs guess exactly right on the number of shocked attributes (i.e., so that \gamma = 1) a brute force search would take 45 days to complete. Clearly, a brute force search strategy just isn’t feasible. There just isn’t enough time to physically do all of the calculations.

time-per-search-vs-time-per-period--web

6. A Persistent Problem

I conclude by addressing a common question. You might ask: “Won’t really fast computers make cognitive control irrelevant?” No. Progress in computer storage has actually outstripped progress in processing speeds by a wide margin. This is known as Kryder’s Law. Over 10 years the cost of processing has dropped by a factor of roughly 32 (i.e., Moore’s Law). By contrast, the cost of storage has dropped by a factor of 1000 over the same period. e.g., take a look at the figure below made using data from www.mkomo.com which shows that the cost of disk space decreases by 58{\scriptstyle \%} each year. What does this mean in practice? Well, as late as 1980 a 26{\scriptstyle \mathrm{MB}} hard drive cost \mathdollar 5 \times 10^3, implying that a 1{\scriptstyle \mathrm{TB}} hard drive would have cost upwards of \mathdollar 2 \times 10^5. These days you can pick up a 1{\scriptstyle \mathrm{TB}} drive for about \mathdollar 50! We have so much storage that finding things is now an important challenge. This is why we find Google so helpful. Instead of being eliminated by computational power, cognitive control turns out to be a distinctly modern problem.

kryders-law-website

Filed Under: Uncategorized

« Previous Page
Next Page »

Pages

  • Publications
  • Working Papers
  • Curriculum Vitae
  • Notebook
  • Courses

Copyright © 2026 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in

 

Loading Comments...
 

You must be logged in to post a comment.