Uncategorized – Page 12 – Research Notebook

Investigation Bandwidth

March 3, 2014 by Alex

1. Motivation

Time is dimensionless in modern asset pricing theory. e.g., the canonical Euler equation:

(1) $\begin{align*} P_t &= \widetilde{\mathrm{E}}_t[ \, P_{t+1} + D_{t+1} \, ] \end{align*}$

says that the price of an asset at time $t$ (i.e., $P_t$ ) is equal to the risk-adjusted expectation at time $t$ (i.e., $\widetilde{E}_t[\cdot]$ ) of the price of the asset at time $t+1$ plus the risk-adjusted expectation of any dividends paid out by the asset at time $t+1$ (i.e., $P_{t+1} + D_{t+1}$ ). Yet, the theory never answers the question: “Plus $1$ what?” Should we be thinking about seconds? Hours? Days? Years? Centuries? Millennia?

Why does this matter? An algorithmic trader adjusting his position each second worries about different risks than Warren Buffett who has a median holding period of decades. e.g., Buffett studies cash flows, dividends, and business plans. By contrast, the probability that a firm paying out a quarterly dividend happens to pay its dividend during any randomly chosen $1$ second time interval is $\sfrac{1}{1814400}$ . i.e., roughly the odds of picking a year at random since the time that the human and chimpanzee evolutionary lines diverged. Thus, if an algorithmic trader and Warren Buffett both looked at the exact same stock at the exact same time, then they would have to use different risk-adjusted expectations operators:

(2) $\begin{align*} P_t &= \begin{cases} \widetilde{\mathrm{E}}^{\text{Alg}}_t[ \, P_{t+1{\scriptscriptstyle \mathrm{sec}}} \, ] &\text{from algorithmic trader's p.o.v.} \\ \widetilde{\mathrm{E}}^{\text{WB}}_t[ \, P_{t+1{\scriptscriptstyle \mathrm{qtr}}} + D_{t+1{\scriptscriptstyle \mathrm{qtr}}} \, ] &\text{from Warren Buffett's p.o.v.} \end{cases} \end{align*}$

This note gives a simple economic model in which traders endogenously specialize in looking for information at a particular time scale and ignore predictability at vastly different time scales.

2. Simulation

I start with a simple numerical simulation that illustrates why traders at the daily horizon will ignore price patterns at vastly different frequencies. Suppose that the Cisco’s stock returns are composed of a constant growth rate $\mu = \sfrac{0.04}{(480 \cdot 252)}$ , a daily wobble $\beta \cdot \sin(2 \cdot \pi \cdot t)$ with $\beta = \sfrac{1}{(480 \cdot 252)}$ , and a white noise term $\epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2)$ with $\sigma= \sfrac{0.12}{\sqrt{480 \cdot 252}}$ :

(3) $\begin{align*} R_t &= \mu + \beta \cdot \sin(2 \cdot \pi \cdot t) + \epsilon_t, \quad \text{for} \quad t = \sfrac{1}{480}, \sfrac{2}{480}, \ldots, \sfrac{10079}{480}, \sfrac{10080}{480} \end{align*}$

I consider a world where the clock ticks forward in $1$ minute increments so that each tick represents $\sfrac{1}{480}$ th of a trading day. The figure below shows a single sample path of Cisco’s return process over the course of a month.

What are the properties of this return process? First, the constant growth rate, $\mu = \sfrac{0.04}{(480 \cdot 252)}$ , implies that Cisco has a $4{\scriptstyle \%}$ per year return on average. Second, the volatility of the noise component, $\sigma= \sfrac{0.12}{\sqrt{480 \cdot 252}}$ , implies that the annualized volatility of Cisco’s returns is $12{\scriptstyle \%/\sqrt{\mathrm{Yr}}}$ . Finally, since:

(4) $\begin{align*} \frac{1}{2 \cdot \pi} \cdot \int_0^{2 \cdot \pi} [\sin(x)]^2 \cdot dx &= 1 \end{align*}$

the choice of $\beta = \sfrac{1}{(480 \cdot 252)}$ means that (in a world with a $0{\scriptstyle \%}$ riskless rate) a trading strategy which is long Cisco stock in the morning and short Cisco stock in the afternoon will generate a $100{\scriptstyle \%}$ return over the course of $1$ year. i.e., this is a big daily wobble! If you start with a $\mathdollar 1$ on the morning of January $1$ st you end up with $\mathdollar 2$ on the evening of December $31$ st on average by following this trading strategy. The figure below confirms this math by simulating $100$ year long realizations of this trading strategy’s returns.

3. Trader’s Problem

Suppose you didn’t know the exact frequency of the wobble in Cisco’s returns. The wobble is equally likely to have a frequency of anywhere from $\sfrac{1}{252}$ cycles per day to $480$ cycles per day. Using the last month’s worth of data, suppose you estimated the regressions specified below:

(5) $\begin{align*} R_t &= \hat{\mu} + \hat{\beta} \cdot \sin(2 \cdot \pi \cdot f \cdot t) + \hat{\gamma} \cdot \cos(2 \cdot \pi \cdot f \cdot t) + \hat{\epsilon}_t \quad \text{for each} \quad \sfrac{1}{252} < f < 480 \end{align*}$

and identified the frequency, $f_{\min}$ , which best fit the data:

(6) $\begin{align*} f_{\min} &= \arg \min_{\sfrac{1}{252} < f < 480} \left\{ \, \hat{\sigma}(f) \, \right\} \end{align*}$

The figure below shows the empirical distribution of these best in-sample fit frequencies when the true frequency is a daily wobble. The figure reads: “A month’s worth of Cisco’s minute-by-minute returns best fits a factor with a frequency of $\sfrac{1}{1.01{\scriptstyle \mathrm{days}}}$ about $2{\scriptstyle \%}$ of the time when the true frequency is $1$ cycle a day.”

Suppose that you notice a wobble with a frequency of $\sfrac{1}{1.01{\scriptstyle \mathrm{days}}}$ fit Cisco’s returns over the last month really well, but you also know that this is a noisy in-sample estimate. The true wobble could have a different frequency. If you can expend some cognitive effort to investigate alternate frequencies, how wide a bandwidth of frequencies should you investigate? Here’s where things get interesting. The figure above essentially says that you should never investigate frequencies outside of $f_{\min} \pm 0.5 \cdot \sfrac{1}{21}$ —i.e., plus or minus half the width of the bell. The probability that a pattern in returns with a frequency outside this range is actually driving the results is nil!

4. Costs and Benefits

Again, suppose you’re a trader whose noticed that there is a daily wobble in Cisco’s returns over the past month. i.e., using the past month’s data, you’ve estimated $f_{\min} = \sfrac{1}{1{\scriptstyle \mathrm{day}}}$ . Just as before, it’s a big wobble. Implemented at the right time scale, $f_\star$ , you know that this strategy of buying early and selling late will generate a $R(f_\star) = 100{\scriptstyle \%/\mathrm{yr}} = 8.33{\scriptstyle \%/\mathrm{mon}}$ return. Nevertheless, you also know that $f_{\min}$ isn’t necessarily the right frequency to invest in just because it had the lowest in-sample error over the last month. You don’t want to go to your MD and pitch a strategy only to have adjust it a month later due to poor performance. Let’s say that is costs you $\kappa$ dollars to investigate a range of $\delta$ frequencies. If you investigate a particular range and $f_\star$ is there, then you will discover $f_\star$ with probability $1$ .

The question is then: “Which frequency buckets should you investigate?” First, are we losing anything by only searching $\delta$ -sized increments. Well, we can tile the entire frequency range with little tiny $\delta$ increments as follows:

(7) $\begin{align*} 1 - \Delta(x,N) &= \sum_{n=0}^{N-1} \mathrm{Pr}\left[ \, x + n \cdot \delta \leq f_\star < x + (n + 1) \cdot \delta \, \middle| \, f_{\min} \, \right] \end{align*}$

i.e., starting at frequency $x$ we can iteratively add $N$ different increments of size $\delta$ . If we start at a small enough frequency, $x$ , and add enough increments, $N$ , then we can tile as much of the entire domain as we like so that $\Delta(x,N)$ is as small as we like.

Next, what are the benefits of discovering the correct time scale to invest in? If $R(f_{\star})$ denotes the returns to investing in a trading strategy at the correct time scale over the course of the next month, let:

(8) $\begin{align*} \mathrm{Corr}[R(f_{\star}),R(f_{\min})] &= C(f_{\star},f_{\min}) \end{align*}$

denote the correlation between the returns of the strategy at the true frequency and the strategy at the best in-sample fit frequency. We know that $C(f_{\star},f_{\star}) = 1$ and that:

(9) $\begin{align*} \frac{dC(f_{\star},f_{\min})}{d|\log f_{\star} - \log f_{\min}|} < 0 \qquad \text{with} \qquad \lim_{|\log f_{\star} - \log f_{\min}| \to \infty} R(f_{\min}) = 0 \end{align*}$

i.e., as $f_{\min}$ gets farther and farther away from $f_{\star}$ , your realized returns over the next month from a trading strategy implemented at horizon $f_{\min}$ will become less and less correlated with the returns of the strategy implemented at $f_{\star}$ and as a consequence shrink to $0$ . Thus, the benefit to discovering that the true frequency was not $f_{\min}$ is given by $(1 - C(f_\star,f_{\min})) \cdot R(f_{\star})$ .

Putting the pieces together, it’s clear that you should investigate a particular range of frequencies for a confounding explanation if the expected probability of finding $f_{\star}$ there given the realized $f_{\min}$ times the benefit of discovering the true $f_{\star}$ in that range exceeds the search cost $\kappa$ :

(10) $\begin{align*} \kappa &\leq \underbrace{\mathrm{Pr}\left[ \, x + n \cdot \delta \leq f_\star < x + (n + 1) \cdot \delta \, \middle| \, f_{\min} \, \right]}_{\substack{\text{Probability of finding $f_\star$ in a } \\ \text{particular range given observed $f_{\min}$.}}} \cdot \overbrace{(1 - C(f_\star,f_{\min})) \cdot R(f_{\star})}^{\substack{\text{Benefit of} \\ \text{discovery}}} \end{align*}$

i.e., you’ll have a donut shaped search pattern around $f_{\min}$ . You won’t investigate frequencies that are really different from $f_{\min}$ since the probability of finding $f_{\star}$ there will be too low to justify the search costs. By contrast you won’t investigate frequencies that are too similar to $f_{\min}$ since the benefits to discovering this minuscule error don’t justify the costs even though such tiny errors may be quite likely.

5. Wrapping Up

I started with the question: “How can it be that an algorithmic trader and Warren Buffett worry about different patterns in the same price path?” In the analysis above I give one possible answer. If you see a tradable anomaly at a particular time scale (e.g., $1$ wobble per day) over the past month, then the probability that this anomaly was caused by a data generating process with a much shorter or much longer frequency is essentially $0$ . I used only sine wave plus noise processes above, but it seems like this assumption can be easily relaxed via results from, say, Friedlin and Wentzell.

The Secrets N Prices Keep

December 30, 2013 by Alex

1. Introduction

Prices are signals about shocks to fundamentals. In a world where there are many stocks and lots of different kinds of shocks to fundamentals, traders are often more concerned with identifying exactly which shocks took place than the value of any particular asset. e.g., imagine you are a day trader. While you certainly care about changes in the fundamental value of Apple stock, you care much more about the size and location of the underlying shocks since you can profit from this information elsewhere. On one hand, if all firms based in California were hit with a positive shock, you might want to buy shares of Apple, Banana Republic, Costco, …, and Zero Skateboards stock. On the other hand, if all electronic equipment companies were hit with a positive shock, you might want to buy up Apple, Bose, Cisco Systems, …, and Zenith shares instead.

It turns out that there is a sharp phase change in traders’ ability to draw inferences about attribute-specific shocks from prices. i.e., when there have been fewer than $N^\star$ transactions, you can’t tell exactly which shocks affected Apple’s fundamental value. Even if you knew that Apple had been hit by some shock, with fewer than $N^\star$ observations you couldn’t tell whether it was a California-specific event or an electronic equipment-specific event. By contrast, when there have been more than $N^\star$ transactions, you can figure out exactly which shocks have occurred. The additional $(N - N^\star)$ transactions simply allow you to fine tune your beliefs about exactly how large the shocks were. The surprising result is that $N^\star$ is a) independent of traders’ cognitive abilities and b) easily calculable via tools from the compressed sensing literature. See my earlier post for details.

This signal recovery bound is thus a novel new constraint on the amount of information that real world traders can extract from prices. Moreover, the bound gives a concrete meaning to the term “local knowledge”. e.g., shocks that haven’t yet manifested themselves in $N^\star$ transactions are local in the sense that no one can spot them through prices. Anyone who knows of their existence must have found out via some other channel. To build intuition, this post gives $3$ examples of this constraint in action.

2. Out-of-Town House Buyer

First I show where this signal recovery bound comes from. People spend lots of time looking for houses in different cites. e.g., see Trulia or my paper. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When studying at a list of recent sales prices, you find yourself a bit surprised. People must have changed their preferences for $1$ of $7$ different amenities: $^{(1)}$ a $2$ car garage, $^{(2)}$ a $3$ rd bedroom, $^{(3)}$ a half-circle driveway, $^{(4)}$ granite countertops, $^{(5)}$ energy efficient appliances, $^{(6)}$ central A/C, or $^{(7)}$ a walk-in closet. Having the mystery amenity raises the sale price by $\beta > 0$ dollars. You would know how preferences had evolved if you had lived in Chicago the whole time; however, in the absence of this local knowledge, how many sales would you need to see in order to figure out which of the $7$ amenities mattered?

The answer is $3$ . How did I come up with this number? For ease of explanation, let’s normalize expected house prices to $\mathrm{E}_{t-1}[p_{n,t}] = 0$ . Suppose you found one house with amenities $\{1,3,5,7\}$ , a second house with amenities $\{2, 3, 6, 7\}$ , and a third house with amenities $\{4, 5, 6,7\}$ . The combination of prices for these $3$ houses would reveal exactly which amenity had been shocked. i.e., if only the first house’s price was higher than expected, $p_{1,t} \approx \beta$ , then Chicagoans must have changed their preferences for having a $2$ car garage:

(1) $\begin{equation*} {\small \begin{bmatrix} p_{1,t} \\ p_{2,t} \\ p_{3,t} \end{bmatrix} = \begin{bmatrix} \beta \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 1 & 0 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} \beta \\ 0 \\ \vdots \\ 0 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \end{bmatrix} } \quad \text{with} \quad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2), \, \beta \gg \sigma \end{equation*}$

By contrast, if it was the case that $p_{1,t} \approx \beta$ , $p_{2,t} \approx \beta$ , and $p_{3,t} \approx \beta$ , then you would know that people now value walk-in closets much more than they did a year ago.

Here is the key point. $3$ sales is just enough information to answer $7$ yes or no questions and rule out the possibility of no change:

(2) $\begin{align*} 7 = 2^3 - 1 \end{align*}$

$N = 4$ sales simply narrows your error bars around the exact value of $\beta$ . $N = 2$ sales only allows you to distinguish between subsets of amenities. e.g., seeing just the $1$ st and $2$ nd houses with unexpectedly high prices only tells you that people like either half-circle driveways or walk-in closets more. It doesn’t tell you which one. The problem changes character at $N = N^\star(7,1) = 3$ . When you have seen fewer than $N^\star = 3$ sales, information about how preferences have changed is purely local knowledge. Prices can’t publicize this information. You must live and work in Chicago to learn it.

3. Industry Analyst’s Advantage

Next, I illustrate how this signal recovery bound acts like a cognitive constraint for would be arbitrageurs. Suppose you’re a petroleum industry analyst. Through long, hard, caffeine-fueled nights of research you’ve discovered that oil companies such as Schlumberger, Halliburton, and Baker Hughes who’ve invested in hydraulic fracturing (a.k.a., “fracking”) are due for a big unexpected payout. This is really valuable information affecting only a few of the major oil companies. Many companies haven’t really invested in this technology, and they won’t be affected by the shock. How aggressively should you trade Schlumberger, Halliburton, and Baker Hughes? On one hand, you want to build up a large position in these stocks to take advantage of the future price increases that you know are going to happen. On the other hand, you don’t want to allow news of this shock to spill out to the rest of the market.

In the canonical Grossman and Stiglitz (1980)-type setup, the reason that would be arbitrageurs can’t immediately infer your hard-earned information from prices is the existence of noise traders. They can’t be completely sure whether a sudden price movement is due to a) your informed trading or b) random noise trader demand. Here, I propose a new confound: the existence of many plausible shocks. e.g., suppose you start aggressively buying up shares of Schlumberger, Halliburton, and Baker Hughes stock. As an arbitrageur I see the resulting gradual price increases in these $3$ stocks, and ask: “What should my next trade be?” Here’s where things get interesting. When there have been fewer than $N^\star$ transactions in the petroleum industry, I can’t tell whether you are trading on a Houston, TX-specific shock or a fracking-specific shock since all $3$ of these companies share both these attributes. I need to see at least $N^\star$ observations in order to recognize the pattern you’re trading on.

The figure above gives a sense of the number of different kinds of shocks that affect the petroleum industry. It reads: “If you select a Wall Street Journal article on the petroleum industry over the period from 2011 to 2013 there is a $19{\scriptstyle \%}$ chance that ‘Oil sands’ is a listed descriptor and a $7{\scriptstyle \%}$ chance that ‘LNG’ (i.e., liquid natural gas) is a listed descriptor.” Thus, oil stock price changes might be due to $Q \gg 1$ different shocks:

(3) $\begin{align*} \hat{p}_{n,t} &= p_{n,t} - \mathrm{E}_{t-1}[p_{n,t}] = \sum_{q=1}^Q \beta_{q,t} \cdot x_{n,q} + \epsilon_{n,t} \qquad \text{with} \qquad \epsilon_{n,t} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2) \end{align*}$

where $x_{n,q}$ denotes stock $n$ ‘s exposure to the $q$ th attribute. e.g., in this example $x_{n,q} = 1$ if the company invested in fracking (i.e., like Schlumberger, Halliburton, and Baker Hughes) and $x_{n,q}=0$ if the company didn’t. What’s more, very few of the $Q$ possible attributes matter each month. e.g., the plot below reads: “Only around $10{\scriptstyle \%}$ of all the descriptors in the Wall Street Journal articles about the petroleum industry over the period from January 2011 to December 2013 are used each month.” Thus, only $K$ of the possible $Q$ attributes appear to realize shocks each period:

(4) $\begin{align*} K &= \Vert {\boldsymbol \beta} \Vert_{\ell_0} = \sum_{q=1}^Q 1_{\{\beta_q \neq 0\}} \qquad \text{with} \qquad K \ll Q \end{align*}$

Note that this calculation includes terms like ‘Crude oil prices’ which occur in roughly half the articles, so the actual rate is likely much lower. Crude oil prices is just a synonym for the industry.

$petroleumindustry--fraction-of-search-subjects-mentioned-each-month$

For simplicity, suppose that $10$ attributes out of a possible $100$ realized a shock in the previous period, and you discovered $1$ of them. How long does your informational monopoly last? Using tools from Wainwright (2009) it’s easy to show that uninformed traders need at least:

(5) $\begin{align*} N^\star(100,10) \approx 10 \cdot \log(100 - 10) = 45 \end{align*}$

observations to identify which $10$ of the $100$ possible payout-relevant attributes in the petroleum industry has realized a shock. If it takes you (…and other industry specialists like you) around $1$ hour to materially increase your position, then you have roughly $5.6 = \sfrac{45}{8}$ days (i.e., around $1$ trading week) to build up a position before the rest of the market catches on assuming an $8$ hour trading day.

4. Asset Management Expertise

Finally, I show how there can be situations where you might not bother trying to learn from prices because there are too many plausible explanations to check out. In this world everyone specializes in acquiring local knowledge. Suppose you’re a wealthy investor, and I’m a broke asset manager with a trading strategy. I walk into your office, and I try to convince you to finance my strategy that has abnormal returns of $r_t$ per month:

(6) $\begin{align*} r_t &= \mu + \epsilon_t \qquad \text{with} \qquad \epsilon_t \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_{\epsilon}^2) \end{align*}$

where $\sigma_{\epsilon}^2 = 1{\scriptstyle \%}$ per month to make the algebra neat. For simplicity, suppose that there is no debate $\mu > 0$ . In return for running the trading strategy, I ask for fees amounting to a fraction $f$ of the gross returns. Of course, I have to tell you a little bit about how the trading strategy works, so you can deduce that I’m taking on a position that is to some extent a currency carry trade and to some extent a short-volatility strategy. This narrows down the list a bit, but it still leaves a lot of possibilities. In the end, you know that I am using some combination of $K = 2$ out of $Q = 100$ possible strategies.

You have $2$ options. On one hand, if you accept the terms of this offer and finance my strategy, you realize returns net of fees equal to:

(7) $\begin{align*} (1 - f) \cdot \mu \cdot T + \sum_{t=1}^T \epsilon_t \end{align*}$

This approach would net you an annualized Sharpe ratio of $\text{SR}_{\text{mgr}} = \sqrt{12} \cdot (1 - f) \cdot \sfrac{\mu}{\sigma}$ e.g., if I asked for a fee of $f = 20{\scriptstyle \%}$ , and my strategy yielded a return of $2{\scriptstyle \%}$ per month, then your annualized Sharpe ratio net of my fees would be $\text{SR}_{\text{mgr}} = 0.55$ .

On the other hand, you could always refuse my offer and try to back out which strategies I was following using the information you gained from our meeting. i.e., you know that my strategy involves using some combination of $K=2$ factors out of a universe of $Q = 100$ possibilities:

(8) $\begin{align*} \mu &= \sum_{q=1}^{100} \beta_q \cdot x_{q,t} \qquad \text{with} \qquad \Vert {\boldsymbol \beta} \Vert_{\ell_0} = 3 \end{align*}$

In order to deduce which strategies I was using as quickly as possible, you’d have to trade random portfolio combinations of these $100$ different factors for:

(9) $\begin{align*} T^\star(100,2) \approx 2 \cdot \log(100 - 2) = 9.17 \, {\scriptstyle \mathrm{months}} \end{align*}$

Your Sharpe ratio during this period would be $\text{SR}_{\text{w/o mgr}|\text{pre}} = 0$ , and afterwards you would earn the same Sharpe ratio as before without having to pay any fees to me:

(10) $\begin{align*} \text{SR}_{\text{w/o mgr}|\text{post}} &= \sqrt{12} \cdot \left( \frac{0.02}{0.10} \right) = 0.69 \end{align*}$

However, if you have to show your investors reports every year, it may not be worth it for you to reverse engineer my trading strategy. Your average Sharpe ratio during this period would be:

(11) $\begin{align*} \text{SR}_{\text{w/o mgr}} &= \sfrac{9.17}{12} \cdot 0 + \sfrac{(12 - 9.17)}{12} \cdot 0.69 = 0.16 \end{align*}$

which is well below the Sharpe ratio on the market portfolio. Thus, you may just want to pay my fees. Even though you could in principle back out which strategies I was using, it would take too long. You’re investors would withdraw due to poor performance before you could capitalize on your newfound knowledge.

5. Discussion

To cement ideas, let’s think about what this result implies for a financial econometrician. We’ve known since the 1970s that there is a strong relationship between oil shocks and the rest of the economy. e.g., see Hamilton (1983), Lamont (1997), and Hamilton (2003). Imagine you’re now an econometrician, and you go back and pinpoint the exact house when each fracking news shock occurred over the last $40$ years. Using this information, you then run an event study which finds that petroleum stocks affected by each news shock display a positive cumulative abnormal return over the course of the following week. Would this be evidence of a market inefficiency? Are traders still under-reacting to oil shocks? No. Ex post event studies assume that traders know exactly what is and what isn’t important in real time. Non-petroleum industry specialists who didn’t lose sleep researching hydraulic fracturing have to parse out which shocks are relevant only from prices. This takes time. In the interim, this knowledge is local.

How Quickly Can We Decipher Price Signals?

December 23, 2013 by Alex

1. Introduction

There are many different attribute-specific shocks that might affect an asset’s fundamental value in any given period. e.g., the prices of all stocks held in model-driven long/short equity funds might suddenly plummet as happened in the Quant Meltdown of August 2007. Alternatively, new city parking regulations might raise the value of homes with a half circle driveway. Innovations in asset prices are signals containing $2$ different kinds of information: a) which of these $Q$ different shocks has taken place and b) how big each of them was.

It’s often a challenge for traders to answer question (a) in real time. e.g., Daniel (2009) notes that during the Quant Meltdown “markets appeared calm to non-quantitative investors… you could not tell that anything was happening without quant goggles.” This post asks the question: How many transactions do traders need to see in order to identify shocked attributes? The surprising result is that there is a well-defined and calculable answer to this question that is independent of traders’ cognitive abilities. Local knowledge is an unavoidable consequence of this location recovery bound.

2. Motivating Example

It’s easiest to see where this location recovery bound comes from via a short example. Suppose you moved away from Chicago a year ago, and now you’re moving back and looking for a house. When looking at a list of recent sales prices, you find yourself surprised. People must have changed their preferences for $1$ of $7$ different amenities: $^{(1)}$ a $2$ car garage, $^{(2)}$ a $3$ rd bedroom, $^{(3)}$ a half-circle driveway, $^{(4)}$ granite countertops, $^{(5)}$ energy efficient appliances, $^{(6)}$ central A/C, or $^{(7)}$ a walk-in closet. Having the mystery amenity raises the sale price by $\beta > 0$ dollars. To be sure, you would know how preferences had evolved if you had lived in Chicago the whole time; however, in the absence of this local knowledge, how many sales would you need to see in order to figure out which of the $7$ amenities mattered?

The answer is $3$ . Where does this number come from? For ease of explanation, let’s normalize the expected house prices to $\mathrm{E}_{t-1}[p_{1,t}] = 0$ . Suppose you found one house with amenities $\{1,3,5,7\}$ , a second house with amenities $\{2, 3, 6, 7\}$ , and a third house with amenities $\{4, 5, 6,7\}$ . The combination of prices for these $3$ houses would reveal exactly which amenity had been shocked. i.e., if only the first house’s price was higher than expected, $p_{1,t} \approx \beta$ , then Chicagoans must have changed their preferences for having a $2$ car garage:

Here is the key point. $3$ sales is just enough information to answer $7$ yes or no questions and rule out the possibility of no change:

(2) $\begin{align*} 7 = 2^3 - 1 \end{align*}$

3. Main Results

This section formalizes the intuition from the example above. Think about innovations in the price of asset $n$ as the sum of a meaningful signal, $f_n$ , and some noise, $\epsilon_n$ :

(3) $\begin{align*} p_{n,t} - \mathrm{E}_{t-1}[p_{n,t}] &= f_n + \epsilon_n = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \epsilon_n \quad \text{with} \quad \epsilon_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma^2) \end{align*}$

where the signal can be decomposed into $Q$ different attribute-specific shocks. In Equation (3) above, $\beta_q \neq 0$ denotes a shock of size $|\beta_q|$ to the $q$ th attribute and $x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N})$ denotes the extent to which asset $n$ displays the $q$ th attribute. Each of the data columns is normalized so that $\mathrm{E} \, \sum_{n=1}^N \mathrm{Var}[x_{n,q}] = 1$ .

In general, when there are more attributes than shocks, $K < Q$ , picking out exactly which $K$ attributes have realized a shock is a combinatorially hard problem as discussed in Natarajan (1995). However, suppose you had an oracle which could bypass this hurdle and tell you exactly which attributes had realized a shock:

(4) $\begin{align*} \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2} &= \inf_{\{\hat{\boldsymbol \beta} : \#[\beta_q \neq 0] \leq K\}} \, \Vert \mathbf{f} - \mathbf{X}\hat{\boldsymbol \beta} \Vert_{\ell_2} \end{align*}$

In this world, your mean squared prediction error, $\mathrm{MSE} = \frac{1}{N} \cdot \Vert \mathbf{p} - \hat{\mathbf{p}} \Vert_{\ell_2}^2$ , is given by:

(5) $\begin{align*} \mathrm{MSE}^{\text{Oracle}} & = \min_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{N^{\text{Oracle}}} \cdot \Vert \mathbf{p} - \hat{\mathbf{p}}^{\text{Oracle}} \Vert_{\ell_2}^2 \, \right\} = \min_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{N^{\text{Oracle}}} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \sigma^2 \, \right\} \end{align*}$

where $N^{\text{Oracle}} = N^{\text{Oracle}}(Q,K) = K$ denotes the number of observations necessary for your oracle. e.g., if each $\beta_q \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{Bin}(\kappa)$ , then $\mathrm{MSE}^{\text{Oracle}} = \sigma^2$ since there is only variation in the location of the shocks and not the size of the shocks.

It turns out that if each asset isn’t too redundant relative to the number of shocked attributes, then you can achieve a mean squared error that is within a log factor of the oracle’s mean squared error using many fewer observations than there are attributes, $N \ll Q$ . e.g., suppose that you used a lasso estimator:

(6) $\begin{align*} \hat{\boldsymbol \beta}^{\text{Lasso}} &= \arg\min_{\hat{\boldsymbol \beta}} \, \left\{ \, \frac{1}{2} \cdot \Vert \mathbf{p} - \mathbf{X} \hat{\boldsymbol \beta} \Vert_{\ell_2}^2 + \lambda_{\ell_1} \cdot \sigma \cdot \Vert \hat{\boldsymbol \beta} \Vert_{\ell_1} \, \right\} \end{align*}$

with $\lambda_{\ell_1} = 2 \cdot \sqrt{2 \cdot \log Q}$ . Then, Candes and Davenport (2011) show that:

(7) $\begin{align*} \mathrm{MSE}^{\text{Lasso}} &\leq \gamma \cdot \inf_{0 \leq K \leq Q} \, \left\{ \, \frac{1}{K} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \log Q \cdot \sigma^2 \, \right\} \end{align*}$

with probability $1 - 6 \cdot Q^{-2 \cdot \log 2} - Q^{-1} \cdot (2 \cdot \pi \cdot \log Q)^{-\sfrac{1}{2}}$ where $\gamma > 0$ is a small numerical constant. However, this paragraph is quite loose. i.e., what exactly does the condition that “each asset isn’t too redundant relative to the number of shocked attributes” mean? Exactly how many observations would you need to see if each asset’s attribute exposure is drawn as $x_{n,q} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sfrac{1}{N})$ ?

Here’s where things get really interesting. Wainwright (2009) shows that there is a sharp bound on the number of observations, $N^\star = N^\star(Q,K)$ , that you need to observe in order for $\ell_1$ -type estimators like Lasso to succeed when attribute exposure is drawn iid Gaussian:

(8) $\begin{align*} N^\star(Q,K) &= \mathcal{O}\left[K \cdot \log(Q - K)\right] \end{align*}$

with $Q \to \infty$ , $K \to \infty$ , and $\sfrac{K}{Q} \to \kappa$ for some $\kappa > 0$ . When traders observe $N < N^\star(Q,K)$ observations picking out which attributes have realized a shock is an NP-hard problem; whereas, when they observe $N \geq N^\star(Q,K)$ there exist efficient convex optimization algorithms that solve this problem. This result says how the $N^\star = 3$ location recovery bound from the motivating example generalizes to arbitrary numbers of attributes, $Q$ , and shocks, $K$ .

4. Just Identified

I conclude this post by discussing the non-sparse case. i.e., $\boldsymbol \beta$ usually isn’t sparse in econometric textbooks á la Hayashi, Wooldridge, or Angrist and Pischke. When every one of the $Q$ attributes matters, it’s easy to decide which attributes to pay attention to—i.e., all of them. In this situation the mean squared error for an oracle is the same as the mean squared error for mere mortals:

(9) $\begin{align*} \mathrm{MSE}^{\text{Oracle}} & = \frac{1}{Q} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Oracle}} \Vert_{\ell_2}^2 + \sigma^2 = \frac{1}{Q} \cdot \Vert \mathbf{f} - \hat{\mathbf{f}}^{\text{Mortal}} \Vert_{\ell_2}^2 + \sigma^2 = \mathrm{MSE}^{\text{Mortal}} \end{align*}$

Does the location recovery bound disappears in this setting?

No. This is not the case. Indeed, the attribute selection bound corresponds to the usual $N \geq Q$ requirement for identification. To see why, let’s return to the motivating example in Section 2, and consider the case where any of the $7$ attributes could have realized a shock. This leaves us with $128$ different shock combinations:

(10) $\begin{align*} 128 &= {7 \choose 0} + {7 \choose 1} + {7 \choose 2} + {7 \choose 3} + {7 \choose 4} + {7 \choose 5} + {7 \choose 6} + {7 \choose 7} \\ &= 1 + 7 + 21 + 35 + 35 + 21 + 7 + 1 \\ &= 2^7 \end{align*}$

so that $N^\star = 7$ gives just enough differences to identify which combination of shocks was realized. More generally, we have that for any number of attributes, $Q$ :

(11) $\begin{align*} 2^Q &= \sum_{k=0}^Q {Q \choose k} \end{align*}$

This gives an interesting information theoretic interpretation to the meaning of “just identified” that has nothing to do with linear algebra or the invertibility of a matrix.

Constraining Effort Not Bandwidth

October 3, 2013 by Alex

1. Introduction

Imagine trying to fill up a $5$ gallon bucket using a hand-powered water pump. You might end up with a half-full bucket after $1$ hour of work for either of $2$ reasons. First, the spigot might be too narrow. i.e., even though you are doing enough work to pull $5$ gallons of water out of the ground each hour, only $2.5$ gallons can actually flow through the spigot during the allotted time. This is a bandwidth constraint. Alternatively, the pump handle might be too short. i.e., you have to crank the handle twice as many times to pull each gallon of water out of the ground. This is an effort constraint.

Existing information-based asset pricing models a la Grossman and Stiglitz (1980) restrict the bandwidth of arbitrageurs’ flow of information. The market just produces too much information per trading period. i.e., people’s minds have narrow spigots. However, traders also face restrictions on how much work they can do each period. Sometimes it’s hard to crank the pump handle often enough to produce the relevant information. e.g., think of a binary signal such as knowing if a cancer drug has completed a phase of testing as in Huberman and Regev (2001). It doesn’t take much bandwidth to convey this news. After all, people immediately recognized its significance when the New York Times wrote about it. Yet, arbitrageurs must have faced a restriction on the number of scientific articles they could read each year since Nature reported this exact same news $5$ months earlier and no one batted an eye! These traders left money on the table because they anticipated having to pump the handle too many times in order to uncover a really simple signal.

This post proposes algorithmic flop counts as a way of quantifying how much effort traders have to do in order to uncover profitable information. I illustrate the idea via a short example and some computations.

2. Effort via Flop Counts

One way of quantifying how much effort it takes to discover a piece of information is to count the number of floating-point operations (i.e., flops) that a computer has to do to estimate the number. I take my discussion of flop counts primarily from Boyd and Vandenberghe and define a ﬂop as an addition, subtraction, multiplication, or division of $2$ floating-point numbers. e.g., I use flops as a unit of effort so that:

(1) $\begin{align*} \mathrm{Effort}\left[ \ 2.374 \times 17.392 \ \right] &= 2{\scriptstyle \mathrm{flops}} \end{align*}$

in the same way that the cost of a Snickers bar might be $\mathdollar 2$ . I then count the total number of ﬂops needed to calculate a number as a proxy for the effort needed to find out the associated piece of information. e.g., if it took $N^2$ flops to compute the average return of all technology stocks but $N^3$ flops to arrive at the median return on assets for all value stocks, then I would say that it is easier (i.e., took less effort) to know the mean return. The key thing here is that this measure is independent of the amount of entropy that either of these $2$ calculations resolves.

I write flop counts as a polynomial function of the dimensions of the matrices and vectors involved. Moreover, I always simplify the expression by removing all but the highest order terms. e.g., suppose that an algorithm required:

(2) $\begin{align*} \left\{ M^7 + 7 \cdot M^4 \cdot N + M^2 \cdot N + 2 \cdot M \cdot N^6 + 5 \cdot M \cdot N^2 \right\}{\scriptstyle \mathrm{flops}} \end{align*}$

In this case, I would write the flop count as:

(3) $\begin{align*} \left\{M^7 + 2 \cdot M \cdot N^6\right\}{\scriptstyle \mathrm{flops}} \end{align*}$

since both these terms are of order $7$ . Finally, if I also know that $N \gg M$ , I might further simplify to $2 \cdot M \cdot N^6$ flops. Below, I am going to be thinking about high-dimensional matrices and vectors (i.e., where $M$ and $N$ are big), so these simplifications are sensible.

Let’s look at a couple of examples to fix ideas. First, consider the task of matrix-to-vector multiplication. i.e., suppose that there is a matrix $\mathbf{X} \in \mathrm{R}^{M \times N}$ and we want to calculate:

(4) $\begin{align*} \mathbf{y} &= \mathbf{X}{\boldsymbol \beta} \end{align*}$

where we know both $\mathbf{X}$ and ${\boldsymbol \beta}$ and want to figure out $\mathbf{y}$ . This task takes an effort of $2 \cdot M \cdot N$ flops. There are $M$ elements in the vector $\mathbf{y}$ , and to compute each one of these elements, we have to multiply $2$ numbers together $N$ times as:

(5) $\begin{align*} y_m &= \sum_n x_{m,n} \cdot \beta_n \end{align*}$

This setup is analogous to $\mathbf{X}$ being a dataset with $M$ observations on $N$ different variables where each variable has a linear affect $\beta_n$ on the outcome variable $y$ .

Next, let’s turn the tables and look at the case when we know the outcome variable $\mathbf{y}$ and want to solve for ${\boldsymbol \beta}$ when $\mathbf{X} \in \mathrm{R}^{N \times N}$ . A standard approach here would be to use the factor-solve method whereby we first factor the data matrix into the product of $K$ components, $\mathbf{X} = \mathbf{X}_1 \mathbf{X}_2 \cdots \mathbf{X}_K$ , and then use these components to iteratively compute ${\boldsymbol \beta} = \mathbf{X}^{-1}\mathbf{y}$ as:

(6) $\begin{align*} {\boldsymbol \beta}_1 &= \mathbf{X}_1^{-1}\mathbf{y} \\ {\boldsymbol \beta}_2 &= \mathbf{X}_2^{-1}{\boldsymbol \beta}_1 = \mathbf{X}_2^{-1}\mathbf{X}_1^{-1}\mathbf{y} \\ &\vdots \\ {\boldsymbol \beta} = {\boldsymbol \beta}_K &= \mathbf{X}_K^{-1}{\boldsymbol \beta}_{K-1} = \mathbf{X}_K^{-1}\mathbf{X}_{K-1}^{-1} \cdots \mathbf{X}_1^{-1}\mathbf{y} \end{align*}$

We call the process of computing the factors $\{\mathbf{X}_1, \mathbf{X}_2, \ldots, \mathbf{X}_K\}$ the factorization step and the process of solving the equations ${\boldsymbol \beta}_{k} = \mathbf{X}_{k}^{-1}{\boldsymbol \beta}_{k-1}$ the solve step. The total flop count of a solution strategy is then the sum of the flop counts for each of these steps. In many cases the cost of the factorization step is the leading order term.

e.g., consider the Cholesky Factorization method that is commonly used in statistical software. We know that for every $\mathbf{X} \in \mathrm{R}^{N \times N}$ there exists a factorization:

(7) $\begin{align*} \mathbf{X} = \mathbf{L} \mathbf{L}^{\top} \end{align*}$

where $\mathbf{L}$ is lower triangular and non-singular with positive diagonal elements. Cost of computing these Cholesky factors is $(\sfrac{1}{3}) \cdot N^3$ flops. By contrast, the resulting solve steps of $\mathbf{L}{\boldsymbol \beta}_1 = \mathbf{y}$ and $\mathbf{L}^{\top}{\boldsymbol \beta} = {\boldsymbol \beta}_1$ each have flop counts of $N^2$ flops bringing the total flop count to $(\sfrac{1}{3}) \cdot N^3 + 2 \cdot N^2 \sim (\sfrac{1}{3}) \cdot N^3$ flops. In the general case, the effort involved in solving a linear system of equations $\mathbf{y} = \mathbf{X} {\boldsymbol \beta}$ for ${\boldsymbol \beta}$ when $\mathbf{X} \in \mathrm{R}^{N \times N}$ grows with $\mathrm{O}(N^3)$ . Boyd and Vandenberghe argue that “for $N$ more than a thousand or so, generic methods… become less practical,” and financial markets definitely have more than “a thousand or so” trading opportunities to check!

3. Asset Structure

Consider constraining traders’ cognitive effort in an information-based asset pricing model a la Kyle (1985) but with many assets and attribute-specific shocks. Specifically, suppose that there are $N$ stocks that each have $H$ different payout-relevant characteristics. Every characteristic can take on $I$ distinct levels. I call a (characteristic, level) pairing an ‘attribute’ and use the indicator variable $a_n(h,i)$ to denote whether or not a stock has an attribute. Think about attributes as sitting in a $(H \times I)$ -dimensional matrix, $\mathbf{A}$ , as illustrated in Equation (8) below:

(8) $\begin{equation*} \mathbf{A}^{\top} = \bordermatrix{ ~ & 1 & 2 & \cdots & H \cr 1 & \text{Agriculture} & \text{Albuquerque} & \cdots & \text{Alcoa Inc} \cr 2 & \text{Apparel} & \textbf{\color{red}Boise} & \cdots & \text{ConocoPhillips} \cr 3 & \textbf{\color{red}Disk Drive} & \text{Chicago} & \cdots & \text{Dell Inc} \cr \vdots & \vdots & \vdots & \ddots & \vdots \cr I & \text{Wholesale} & \text{Vancouver} & \cdots & \textbf{\color{red}Xerox Corp} \cr} \end{equation*}$

I’ve highlighted the attributes for Micron Technology. e.g., we have that $a_{\text{Mcrn}}(\text{City},\text{Boise}) = 1$ while $a_{\text{WDig}}(\text{City},\text{Boise}) = 0$ since Micron Technology is based in Boise, ID while Western Digital is based in SoCal.

Further, suppose that each stock’s value is then the sum of a collection of attribute-specific shocks:

(9) $\begin{align*} v_n &= \sum_{h,i} x(h,i) \cdot a_n(h,i) \end{align*}$

where the shocks are distributed according to the rule:

(10) $\begin{align*} x(h,i) &= x^+(h,i) + x^-(h,i) \quad \text{with each} \quad x^\pm(h,i) \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \begin{cases} \pm \sfrac{\delta}{\sqrt{H}} &\text{ w/ prob } \pi \\ \ \: \, 0 &\text{ w/ prob } (1 - \pi) \end{cases} \end{align*}$

Each of the $x(h,i)$ indicates whether or not the attribute $(h,i)$ happened to realize a shock. The $\sfrac{\delta}{\sqrt{H}} > 0$ term represents the amplitude of all shocks in units of dollars per share, and the $\pi$ term represents the probability of either a positive or negative shock to attribute $(h,i)$ each period.

If value investors learn asset-specific information and Kyle (1985)-type market makers price each individual stock using only their own order flow in a dynamic setting, then each individual stock will be priced correctly:

(11) $\begin{align*} \mathrm{E} \left[ \ p_n - v_n \ \middle| \ y_n \ \right] &= 0 \end{align*}$

where $y_n$ denotes the aggregate order flow for stock $n$ . Yet, the high-dimensionality of market would mean that there still could be groups of mispriced stocks:

(12) $\begin{align*} \mathrm{E} \left[ \ \langle p_n \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = \sfrac{\delta}{\sqrt{H}} \ \right] &< 0 \; \text{and} \; \mathrm{E} \left[ \ \langle p_n \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = - \sfrac{\delta}{\sqrt{H}} \ \right] > 0 \end{align*}$

where $\langle p_n \rangle_{h,i} = \sfrac{I}{N} \cdot \sum_n p_n \cdot a_n(h,i)$ denotes the sample average price for stocks with a particular attribute, $(h,i)$ . This is a case of more is different. If an oracle told you that $x(h,i) = \sfrac{\delta}{\sqrt{H}}$ for some attribute $(h,i)$ , then you would know that the average price of stocks with attribute $(h,i)$ would be:

(13) $\begin{align*} \langle p_n \rangle_{h,i} &= \lambda_n \cdot \beta_n \cdot \frac{\delta}{\sqrt{H}} + \mathrm{O}(\sfrac{N}{I})^{-1/2} \end{align*}$

since $\lambda_n \cdot \beta_n < 1$ in a dynamic Kyle (1985) model where informed traders have an incentive to trade less aggressively today (i.e., decrease $\beta_n$ and thus $\lambda_n$ ) in order to act on their information again tomorrow. In this setting, $\langle p_{n,1} \rangle_{h,i}$ will be less than its fundamental value $\langle v_n \rangle_{h,i} = \sfrac{\delta}{\sqrt{H}}$ even though it will be easy to see that $\langle p_{n,1} \rangle_{h,i} \neq 0$ as $\sfrac{I}{N} \to 0$ .

4. Arbitrageurs’ Inference Problem

So how much effort does it take to discover the set of shocked attributes, $\mathbf{A}^\star \subseteq \mathbf{A}$ :

(14) $\begin{align*} \mathbf{A}^\star &= \left\{ \ (h,i) \in H \times I \ \middle| \ x(h,i) \neq 0 \ \right\} \quad \text{where} \quad A^\star = \#\left\{ \ (h,i) \ \middle| \ (h,i) \in \mathbf{A}^\star \ \right\} \end{align*}$

given their price impact? What’s stopping arbitrageurs from trading away these attribute-specific pricing errors? Well, the problem of finding the attributes in $\mathbf{A}^\star$ boils down to solving:

(15) $\begin{align*} \underset{\mathbf{y}}{\begin{pmatrix} p_{1,1} \\ p_{2,1} \\ \vdots \\ p_{n,1} \\ \vdots \\ p_{N,1} \end{pmatrix}} &= \underset{\mathbf{X}}{\begin{bmatrix} a_1(1,1) & a_1(1,2) & \cdots & a_1(h,i) & \cdots & a_1(H,I) \\ a_2(1,1) & a_2(1,2) & \cdots & a_2(h,i) & \cdots & a_2(H,I) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ a_n(1,1) & a_n(1,2) & \cdots & a_n(h,i) & \cdots & a_n(H,I) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ a_N(1,1) & a_N(1,2) & \cdots & a_N(h,i) & \cdots & a_N(H,I) \end{bmatrix}} \underset{\boldsymbol \beta}{\begin{pmatrix} \langle p_{n,1} \rangle_{1,1} \\ \langle p_{n,1} \rangle_{1,2} \\ \vdots \\ \langle p_{n,1} \rangle_{h,i} \\ \vdots \\ \langle p_{n,1} \rangle_{H,I} \end{pmatrix}} \end{align*}$

for ${\boldsymbol \beta}$ where $\mathbf{X} \in \mathrm{R}^{N \times A}$ , $A \gg N \geq A^\star$ , and $\mathrm{Rank}[\mathbf{X}] = N$ . i.e., this is a similar problem as the linear solve in Section 3 above, but with $2$ additional complications. First, the system is underdetermined in the sense that there are many more payout-relevant attributes than stocks, $A \gg N \geq A^\star$ . Second, arbitrageurs don’t know exactly how many attributes are in $\mathbf{A}^\star$ . They know that on average, $\mathrm{E}[A^\star] = 2 \cdot \pi \cdot (1 - \pi) \cdot A$ ; however, $A^\star$ itself is a random variable.

It’s easy enough to extend the solution strategy in Section 3 to the case of an underdetermined system where a solution ${\boldsymbol \beta}$ is a member of the set:

(16) $\begin{align*} \left\{ \ {\boldsymbol \beta} \ \middle| \ \mathbf{X}{\boldsymbol \beta} = \mathbf{y} \ \right\} &= \left\{ \ \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \begin{bmatrix} {\boldsymbol \beta}_1 \\ 0 \end{bmatrix} + \mathbf{F} \mathbf{b} \ \middle| \ \mathbf{b} \in \mathrm{R}^{A - A^\star} \ \right\} \end{align*}$

where $\mathbf{F}$ is a matrix whose column vectors are a basis for the null space of $\mathbf{X}$ . Suppose that $\mathbf{X}_1 \subseteq \mathbf{X}$ is $(A^\star \times A^\star)$ -dimensional and non-singular, then:

(17) $\begin{align*} \mathbf{X} {\boldsymbol \beta} &= \begin{bmatrix} \mathbf{X}_1 & \mathbf{X}_2 \end{bmatrix} \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \mathbf{y} \quad \text{and} \quad {\boldsymbol \beta}_1 = \mathbf{X}_1^{-1} \left( \mathbf{y} - \mathbf{X}_2 {\boldsymbol \beta}_2 \right) \end{align*}$

Obviously, setting ${\boldsymbol \beta}_2 = 0$ is one solution. The full set of solutions defining the null space $\mathbf{F}$ is given by:

(18) $\begin{align*} {\boldsymbol \beta} &= \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \begin{bmatrix} \mathbf{X}_1^{-1} \mathbf{y} \\ 0 \end{bmatrix} + \underbrace{\begin{bmatrix} - \mathbf{X}_1^{-1} \mathbf{X}_2 \\ \mathbf{I} \end{bmatrix}}_{\mathbf{F}} \mathbf{b} \end{align*}$

Thus, if it takes $f$ flops to factor $\mathbf{X}$ into $\begin{bmatrix} \mathbf{X}_1 & \mathbf{X}_2 \end{bmatrix}$ and $s$ flops to solve each linear system of the form $\mathbf{X}_1 {\boldsymbol \beta}_1 = \mathbf{y}$ , then the total cost of parameterizing all the solutions is:

(19) $\begin{align*} \{ f + s \cdot (A - A^\star + 1)\}{\scriptstyle \mathrm{flops}} \end{align*}$

Via the LU factorization method, I know that the factorization step will cost roughly:

(20) $\begin{align*} f &= \left\{ (\sfrac{2}{3}) \cdot (A^\star)^3 + (A^\star)^2 \cdot (A - A^\star) \right\}{\scriptstyle \mathrm{flops}} \end{align*}$

Moreover, we know from Section 3 that the cost of the solve step will be on the order $s = \mathrm{O}(A^\star)^3$ . However, there is one detail left to consider still. Namely that arbitrageurs don’t know $A^\star$ . Thus, they have to solve for both ${\boldsymbol \beta}$ and $A^\star$ by starting at $A^\star_0 = \mathrm{E}[A^\star]$ and iterating on the above process until the columns of $\mathbf{F}$ actually represents a basis for the null space of $\mathbf{X}_1$ . Thus, the total effort needed is:

(21) $\begin{align*} \left\{ (\sfrac{1}{3}) \cdot (A^\star)^3 \cdot (A - A^\star) \right\}^{\sfrac{1}{\gamma}} {\scriptstyle \mathrm{flops}} \end{align*}$

where $\gamma \in (0,1)$ is the convergence rate and the calculation is dominated by the effort spent searching through the null space to be sure that $A^\star$ is correct. More broadly, this step is just one way of capturing the deeper idea that knowing where to look is hard. e.g., Warren Buffett says that he “can say no in $10$ seconds or so to $90{\scriptstyle \%}$ or more of all the [investment opportunities] that come along.” This is great… until you consider how many investment opportunities Buffett runs across every single day. Saying no in $10$ second flat then turns out to be quite a chore! Alternatively, as the figure above highlights, this is why traders use personalized multi-monitor computing setups that make it easy to spot patterns instead of a shared super computer with minimal output.

5. Clock Time Interpretation

Is $\mathrm{O}((A^\star)^3 \cdot (A - A^\star))^{\sfrac{1}{\gamma}}$ flops a big number? Is it a small number? Flop counts were originally used when floating-point operations were the main computing bottleneck. Now things relating to how data are stored such as cache boundaries and reference locality have first order affects on computation time as well. Nevertheless, ﬂop counts can still give a good back of the envelope estimate of the relative amount of time it would take to execute a procedure, and such a calculation would be helpful in trying to interpret the unit of measurement “flops” on a human scale. e.g., on one hand, arbitrageur effort would be a silly constraint to worry about if the time it took to execute real world calculations was infinitesimally small. On the other hand, flops might be a poor unit of measure for arbitrageurs’ effort if the time it took to carry out reasonable calculations was on the order of millennia since arbitrageurs clearly don’t wait this long to act! Actually doing a quick computation can allay these fears.

Suppose that computers can execute roughly $1{\scriptstyle \mathrm{mil}}$ operations per second. Millions of instructions per second (i.e., $\mathrm{MIPS}$ ) is a standard unit of computational speed. I can then calculate the amount of time it would take to execute a given number of flops at a speed of $1{\scriptstyle \mathrm{MIPS}}$ as:

(22) $\begin{align*} \mathrm{Time} &= \left((A^\star)^3 \cdot (A - A^\star) \right)^{\sfrac{1}{\gamma}} \times \left(\frac{1 {\scriptstyle \mathrm{sec}}}{10^6 {\scriptstyle \mathrm{flops}}}\right) \times \left(\frac{1 {\scriptstyle \mathrm{day}}}{86400 {\scriptstyle \mathrm{sec}}}\right) \end{align*}$

Thus, if there are roughly $5000$ characteristics that can take on $50$ different levels and $1$ out of every $1000$ attributes realizes a shock each period, then even if arbitrageurs guess exactly right on the number of shocked attributes (i.e., so that $\gamma = 1$ ) a brute force search would take $45$ days to complete. Clearly, a brute force search strategy just isn’t feasible. There just isn’t enough time to physically do all of the calculations.

6. A Persistent Problem

I conclude by addressing a common question. You might ask: “Won’t really fast computers make cognitive control irrelevant?” No. Progress in computer storage has actually outstripped progress in processing speeds by a wide margin. This is known as Kryder’s Law. Over $10$ years the cost of processing has dropped by a factor of roughly $32$ (i.e., Moore’s Law). By contrast, the cost of storage has dropped by a factor of $1000$ over the same period. e.g., take a look at the figure below made using data from www.mkomo.com which shows that the cost of disk space decreases by $58{\scriptstyle \%}$ each year. What does this mean in practice? Well, as late as 1980 a $26{\scriptstyle \mathrm{MB}}$ hard drive cost $\mathdollar 5 \times 10^3$ , implying that a $1{\scriptstyle \mathrm{TB}}$ hard drive would have cost upwards of $\mathdollar 2 \times 10^5$ . These days you can pick up a $1{\scriptstyle \mathrm{TB}}$ drive for about $\mathdollar 50$ ! We have so much storage that finding things is now an important challenge. This is why we find Google so helpful. Instead of being eliminated by computational power, cognitive control turns out to be a distinctly modern problem.

Many Assets with Attribute-Specific Shocks

October 2, 2013 by Alex

1. Motivation and Outline

Asset pricing models tend to focus on a single stock that realizes a normally distributed value shock of undefined origins. e.g., think of Kyle (1985) as a representative example. This is a great starting point; however, massive size and dense interconnectedness are key features of financial markets. Studying a financial market without these features is like studying dry water. In this post I suggest a simple way to modify the standard payout structure to allow for many assets and attribute-specific shocks.

What do I mean by attribute-specific shocks? To illustrate, have a look at the figure below which shows the most common $25{\scriptstyle \%}$ of topics that came into play when journalists from the Wall Street Journal wrote about Micron Technology from 2001 to 2012. The figure reads that: “If you select a Wall Street Journal article that mentioned Micron Technology in the abstract at random, then there is a $9{\scriptstyle \%}$ chance that ‘Antitrust’ is a listed subject.” Here’s the key point. When news about Micron Technology emerged, it was never just about Micron Technology. Journalists wrote about a particular SEC investigation, or a technology shock affecting all hard disk drive makers, or the firms currently active in the mergers and acquisitions market, etc… Value shocks are physical. They are rooted in particular events affecting subsets of stocks.

A big market with attribute-specific shocks means perspective matters. Consider a real world example. e.g., Khandani and Lo (2007) wrote about the ‘Quant Meltdown’ of 2007 that “the most remarkable aspect of these hedge-fund losses was the fact that they were confined almost exclusively to funds using quantitative strategies. With laser-like precision, model-driven long/short equity funds were hit hard on Tue Aug $7$ th and Wed Aug $8$ th, despite relatively little movement in [the average level of] fixed-income and equity markets during those $2$ days and no major losses reported in any other hedge-fund sectors.” Every individual stock was priced correctly, yet there was still a huge multi-stock price movement in a particular subset of stocks. Here’s the kicker: You would never have noticed this shock unless you knew exactly where to look!

2. Payout Structure

In Kyle (1985) there is a single stock with a fundamental value distributed as $v \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_v^2)$ . Suppose that, instead, there are actually $N$ stocks that each have $H$ different payout-relevant characteristics. Every characteristic can take on $I$ distinct levels. I call a (characteristic, level) pairing an ‘attribute’ and use the indicator variable $a_n(h,i)$ to denote whether or not a stock has an attribute. Think about attributes as sitting in a $(H \times I)$ -dimensional matrix, $\mathbf{A}$ , as illustrated in Equation (1) below:

(1) $\begin{equation*} \mathbf{A}^{\top} = \bordermatrix{ ~ & 1 & 2 & \cdots & H \cr 1 & \text{Agriculture} & \text{Albuquerque} & \cdots & \text{Alcoa Inc} \cr 2 & \text{Apparel} & \textbf{\color{red}Boise} & \cdots & \text{ConocoPhillips} \cr 3 & \textbf{\color{red}Disk Drive} & \text{Chicago} & \cdots & \text{Dell Inc} \cr \vdots & \vdots & \vdots & \ddots & \vdots \cr I & \text{Wholesale} & \text{Vancouver} & \cdots & \textbf{\color{red}Xerox Corp} \cr} \end{equation*}$

Further, suppose that each stock’s value is then the sum of a collection of attribute-specific shocks:

(2) $\begin{align*} v_n &= \sum_{h,i} x(h,i) \cdot a_n(h,i) \end{align*}$

where the shocks are distributed according to the rule:

(3) $\begin{align*} x(h,i) &= x^+(h,i) + x^-(h,i) \quad \text{with each} \quad x^\pm(h,i) \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \begin{cases} \pm \sfrac{\delta}{\sqrt{H}} &\text{ w/ prob } \pi \\ \ \: \, 0 &\text{ w/ prob } (1 - \pi) \end{cases} \end{align*}$

You could also add the usual factor exposure and firm-specific shocks to the model:

(4) $\begin{align*} v_n &= \cancelto{0}{{\boldsymbol \theta}_n^{\top} \mathbf{f}} + \sum_{h,i} x(h,i) \cdot a_n(h,i) + \cancelto{0}{\epsilon_n} \end{align*}$

I’ve excluded these terms for clarity since they are not new. You might be wondering: “Aren’t these attribute-specific shocks captured by a covariance matrix, though?” No. The covariance between any $2$ assets in this setup is:

(5) $\begin{align*} \mathrm{Cov}\left[v_n,v_{n'}\right] = H \cdot \left(\sfrac{1}{I}\right)^2 \cdot 2 \cdot \pi \cdot (1-\pi) \cdot \left( \sfrac{\delta}{\sqrt{H}} \right)^2 = 2 \cdot \pi \cdot (1 - \pi) \cdot \left( \sfrac{\delta}{I} \right)^2 \end{align*}$

where the first $H$ corresponds to the number of characteristics, the $(\sfrac{1}{I})^2$ term denotes the probability that both stocks have the same level for a particular characteristic, the $2 \cdot \pi \cdot (1 - \pi)$ term denotes the probability that the attribute realizes a shock, and the $(\sfrac{\delta}{\sqrt{H}})^2$ term denotes the squared attributes-specific shock. The takeaway from this calculation is that the covariance matrix is completely flat (i.e., it doesn’t matter which $n$ and $n'$ you compare) and arbitrarily small.

Lots of things that you might think of as explained by constant covariance aren’t. e.g., the figure above shows the maximum industry-specific contribution to daily return variance from January 1976 to December 2011 using the methodology in Campbell, Lettau, Malkiel, and Xu (2001). The vertical text at the bottom gives the name of the industry with the largest industry-specific contribution to daily return variance each month any time it changes from the previous month. The figure reads that: “While traders can usually expect to understand no more than $5{\scriptstyle \%/\mathrm{yr}}$ of a typical firm’s $20{\scriptstyle \%/\mathrm{yr}}$ variation in daily returns, there are times such as in $1987$ when this $5{\scriptstyle \%/\mathrm{yr}}$ figure suddenly jumps to over $25{\scriptstyle \%/\mathrm{yr}}$ . What’s more, the density of the text along the base of the figure shows that the important (i.e., extremal) industry regularly changes from month to month.”

3. Approximation Error

One of the nice features of this reformulation of the usual normal value shocks is that, although it changes the interpretation of where each firm’s value comes from, it doesn’t alter any of the Guassian structure of the problem. i.e., the normal approximation to the binomial distribution says that:

(6) $\begin{align*} \sum_{h,i} x(h,i) \cdot a_n(h,i) \overset{\scriptscriptstyle \text{``ish''}}{\sim} \mathrm{N}(0, \sigma_v^2) \quad \text{where} \quad \sigma_v^2 = 2 \cdot \delta^2 \cdot \pi \cdot (1 - \pi) \end{align*}$

where the “ish” means that there is a small and easy to compute approximation error. e.g., consider the collection of attribute-specific shocks for asset $n$ , $\{x_1,x_2,\ldots,x_H\}$ , with $\mathrm{E}[x_h] = 0$ , $\mathrm{E}[x_h^2] = \sigma_v^2 > 0$ , and $\mathrm{E}[|x_h|^3] = \rho_v < \infty$ and define the normalized sum $X(H) = \sfrac{1}{(\sigma_v \cdot \sqrt{H})} \cdot \sum_h x_h$ with the cumulative density function $F_H(x) = \mathrm{Pr}[X(H) \leq x]$ . Then, we know via the central limit theorem that $F_H(x) \to \Phi(x)$ as $H \to \infty$ where $\Phi(\cdot)$ is the standard normal distribution.

Moreover, the Berry-Esseen Theorem says that:

(7) $\begin{align*} \max_{x \in \mathrm{R}}\left\{ \ \left| F_H(x) - \Phi(x) \right| \ \right\} &\leq \frac{0.50 \cdot \rho_v}{\sigma_v^3 \cdot \sqrt{H}} = \frac{0.50}{\sqrt{2 \cdot H \cdot \pi \cdot (1 - \pi)}} \end{align*}$

where the second equals sign applies only in the special case of the sum of $2$ binomially distributed random variables. The figure above shows how well this approximation holds as the number of payout-relevant characteristics, $H$ , increases from $100$ to $10000$ in a world where $\pi = \sfrac{1}{100}$ . I compute the $x$ -axis on a grid of unit length $\Delta x = \sfrac{1}{100}$ . If there are $100$ firms with values that typically range over an area of $\sigma_v^2 = \mathdollar 100{\scriptstyle /\mathrm{sh}}$ , then in a world with $H = 1000$ payout-relevant characteristics only $6$ stocks will be misvalued by a mere $\Delta x \cdot \sigma_v \cdot \sqrt{H} \approx \mathdollar 3{\scriptstyle /\mathrm{sh}}$ if you use the normal approximation to the binomial distribution rather than the true distribution. Thus, less than $1$ dollar in $500$ isn’t accounted for by the approximation:

(8) $\begin{align*} 0.0018 &= \frac{6{\scriptstyle \mathrm{stocks}} \cdot \mathdollar 3{\scriptstyle /\mathrm{sh}}}{100{\scriptstyle \mathrm{stocks}} \cdot \sigma_v^2} \end{align*}$

By contrast, the figure below shows the $12{\scriptstyle \mathrm{mo}}$ moving average of the percent of the variance in firm-level daily returns explained by market and industry factors over the time period from January 1976 to December 2011 using the methodology from Campbell, Lettau, Malkiel, and Xu (2001). This figure reads that: “For a randomly selected stock in 1999, market and industry considerations only account for around $30{\scriptstyle \%}$ of its daily return variation.” In other words, the usual factor models typically account for less than half of the fluctuations in firm value. i.e., they are $2$ orders of magnitude less precise than the approximation error!

4. Whose Perspective?

You might ask: “Why bother adding this extra structure?” In a big market with attribute-specific shocks, perspective matters. This is the punchline. Asset values and attribute-specific shocks essentially carry the same information since:

(9) $\begin{align*} v_n &= \sum_{h,i} x(h,i) \cdot a_n(h,i) + \mathrm{O}(H)^{-\sfrac{1}{2}} \\ x(h,i) &= \frac{1}{\sfrac{N}{I}} \cdot \sum_n v_n \cdot a_n(h,i) + \mathrm{O}(\sfrac{N}{I})^{-\sfrac{1}{2}} \end{align*}$

However, knowing the value of an asset tells you very little about whether any particular one of its attributes has realized a shock. Similarly, knowing whether an attribute has realized a shock is a really noisy signal about the value of any particular stock with that attribute.

To see how this duality might affect asset prices, consider a simple example. e.g., suppose that we are in a multi-period Kyle (1985)-type world where value investors know the fundamental value of a particular stock, and they place orders with a market maker who processes only the order flow for that particular stock. It could well be the case that market makers price each stock correctly on average:

(10) $\begin{align*} \mathrm{E} \left[ \ p_{n,t} - v_n \ \middle| \ y_{n,t} \ \right] &= 0 \end{align*}$

Yet, the high-dimensionality of market would mean that there still could be groups of mispriced stocks:

(11) $\begin{align*} \mathrm{E} \left[ \ \langle p_{n,t} \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = \sfrac{\delta}{\sqrt{H}} \ \right] &< 0 \; \text{and} \; \mathrm{E} \left[ \ \langle p_{n,t} \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = - \sfrac{\delta}{\sqrt{H}} \ \right] > 0 \end{align*}$

where $\langle p_{n,1} \rangle_{h,i} = \sfrac{I}{N} \cdot \sum_n p_{n,1} \cdot a_n(h,i)$ denotes the sample average time $t$ price for stocks with a particular attribute, $(h,i)$ . This is a case of more is different. If an oracle told you that $x(h,i) = \sfrac{\delta}{\sqrt{H}}$ for some attribute $(h,i)$ , then you would know that the average price of stocks with attribute $(h,i)$ would be:

(12) $\begin{align*} \langle p_{n,t} \rangle_{h,i} &= \langle \lambda_{n,t} \cdot \beta_{n,t} \rangle_{h,i} \cdot \frac{\delta}{\sqrt{H}} + \mathrm{O}(\sfrac{N}{I})^{-1/2} \end{align*}$

where $\langle \lambda_{n,t} \cdot \beta_{n,t}\rangle_{h,i} < 1$ since value investors would have an incentive to delay trading in a dynamic model. i.e., $\langle p_{n,1} \rangle_{h,i}$ will be less than its fundamental value $\langle v_n \rangle_{h,i} = \sfrac{\delta}{\sqrt{H}}$ even though it will be easy to see that $\langle p_{n,1} \rangle_{h,i} \neq 0$ as $\sfrac{I}{N} \to 0$ .

There are way more payout-relevant attributes than anyone could ever investigate in a single period. This is why Charlie Munger explains that it’s his job “to find a few intelligent things to do, not to keep up with every damn thing in the world.” If we think about each stock as a location in a “spatial” domain and the attribute-specific shocks as particular points in a “frequency” domain, this result takes on the flavor of a generalized uncertainty principle. i.e., it’s really hard to simultaneously estimate the price of a portfolio at both very fine scales (i.e., containing a single asset) and very low frequencies (i.e., affecting every stock with an attribute).

« Previous Page