Alex – Page 13 – Research Notebook

Constraining Effort Not Bandwidth

October 3, 2013 by Alex

1. Introduction

Imagine trying to fill up a $5$ gallon bucket using a hand-powered water pump. You might end up with a half-full bucket after $1$ hour of work for either of $2$ reasons. First, the spigot might be too narrow. i.e., even though you are doing enough work to pull $5$ gallons of water out of the ground each hour, only $2.5$ gallons can actually flow through the spigot during the allotted time. This is a bandwidth constraint. Alternatively, the pump handle might be too short. i.e., you have to crank the handle twice as many times to pull each gallon of water out of the ground. This is an effort constraint.

Existing information-based asset pricing models a la Grossman and Stiglitz (1980) restrict the bandwidth of arbitrageurs’ flow of information. The market just produces too much information per trading period. i.e., people’s minds have narrow spigots. However, traders also face restrictions on how much work they can do each period. Sometimes it’s hard to crank the pump handle often enough to produce the relevant information. e.g., think of a binary signal such as knowing if a cancer drug has completed a phase of testing as in Huberman and Regev (2001). It doesn’t take much bandwidth to convey this news. After all, people immediately recognized its significance when the New York Times wrote about it. Yet, arbitrageurs must have faced a restriction on the number of scientific articles they could read each year since Nature reported this exact same news $5$ months earlier and no one batted an eye! These traders left money on the table because they anticipated having to pump the handle too many times in order to uncover a really simple signal.

This post proposes algorithmic flop counts as a way of quantifying how much effort traders have to do in order to uncover profitable information. I illustrate the idea via a short example and some computations.

2. Effort via Flop Counts

One way of quantifying how much effort it takes to discover a piece of information is to count the number of floating-point operations (i.e., flops) that a computer has to do to estimate the number. I take my discussion of flop counts primarily from Boyd and Vandenberghe and define a ﬂop as an addition, subtraction, multiplication, or division of $2$ floating-point numbers. e.g., I use flops as a unit of effort so that:

(1) $\begin{align*} \mathrm{Effort}\left[ \ 2.374 \times 17.392 \ \right] &= 2{\scriptstyle \mathrm{flops}} \end{align*}$

in the same way that the cost of a Snickers bar might be $\mathdollar 2$ . I then count the total number of ﬂops needed to calculate a number as a proxy for the effort needed to find out the associated piece of information. e.g., if it took $N^2$ flops to compute the average return of all technology stocks but $N^3$ flops to arrive at the median return on assets for all value stocks, then I would say that it is easier (i.e., took less effort) to know the mean return. The key thing here is that this measure is independent of the amount of entropy that either of these $2$ calculations resolves.

I write flop counts as a polynomial function of the dimensions of the matrices and vectors involved. Moreover, I always simplify the expression by removing all but the highest order terms. e.g., suppose that an algorithm required:

(2) $\begin{align*} \left\{ M^7 + 7 \cdot M^4 \cdot N + M^2 \cdot N + 2 \cdot M \cdot N^6 + 5 \cdot M \cdot N^2 \right\}{\scriptstyle \mathrm{flops}} \end{align*}$

In this case, I would write the flop count as:

(3) $\begin{align*} \left\{M^7 + 2 \cdot M \cdot N^6\right\}{\scriptstyle \mathrm{flops}} \end{align*}$

since both these terms are of order $7$ . Finally, if I also know that $N \gg M$ , I might further simplify to $2 \cdot M \cdot N^6$ flops. Below, I am going to be thinking about high-dimensional matrices and vectors (i.e., where $M$ and $N$ are big), so these simplifications are sensible.

Let’s look at a couple of examples to fix ideas. First, consider the task of matrix-to-vector multiplication. i.e., suppose that there is a matrix $\mathbf{X} \in \mathrm{R}^{M \times N}$ and we want to calculate:

(4) $\begin{align*} \mathbf{y} &= \mathbf{X}{\boldsymbol \beta} \end{align*}$

where we know both $\mathbf{X}$ and ${\boldsymbol \beta}$ and want to figure out $\mathbf{y}$ . This task takes an effort of $2 \cdot M \cdot N$ flops. There are $M$ elements in the vector $\mathbf{y}$ , and to compute each one of these elements, we have to multiply $2$ numbers together $N$ times as:

(5) $\begin{align*} y_m &= \sum_n x_{m,n} \cdot \beta_n \end{align*}$

This setup is analogous to $\mathbf{X}$ being a dataset with $M$ observations on $N$ different variables where each variable has a linear affect $\beta_n$ on the outcome variable $y$ .

Next, let’s turn the tables and look at the case when we know the outcome variable $\mathbf{y}$ and want to solve for ${\boldsymbol \beta}$ when $\mathbf{X} \in \mathrm{R}^{N \times N}$ . A standard approach here would be to use the factor-solve method whereby we first factor the data matrix into the product of $K$ components, $\mathbf{X} = \mathbf{X}_1 \mathbf{X}_2 \cdots \mathbf{X}_K$ , and then use these components to iteratively compute ${\boldsymbol \beta} = \mathbf{X}^{-1}\mathbf{y}$ as:

(6) $\begin{align*} {\boldsymbol \beta}_1 &= \mathbf{X}_1^{-1}\mathbf{y} \\ {\boldsymbol \beta}_2 &= \mathbf{X}_2^{-1}{\boldsymbol \beta}_1 = \mathbf{X}_2^{-1}\mathbf{X}_1^{-1}\mathbf{y} \\ &\vdots \\ {\boldsymbol \beta} = {\boldsymbol \beta}_K &= \mathbf{X}_K^{-1}{\boldsymbol \beta}_{K-1} = \mathbf{X}_K^{-1}\mathbf{X}_{K-1}^{-1} \cdots \mathbf{X}_1^{-1}\mathbf{y} \end{align*}$

We call the process of computing the factors $\{\mathbf{X}_1, \mathbf{X}_2, \ldots, \mathbf{X}_K\}$ the factorization step and the process of solving the equations ${\boldsymbol \beta}_{k} = \mathbf{X}_{k}^{-1}{\boldsymbol \beta}_{k-1}$ the solve step. The total flop count of a solution strategy is then the sum of the flop counts for each of these steps. In many cases the cost of the factorization step is the leading order term.

e.g., consider the Cholesky Factorization method that is commonly used in statistical software. We know that for every $\mathbf{X} \in \mathrm{R}^{N \times N}$ there exists a factorization:

(7) $\begin{align*} \mathbf{X} = \mathbf{L} \mathbf{L}^{\top} \end{align*}$

where $\mathbf{L}$ is lower triangular and non-singular with positive diagonal elements. Cost of computing these Cholesky factors is $(\sfrac{1}{3}) \cdot N^3$ flops. By contrast, the resulting solve steps of $\mathbf{L}{\boldsymbol \beta}_1 = \mathbf{y}$ and $\mathbf{L}^{\top}{\boldsymbol \beta} = {\boldsymbol \beta}_1$ each have flop counts of $N^2$ flops bringing the total flop count to $(\sfrac{1}{3}) \cdot N^3 + 2 \cdot N^2 \sim (\sfrac{1}{3}) \cdot N^3$ flops. In the general case, the effort involved in solving a linear system of equations $\mathbf{y} = \mathbf{X} {\boldsymbol \beta}$ for ${\boldsymbol \beta}$ when $\mathbf{X} \in \mathrm{R}^{N \times N}$ grows with $\mathrm{O}(N^3)$ . Boyd and Vandenberghe argue that “for $N$ more than a thousand or so, generic methods… become less practical,” and financial markets definitely have more than “a thousand or so” trading opportunities to check!

3. Asset Structure

Consider constraining traders’ cognitive effort in an information-based asset pricing model a la Kyle (1985) but with many assets and attribute-specific shocks. Specifically, suppose that there are $N$ stocks that each have $H$ different payout-relevant characteristics. Every characteristic can take on $I$ distinct levels. I call a (characteristic, level) pairing an ‘attribute’ and use the indicator variable $a_n(h,i)$ to denote whether or not a stock has an attribute. Think about attributes as sitting in a $(H \times I)$ -dimensional matrix, $\mathbf{A}$ , as illustrated in Equation (8) below:

(8) $\begin{equation*} \mathbf{A}^{\top} = \bordermatrix{ ~ & 1 & 2 & \cdots & H \cr 1 & \text{Agriculture} & \text{Albuquerque} & \cdots & \text{Alcoa Inc} \cr 2 & \text{Apparel} & \textbf{\color{red}Boise} & \cdots & \text{ConocoPhillips} \cr 3 & \textbf{\color{red}Disk Drive} & \text{Chicago} & \cdots & \text{Dell Inc} \cr \vdots & \vdots & \vdots & \ddots & \vdots \cr I & \text{Wholesale} & \text{Vancouver} & \cdots & \textbf{\color{red}Xerox Corp} \cr} \end{equation*}$

I’ve highlighted the attributes for Micron Technology. e.g., we have that $a_{\text{Mcrn}}(\text{City},\text{Boise}) = 1$ while $a_{\text{WDig}}(\text{City},\text{Boise}) = 0$ since Micron Technology is based in Boise, ID while Western Digital is based in SoCal.

Further, suppose that each stock’s value is then the sum of a collection of attribute-specific shocks:

(9) $\begin{align*} v_n &= \sum_{h,i} x(h,i) \cdot a_n(h,i) \end{align*}$

where the shocks are distributed according to the rule:

(10) $\begin{align*} x(h,i) &= x^+(h,i) + x^-(h,i) \quad \text{with each} \quad x^\pm(h,i) \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \begin{cases} \pm \sfrac{\delta}{\sqrt{H}} &\text{ w/ prob } \pi \\ \ \: \, 0 &\text{ w/ prob } (1 - \pi) \end{cases} \end{align*}$

Each of the $x(h,i)$ indicates whether or not the attribute $(h,i)$ happened to realize a shock. The $\sfrac{\delta}{\sqrt{H}} > 0$ term represents the amplitude of all shocks in units of dollars per share, and the $\pi$ term represents the probability of either a positive or negative shock to attribute $(h,i)$ each period.

If value investors learn asset-specific information and Kyle (1985)-type market makers price each individual stock using only their own order flow in a dynamic setting, then each individual stock will be priced correctly:

(11) $\begin{align*} \mathrm{E} \left[ \ p_n - v_n \ \middle| \ y_n \ \right] &= 0 \end{align*}$

where $y_n$ denotes the aggregate order flow for stock $n$ . Yet, the high-dimensionality of market would mean that there still could be groups of mispriced stocks:

(12) $\begin{align*} \mathrm{E} \left[ \ \langle p_n \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = \sfrac{\delta}{\sqrt{H}} \ \right] &< 0 \; \text{and} \; \mathrm{E} \left[ \ \langle p_n \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = - \sfrac{\delta}{\sqrt{H}} \ \right] > 0 \end{align*}$

where $\langle p_n \rangle_{h,i} = \sfrac{I}{N} \cdot \sum_n p_n \cdot a_n(h,i)$ denotes the sample average price for stocks with a particular attribute, $(h,i)$ . This is a case of more is different. If an oracle told you that $x(h,i) = \sfrac{\delta}{\sqrt{H}}$ for some attribute $(h,i)$ , then you would know that the average price of stocks with attribute $(h,i)$ would be:

(13) $\begin{align*} \langle p_n \rangle_{h,i} &= \lambda_n \cdot \beta_n \cdot \frac{\delta}{\sqrt{H}} + \mathrm{O}(\sfrac{N}{I})^{-1/2} \end{align*}$

since $\lambda_n \cdot \beta_n < 1$ in a dynamic Kyle (1985) model where informed traders have an incentive to trade less aggressively today (i.e., decrease $\beta_n$ and thus $\lambda_n$ ) in order to act on their information again tomorrow. In this setting, $\langle p_{n,1} \rangle_{h,i}$ will be less than its fundamental value $\langle v_n \rangle_{h,i} = \sfrac{\delta}{\sqrt{H}}$ even though it will be easy to see that $\langle p_{n,1} \rangle_{h,i} \neq 0$ as $\sfrac{I}{N} \to 0$ .

4. Arbitrageurs’ Inference Problem

So how much effort does it take to discover the set of shocked attributes, $\mathbf{A}^\star \subseteq \mathbf{A}$ :

(14) $\begin{align*} \mathbf{A}^\star &= \left\{ \ (h,i) \in H \times I \ \middle| \ x(h,i) \neq 0 \ \right\} \quad \text{where} \quad A^\star = \#\left\{ \ (h,i) \ \middle| \ (h,i) \in \mathbf{A}^\star \ \right\} \end{align*}$

given their price impact? What’s stopping arbitrageurs from trading away these attribute-specific pricing errors? Well, the problem of finding the attributes in $\mathbf{A}^\star$ boils down to solving:

(15) $\begin{align*} \underset{\mathbf{y}}{\begin{pmatrix} p_{1,1} \\ p_{2,1} \\ \vdots \\ p_{n,1} \\ \vdots \\ p_{N,1} \end{pmatrix}} &= \underset{\mathbf{X}}{\begin{bmatrix} a_1(1,1) & a_1(1,2) & \cdots & a_1(h,i) & \cdots & a_1(H,I) \\ a_2(1,1) & a_2(1,2) & \cdots & a_2(h,i) & \cdots & a_2(H,I) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ a_n(1,1) & a_n(1,2) & \cdots & a_n(h,i) & \cdots & a_n(H,I) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ a_N(1,1) & a_N(1,2) & \cdots & a_N(h,i) & \cdots & a_N(H,I) \end{bmatrix}} \underset{\boldsymbol \beta}{\begin{pmatrix} \langle p_{n,1} \rangle_{1,1} \\ \langle p_{n,1} \rangle_{1,2} \\ \vdots \\ \langle p_{n,1} \rangle_{h,i} \\ \vdots \\ \langle p_{n,1} \rangle_{H,I} \end{pmatrix}} \end{align*}$

for ${\boldsymbol \beta}$ where $\mathbf{X} \in \mathrm{R}^{N \times A}$ , $A \gg N \geq A^\star$ , and $\mathrm{Rank}[\mathbf{X}] = N$ . i.e., this is a similar problem as the linear solve in Section 3 above, but with $2$ additional complications. First, the system is underdetermined in the sense that there are many more payout-relevant attributes than stocks, $A \gg N \geq A^\star$ . Second, arbitrageurs don’t know exactly how many attributes are in $\mathbf{A}^\star$ . They know that on average, $\mathrm{E}[A^\star] = 2 \cdot \pi \cdot (1 - \pi) \cdot A$ ; however, $A^\star$ itself is a random variable.

It’s easy enough to extend the solution strategy in Section 3 to the case of an underdetermined system where a solution ${\boldsymbol \beta}$ is a member of the set:

(16) $\begin{align*} \left\{ \ {\boldsymbol \beta} \ \middle| \ \mathbf{X}{\boldsymbol \beta} = \mathbf{y} \ \right\} &= \left\{ \ \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \begin{bmatrix} {\boldsymbol \beta}_1 \\ 0 \end{bmatrix} + \mathbf{F} \mathbf{b} \ \middle| \ \mathbf{b} \in \mathrm{R}^{A - A^\star} \ \right\} \end{align*}$

where $\mathbf{F}$ is a matrix whose column vectors are a basis for the null space of $\mathbf{X}$ . Suppose that $\mathbf{X}_1 \subseteq \mathbf{X}$ is $(A^\star \times A^\star)$ -dimensional and non-singular, then:

(17) $\begin{align*} \mathbf{X} {\boldsymbol \beta} &= \begin{bmatrix} \mathbf{X}_1 & \mathbf{X}_2 \end{bmatrix} \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \mathbf{y} \quad \text{and} \quad {\boldsymbol \beta}_1 = \mathbf{X}_1^{-1} \left( \mathbf{y} - \mathbf{X}_2 {\boldsymbol \beta}_2 \right) \end{align*}$

Obviously, setting ${\boldsymbol \beta}_2 = 0$ is one solution. The full set of solutions defining the null space $\mathbf{F}$ is given by:

(18) $\begin{align*} {\boldsymbol \beta} &= \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \begin{bmatrix} \mathbf{X}_1^{-1} \mathbf{y} \\ 0 \end{bmatrix} + \underbrace{\begin{bmatrix} - \mathbf{X}_1^{-1} \mathbf{X}_2 \\ \mathbf{I} \end{bmatrix}}_{\mathbf{F}} \mathbf{b} \end{align*}$

Thus, if it takes $f$ flops to factor $\mathbf{X}$ into $\begin{bmatrix} \mathbf{X}_1 & \mathbf{X}_2 \end{bmatrix}$ and $s$ flops to solve each linear system of the form $\mathbf{X}_1 {\boldsymbol \beta}_1 = \mathbf{y}$ , then the total cost of parameterizing all the solutions is:

(19) $\begin{align*} \{ f + s \cdot (A - A^\star + 1)\}{\scriptstyle \mathrm{flops}} \end{align*}$

Via the LU factorization method, I know that the factorization step will cost roughly:

(20) $\begin{align*} f &= \left\{ (\sfrac{2}{3}) \cdot (A^\star)^3 + (A^\star)^2 \cdot (A - A^\star) \right\}{\scriptstyle \mathrm{flops}} \end{align*}$

Moreover, we know from Section 3 that the cost of the solve step will be on the order $s = \mathrm{O}(A^\star)^3$ . However, there is one detail left to consider still. Namely that arbitrageurs don’t know $A^\star$ . Thus, they have to solve for both ${\boldsymbol \beta}$ and $A^\star$ by starting at $A^\star_0 = \mathrm{E}[A^\star]$ and iterating on the above process until the columns of $\mathbf{F}$ actually represents a basis for the null space of $\mathbf{X}_1$ . Thus, the total effort needed is:

(21) $\begin{align*} \left\{ (\sfrac{1}{3}) \cdot (A^\star)^3 \cdot (A - A^\star) \right\}^{\sfrac{1}{\gamma}} {\scriptstyle \mathrm{flops}} \end{align*}$

where $\gamma \in (0,1)$ is the convergence rate and the calculation is dominated by the effort spent searching through the null space to be sure that $A^\star$ is correct. More broadly, this step is just one way of capturing the deeper idea that knowing where to look is hard. e.g., Warren Buffett says that he “can say no in $10$ seconds or so to $90{\scriptstyle \%}$ or more of all the [investment opportunities] that come along.” This is great… until you consider how many investment opportunities Buffett runs across every single day. Saying no in $10$ second flat then turns out to be quite a chore! Alternatively, as the figure above highlights, this is why traders use personalized multi-monitor computing setups that make it easy to spot patterns instead of a shared super computer with minimal output.

5. Clock Time Interpretation

Is $\mathrm{O}((A^\star)^3 \cdot (A - A^\star))^{\sfrac{1}{\gamma}}$ flops a big number? Is it a small number? Flop counts were originally used when floating-point operations were the main computing bottleneck. Now things relating to how data are stored such as cache boundaries and reference locality have first order affects on computation time as well. Nevertheless, ﬂop counts can still give a good back of the envelope estimate of the relative amount of time it would take to execute a procedure, and such a calculation would be helpful in trying to interpret the unit of measurement “flops” on a human scale. e.g., on one hand, arbitrageur effort would be a silly constraint to worry about if the time it took to execute real world calculations was infinitesimally small. On the other hand, flops might be a poor unit of measure for arbitrageurs’ effort if the time it took to carry out reasonable calculations was on the order of millennia since arbitrageurs clearly don’t wait this long to act! Actually doing a quick computation can allay these fears.

Suppose that computers can execute roughly $1{\scriptstyle \mathrm{mil}}$ operations per second. Millions of instructions per second (i.e., $\mathrm{MIPS}$ ) is a standard unit of computational speed. I can then calculate the amount of time it would take to execute a given number of flops at a speed of $1{\scriptstyle \mathrm{MIPS}}$ as:

(22) $\begin{align*} \mathrm{Time} &= \left((A^\star)^3 \cdot (A - A^\star) \right)^{\sfrac{1}{\gamma}} \times \left(\frac{1 {\scriptstyle \mathrm{sec}}}{10^6 {\scriptstyle \mathrm{flops}}}\right) \times \left(\frac{1 {\scriptstyle \mathrm{day}}}{86400 {\scriptstyle \mathrm{sec}}}\right) \end{align*}$

Thus, if there are roughly $5000$ characteristics that can take on $50$ different levels and $1$ out of every $1000$ attributes realizes a shock each period, then even if arbitrageurs guess exactly right on the number of shocked attributes (i.e., so that $\gamma = 1$ ) a brute force search would take $45$ days to complete. Clearly, a brute force search strategy just isn’t feasible. There just isn’t enough time to physically do all of the calculations.

6. A Persistent Problem

I conclude by addressing a common question. You might ask: “Won’t really fast computers make cognitive control irrelevant?” No. Progress in computer storage has actually outstripped progress in processing speeds by a wide margin. This is known as Kryder’s Law. Over $10$ years the cost of processing has dropped by a factor of roughly $32$ (i.e., Moore’s Law). By contrast, the cost of storage has dropped by a factor of $1000$ over the same period. e.g., take a look at the figure below made using data from www.mkomo.com which shows that the cost of disk space decreases by $58{\scriptstyle \%}$ each year. What does this mean in practice? Well, as late as 1980 a $26{\scriptstyle \mathrm{MB}}$ hard drive cost $\mathdollar 5 \times 10^3$ , implying that a $1{\scriptstyle \mathrm{TB}}$ hard drive would have cost upwards of $\mathdollar 2 \times 10^5$ . These days you can pick up a $1{\scriptstyle \mathrm{TB}}$ drive for about $\mathdollar 50$ ! We have so much storage that finding things is now an important challenge. This is why we find Google so helpful. Instead of being eliminated by computational power, cognitive control turns out to be a distinctly modern problem.

Many Assets with Attribute-Specific Shocks

October 2, 2013 by Alex

1. Motivation and Outline

Asset pricing models tend to focus on a single stock that realizes a normally distributed value shock of undefined origins. e.g., think of Kyle (1985) as a representative example. This is a great starting point; however, massive size and dense interconnectedness are key features of financial markets. Studying a financial market without these features is like studying dry water. In this post I suggest a simple way to modify the standard payout structure to allow for many assets and attribute-specific shocks.

What do I mean by attribute-specific shocks? To illustrate, have a look at the figure below which shows the most common $25{\scriptstyle \%}$ of topics that came into play when journalists from the Wall Street Journal wrote about Micron Technology from 2001 to 2012. The figure reads that: “If you select a Wall Street Journal article that mentioned Micron Technology in the abstract at random, then there is a $9{\scriptstyle \%}$ chance that ‘Antitrust’ is a listed subject.” Here’s the key point. When news about Micron Technology emerged, it was never just about Micron Technology. Journalists wrote about a particular SEC investigation, or a technology shock affecting all hard disk drive makers, or the firms currently active in the mergers and acquisitions market, etc… Value shocks are physical. They are rooted in particular events affecting subsets of stocks.

A big market with attribute-specific shocks means perspective matters. Consider a real world example. e.g., Khandani and Lo (2007) wrote about the ‘Quant Meltdown’ of 2007 that “the most remarkable aspect of these hedge-fund losses was the fact that they were confined almost exclusively to funds using quantitative strategies. With laser-like precision, model-driven long/short equity funds were hit hard on Tue Aug $7$ th and Wed Aug $8$ th, despite relatively little movement in [the average level of] fixed-income and equity markets during those $2$ days and no major losses reported in any other hedge-fund sectors.” Every individual stock was priced correctly, yet there was still a huge multi-stock price movement in a particular subset of stocks. Here’s the kicker: You would never have noticed this shock unless you knew exactly where to look!

2. Payout Structure

In Kyle (1985) there is a single stock with a fundamental value distributed as $v \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_v^2)$ . Suppose that, instead, there are actually $N$ stocks that each have $H$ different payout-relevant characteristics. Every characteristic can take on $I$ distinct levels. I call a (characteristic, level) pairing an ‘attribute’ and use the indicator variable $a_n(h,i)$ to denote whether or not a stock has an attribute. Think about attributes as sitting in a $(H \times I)$ -dimensional matrix, $\mathbf{A}$ , as illustrated in Equation (1) below:

(1) $\begin{equation*} \mathbf{A}^{\top} = \bordermatrix{ ~ & 1 & 2 & \cdots & H \cr 1 & \text{Agriculture} & \text{Albuquerque} & \cdots & \text{Alcoa Inc} \cr 2 & \text{Apparel} & \textbf{\color{red}Boise} & \cdots & \text{ConocoPhillips} \cr 3 & \textbf{\color{red}Disk Drive} & \text{Chicago} & \cdots & \text{Dell Inc} \cr \vdots & \vdots & \vdots & \ddots & \vdots \cr I & \text{Wholesale} & \text{Vancouver} & \cdots & \textbf{\color{red}Xerox Corp} \cr} \end{equation*}$

Further, suppose that each stock’s value is then the sum of a collection of attribute-specific shocks:

(2) $\begin{align*} v_n &= \sum_{h,i} x(h,i) \cdot a_n(h,i) \end{align*}$

where the shocks are distributed according to the rule:

(3) $\begin{align*} x(h,i) &= x^+(h,i) + x^-(h,i) \quad \text{with each} \quad x^\pm(h,i) \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \begin{cases} \pm \sfrac{\delta}{\sqrt{H}} &\text{ w/ prob } \pi \\ \ \: \, 0 &\text{ w/ prob } (1 - \pi) \end{cases} \end{align*}$

You could also add the usual factor exposure and firm-specific shocks to the model:

(4) $\begin{align*} v_n &= \cancelto{0}{{\boldsymbol \theta}_n^{\top} \mathbf{f}} + \sum_{h,i} x(h,i) \cdot a_n(h,i) + \cancelto{0}{\epsilon_n} \end{align*}$

I’ve excluded these terms for clarity since they are not new. You might be wondering: “Aren’t these attribute-specific shocks captured by a covariance matrix, though?” No. The covariance between any $2$ assets in this setup is:

(5) $\begin{align*} \mathrm{Cov}\left[v_n,v_{n'}\right] = H \cdot \left(\sfrac{1}{I}\right)^2 \cdot 2 \cdot \pi \cdot (1-\pi) \cdot \left( \sfrac{\delta}{\sqrt{H}} \right)^2 = 2 \cdot \pi \cdot (1 - \pi) \cdot \left( \sfrac{\delta}{I} \right)^2 \end{align*}$

where the first $H$ corresponds to the number of characteristics, the $(\sfrac{1}{I})^2$ term denotes the probability that both stocks have the same level for a particular characteristic, the $2 \cdot \pi \cdot (1 - \pi)$ term denotes the probability that the attribute realizes a shock, and the $(\sfrac{\delta}{\sqrt{H}})^2$ term denotes the squared attributes-specific shock. The takeaway from this calculation is that the covariance matrix is completely flat (i.e., it doesn’t matter which $n$ and $n'$ you compare) and arbitrarily small.

Lots of things that you might think of as explained by constant covariance aren’t. e.g., the figure above shows the maximum industry-specific contribution to daily return variance from January 1976 to December 2011 using the methodology in Campbell, Lettau, Malkiel, and Xu (2001). The vertical text at the bottom gives the name of the industry with the largest industry-specific contribution to daily return variance each month any time it changes from the previous month. The figure reads that: “While traders can usually expect to understand no more than $5{\scriptstyle \%/\mathrm{yr}}$ of a typical firm’s $20{\scriptstyle \%/\mathrm{yr}}$ variation in daily returns, there are times such as in $1987$ when this $5{\scriptstyle \%/\mathrm{yr}}$ figure suddenly jumps to over $25{\scriptstyle \%/\mathrm{yr}}$ . What’s more, the density of the text along the base of the figure shows that the important (i.e., extremal) industry regularly changes from month to month.”

3. Approximation Error

One of the nice features of this reformulation of the usual normal value shocks is that, although it changes the interpretation of where each firm’s value comes from, it doesn’t alter any of the Guassian structure of the problem. i.e., the normal approximation to the binomial distribution says that:

(6) $\begin{align*} \sum_{h,i} x(h,i) \cdot a_n(h,i) \overset{\scriptscriptstyle \text{``ish''}}{\sim} \mathrm{N}(0, \sigma_v^2) \quad \text{where} \quad \sigma_v^2 = 2 \cdot \delta^2 \cdot \pi \cdot (1 - \pi) \end{align*}$

where the “ish” means that there is a small and easy to compute approximation error. e.g., consider the collection of attribute-specific shocks for asset $n$ , $\{x_1,x_2,\ldots,x_H\}$ , with $\mathrm{E}[x_h] = 0$ , $\mathrm{E}[x_h^2] = \sigma_v^2 > 0$ , and $\mathrm{E}[|x_h|^3] = \rho_v < \infty$ and define the normalized sum $X(H) = \sfrac{1}{(\sigma_v \cdot \sqrt{H})} \cdot \sum_h x_h$ with the cumulative density function $F_H(x) = \mathrm{Pr}[X(H) \leq x]$ . Then, we know via the central limit theorem that $F_H(x) \to \Phi(x)$ as $H \to \infty$ where $\Phi(\cdot)$ is the standard normal distribution.

Moreover, the Berry-Esseen Theorem says that:

(7) $\begin{align*} \max_{x \in \mathrm{R}}\left\{ \ \left| F_H(x) - \Phi(x) \right| \ \right\} &\leq \frac{0.50 \cdot \rho_v}{\sigma_v^3 \cdot \sqrt{H}} = \frac{0.50}{\sqrt{2 \cdot H \cdot \pi \cdot (1 - \pi)}} \end{align*}$

where the second equals sign applies only in the special case of the sum of $2$ binomially distributed random variables. The figure above shows how well this approximation holds as the number of payout-relevant characteristics, $H$ , increases from $100$ to $10000$ in a world where $\pi = \sfrac{1}{100}$ . I compute the $x$ -axis on a grid of unit length $\Delta x = \sfrac{1}{100}$ . If there are $100$ firms with values that typically range over an area of $\sigma_v^2 = \mathdollar 100{\scriptstyle /\mathrm{sh}}$ , then in a world with $H = 1000$ payout-relevant characteristics only $6$ stocks will be misvalued by a mere $\Delta x \cdot \sigma_v \cdot \sqrt{H} \approx \mathdollar 3{\scriptstyle /\mathrm{sh}}$ if you use the normal approximation to the binomial distribution rather than the true distribution. Thus, less than $1$ dollar in $500$ isn’t accounted for by the approximation:

(8) $\begin{align*} 0.0018 &= \frac{6{\scriptstyle \mathrm{stocks}} \cdot \mathdollar 3{\scriptstyle /\mathrm{sh}}}{100{\scriptstyle \mathrm{stocks}} \cdot \sigma_v^2} \end{align*}$

By contrast, the figure below shows the $12{\scriptstyle \mathrm{mo}}$ moving average of the percent of the variance in firm-level daily returns explained by market and industry factors over the time period from January 1976 to December 2011 using the methodology from Campbell, Lettau, Malkiel, and Xu (2001). This figure reads that: “For a randomly selected stock in 1999, market and industry considerations only account for around $30{\scriptstyle \%}$ of its daily return variation.” In other words, the usual factor models typically account for less than half of the fluctuations in firm value. i.e., they are $2$ orders of magnitude less precise than the approximation error!

4. Whose Perspective?

You might ask: “Why bother adding this extra structure?” In a big market with attribute-specific shocks, perspective matters. This is the punchline. Asset values and attribute-specific shocks essentially carry the same information since:

(9) $\begin{align*} v_n &= \sum_{h,i} x(h,i) \cdot a_n(h,i) + \mathrm{O}(H)^{-\sfrac{1}{2}} \\ x(h,i) &= \frac{1}{\sfrac{N}{I}} \cdot \sum_n v_n \cdot a_n(h,i) + \mathrm{O}(\sfrac{N}{I})^{-\sfrac{1}{2}} \end{align*}$

However, knowing the value of an asset tells you very little about whether any particular one of its attributes has realized a shock. Similarly, knowing whether an attribute has realized a shock is a really noisy signal about the value of any particular stock with that attribute.

To see how this duality might affect asset prices, consider a simple example. e.g., suppose that we are in a multi-period Kyle (1985)-type world where value investors know the fundamental value of a particular stock, and they place orders with a market maker who processes only the order flow for that particular stock. It could well be the case that market makers price each stock correctly on average:

(10) $\begin{align*} \mathrm{E} \left[ \ p_{n,t} - v_n \ \middle| \ y_{n,t} \ \right] &= 0 \end{align*}$

Yet, the high-dimensionality of market would mean that there still could be groups of mispriced stocks:

(11) $\begin{align*} \mathrm{E} \left[ \ \langle p_{n,t} \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = \sfrac{\delta}{\sqrt{H}} \ \right] &< 0 \; \text{and} \; \mathrm{E} \left[ \ \langle p_{n,t} \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = - \sfrac{\delta}{\sqrt{H}} \ \right] > 0 \end{align*}$

where $\langle p_{n,1} \rangle_{h,i} = \sfrac{I}{N} \cdot \sum_n p_{n,1} \cdot a_n(h,i)$ denotes the sample average time $t$ price for stocks with a particular attribute, $(h,i)$ . This is a case of more is different. If an oracle told you that $x(h,i) = \sfrac{\delta}{\sqrt{H}}$ for some attribute $(h,i)$ , then you would know that the average price of stocks with attribute $(h,i)$ would be:

(12) $\begin{align*} \langle p_{n,t} \rangle_{h,i} &= \langle \lambda_{n,t} \cdot \beta_{n,t} \rangle_{h,i} \cdot \frac{\delta}{\sqrt{H}} + \mathrm{O}(\sfrac{N}{I})^{-1/2} \end{align*}$

where $\langle \lambda_{n,t} \cdot \beta_{n,t}\rangle_{h,i} < 1$ since value investors would have an incentive to delay trading in a dynamic model. i.e., $\langle p_{n,1} \rangle_{h,i}$ will be less than its fundamental value $\langle v_n \rangle_{h,i} = \sfrac{\delta}{\sqrt{H}}$ even though it will be easy to see that $\langle p_{n,1} \rangle_{h,i} \neq 0$ as $\sfrac{I}{N} \to 0$ .

There are way more payout-relevant attributes than anyone could ever investigate in a single period. This is why Charlie Munger explains that it’s his job “to find a few intelligent things to do, not to keep up with every damn thing in the world.” If we think about each stock as a location in a “spatial” domain and the attribute-specific shocks as particular points in a “frequency” domain, this result takes on the flavor of a generalized uncertainty principle. i.e., it’s really hard to simultaneously estimate the price of a portfolio at both very fine scales (i.e., containing a single asset) and very low frequencies (i.e., affecting every stock with an attribute).

Scaling Up “Iffy” Decisions

September 3, 2013 by Alex

1. Introduction

Imagine you are an algorithmic trader, and you have to set up a trading platform. How many signals should you try to process? How many assets should you trade? If you are like most people, your answer will be something like: “As many as I possibly can given my technological constraints.” Most people have the intuition that the only thing holding back algorithmic trading is computational speed. They think that left unchecked, computers will eventually uncover every trading opportunity no matter how small or obscure. e.g., as Scott Patterson writes in Dark Pools, these computerized trading platforms are just “tricked-out artificial intelligence systems designed to scope out hidden pockets in the market where they can ply their trades.”

This post highlights another concern. As you use computers to discover more and more “hidden pockets in the market”, the scale of these trades might grow faster than the usefulness of your information. Weak and diffuse signals might turn out to be really risky to trade on. How might this work? e.g., suppose that in order for your computers to recognize the impact of China’s GDP on US stock returns you need to feed them data for $500$ assets; however, in order for your computers to recognize the impact of recent copper discoveries in Guatemala on electronics manufacturing company returns, you need to feed your machines data on $5000$ assets. Even if the precision of the signal that your computers spit out is the same, its magnitude will be smaller. Copper discoveries in Guatemala just don’t matter as much. Thus, you would take on more risk by aggressively trading the second signal because it’s weaker and you would have to take on a position in $10$ times as many assets! Thus, this post suggests that even if you might want to maximize the number of inputs your machines get, you might want to limit how broadly you apply them.

First, in Section 2 I illustrate how the scale of a trading opportunity might increase faster than the precision on your signal via an an example based on the well known Hodges’ Estimator. You should definitely check out Larry Wasserman‘s excellent post on the this topic for an introduction to the ideas. In Section 3 I then outline the basic decision problem I have in mind. In Section 4 I show how the risk associated with trading weak signals might explode as described above. Finally, in Section 4 I conclude by suggesting an empirical application of these ideas.

2. Hodges’ Estimator

Think about the following problem. Suppose that $m_1, m_2, \ldots, m_Q \overset{\scriptscriptstyle \mathrm{iid}}{\sim} N(\mu,1)$ are random variables denoting (say) innovations in the quarterly dividends of a group of $Q$ stocks:

(1) $\begin{align*} m_q &= d_{q,t+1} - \mathrm{E}_t[d_{q,t+1}] \end{align*}$

You want to know if the mean of the dividend changes is non-zero. i.e., has this group of stocks realized a shock? Define the estimator $\mathrm{Hdg}[Q]$ as follows:

(2) $\begin{align*} \mathrm{Hdg}[Q] &= \begin{cases} \mathrm{Avg}[Q] &\text{if } \mathrm{Avg}[Q] \geq Q^{-1/4} \\ 0 &\text{else } \end{cases} \end{align*}$

where $\mathrm{Avg}[Q] = \frac{1}{Q} \cdot \sum_{q=1}^Q m_q$ . This estimator says: “If the average of the dividend changes is sufficiently big, I’m going to assume there has been a shock of size $\mathrm{Avg}[Q]$ ; otherwise, I’ll assume that there’s been no shock.” This is an example of a Hodges-type estimator. $\mathrm{Hdg}[Q]$ is a consistent estimator of $\mu$ in the sense that:

(3) $\begin{align*} \sqrt{Q} \cdot (\mathrm{Hdg}[Q] - \mu) &\overset{\scriptscriptstyle \mathrm{Dist}}{\to} \begin{cases} N(0,1) &\text{if } \mu \neq 0 \\ 0 &\text{if } \mu = 0 \end{cases} \end{align*}$

Thus, as you examine more and more stocks from this group, you are guaranteed to discover the true mean. However, the worst case expected loss of the estimator, $\mathrm{Hdg}[Q]$ , is infinite!

(4) $\begin{align*} \sup_\mu \mathrm{E}\left[ Q \cdot (\mathrm{Hdg}[Q] - \mu)^2 \right] &\to \infty \end{align*}$

This is true even though the worst case loss for the sample mean, $\mathrm{Avg}[Q]$ , is flat:

(5) $\begin{align*} \sup_\mu \mathrm{E}\left[ Q \cdot (\mathrm{Avg}[Q] - \mu)^2 \right] &\to 1 \end{align*}$

I plot the risk associated with Hodges’ estimator in the figure below. What’s going on here? Well, as $Q$ gets bigger and bigger, there remains a region around $\mu = 0$ where you are quite certain that the mean is $0$ . However, at the edge of this region, there is a band of values of $\mu$ where if you are wrong and $\mu \neq 0$ , then your prediction error summed across every one of the $Q$ stocks you examined turns out to be quite big.

3. Decision Problem

Now, think about a decision problem that generates a really similar decision rule—namely, try to figure out how many shares, $a$ , to purchase where the stock’s payout is determined by $N$ different attributes via the coefficients ${\boldsymbol \mu}$ :

(6) $\begin{align*} \max_a V(a;{\boldsymbol \mu},\mathbf{x}) &= \max_a V(a) = \min_a \frac{\gamma}{2} \cdot \left( a - \sum_{n=1}^N \mu_n \cdot x_n \right)^2 \end{align*}$

Here, you are trying to maximize your value, $V$ , by choosing the right number of shares to hold—i.e., the risk action. Ideally you would take the action which is exactly equal to the ideal action $a = \sum_{n=1}^N \mu_n \cdot x_n$ ; however, it’s hard to figure out the exact loadings, ${\boldsymbol \mu}$ , for every single one of the $N$ relevant dimensions. As a result, you take an action $a$ which isn’t exactly perfect:

(7) $\begin{align*} a &= A(\mathbf{m};\mathbf{x}) = A(\mathbf{m}) = \sum_{n=1}^N m_n \cdot x_n \end{align*}$

I use the function $L(\cdot)$ to denote the loss in value you suffer by choosing a suboptimal asset holding:

(8) $\begin{align*} L(\mathbf{m};{\boldsymbol \mu}) = L(\mathbf{m}) &= \mathrm{E}\left[ \ V(A({\boldsymbol \mu}); {\boldsymbol \mu}, \mathbf{x}) - V(A(\mathbf{m}); {\boldsymbol \mu}, \mathbf{x}) \ \right] \\ &= - \frac{\gamma}{2} \cdot \mathrm{E}\left[ \left( \sum_{n=1}^N (m_n - \mu_n) \cdot x_n \right)^2 \right] \\ &= \frac{\gamma}{2} \cdot \mathrm{E}\left[ \sum_{n=1}^N (m_n - \mu_n)^2 \right] \end{align*}$

where the $3$ rd line follows from assuming that $x_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} N(0,1)$ .

So which of the $N$ different details should you pay attention to? Which of them should you ignore? One way to frame this problem would be to look for a sparse solution and only ask your computers to trade on the signals that are sufficiently important. Gabaix (2012) shows how to do this using an $\ell_1$ -program, I outline the main ideas in an earlier post. Basically, you would try to minimize your loss from taking a suboptimal action subject to an $\ell_1$ -penalty:

(9) $\begin{align*} L(\mathbf{m}) &= \min_{\mathbf{m} \in \mathrm{R}^N} \left\{ \frac{\gamma}{2} \cdot \sum_{n=1}^N (m_n - \mu_n)^2 + \kappa \cdot \sum_{n=1}^N |m_n| \right\} \end{align*}$

so that your optimal choice of actions is given by:

(10) $\begin{align*} m_n &= \begin{cases} \mu_n &\text{if } |\mu_n| \geq \kappa \\ 0 &\text{if } |\mu_n| < \kappa \end{cases} \end{align*}$

That’s the decision problem. So far everything is old hat.

4. Maximum Risk

The key insight in this post is that these $\mu_n$ terms don’t come down from on high. As a trader, you don’t just know what these terms are from the outset. They weren’t stitched onto the forehead of your favorite teddy bear. Instead, you have to use data on lots of stocks to estimate them as you go. In this section, I think about a world where you feed data on $Q$ different stocks to your machines, and each of these assets has the appropriate action $\sum_{n=1}^N \mu_{n,q} \cdot x_{n,q}$ . I then investigate what happens when you use the estimator:

(11) $\begin{align*} \widetilde{m}_n &= \begin{cases} \mathrm{Avg}_n[Q] &\text{if } |\mathrm{Avg}_n[Q]| \geq \kappa \\ 0 &\text{if } |\mathrm{Avg}_n[Q]| < \kappa \end{cases} \end{align*}$

instead of the estimator in Equation (10) where $\mathrm{Avg}_n[Q] = \frac{1}{Q} \cdot \sum_{q=1}^Q \mu_{n,q}$ and $\kappa$ isn’t growing that fast as $Q$ gets larger and larger. i.e., we have that:

(12) $\begin{align*} \widetilde{m}_n &= \mathrm{Avg}_n[Q] \cdot 1_{\{ |\mathrm{Avg}_n[Q]| \geq \kappa \}} \quad \text{with} \quad \kappa = \mathrm{O}(\sqrt{2 \log Q}) \end{align*}$

In some senses, $\widetilde{\mathbf{m}}$ is still a really good estimator of ${\boldsymbol \mu}$ . To see this, let $\mu_n$ denote the true effect for a particular attribute. Then, clearly for $|\mu_n| \geq \kappa$ , we have that:

(13) $\begin{align*} \sqrt{Q} \cdot (\widetilde{m}_n - \mu_n) &\overset{\scriptscriptstyle \mathrm{Dist}}{\to} \mathrm{N}(0,1) \end{align*}$

Similarly, for $\mu_n = 0$ , we have that:

(14) $\begin{align*} \sqrt{Q} \cdot (\widetilde{m}_n - \mu_n) &\overset{\scriptscriptstyle \mathrm{Dist}}{\to} 0 \end{align*}$

Thus, the estimate $\widetilde{m}_n$ is a strictly better estimator of $\mu_n$ than the sample average. Pretty cool.

However, in another sense, it’s a terrible estimator since the maximum risk associated with trading on $\widetilde{\mathbf{m}}$ is unbounded:

(15) $\begin{align*} \sup_{\mu_n} R_n(\widetilde{\mathbf{m}}) &\to \infty \quad \text{where} \quad R_n(\widetilde{\mathbf{m}}) = \mathrm{E}\left[ Q \cdot (\widetilde{m}_n - \mu_n)^2 \right] \end{align*}$

This result means that if you use this estimator to trade on, there are parameter values for $\mu_n$ which lead your computer to take on positions that are infinitely risky and make you really unhappy! How is this possible? Well, take a look at the following decomposition of the risk associated with the estimator $\widetilde{\mathbf{m}}$ :

(16) $\begin{align*} R_n(\widetilde{\mathbf{m}}) &= \mathrm{E}\left[ Q \cdot (\widetilde{m}_n - \mu_n)^2 \right] \\ &= \mathrm{E}\left[ Q \cdot (\mathrm{Avg}_n[Q] \cdot 1_{\{ |\mathrm{Avg}_n[Q]| \geq \kappa \}} - \mu_n)^2 \right] \\ &= \mathrm{Pr}\left[ |\mathrm{Avg}_n[Q]| \geq \kappa \right] \cdot \mathrm{E}\left[ Q \cdot (\mathrm{Avg}_n[Q] - \mu_n)^2 \right] + \mathrm{Pr}\left[ |\mathrm{Avg}_n[Q]| < \kappa \right] \cdot \mathrm{E}\left[ Q \cdot \mu_n^2 \right] \end{align*}$

It turns out that there are choices of $\mu_n$ for which $\mathrm{Pr}\left[ |\mathrm{Avg}_n[Q]| < \kappa \right] \to 1$ as $Q \to \infty$ , but at the same time $\mathrm{E}\left[ Q \cdot \mu_n^2 \right] \to \infty$ for the same process. In words, this means that there are choices of $\mu_n$ which make dimension $n$ always appear to your machines as an unimportant detail no matter how many stocks you look at, but this conclusion is still risky enough that if you applied to every one of these $Q$ stocks you’d get an infinitely risky portfolio.

e.g., consider the case where:

(17) $\begin{align*} \mu_n &= \hbar \cdot Q^{-1/4} \quad \text{with} \quad 0 < \hbar < 1 \end{align*}$

Then, it’s easy to see that:

(18) $\begin{align*} \mathrm{Pr}_{\mu_n} \left[ \ |\mathrm{Avg}_n[Q]| < Q^{-1/4} \ \right] &= \mathrm{Pr}_{\mu_n} \left[ \ Q^{-1/4} < \mathrm{Avg}_n[Q] < Q^{-1/4} \ \right] \\ &= \mathrm{Pr}_{\mu_n} \left[ \ \sqrt{Q} \cdot \left(- Q^{-1/4} - \mu_n \right) < z < \sqrt{Q} \cdot \left(- Q^{-1/4} - \mu_n \right) \ \right] \\ &= \mathrm{Pr}_{\mu_n} \left[ \ - Q^{1/4} \cdot ( 1 + \hbar ) < z < Q^{1/4} \cdot ( 1 - \hbar ) \ \right] \end{align*}$

where $z \overset{\scriptscriptstyle \mathrm{iid}}{\sim} N(0,1)$ . But, this means that as $Q \to \infty$ , we have that:

(19) $\begin{align*} \mathrm{Pr}_{\mu_n} \left[ \mathrm{Avg}_n[Q] < \kappa \right] &\to 1 \quad \text{while} \quad \mathrm{E}\left[ Q \cdot \mu_n^2 \right] \to \infty \end{align*}$

Thus, we have a proof by explicit construction. The punchline is that unleashing your machines on the market might lead them into situations where the scale of the position grows faster than the precision of the signal.

5. Empirical Prediction

What does this result mean in the real world? Well, the argument in the section above relied on the fact that when you used $Q$ different assets to derive a signal, you also then traded on all $Q$ assets. i.e., this is why the risk function has a factor of $Q$ in it. Thus, one way to get around this problem would be to commit to trading a smaller number of asset $Q'$ where:

(20) $\begin{align*} Q \gg Q' \end{align*}$

even though you had to use $Q$ assets to recover the signal. i.e., even if you have to grab signals from the $4$ corners of the market to create your trading strategy, you might nevertheless specialize in trading only a few assets so that in contrast to Equation (19) you would always have:

(21) $\begin{align*} \mathrm{Pr}_{\mu_n} \left[ \mathrm{Avg}_n[Q] < \kappa \right] &\to 1 \quad \text{while} \quad \mathrm{E}\left[ Q' \cdot \mu_n^2 \right] \to \mathrm{const} < \infty \end{align*}$

If traders are actually using signals from lots of assets, but only trading a few of them to avoid the infinite risk problem, then this theory would give a new motivation for excess comovement. i.e., you and I might feed the same data into our machines, get the same output, and choose to trade on entirely different subsets of stocks.

Identifying Relevant Asset Pricing Time Scales

August 21, 2013 by Alex

1. Introduction

Take a look at the figure below which displays the price level and trading volume of the S&P 500 SPDR over trading year from July 2012 to July 2013. The solid black line in the top panel shows the price process for the ETF at a daily frequency. Look at how much within day variation in the trading volume there is. You can see that the number of shares traded per day ranges over several orders of magnitude from roughly $1 \times 10^6$ to $300 \times 10^6$ . What’s more, the red vertical lines in the top figure show the intraday range for the traded price, and these bands stretch $\mathdollar 10$ per share in some cases. People are trading this ETF at all sorts of different investment horizons. Any model fit to daily data will ignore really interesting economics operating at these shorter investment horizons. Vice versa, any model fit to higher frequency minute-by-minute data will miss out on some longer buy-and-hold decisions.

In this post, I ask a pair of questions: (a) “Is it possible to recover the most ‘important’ investment horizons from the time series of SPDR prices?” and (b) “What statistical techniques might you use to do this?”

I work in reverse order. After outlining a toy stochastic process with a clear time scale pattern in Section 2, I start the real work in Sections 3 and 4 by discussing a pair different statistical tools you might use to uncover important time scales in this asset market. Here, when you read the word ‘important’ you should think ‘time scales where people are actually making decisions’. In Section 3, I outline the standard approach in asset pricing of using a time series regression with multiple lags. Then, in Section 4, I show how this technique has an equivalent waveform representation. After reading Sections 2 and 3 it may seem like it’s always possible to recover the relevant investment horizons from a financial time series. In Sections 5 and 6, I end on a down note by giving a counter example. The offending stochastic process is a workhorse in financial economics—namely, the Ornstein-Uhlenbeck process. Thus, the answer to question (a) seems to be: No.

You can find all of the code to create the figures below here.

2. Toy Stochastic Process

The next $2$ sections discuss different ways of recovering the relevant time scales from a financial data series. In particular, I am interested in the time series of log prices as I don’t want to have to worry about the series going negative. I define returns, $r_{t + \Delta t}$ , as:

(1) $\begin{align*} r_{t + \Delta t} &= \log p_{t+\Delta t} - \log p_t \end{align*}$

and assume that both log prices and returns are wide-sense stationary so that:

(2) $\begin{align*} \mathrm{E}[x_t] &= 0 \quad \text{and} \quad \mathrm{E}[x_t \cdot x_{t - h \cdot \Delta t}] = \mathrm{C}(h) \cdot \sigma^2 \end{align*}$

for $x_t \in \{ \log p_t, r_t \}$ where $\mathrm{C}(h)$ denotes the $h$ -period ahead autocorrelation function. I use $+\Delta t$ instead of the usual $+1$ in the time subscripts above because I want to emphasize the fact that the log price and return time series are scale dependent. In the analysis below, I’m going to think about running the analysis at the daily horizon so that $\Delta t = 1{\scriptstyle \mathrm{day}}$ .

To make this problem concrete, I use a particular numerical example:

(3) $\begin{align*} \log p_t &= \frac{95}{100} \cdot \left( \frac{1}{7} \cdot \sum_{h=1}^7 \log p_{(t - h){\scriptscriptstyle \mathrm{days}}} \right) - \frac{95}{100} \cdot \left( \frac{1}{30} \cdot \sum_{h=1}^{30} \log p_{(t - h){\scriptscriptstyle \mathrm{days}}} \right) + \frac{1}{10} \cdot \varepsilon_t \end{align*}$

I plot a year’s worth of daily data from this process (err… I am using calendar time rather than market time… so think $365$ not $252$ ) in the plot above. This process says that the log price today will go up by $95$ cents whenever the average log price over the last week was $1$ unit higher, but it will go down by $95$ cents whenever the average log price over the last month ago was $1$ unit higher. It’s a nice example to work with because there is an obvious pattern in log prices with a period of just south of $2$ months.

3. Autoregressive Representation

The most common way of accounting for time series predictability in asset pricing is to use an autoregression. e.g., you might run a regression of the log price level today on the log price level yesterday, on the log price level the day before, on the log price level the day before that, and so on…

(4) $\begin{align*} \log p_t &= \sum_{h=1}^H \beta_h \cdot \log p_{t-h} + \xi_{h,t} \end{align*}$

Note that because of the wide-sense stationarity of the log price process the regression coefficients simplify to just the horizon-specific autocorrelation:

(5) $\begin{align*} \beta_h &= \frac{\mathrm{C}(h) \cdot \mathrm{StD}[\log p_t] \cdot \mathrm{StD}[\log p_{t-h}]}{\mathrm{Var}[\log p_{t-h}]} \quad \text{and} \quad \mathrm{StD}[\log p_t] = \mathrm{StD}[\log p_{t-h}] \end{align*}$

Why might this approach make sense? First, for the process described in Equation (3), it’s obvious that the log price time series has an autoregressive representation since I constructed it that way. Second and more generally, this approach will hold due to Wold’s Theorem which states that every covariance stationary time series $x_t$ can be written as the sum of $2$ time series with the first time series completely deterministic and the second completely random:

(6) $\begin{align*} x_t &= \eta_t + \sum_{h=0}^{\infty} \mathrm{C}(h) \cdot \epsilon_{t-h}, \quad \sum_{h=1}^{\infty} |\mathrm{C}(h)|^2 < \infty \text{ and } \mathrm{C}(0) = 1 \end{align*}$

Here, $\eta_t$ is the completely deterministic time series and $\epsilon_t$ is the completely random white noise time series. The figure below shows the coefficient estimates, $\widehat{\mathrm{C}(h)}$ , from projecting the log price time series onto its past realizations for lags of anywhere from $h = 1{\scriptstyle \mathrm{day}}$ to $h = 3{\scriptstyle \mathrm{months}}$ .

4. Waveform Representation

Fun fact: There is also a waveform representation of the same autocorrelation function:

(7) $\begin{align*} \mathrm{C}(h) &= \int_{f \geq 0} \mathrm{S}(f) \cdot e^{i \cdot f \cdot h \cdot \Delta t} \cdot df \end{align*}$

This representation will always exist whenever the data have translational symmetry. i.e., put yourself in the role a trader thinking about buying a share of the S&P 500 SPDR again. If you had to make a prediction about tomorrow’s price level as a function of the log price level today, its values $1$ week ago, and its value $1$ month ago, you wouldn’t really care whether the current year was 1967, 1984, 1999, or 2013. This is just another way of saying that the autocorrelation coefficients only depend on the time gap.

Where does this alternative representation come from? Why translational symmetry? Plane waves turn out to be the eigenfunctions of the translation operator, $\mathrm{T}_\theta[\cdot]$ :

(8) $\begin{align*} \mathrm{T}_\theta[\mathrm{C}(h)] &= \mathrm{C}(h - \theta) \end{align*}$

In the context of this note, the translation operator eats autocorrelation functions and returns the value $\theta$ time periods to the right. i.e., if $\mathrm{C}(4{\scriptstyle \mathrm{days}})$ gave you the autocorrelation between the log price at any two points in time that are $4$ days apart, then $\mathrm{T}_{1{\scriptscriptstyle \mathrm{day}}}[\mathrm{C}(4{\scriptstyle \mathrm{days}})]$ would give you the autocorrelation between the log price at any two points in time that are $3$ days apart. Note that the translation operator is linear since translating the sum of functions is the same as the sum of translated functions. Thus, just as if $\mathrm{T}_\theta[\cdot]$ was a matrix, we can ask for the eigenfunctions of $\mathrm{T}_\theta[\cdot]$ written as $\mathrm{C}_{f}$ :

(9) $\begin{align*} \mathrm{T}_\theta[\mathrm{C}_f(h)] &= \mathrm{C}_f(h - \theta) = \lambda_{f,\theta} \cdot \mathrm{C}_f(h) \end{align*}$

Such a process is obviously given by the complex plane waves with $\mathrm{C}_f(h) = e^{i \cdot f \cdot h \cdot \Delta t}$ and $\lambda_{f,\theta} = e^{-i \cdot f \cdot \theta \cdot \Delta t}$ .

As a result, we can think about recovering all the information in the autocorrelation function at horizon $h$ by projecting it onto the eigenfunctions $\{ \mathrm{C}_f(h) \}_{f \geq 0}$ as depicted in the figure above known as a spectral density plot. This figure shows the results of $100$ regressions at frequencies in the range $[1/100{\scriptstyle \mathrm{days}},1/3{\scriptstyle \mathrm{days}}]$ :

(10) $\begin{align*} \log p_t &= \hat{a}_f \cdot \sin(f \cdot t) + \hat{b}_f \cdot \cos(f \cdot t) + \xi_{f,t} \end{align*}$

Roughly speaking, the coefficients $\hat{a}_f$ and $\hat{b}_f$ capture how predictive fluctuations at the frequency $f$ in units of $1/\mathrm{days}$ are of future log price movements. Thus, the summary statistic $(\hat{a}_f^2 + \hat{b}_f^2)/2$ captures how much of the variation in log prices is explained by historical movements at the frequency $f$ . This statistic is known as the power of the log price series at a particular frequency.

The Wiener-Khintchine Theorem formally links these two different ways of looking at the same autocorrelation information:

(11) $\begin{align*} \mathrm{C}(h) &= \sum_{f \geq 0} \mathrm{S}(f) \cdot e^{i \cdot f \cdot h \cdot \Delta t} \cdot \Delta f \quad \text{and} \quad \mathrm{S}(f) = \sum_{h \geq} \mathrm{C}(h) \cdot e^{- i \cdot f \cdot h \cdot \Delta t} \cdot \Delta h \end{align*}$

Using Euler’s formula that $e^{i \cdot x} = \cos(x) + i \cdot \sin(x)$ and keeping only the real component yields the following mapping from frequency space to autocorrelation space:

(12) $\begin{align*} \mathrm{C}_{\mathrm{WK}}(h) &= \sum_{f=0}^F \left( \frac{\widehat{\mathrm{S}(f)}}{\sum_{f'=0}^F \widehat{S(f')}} \right) \cdot \cos(f \cdot h \cdot \Delta t) \end{align*}$

where I assume that the range $[0,F]$ covers a sufficient amount of the relevant frequency spectrum. The figure below verifies the mathematics by showing the close empirical fit between the two calculations.

5. A Counter Example

After giving some tools to mine relevant time scales from financial time series in the previous sections, I conclude by giving an example of a simple stochastic process which thumbs its nose at these tools. Before actually looking at the example, it’s worthwhile to stop for a moment to think about the sort of process which might be hard to handle. You can see glimpses of it in the analysis above. Specifically, note how even though I created the time series in Equation (3) using a $7$ day moving average and a $30$ day moving average, there is no evidence of these $2$ time horizons in the sample autocorrelation coefficients. It’s not as if the figure shows a coefficient of:

(13) $\begin{align*} \widehat{\mathrm{C}(h)} &= \frac{95}{100} \cdot \left( \frac{1}{7} - \frac{1}{30} \right) \end{align*}$

for all lags $h \leq 7{\scriptstyle \mathrm{days}}$ . Likewise, the spectral density of the process shows a peak at somewhere between $1/7{\scriptstyle \mathrm{days}}$ and $1/30{\scriptstyle \mathrm{days}}$ . Thus, the time scale we see in the raw data is an emergent feature of the interaction of both the weekly and monthly effects. Intuitively, it would be very hard to identify the economically relevant time scale from a stochastic process where interesting features emerge at all time scales.

An Ornstein and Uhlenbeck gave an example of just such a stochastic process. Take a look at the figure above which plots the following Ornstein-Uhlenbeck (OU) process:

(14) $\begin{align*} d \log p_t &= \theta \cdot \left( \mu - \log p_t \right) \cdot dt + \sigma \cdot d\xi_t, \quad \text{with} \quad \mu = 0, \ \theta = 1 - e^{- \log 2 / 15}, \ \sigma = 1/10 \end{align*}$

With $dt = 1$ day, the equation above reads: “Daily changes in the log price are $0$ on average. However, the log price realizes daily kicks on the order of $1/10$ th of a percent, and these kicks have a half life of $15$ days.” Thus, it’s natural to think about this OU process as having a relevant time scale on the order of $1$ month, and you can see this time scale in the sample log price path. The peaks and troughs in the green line all last somewhere around $1$ month.

Here’s the punchline. Even though the process was explicitly constructed to have a relevant monthly time scale, there is no obvious bump at the monthly horizon in either the autoregressive representation or the waveform representation. In fact, OU processes are well known to produce $1/f$ noise—i.e., noise which follows a power law decay pattern as shown in the figure below. Kicks which have a half life on the order of $30$ days lead to emergent behavior at all time scales!

6. Uniqueness of Approximations

Of course, there is a mapping between the precise rate of decay in the figure below at the relevant time scale, but this is besides the point. You would have to know the exact stochastic process to know to reverse engineer the mapping. What’s more, this problem isn’t an issue that will be solved with more advanced filtering techniques such as wavelets. It’s not that the filtering technology is too coarse to capture the real structure. It’s that the real time scale structure created by the OU process itself is incredibly smooth. If you see a price process whose power spectrum mirrors that of an OU process with $1/f$ decay, you can’t be sure if its an OU process with a monthly time scale as above or a process economic decisions being made at each horizon.

This result has to do with the fact that even very well behaved approximations are only unique in a very narrow sense. What do I mean by this? Well, consider asymptotic approximations where the approximation error is smaller than the last term at each level of approximation. i.e., the approximation:

(15) $\begin{align*} f(\epsilon) &\sim \sum_{n=0}^N a_n \cdot f_n(\epsilon) \end{align*}$

is asymptotic to $f(\epsilon)$ as $\epsilon \to 0$ if for each $M \leq N$ :

(16) $\begin{align*} \frac{f(\epsilon) - \sum_{n=1}^N f_n(\epsilon)}{f_M(\epsilon)} &\to 0 \quad \text{as} \quad \epsilon \to 0 \end{align*}$

Asymptotic approximations are well behaved in the sense that you can naively add, subtract, multiply, divide, etc\ldots them just like they were numbers. What’s more, for a given choice of $\{ f_n \}_{n \geq 0}$ , all of the coefficients $\{ a_n \}_{n \geq 0}$ are unique.

At first this uniqueness result looks really promising! However, on closer inspection it’s clear that the result is rather finicky. e.g., the same function can have different asymptotic approximations:

(17) $\begin{align*} \text{as } \epsilon \to 0, \quad \tan(\epsilon) &\sim \epsilon + \frac{1}{3} \cdot \epsilon^3 + \frac{2}{15} \cdot \epsilon^5 \\ &\sim \sin(\epsilon) + \frac{1}{2} \cdot \sin(\epsilon)^3 + \frac{3}{8} \cdot \sin(\epsilon)^5 \\ &\sim \epsilon \cdot \cosh\left( \epsilon \cdot \sqrt{2/3} \right) + \frac{31}{270} \cdot \left( \epsilon \cdot \cosh\left( \epsilon \cdot \sqrt{2/3} \right) \right)^5 \end{align*}$

What’s more, different functions can have the same asymptotic approximations:

(18) $\begin{align*} e^{\epsilon} &\sim \sum_{n=0}^{\infty} \frac{\epsilon^n}{n!} \quad \text{as} \quad \epsilon \to \infty \\ e^{\epsilon} + e^{-1/\epsilon} &\sim \sum_{n=0}^{\infty} \frac{\epsilon^n}{n!} \quad \text{as} \quad \epsilon \searrow \infty \end{align*}$

What’s really interesting about this last example is that these $2$ functions have asymptotic approximations that share an infinite number of terms!

To close the loop, consider these approximation results in the context of the econometric analysis above. What I was doing in these exercises was picking a collection of $\{f_n\}_{n \geq 0}$ and then empirically estimating $\{a_n\}_{n \geq 0}$ . For each choice of approximations, I got a unique set of coefficients out. However, the counter example above in Section 5 shows that data generating functions with very different time scales can have very similar approximations. The analysis in this section shows that perhaps this result is not too surprising. A different way of putting this idea is that by choosing an approximation to data generating process, $f(\epsilon)$ , you are factoring the economic content of the series into $2$ different component: $\{a_n\}_{n \geq 0}$ and $\{f_n\}_{n \geq 0}$ . If you take a stand on the $\{f_n\}_{n \geq 0}$ terms, the corresponding $\{a_n\}_{n \geq 0}$ will certainly be unique; however, there is no guarantee that these coefficients carry all of the economic information that you want to recover from the data. e.g., the relevant time scale information might be buried in the $\{f_n\}_{n \geq 0}$ series rather than the coefficients $\{a_n\}_{n \geq 0}$ .

Sacrificing Noise Traders

July 1, 2013 by Alex

1. Introduction

One way to look at the stock market is as an information aggregation technology. For instance, imagine that you are the CEO of a pencil making company, and have to decide whether or not to stick with making old-fashioned wood pencils or to switch over to making mechanical pencils. If equity shares in lumber companies are publicly traded, you can pop open your laptop and look at their valuation online. Suppose you see that all lumber companies have low valuations and few other customers. In this world, you should really consider updating your product line. Note that it would be much harder to make this inference if all lumber company equity was privately held. No private equity shop is going to answer the phone and tell you that one of their investments is in the toilet. What’s more, (1) more analysts and (2) better informed analysts will study each lumber company’s business operations if there is publicly traded equity and no one has to know who these better informed analysts are ahead of time either. As I think Kevin Costner once said: “If there’s profits, they will come.”

The big question, though, is: Where do these profits come from? What entices informed traders to enter the market and push their information into prices? Asset pricing models such as Grossman and Stiglitz (1980) and Kyle (1985) give us the answer. Informed traders’ profits come directly from the stupidity of noise traders. These profits are transfer payments from the pockets of one group of citizens to the pockets of another. For a social planner, having prices that tell people about the fundamental values of important companies is a good thing. However, noise traders are people too, and sacrificing too many of them to get accurate prices is bad.

In this post, I use a simple, one period, Kyle (1985)-type model to ask the question: How many noise traders do you need to throw to the dogs in order to get accurate prices? Specifically, I think about a world with an asset that pays out $v \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_v^2)$ and has price $p$ . If you are the social planner, you then try to maximize the benefits of having informative prices, $\mathrm{Cov}[p,v]$ , minus the costs of wasting lots of noise traders who could be doing other productive things, $c(\text{noise traders})$ , subject to the constraint that it has to be worth it for informed traders to enter into the market, $\mathrm{E}[\text{informed trader profit}] \geq \bar{\pi}$ :

(1) $\begin{align*} \max_{\text{noise traders} \geq 0} \left\{ \mathrm{Cov}[p,v] - c(\text{noise traders}) \right\} \quad &\text{subject to} \quad \mathrm{E}[\text{informed trader profit}] \geq \bar{\pi} \end{align*}$

The crazy thing about setting the problem up this way is that the number of noise traders in the market doesn’t affect how informative the prices are:

(2) $\begin{align*} \mathrm{Cov}[p,v] &= \text{constant} \times \sigma_v^2 \end{align*}$

Put differently, as the social planner, you need to sacrifice enough noise traders so that informed traders actually like being traders and won’t change careers. If there aren’t enough noise traders, informed traders won’t make their reservation wage, $\bar{\pi}$ , and will switch over to being butchers or bakers. However, pumping more and more noise traders into the market won’t make prices any more informative.

2. Economic Model

How does the model work? Imagine that Alice decides to be an informed trader rather than a butcher. For all of her time studying the markets, she get rewarded with a signal, $s$ , about the fundamental value of a lumber company, Logs Inc:

(3) $\begin{align*} s &= v + \epsilon \quad \text{where} \quad \epsilon \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_\epsilon^2) \end{align*}$

Since everything in the model is nice and normally distributed, her posterior beliefs about the fundamental value of Logs Inc will be normally distributed with variance $\mathrm{Var}[v|s] = (1/\sigma_v^2 + 1/\sigma_\epsilon^2)^{-1}$ and mean $\mathrm{E}[v|s] = \mathrm{SNR} \cdot s$ where $\mathrm{SNR}$ denotes Alice’s signal to noise ratio:

(4) $\begin{align*} \mathrm{SNR} &= \frac{\sigma_v^2}{\sigma_v^2 + \sigma_\epsilon^2} \end{align*}$

For instance, if Alice just saw the fundamental value, $v$ , directly then $\sigma_\epsilon = 0$ and here signal to noise ratio would be $\mathrm{SNR} = 1$ . Conversely, as $\sigma_\epsilon \to \infty$ her signal becomes meaningless and her signal to noise ratio tends to $\mathrm{SNR} \to 0$ .

There is a competitive market maker for Logs Inc stock, Bob, who observes aggregate demand, $y$ , and sets the price equal to his conditional expectation of its fundamental value:

(5) $\begin{align*} y &= x + z \quad \text{where} \quad z \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_z^2) \\ p &= \mathrm{E}[v|y] \end{align*}$

Here $x$ denotes Alice’s informed demand for Logs Inc stock in units of shares and $z$ denotes noise trader demand for Logs Inc stock. When I say that Bob only observes aggregate demand, I mean that if Bob sees a buy order of $10$ shares, he has no idea whether (a) Alice wants to buy $20$ shares and the noise traders want to sell $10$ shares or (b) Alice wants to sell $10$ shares and the noise traders want to buy $20$ shares. The assumption of perfect competition for Bob means that he has to set the price equal to his conditional expectation. If he tried to hedge his bets and deviate from $p = \mathrm{E}[v|y]$ , someone else would step in and scoop his business.

If Alice is trying to maximize her expected profit, $\pi_x$ :

(6) $\begin{align*} \pi_x &= (v - p) \cdot x \end{align*}$

then an equilibrium would be a choice of demand for Alice, $x = \beta \cdot s$ , and a pricing rule for Bob, $p = \lambda \cdot y$ , such that:

Given Bob’s pricing rule, $p = \lambda \cdot y$ , Alices demand maximizes her expected profit.
Given Alice’s demand rule, $x = \beta \cdot s$ , Bob’s price equals his conditional expectation of Logs Inc’s fundamental value.

Taking Bob’s pricing rule as given and maximizing Alice’s expected profit, $\mathrm{E}[\pi_x|v]$ , yields an equation for her optimal demand given her realized signal, $s$ :

(7) $\begin{align*} x &= \left( \frac{\mathrm{SNR}}{2 \cdot \lambda} \right) \cdot s \end{align*}$

Substituting this demand rule into Bob’s conditional expectation then characterizes the equilibrium parameters $\beta$ and $\lambda$ that govern Alice and Bob’s demand and pricing rules:

(8) $\begin{align*} \lambda = \frac{\mathrm{Cov}[v,y]}{\mathrm{Var}[y]} &= \frac{\sqrt{\mathrm{SNR}}}{2} \cdot \frac{\sigma_v}{\sigma_z} \\ \beta = \frac{\mathrm{SNR}}{2 \cdot \lambda} &= \sqrt{\mathrm{SNR}} \cdot \frac{\sigma_z}{\sigma_v} \end{align*}$

In words, $\lambda \propto \sigma_v/\sigma_z$ means that the price of Logs Inc stock will be more responsive to aggregate demand shocks when there is more information to be revealed or when there are few noise traders to mask the information. Conversely, Alice’s demand will be more responsive to a strong private signal when there is more noise trading for her to hid behind or when she wasn’t expecting to discover much in the first place.

3. Unconditional Moments

With the model in place, we can now get to the interesting part of the analysis. Namely, the number of noise traders in the market doesn’t affect how informative prices are. To see this, note that prices have the following functional form:

(9) $\begin{align*} p &= \frac{\mathrm{SNR}}{2} \cdot (v+\epsilon) + \frac{\sqrt{\mathrm{SNR}}}{2} \cdot \frac{\sigma_v}{\sigma_z} \cdot z \end{align*}$

The key piece of this equation is that the coefficient on the fundamental value of Logs Inc, $\mathrm{SNR}/2$ , doesn’t have any dependence on the number of noise traders in the market. i.e., if the fundamental value of Logs Inc goes up by $\mathdollar 1$ , then the price of Logs Inc will go up by $\mathdollar \mathrm{SNR}/2$ on average, and this relationship will be true whether the volatility of noise trader demand is $1{\scriptstyle \mathrm{mil}}$ shares per period or $1$ share per period. More precisely, we have that the covariance of Logs Inc’s price with its fundamental value is:

(10) $\begin{align*} \mathrm{Cov}[p,v] &= \frac{\mathrm{SNR}}{2} \cdot \sigma_v^2 \end{align*}$

No matter how much noise trader demand there is, the price is always equally informative about the fundamental value. This is a very strong prediction!

4. Planner’s Problem

OK. So, we’ve written down a really simple model, and this model says that the number of noise traders doesn’t impact how informative prices are about asset fundamentals. What does this say about the original question? How should you, the social planner, decide the number of noise traders to sacrifice so that everyone in the economy can use the resulting price signals?

Well, the first thing to see is that all of Alice’s profits from being an informed trader rather than a butcher come at the expense of noise traders:

(11) $\begin{align*} \mathrm{E}[\pi_x] &= - \mathrm{E}[\pi_z] = \frac{\sqrt{\mathrm{SNR}}}{2} \cdot \sigma_z \cdot \sigma_v \end{align*}$

Essentially, the rest of the economy is subsidizing the financial market by an amount $\sqrt{\mathrm{SNR}} \cdot \sigma_z \cdot \sigma_v/2$ . It might be worth it if it’s really helpful for everyone in the economy to see an accurate valuation of Logs Inc. There are lots of transfer payments which people are happy to make (e.g., welfare, social security, etc…), but the key observation is that it’s a transfer payment. What’s more, since adding more noise traders doesn’t affect price informativeness, you are going to want to sacrifice the the minimum number of noise trader required to make sure that Alice is a trader and not a butcher:

(12) $\begin{align*} \frac{\sqrt{\mathrm{SNR}}}{2} \cdot \sigma_z \cdot \sigma_v &\geq \bar{\pi} \end{align*}$

To get an answer in numbers of noise traders, suppose that each noise trader that you anoint contributes demand variance, $\hbar$ , so that total noise trader demand variance is given by $\sigma_z^2 = N \cdot \hbar$ .

In this world, you need to sacrifice $N$ noise traders to make sure Alice becomes an informed trader:

(13) $\begin{align*} N &= 4 \cdot \frac{\bar{\pi}^2}{\hbar} \cdot \left( \frac{\sigma_v^2 + \sigma_\epsilon^2}{\sigma_v^4} \right) \end{align*}$

This equation says that you need to sacrifice (a) more noise traders when Alice can make more money being a butcher, (b) fewer noise traders when each of them is willing to trade more wildly, (c) fewer noise traders when there is more information about Logs Inc to be discovered, and (d) more noise traders when Alice’s signal about Logs Inc is more noisy.

5. Conclusion

Stock prices are useful signals that we pay for with noise trader demand. This post then used a Kyle (1985)-type model to answer a simple question: As a social planner, how many noise traders should you sacrifice? The interesting fact that pops out of this model is that noise trader demand volatility doesn’t affect price informativeness. It only affects informed trader profits. So as a social planner, you want to have just enough noise trader demand volatility in the market to get Alice to figure out the value of Logs Inc.

A natural question to conclude with is: How general is this result? Surely there are times when noise trader demand shocks affect price informativeness in the real world. In the model, these $2$ aspects of the economy are completely divorced due to the fact that the equilibrium price impact coefficient, $\lambda$ , and the equilibrium demand coefficient, $\beta$ , in some sense undo one another:

(14) $\begin{align*} \lambda \times \beta &= \text{constant} \end{align*}$

Put differently, and increase in noise trader demand will make Alice trade more aggressively since it will be harder for Bob to figure out whether or not changes in aggregate demand are due to Alice or the noise traders. However, as a result, Bob will respond by moving the price around less in response to equal sized changes in aggregate demand.

To see how delicate this canceling out actually is, imagine the Bob has beliefs about the volatility of the underlying asset that are off by $(100 \times \eta)\%$ . e.g., if $\eta = 0.05$ then he would believe that fundamental volatility was $\mathdollar 1.05$ when it was in fact $\mathdollar 1.00$ . When Alice gets a really strong signal abou the fundamental value, $\mathrm{SNR} = 1$ , this canceling out seems to be quite robust:

(15) $\begin{align*} \mathrm{Cov}[p,v] &= \frac{1}{2} \cdot \sigma_v^2 \cdot \left\{ 1 + \eta + \eta^2 + \cdots \right\} \end{align*}$

Small errors in Bob’s beliefs decay pretty quickly. However, when $\mathrm{SNR} \searrow 0$ , problems can occur and the delicate balance between $\lambda$ and $\beta$ can break down:

(16) $\begin{align*} \mathrm{Cov}[p,v] &= \frac{\mathrm{SNR}}{2} \cdot \sigma_v^2 \cdot \left\{ 1 + \eta \cdot (2 - \mathrm{SNR}) + \cdots \right\} \end{align*}$

In percentage terms, small errors in Bob’s beliefs could really add up in situations where the $\mathrm{SNR}$ is low. Since $\mathrm{SNR} \leq 1$ , the factor multiplying $\eta$ is greater than unity. Thus, when Alice gets really weak signals about the fundamental value of Logs Inc, minor errors in Bob’s understanding of the market can lead to wildly incorrect pricing.

« Previous Page