Constraining Effort Not Bandwidth

1. Introduction

Imagine trying to fill up a $5$ gallon bucket using a hand-powered water pump. You might end up with a half-full bucket after $1$ hour of work for either of $2$ reasons. First, the spigot might be too narrow. i.e., even though you are doing enough work to pull $5$ gallons of water out of the ground each hour, only $2.5$ gallons can actually flow through the spigot during the allotted time. This is a bandwidth constraint. Alternatively, the pump handle might be too short. i.e., you have to crank the handle twice as many times to pull each gallon of water out of the ground. This is an effort constraint.

Existing information-based asset pricing models a la Grossman and Stiglitz (1980) restrict the bandwidth of arbitrageurs’ flow of information. The market just produces too much information per trading period. i.e., people’s minds have narrow spigots. However, traders also face restrictions on how much work they can do each period. Sometimes it’s hard to crank the pump handle often enough to produce the relevant information. e.g., think of a binary signal such as knowing if a cancer drug has completed a phase of testing as in Huberman and Regev (2001). It doesn’t take much bandwidth to convey this news. After all, people immediately recognized its significance when the New York Times wrote about it. Yet, arbitrageurs must have faced a restriction on the number of scientific articles they could read each year since Nature reported this exact same news $5$ months earlier and no one batted an eye! These traders left money on the table because they anticipated having to pump the handle too many times in order to uncover a really simple signal.

This post proposes algorithmic flop counts as a way of quantifying how much effort traders have to do in order to uncover profitable information. I illustrate the idea via a short example and some computations.

2. Effort via Flop Counts

One way of quantifying how much effort it takes to discover a piece of information is to count the number of floating-point operations (i.e., flops) that a computer has to do to estimate the number. I take my discussion of flop counts primarily from Boyd and Vandenberghe and define a ﬂop as an addition, subtraction, multiplication, or division of $2$ floating-point numbers. e.g., I use flops as a unit of effort so that:

(1) $\begin{align*} \mathrm{Effort}\left[ \ 2.374 \times 17.392 \ \right] &= 2{\scriptstyle \mathrm{flops}} \end{align*}$

in the same way that the cost of a Snickers bar might be $\mathdollar 2$ . I then count the total number of ﬂops needed to calculate a number as a proxy for the effort needed to find out the associated piece of information. e.g., if it took $N^2$ flops to compute the average return of all technology stocks but $N^3$ flops to arrive at the median return on assets for all value stocks, then I would say that it is easier (i.e., took less effort) to know the mean return. The key thing here is that this measure is independent of the amount of entropy that either of these $2$ calculations resolves.

I write flop counts as a polynomial function of the dimensions of the matrices and vectors involved. Moreover, I always simplify the expression by removing all but the highest order terms. e.g., suppose that an algorithm required:

(2) $\begin{align*} \left\{ M^7 + 7 \cdot M^4 \cdot N + M^2 \cdot N + 2 \cdot M \cdot N^6 + 5 \cdot M \cdot N^2 \right\}{\scriptstyle \mathrm{flops}} \end{align*}$

In this case, I would write the flop count as:

(3) $\begin{align*} \left\{M^7 + 2 \cdot M \cdot N^6\right\}{\scriptstyle \mathrm{flops}} \end{align*}$

since both these terms are of order $7$ . Finally, if I also know that $N \gg M$ , I might further simplify to $2 \cdot M \cdot N^6$ flops. Below, I am going to be thinking about high-dimensional matrices and vectors (i.e., where $M$ and $N$ are big), so these simplifications are sensible.

Let’s look at a couple of examples to fix ideas. First, consider the task of matrix-to-vector multiplication. i.e., suppose that there is a matrix $\mathbf{X} \in \mathrm{R}^{M \times N}$ and we want to calculate:

(4) $\begin{align*} \mathbf{y} &= \mathbf{X}{\boldsymbol \beta} \end{align*}$

where we know both $\mathbf{X}$ and ${\boldsymbol \beta}$ and want to figure out $\mathbf{y}$ . This task takes an effort of $2 \cdot M \cdot N$ flops. There are $M$ elements in the vector $\mathbf{y}$ , and to compute each one of these elements, we have to multiply $2$ numbers together $N$ times as:

(5) $\begin{align*} y_m &= \sum_n x_{m,n} \cdot \beta_n \end{align*}$

This setup is analogous to $\mathbf{X}$ being a dataset with $M$ observations on $N$ different variables where each variable has a linear affect $\beta_n$ on the outcome variable $y$ .

Next, let’s turn the tables and look at the case when we know the outcome variable $\mathbf{y}$ and want to solve for ${\boldsymbol \beta}$ when $\mathbf{X} \in \mathrm{R}^{N \times N}$ . A standard approach here would be to use the factor-solve method whereby we first factor the data matrix into the product of $K$ components, $\mathbf{X} = \mathbf{X}_1 \mathbf{X}_2 \cdots \mathbf{X}_K$ , and then use these components to iteratively compute ${\boldsymbol \beta} = \mathbf{X}^{-1}\mathbf{y}$ as:

(6) $\begin{align*} {\boldsymbol \beta}_1 &= \mathbf{X}_1^{-1}\mathbf{y} \\ {\boldsymbol \beta}_2 &= \mathbf{X}_2^{-1}{\boldsymbol \beta}_1 = \mathbf{X}_2^{-1}\mathbf{X}_1^{-1}\mathbf{y} \\ &\vdots \\ {\boldsymbol \beta} = {\boldsymbol \beta}_K &= \mathbf{X}_K^{-1}{\boldsymbol \beta}_{K-1} = \mathbf{X}_K^{-1}\mathbf{X}_{K-1}^{-1} \cdots \mathbf{X}_1^{-1}\mathbf{y} \end{align*}$

We call the process of computing the factors $\{\mathbf{X}_1, \mathbf{X}_2, \ldots, \mathbf{X}_K\}$ the factorization step and the process of solving the equations ${\boldsymbol \beta}_{k} = \mathbf{X}_{k}^{-1}{\boldsymbol \beta}_{k-1}$ the solve step. The total flop count of a solution strategy is then the sum of the flop counts for each of these steps. In many cases the cost of the factorization step is the leading order term.

e.g., consider the Cholesky Factorization method that is commonly used in statistical software. We know that for every $\mathbf{X} \in \mathrm{R}^{N \times N}$ there exists a factorization:

(7) $\begin{align*} \mathbf{X} = \mathbf{L} \mathbf{L}^{\top} \end{align*}$

where $\mathbf{L}$ is lower triangular and non-singular with positive diagonal elements. Cost of computing these Cholesky factors is $(\sfrac{1}{3}) \cdot N^3$ flops. By contrast, the resulting solve steps of $\mathbf{L}{\boldsymbol \beta}_1 = \mathbf{y}$ and $\mathbf{L}^{\top}{\boldsymbol \beta} = {\boldsymbol \beta}_1$ each have flop counts of $N^2$ flops bringing the total flop count to $(\sfrac{1}{3}) \cdot N^3 + 2 \cdot N^2 \sim (\sfrac{1}{3}) \cdot N^3$ flops. In the general case, the effort involved in solving a linear system of equations $\mathbf{y} = \mathbf{X} {\boldsymbol \beta}$ for ${\boldsymbol \beta}$ when $\mathbf{X} \in \mathrm{R}^{N \times N}$ grows with $\mathrm{O}(N^3)$ . Boyd and Vandenberghe argue that “for $N$ more than a thousand or so, generic methods… become less practical,” and financial markets definitely have more than “a thousand or so” trading opportunities to check!

3. Asset Structure

Consider constraining traders’ cognitive effort in an information-based asset pricing model a la Kyle (1985) but with many assets and attribute-specific shocks. Specifically, suppose that there are $N$ stocks that each have $H$ different payout-relevant characteristics. Every characteristic can take on $I$ distinct levels. I call a (characteristic, level) pairing an ‘attribute’ and use the indicator variable $a_n(h,i)$ to denote whether or not a stock has an attribute. Think about attributes as sitting in a $(H \times I)$ -dimensional matrix, $\mathbf{A}$ , as illustrated in Equation (8) below:

(8) $\begin{equation*} \mathbf{A}^{\top} = \bordermatrix{ ~ & 1 & 2 & \cdots & H \cr 1 & \text{Agriculture} & \text{Albuquerque} & \cdots & \text{Alcoa Inc} \cr 2 & \text{Apparel} & \textbf{\color{red}Boise} & \cdots & \text{ConocoPhillips} \cr 3 & \textbf{\color{red}Disk Drive} & \text{Chicago} & \cdots & \text{Dell Inc} \cr \vdots & \vdots & \vdots & \ddots & \vdots \cr I & \text{Wholesale} & \text{Vancouver} & \cdots & \textbf{\color{red}Xerox Corp} \cr} \end{equation*}$

I’ve highlighted the attributes for Micron Technology. e.g., we have that $a_{\text{Mcrn}}(\text{City},\text{Boise}) = 1$ while $a_{\text{WDig}}(\text{City},\text{Boise}) = 0$ since Micron Technology is based in Boise, ID while Western Digital is based in SoCal.

Further, suppose that each stock’s value is then the sum of a collection of attribute-specific shocks:

(9) $\begin{align*} v_n &= \sum_{h,i} x(h,i) \cdot a_n(h,i) \end{align*}$

where the shocks are distributed according to the rule:

(10) $\begin{align*} x(h,i) &= x^+(h,i) + x^-(h,i) \quad \text{with each} \quad x^\pm(h,i) \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \begin{cases} \pm \sfrac{\delta}{\sqrt{H}} &\text{ w/ prob } \pi \\ \ \: \, 0 &\text{ w/ prob } (1 - \pi) \end{cases} \end{align*}$

Each of the $x(h,i)$ indicates whether or not the attribute $(h,i)$ happened to realize a shock. The $\sfrac{\delta}{\sqrt{H}} > 0$ term represents the amplitude of all shocks in units of dollars per share, and the $\pi$ term represents the probability of either a positive or negative shock to attribute $(h,i)$ each period.

If value investors learn asset-specific information and Kyle (1985)-type market makers price each individual stock using only their own order flow in a dynamic setting, then each individual stock will be priced correctly:

(11) $\begin{align*} \mathrm{E} \left[ \ p_n - v_n \ \middle| \ y_n \ \right] &= 0 \end{align*}$

where $y_n$ denotes the aggregate order flow for stock $n$ . Yet, the high-dimensionality of market would mean that there still could be groups of mispriced stocks:

(12) $\begin{align*} \mathrm{E} \left[ \ \langle p_n \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = \sfrac{\delta}{\sqrt{H}} \ \right] &< 0 \; \text{and} \; \mathrm{E} \left[ \ \langle p_n \rangle_{h,i} - \langle v_n \rangle_{h,i} \ \middle| \ x(h,i) = - \sfrac{\delta}{\sqrt{H}} \ \right] > 0 \end{align*}$

where $\langle p_n \rangle_{h,i} = \sfrac{I}{N} \cdot \sum_n p_n \cdot a_n(h,i)$ denotes the sample average price for stocks with a particular attribute, $(h,i)$ . This is a case of more is different. If an oracle told you that $x(h,i) = \sfrac{\delta}{\sqrt{H}}$ for some attribute $(h,i)$ , then you would know that the average price of stocks with attribute $(h,i)$ would be:

(13) $\begin{align*} \langle p_n \rangle_{h,i} &= \lambda_n \cdot \beta_n \cdot \frac{\delta}{\sqrt{H}} + \mathrm{O}(\sfrac{N}{I})^{-1/2} \end{align*}$

since $\lambda_n \cdot \beta_n < 1$ in a dynamic Kyle (1985) model where informed traders have an incentive to trade less aggressively today (i.e., decrease $\beta_n$ and thus $\lambda_n$ ) in order to act on their information again tomorrow. In this setting, $\langle p_{n,1} \rangle_{h,i}$ will be less than its fundamental value $\langle v_n \rangle_{h,i} = \sfrac{\delta}{\sqrt{H}}$ even though it will be easy to see that $\langle p_{n,1} \rangle_{h,i} \neq 0$ as $\sfrac{I}{N} \to 0$ .

4. Arbitrageurs’ Inference Problem

So how much effort does it take to discover the set of shocked attributes, $\mathbf{A}^\star \subseteq \mathbf{A}$ :

(14) $\begin{align*} \mathbf{A}^\star &= \left\{ \ (h,i) \in H \times I \ \middle| \ x(h,i) \neq 0 \ \right\} \quad \text{where} \quad A^\star = \#\left\{ \ (h,i) \ \middle| \ (h,i) \in \mathbf{A}^\star \ \right\} \end{align*}$

given their price impact? What’s stopping arbitrageurs from trading away these attribute-specific pricing errors? Well, the problem of finding the attributes in $\mathbf{A}^\star$ boils down to solving:

(15) $\begin{align*} \underset{\mathbf{y}}{\begin{pmatrix} p_{1,1} \\ p_{2,1} \\ \vdots \\ p_{n,1} \\ \vdots \\ p_{N,1} \end{pmatrix}} &= \underset{\mathbf{X}}{\begin{bmatrix} a_1(1,1) & a_1(1,2) & \cdots & a_1(h,i) & \cdots & a_1(H,I) \\ a_2(1,1) & a_2(1,2) & \cdots & a_2(h,i) & \cdots & a_2(H,I) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ a_n(1,1) & a_n(1,2) & \cdots & a_n(h,i) & \cdots & a_n(H,I) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ a_N(1,1) & a_N(1,2) & \cdots & a_N(h,i) & \cdots & a_N(H,I) \end{bmatrix}} \underset{\boldsymbol \beta}{\begin{pmatrix} \langle p_{n,1} \rangle_{1,1} \\ \langle p_{n,1} \rangle_{1,2} \\ \vdots \\ \langle p_{n,1} \rangle_{h,i} \\ \vdots \\ \langle p_{n,1} \rangle_{H,I} \end{pmatrix}} \end{align*}$

for ${\boldsymbol \beta}$ where $\mathbf{X} \in \mathrm{R}^{N \times A}$ , $A \gg N \geq A^\star$ , and $\mathrm{Rank}[\mathbf{X}] = N$ . i.e., this is a similar problem as the linear solve in Section 3 above, but with $2$ additional complications. First, the system is underdetermined in the sense that there are many more payout-relevant attributes than stocks, $A \gg N \geq A^\star$ . Second, arbitrageurs don’t know exactly how many attributes are in $\mathbf{A}^\star$ . They know that on average, $\mathrm{E}[A^\star] = 2 \cdot \pi \cdot (1 - \pi) \cdot A$ ; however, $A^\star$ itself is a random variable.

It’s easy enough to extend the solution strategy in Section 3 to the case of an underdetermined system where a solution ${\boldsymbol \beta}$ is a member of the set:

(16) $\begin{align*} \left\{ \ {\boldsymbol \beta} \ \middle| \ \mathbf{X}{\boldsymbol \beta} = \mathbf{y} \ \right\} &= \left\{ \ \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \begin{bmatrix} {\boldsymbol \beta}_1 \\ 0 \end{bmatrix} + \mathbf{F} \mathbf{b} \ \middle| \ \mathbf{b} \in \mathrm{R}^{A - A^\star} \ \right\} \end{align*}$

where $\mathbf{F}$ is a matrix whose column vectors are a basis for the null space of $\mathbf{X}$ . Suppose that $\mathbf{X}_1 \subseteq \mathbf{X}$ is $(A^\star \times A^\star)$ -dimensional and non-singular, then:

(17) $\begin{align*} \mathbf{X} {\boldsymbol \beta} &= \begin{bmatrix} \mathbf{X}_1 & \mathbf{X}_2 \end{bmatrix} \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \mathbf{y} \quad \text{and} \quad {\boldsymbol \beta}_1 = \mathbf{X}_1^{-1} \left( \mathbf{y} - \mathbf{X}_2 {\boldsymbol \beta}_2 \right) \end{align*}$

Obviously, setting ${\boldsymbol \beta}_2 = 0$ is one solution. The full set of solutions defining the null space $\mathbf{F}$ is given by:

(18) $\begin{align*} {\boldsymbol \beta} &= \begin{bmatrix} {\boldsymbol \beta}_1 \\ {\boldsymbol \beta}_2 \end{bmatrix} = \begin{bmatrix} \mathbf{X}_1^{-1} \mathbf{y} \\ 0 \end{bmatrix} + \underbrace{\begin{bmatrix} - \mathbf{X}_1^{-1} \mathbf{X}_2 \\ \mathbf{I} \end{bmatrix}}_{\mathbf{F}} \mathbf{b} \end{align*}$

Thus, if it takes $f$ flops to factor $\mathbf{X}$ into $\begin{bmatrix} \mathbf{X}_1 & \mathbf{X}_2 \end{bmatrix}$ and $s$ flops to solve each linear system of the form $\mathbf{X}_1 {\boldsymbol \beta}_1 = \mathbf{y}$ , then the total cost of parameterizing all the solutions is:

(19) $\begin{align*} \{ f + s \cdot (A - A^\star + 1)\}{\scriptstyle \mathrm{flops}} \end{align*}$

Via the LU factorization method, I know that the factorization step will cost roughly:

(20) $\begin{align*} f &= \left\{ (\sfrac{2}{3}) \cdot (A^\star)^3 + (A^\star)^2 \cdot (A - A^\star) \right\}{\scriptstyle \mathrm{flops}} \end{align*}$

Moreover, we know from Section 3 that the cost of the solve step will be on the order $s = \mathrm{O}(A^\star)^3$ . However, there is one detail left to consider still. Namely that arbitrageurs don’t know $A^\star$ . Thus, they have to solve for both ${\boldsymbol \beta}$ and $A^\star$ by starting at $A^\star_0 = \mathrm{E}[A^\star]$ and iterating on the above process until the columns of $\mathbf{F}$ actually represents a basis for the null space of $\mathbf{X}_1$ . Thus, the total effort needed is:

(21) $\begin{align*} \left\{ (\sfrac{1}{3}) \cdot (A^\star)^3 \cdot (A - A^\star) \right\}^{\sfrac{1}{\gamma}} {\scriptstyle \mathrm{flops}} \end{align*}$

where $\gamma \in (0,1)$ is the convergence rate and the calculation is dominated by the effort spent searching through the null space to be sure that $A^\star$ is correct. More broadly, this step is just one way of capturing the deeper idea that knowing where to look is hard. e.g., Warren Buffett says that he “can say no in $10$ seconds or so to $90{\scriptstyle \%}$ or more of all the [investment opportunities] that come along.” This is great… until you consider how many investment opportunities Buffett runs across every single day. Saying no in $10$ second flat then turns out to be quite a chore! Alternatively, as the figure above highlights, this is why traders use personalized multi-monitor computing setups that make it easy to spot patterns instead of a shared super computer with minimal output.

5. Clock Time Interpretation

Is $\mathrm{O}((A^\star)^3 \cdot (A - A^\star))^{\sfrac{1}{\gamma}}$ flops a big number? Is it a small number? Flop counts were originally used when floating-point operations were the main computing bottleneck. Now things relating to how data are stored such as cache boundaries and reference locality have first order affects on computation time as well. Nevertheless, ﬂop counts can still give a good back of the envelope estimate of the relative amount of time it would take to execute a procedure, and such a calculation would be helpful in trying to interpret the unit of measurement “flops” on a human scale. e.g., on one hand, arbitrageur effort would be a silly constraint to worry about if the time it took to execute real world calculations was infinitesimally small. On the other hand, flops might be a poor unit of measure for arbitrageurs’ effort if the time it took to carry out reasonable calculations was on the order of millennia since arbitrageurs clearly don’t wait this long to act! Actually doing a quick computation can allay these fears.

Suppose that computers can execute roughly $1{\scriptstyle \mathrm{mil}}$ operations per second. Millions of instructions per second (i.e., $\mathrm{MIPS}$ ) is a standard unit of computational speed. I can then calculate the amount of time it would take to execute a given number of flops at a speed of $1{\scriptstyle \mathrm{MIPS}}$ as:

(22) $\begin{align*} \mathrm{Time} &= \left((A^\star)^3 \cdot (A - A^\star) \right)^{\sfrac{1}{\gamma}} \times \left(\frac{1 {\scriptstyle \mathrm{sec}}}{10^6 {\scriptstyle \mathrm{flops}}}\right) \times \left(\frac{1 {\scriptstyle \mathrm{day}}}{86400 {\scriptstyle \mathrm{sec}}}\right) \end{align*}$

Thus, if there are roughly $5000$ characteristics that can take on $50$ different levels and $1$ out of every $1000$ attributes realizes a shock each period, then even if arbitrageurs guess exactly right on the number of shocked attributes (i.e., so that $\gamma = 1$ ) a brute force search would take $45$ days to complete. Clearly, a brute force search strategy just isn’t feasible. There just isn’t enough time to physically do all of the calculations.

6. A Persistent Problem

I conclude by addressing a common question. You might ask: “Won’t really fast computers make cognitive control irrelevant?” No. Progress in computer storage has actually outstripped progress in processing speeds by a wide margin. This is known as Kryder’s Law. Over $10$ years the cost of processing has dropped by a factor of roughly $32$ (i.e., Moore’s Law). By contrast, the cost of storage has dropped by a factor of $1000$ over the same period. e.g., take a look at the figure below made using data from www.mkomo.com which shows that the cost of disk space decreases by $58{\scriptstyle \%}$ each year. What does this mean in practice? Well, as late as 1980 a $26{\scriptstyle \mathrm{MB}}$ hard drive cost $\mathdollar 5 \times 10^3$ , implying that a $1{\scriptstyle \mathrm{TB}}$ hard drive would have cost upwards of $\mathdollar 2 \times 10^5$ . These days you can pick up a $1{\scriptstyle \mathrm{TB}}$ drive for about $\mathdollar 50$ ! We have so much storage that finding things is now an important challenge. This is why we find Google so helpful. Instead of being eliminated by computational power, cognitive control turns out to be a distinctly modern problem.