Uncategorized – Page 4 – Research Notebook

The Continuous Limit of Kyle (1985) Ends With A Bang

July 31, 2018 by Alex

Suppose you uncover good news about a stock’s fundamental value. When you start trading on this information, market makers will notice a spike in aggregate demand, infer that someone must have discovered good news, and adjust the stock’s price upward accordingly. How much will this price impact cut into your profits? Well, you could measure the damage by estimating the change in the stock’s price per $100$ shares you buy. If the number is small, the asset’s liquid; if it’s big, it ain’t. Price impact = 1/liquidity.

Of course, if the stock you’re interested in is very illiquid, then maybe you should just make yourself less conspicuous by spreading out your order flow. You can always sacrifice immediacy to gain liquidity. It’s a simple enough idea. But, figuring out precisely how you should spread out your order flow is a hard problem because it introduces a feedback loop. When you spread out your order flow, market makers are less likely to notice a spike in aggregate demand, so your equilibrium price impact will be smaller. But, if your price impact is smaller, then you have less of an incentive to spread out your order flow in the first place.

Where does this feedback loop end? How does strategic trading affect market liquidity in equilibrium? Kyle (1985) provides a nice model that answers these questions. In the model, there’s a single informed agent who places market orders for a risky asset over the course of $N \geq 1$ auctions. And, a risk-neutral market maker sets the market-clearing price in each auction after observing a noisy version of the informed agent’s demand. Solving the model reveals precisely how the informed agent should optimally adjust his order flow from auction to auction, thereby maximizing his expected profits given the equilibrium price impact.

One of the things I really like about the original paper is its analysis of the continuous-time limit. It turns out that, as the number of auctions during a fixed unit of time tends to infinity, $N \to \infty$ , the informed agent strategically adjusts his demand from auction to auction in a way that keeps the equilibrium price impact constant. This is surprising. In general, you’d expect your price impact at 2:00pm to depend on what’s happened during the previous hour because market makers learn from aggregate order flow. But, if you can trade infinitely often, then the model indicates that you will strategically trade in a way that offsets this learning. Obviously, we don’t think that high-frequency trading actually makes liquidity constant. This is just a really clever way of showing how important equilibrium feedback loops are.

While I really like this analysis, I also think it’s created a misconception about how the informed agent behaves in the model. A constant price impact does not imply that the market is stable… quite the opposite actually. A Kyle (1985) with a large number of auctions describes a market that ends with a lightning fast flurry of intense trading activity. When there are many auctions, $N \gg 1$ , the informed agent trades smoothly and steadily until the final few auctions. Then, all hell breaks loose. This post illustrates why.

Structure

In the model, there’s a single informed agent who places market orders for a risky asset in $N \geq 1$ auctions that take place during a fixed time interval. Let $\Delta t = \sfrac{1}{N}$ denote the time between auctions in minutes. For concreteness, I’m going to think about this time interval as 1:00-2:00pm. So, if $N = 2$ , then $\Delta t = 30$ minutes and there are auctions at 1:30 and 2:00pm; whereas, if $N = 4$ , then $\Delta t = 15$ minutes and auctions occur at 1:15, 1:30, 1:45, and 2:00pm. This risky asset has fundamental value $v \sim \mathrm{Normal}(0, \, 1)$ , and the informed agent knows $v$ prior to the start of the first auction. Based on this information, he submits market orders of size $\Delta x_n$ in each auction. So, $x_n = \sum_{\check{n}=1}^n \Delta x_{\check{n}}$ represents his cumulative demand during the first $n$ auctions.

The equilibrium price in each auction, $p_n$ , is determined by a market maker after observing the aggregate order flow, which is the sum of both the informed agent’s demand and a random demand shock:

$\begin{equation*} \Delta a_n = \Delta x_n + \Delta u_n \qquad \text{where} \qquad \Delta u_n \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{Normal}(0,\! \sqrt{\Delta t}) \end{equation*}$

What’s important here is that the market maker only observes $\Delta a_n$ . He can’t separately observe $\Delta x_n$ or $\Delta u_n$ on its own. As a result, when there’s a lot of demand, he might suspect that the informed agent has uncovered good news, but he can’t be certain. The order flow could always just be due to random noise. Given the market maker’s pricing rule, let $\pi_n = \sum_{\hat{n}=n}^N (v - p_{\hat{n}}) \cdot \Delta x_{\hat{n}}$ denote the informed agent’s future profit on all positions acquired at auctions $\{n, \, \ldots, \, N\}$ .

An equilibrium consists of two sequences of functions, $\mathsf{X} = \langle \mathrm{X}_1(\cdot), \, \ldots, \, \mathrm{X}_N(\cdot) \rangle$ and $\mathsf{P} = \langle \mathrm{P}_1(\cdot), \, \ldots, \, \mathrm{P}_N(\cdot) \rangle$ . The first sequence defines the informed agent’s trading strategy, $x_n = \mathrm{X}_n(p_1, \, \ldots, \, p_{n-1}, \, v)$ . The second sequence defines the market maker’s pricing rule, $p_n = \mathrm{P}_n(\Delta a_1, \, \ldots, \, \Delta a_n)$ . And, we say that $(\mathsf{X}^\star\!, \, \mathsf{P}^\star)$ is an equilibrium if the following two conditions hold. First, the informed agent’s trading strategy has to be profit maximizing. i.e., if you were to pick any other alternative trading strategy, $\mathsf{X}^{\text{alt}}$ , that’s identical to $\mathsf{X}^\star$ in the first $(n-1)$ auctions, then it must be the case that:

$\begin{equation*} \mathrm{E}[ \, \pi_n(\mathsf{X}^\star\!,\,\mathsf{P}^\star) \mid p_1^\star, \, \ldots, \, p_{n-1}^\star, \, v \, ] \geq \mathrm{E}[ \, \pi_n(\mathsf{X}^\text{alt},\,\mathsf{P}^\star) \mid p_1^\star, \, \ldots, \, p_{n-1}^\star, \, v \, ] \end{equation*}$

Second, the market maker’s pricing rule has to be efficient:

$\begin{equation*} p_n^\star = \mathrm{E}[ \, v \mid \Delta a_1^\star, \, \ldots, \, \Delta a_n^\star \, ] \qquad \text{where} \qquad \Delta a_n^\star = \Delta x_n^\star+ \Delta u_n^{\phantom{\star}} \end{equation*}$

i.e., price must be equal to the conditional expectation of fundamental value given aggregate order flow.

Solution

Theorem 2 of Kyle (1985) gives the solution to this model. If you restrict yourself to only considering pricing rules of the form

$\begin{equation*} p_n = p_{n-1} + \lambda_n \cdot \Delta a_n \qquad \text{and} \qquad p_0 = \mathrm{E}[v] = 0 \end{equation*}$

then there’s a unique equilibrium $(\mathsf{X}^\star\!,\,\mathsf{P}^\star)$ given by

$\begin{align*} \Delta x_n^\star &= \beta_n^{\phantom{\star}}\! \cdot (v - p_{n-1}^\star) \cdot \Delta t \\ \Delta p_n^\star &= \lambda_n^{\phantom{\star}}\! \cdot (\Delta x_n^\star + \Delta u_n^{\phantom{\star}}) \\ \sigma_n^2 &= \mathrm{Var}[ \, v \mid \Delta x_1^\star + \Delta u_1^{\phantom{\star}}, \, \ldots, \, \Delta x_n^\star + \Delta u_n^{\phantom{\star}} \,\! ] \\ \mathrm{E}[ \, \pi_{n+1}^{\phantom{\star}} \mid p_1^\star, \, \ldots, \, p_n^\star, \, v \, ] &= \alpha_n^{\phantom{\star}}\! \cdot (v - p_n^\star)^2 + \delta_n^{\phantom{\star}} \end{align*}$

where $\beta_n$ , $\sigma_n^2$ , $\lambda_n$ , $\alpha_n$ , and $\delta_n$ are the solution to the following system of difference equations

$\begin{align*} \beta_n \cdot \Delta t &= {\textstyle \frac{1}{2 \cdot \lambda_n} \times \left(\frac{1 - 2 \cdot \alpha_n \cdot \lambda_n}{1 - \alpha_n \cdot \lambda_n} \right)} \\ \sigma_n^2 &= \sigma_{n-1}^2 \cdot (1 - \lambda_n \cdot [\beta_n \cdot \Delta t]) \\ \lambda_n &= \sigma_n^2 \cdot \beta_n \\ \alpha_{n-1} &= {\textstyle \frac{1}{4 \cdot \lambda_n} \times \left( \frac{1}{1 - \alpha_n \cdot \lambda_n}\right)} \\ \delta_{n-1} &= \delta_n + \alpha_n \cdot \lambda_n^2 \cdot \Delta t \end{align*}$

under the conditions $p_0 = \alpha_N = \delta_N = 0$ , $\sigma_0^2 = 1$ , and $\lambda_n \cdot (1 - \alpha_n \cdot \lambda_n) > 0$ .

On the right, I’ve plotted these constants for different $N$ s. You can see how the equilibrium values change as the number of auctions increases by moving the slider at the top. To verify consistency, note that when there are $N=4$ auctions, the values of $\lambda_n$ and $\sigma_n^2$ to the right are the same as those depicted by the four $\square \, \square \, \square \, \square$ s in Figure 1 of Kyle (1985).

N→∞ Limit

What’s more, by pushing the slider all the way to the right so that $N=60$ and auctions occur every $\Delta t = 1$ minute, you get something close to the continuous-limit result I described in the introduction. As $N \to \infty$ , the price-impact coefficient $\lambda_n \to 1$ due to the way that the informed agent spreads out his order flow across auctions. But, does this mean that informed demand is smooth?

No.

You can see as much in the animation to the right. The top panel shows realizations of the informed agent’s trading volume, $|\Delta x_n|$ . As you push the slider towards $N = 60$ in this figure, you can watch the informed agent’s demand fluctuate more and more as he approaches the last 2:00pm auction. He’s spreading out his demand, but his demand’s clearly not smooth. When $N$ is large, the market ends with a short intense burst of trading activity. In fact, you might even think about this short intense burst of activity as a tell-tale sign of short-horizon trading.

Here’s another way of looking at what’s happening. The noise shocks are the same regardless of which auction they arrive in, $\Delta u_n \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{Normal}(0,\! \sqrt{\Delta t})$ . Noise doesn’t care whether it’s auction $n=1$ or auction $n=60$ . So, the typical noise shock will always be on the order of $\sqrt{\Delta t}$ . As a result, the typical change in the equilibrium price in auction $(n-1)$ due to demand noise will be $\lambda_{n-1} \cdot \! \sqrt{\Delta t}$ . And, this random price change will, in turn, change the informed agent’s demand in auction $n$ since $\Delta x_n = \beta_n \cdot (v - p_{n-1}) \cdot \Delta t$ :

$\begin{equation*} \mu_n = \beta_n \cdot (\lambda_{n-1} \cdot \! \sqrt{\Delta t}) \cdot \Delta t \end{equation*}$

By plugging in the equilibrium conditions from the previous section, we can rewrite the typical change in the informed agent’s demand due to random fluctuations in the previous auction as follows:

$\begin{align*} \mu_n &= {\textstyle \frac{\lambda_{n-1}}{2 \cdot \lambda_n} \cdot \left(\frac{1 - 2 \cdot \alpha_n \cdot \lambda_n}{1 - \alpha_n \cdot \lambda_n} \right)} \times \! \sqrt{\Delta t} \\ &= 2 \cdot \alpha_{n-1} \cdot \lambda_{n-1} \cdot (1 - 2 \cdot \alpha_n \cdot \lambda_n) \times \! \sqrt{\Delta t} \end{align*}$

The bottom panel in the figure to the right plots the quantity. When $N = 60$ , the informed agent responds more and more aggressively to noise shocks in the previous auction as $n \to 60$ . This behavior corresponds to the hyperbolic growth in $\beta_n$ as $n \to 60$ plotted above. Constant price impact does not imply smooth trading.

Empirical Bayes and Price Signals

April 27, 2018 by Alex

Asset-pricing models are built upon the idea that traders learn from price signals. For example, suppose there are $N \geq 1$ actively managed mutual funds. And, imagine a trader that observes the entire cross-section of these mutual funds’ returns in month $t$ :

$\begin{equation*} R_{n,t} = \alpha_n + \epsilon_{n,t} \qquad \text{where} \qquad \epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma_{\epsilon}^2) \end{equation*}$

In the equation above, $\alpha_n$ is the average performance of the $n$ th active mutual-fund manager while $\epsilon_{n,t}$ is measurement error in price signals. A skilled mutual-fund manager has an $\alpha_n > 0$ . So, after observing cross-section of mutual funds’ returns in a given month, a trader can use Bayes’ rule

$\begin{equation*} \mathrm{Pr}[\alpha_n > 0|R_{n,t}] = {\textstyle \left( \frac{\mathrm{Pr}[R_{n,t}|\alpha_n > 0]}{\mathrm{Pr}[R_{n,t}]} \right)} \times \mathrm{Pr}[\alpha_n > 0] \end{equation*}$

to update his beliefs about whether or not the $n$ th active mutual-fund manager is skilled.

At first glance, the logic above seems trivial. But, on closer inspection, there’s something a bit paradoxical about framing a trader’s inference problem this way. If a trader wants to use Bayes’ rule to learn about a particular fund manager’s skill level from realized returns, then it seems like the trader needs to know how skill is distributed across the population of fund managers. After all, the last term in the equation above is just one minus the cumulative-distribution function at zero, $\mathrm{Pr}[\alpha_n > 0] = 1 - \mathrm{CDF}_{\alpha}(0)$ . And, financial economists disagree about basic properties of this distribution. For instance, there is debate about whether any active mutual-fund managers are skilled—i.e., about whether $\mathrm{Pr}[\alpha_n > 0] = 0$ or $\mathrm{Pr}[\alpha_n > 0] > 0$ . If we can’t agree about these sorts of basic facts, then are traders supposed to apply Bayes’ rule?

This is where the empirical-Bayes method makes an appearance. It turns out that, in the example above, a trader can learn about a particular mutual-fund manager’s skill from realized returns without having any ex ante knowledge about how skill is distributed across the population of fund managers. He can just estimate this prior distribution from the data. And, this post illustrates how using a trick known as Tweedie’s formula.

Simple Example

Let’s start with a simple example to illustrate how the empirical-Bayes method works. Suppose that fund-manager skill is normally distributed across the population:

$\begin{equation*} \alpha_n \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(\mu_{\alpha}, \, \sigma_{\alpha}^2) \end{equation*}$

In this setup, it’s easy to compute a trader’s posterior beliefs about the skill of the $n$ th mutual-fund manager after observing this manager’s month- $t$ returns:

(1) $\begin{equation*} \mathrm{E}[\alpha_n|R_{n,t}] = {\textstyle \left( \frac{\sigma_{\alpha}^2}{\sigma_{\alpha}^2 + \sigma_{\epsilon}^2}\right)} \times R_{n,t} + {\textstyle \left( \frac{\sigma_{\epsilon}^2}{\sigma_{\alpha}^2 + \sigma_{\epsilon}^2} \right)} \times \mu_\alpha \end{equation*}$

This is a completely standard Gaussian-learning problem (e.g., see here, here, here, etc…).

The formula in Equation (1) seems to require knowledge of the mean and variance of the skill distribution, $\mu_{\alpha}$ and $\sigma_{\alpha}^2$ . But, not so. Notice that when both skill and error are normally distributed, realized returns are also normally distributed, $R_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(\mu_R, \, \sigma_R^2)$ , with

$\begin{equation*} \begin{split} \mu_R &= \mu_{\alpha} \\ \sigma_R^2 &= \sigma_{\alpha}^2 + \sigma_{\epsilon}^2 \end{split} \end{equation*}$

So, a trader could form the correct ex-post beliefs about the skill of $n$ th mutual-fund manager by simply estimating the cross-sectional mean and variance of the realized returns

(2) $\begin{equation*} \mathrm{E}[\alpha_n|R_{n,t}] = {\textstyle \left( \frac{\sigma_R^2 - \sigma_{\epsilon}^2}{\sigma_R^2} \right)} \times R_{n,t} + {\textstyle \left( \frac{\sigma_{\epsilon}^2}{\sigma_R^2} \right)} \times \mu_R \end{equation*}$

since $\sfrac{\sigma_{\alpha}^2}{(\sigma_{\alpha}^2 + \sigma_{\epsilon}^2)} = \sfrac{(\sigma_R^2 - \sigma_{\epsilon}^2)}{\sigma_R^2}$ and $\sfrac{\sigma_{\epsilon}^2}{(\sigma_{\alpha}^2 + \sigma_{\epsilon}^2)} \times \mu_{\alpha} = \sfrac{\sigma_{\epsilon}^2}{\sigma_R^2} \times \mu_R$ .

This is the essence of the empirical-Bayes method. If you’re interested in learning from a specific observation and you don’t know which priors to use, then just use the remaining data to estimate these priors. i.e., replace $\mu_R$ and $\sigma_R^2$ with $\hat{\mu}_R = \frac{1}{N-1} \cdot \sum_{n' \neq n} R_{n',t}$ and $\hat{\sigma}_R^2 = \frac{1}{N-2} \cdot \sum_{n' \neq n} (R_{n',t} - \hat{\mu}_R)^2$ in Equation (2).

The figure above illustrates how this scheme works. The three left panels show the cross-sectional distributions of measurement error, manager skill, and realized returns under the assumption of normality. The right panel then shows a trader’s posterior beliefs about the skill of the $n$ th mutual-fund manager ( $y$ -axis) after observing this fund’s realized returns in month $t$ ( $x$ -axis). The purple line shows $\mathrm{E}[\alpha_n|R_n]$ calculated using knowledge of both $\mu_{\alpha}$ and $\sigma_{\alpha}^2$ . The dashed black line shows $\mathrm{E}[\alpha_n|R_n]$ calculated via the empirical-Bayes method using estimates of $\hat{\mu}_R$ and $\hat{\sigma}_R^2$ from the cross-sectional distribution of returns.

Tweedie’s Formula

Tweedie’s formula is a natural extension of this approach that doesn’t require manager skill (or whatever it is that traders are trying to learn about) to be normally distributed. Here’s the idea. Notice that in the normally-distributed case, $R_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(\mu_R, \, \sigma_R^2)$ , the probability-density function (PDF) of realized returns is given by:

$\begin{equation*} \mathrm{f}(R) = {\textstyle \frac{1}{\sqrt{2 \cdot \pi \cdot \sigma_R^2}}} \cdot \exp {\textstyle \left\{- \, \frac{1}{2 \cdot \sigma_R^2} \cdot (R-\mu_R)^2 \right\}} \end{equation*}$

And, if we define the log of this PDF, $\ell(R) = \log \mathrm{f}(R)$ , then $\ell'(R) = - \, (\sfrac{1\!}{\sigma_R^2}) \cdot (R - \mu_R)$ . So, we can write the formula for a trader’s posterior beliefs in Equation (2) as follows:

(3) $\begin{equation*} \mathrm{E}[\alpha_n|R_n] = R_n + \sigma_{\epsilon}^2 \cdot \ell'(R_n) \end{equation*}$

This is Tweedie’s formula. And, in keeping with Stigler’s law, it was Robbins (1956) who showed that Tweedie’s formula holds approximately for any prior distribution on $\alpha_n$ that satisfies standard regularity conditions, such as being smooth and having a single peak. The formula in Equation (3) is interesting because it means that, if a trader can approximate the cross-sectional distribution of mutual-fund returns (i.e., estimate $\hat{\mathrm{f}}(R)$ ), then he can appropriately update his beliefs about any particular manager’s skill level (i.e., compute $\hat{\mathrm{E}}[\alpha|R_{n,t}]$ ). There’s no need for him to take a hard-line dogmatic stance about what the cross-sectional distribution of mutual-fund manager skill looks like.

To see this point in action, check out the figure above. First, click on the “Normal” button. This version of the figure replicates the earlier result by estimating $\mathrm{f}(R)$ with a $4$ th-order polynomial rather than by directly estimating $\hat{\mu}_R$ and $\hat{\sigma}_R^2$ . The interesting part, however, is that this result also holds when mutual-fund manager skill is not normally distributed. For example, suppose that skill obeys a Laplace distribution:

$\begin{equation*} \alpha_n \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{Lap}(\lambda_{\alpha}/\sigma_{\epsilon}) \qquad \text{where} \qquad \mathrm{Lap}(\theta) = (\sfrac{\theta\!}{2}) \cdot e^{- \theta \cdot |x|} \end{equation*}$

Under this assumption, the correct way for a trader to update his prior beliefs about the $n$ th manager’s skill after observing the manager’s realized return in month $t$ is to use a threshold rule. When the manager’s return is sufficiently small, $|R_{n,t}| \leq \sigma_{\epsilon} \cdot \lambda_{\alpha}$ , a trader should not update his beliefs at all:

$\begin{equation*} \mathrm{E}[\alpha_n|R_n] = \begin{cases} \mathrm{Sgn}(R_n) \cdot (|R_n| - \sigma_{\epsilon} \cdot \lambda_{\alpha}) &\text{if } |R_n| > \sigma_{\epsilon} \cdot \lambda_{\alpha} \\ 0 &\text{otherwise} \end{cases} \end{equation*}$

This is the Bayesian LASSO. And, by clicking on the “Laplace” button in the figure above, you can see how Tweedie’s formula captures this non-responsiveness without having to directly assume that traders are using an $\ell_1$ penalty. Something resembling a soft-thresholding rule just emerges from the data.

Actual Data

Finally, click on the “???????” button. In the lower left-hand corner, you should now see the distribution of returns for all actively managed equity mutual funds in May 2012 (normalized to be on the same scale as the data from the earlier simulations). The data in this version of the figure comes from WRDS. I just picked one month at random. The dashed line is the estimated PDF for this return distribution, which I again computed using a $4$ th-order polynomial (see CASI, Ch. 15). There is nothing in the middle box because I don’t know the skill level of each mutual-fund manager. In the upper-left box, I’ve plotted the PDF of the measurement error. But, I didn’t plot a histogram of realized errors because, again, I can’t tell skill from luck. I can only see the cross-section of returns.

There are two interesting things about the plot of the posterior beliefs on the right. The first is that, if you apply Tweedie’s formula to actual mutual-fund returns, then you get a picture that looks a lot like the picture that emerged using Laplace priors. In other words, it looks a lot like a world where the distribution of active mutual-fund manager skill has fat tails—i.e., a world where there are a couple of very skilled managers and a couple of utterly incompetent managers and everyone else is just sort of ‘meh’. The second interesting thing about this picture is that making prices less informative (i.e., increasing $\sigma_{\epsilon}^2$ ) affects traders’ posterior beliefs in a highly non-linear way. Put differently, when prices are less informative, traders don’t just react less to all price signals. They just stop reacting to small price changes. This is not a result that you could get in an information-based asset-pricing model with strictly normal shocks.

How Bad Are False Positives, Really?

January 9, 2018 by Alex

Imagine you’re looking for variables that predict the cross-section of expected returns. No search process is perfect. As you work, you will inevitably uncover both tradable anomalies as well as spurious correlations. To figure out which are which, you regress returns on each variable that you come across:

$\begin{equation*} r_n = \hat{\alpha} + \hat{\beta} \cdot x_n + \hat{\epsilon}_n \end{equation*}$

This is helpful because “any predictive regression can be expressed as a portfolio sort” and vice versa. So, a statistically significant $\hat{\beta}$ suggests a profitable stock-picking strategy.

But, what qualifies as a “statistically significant” test result? If a variable doesn’t actually predict returns, then the probability that its $\hat{\beta}$ will have a $t$ -stat greater than $1.96$ is $5\%$ by definition:

$\begin{equation*} \mathrm{Pr}\big( \, \mathrm{t}(\hat{\beta}) > 1.96 \, \big| \, \text{variable is spurious} \, \big) = 0.05 \end{equation*}$

But, what if you didn’t just consider one variable on its own? Instead, suppose that you ran $K \gg 1$ separate regressions. If all $K$ variables were just spurious correlations, the probability that at least one of these $\hat{\beta}$ s has a $t$ -stat greater than $1.96$ would be much larger than $5\%$ :

$\begin{equation*} \mathrm{Pr}\big( \, \max\{ \mathrm{t}(\hat{\beta}_1), \, \mathrm{t}(\hat{\beta}_2), \, \ldots, \, \mathrm{t}(\hat{\beta}_K) \} > 1.96 \, \big| \, \text{all $K$ variables are spurious} \, \big) = 1 - (1 - 0.05)^K > 0.05 \end{equation*}$

Finding a $\mathrm{t}(\hat{\beta}_k) > 1.96$ becomes meaningless as you run more and more regressions. e.g., if $K = 10$ , then you should expect to see a $t$ -stat larger than $1.96$ more than $1 - (1 - 0.05)^{10} \approx 40\%$ of the time.

When people in academic finance talk about the problem of “data mining”, this is what they’re referring to. It seems patently obvious that having such a high false-positive rate is a bad thing. And, at first glance, it seems like there’s an easy way to fix the problem: just use a larger cutoff for statistical significance. e.g., researchers have suggested using a t-stat greater than 3.00 rather than $1.96$ to account for the fact that we’ve proposed thousands of candidate variables. But, is our obsession with minimizing the false-positive rate really the right approach? Do we always want to choose our statistical tests so that they have the lowest possible false-positive rate? Not necessarily. And, this post describes two reasons why.

Reason #1

We don’t care about the false-positive rate for its own sake. What we really want to know is: “Conditional on observing a significant test result, how likely is it that we’ve found an honest-to-goodness anomaly?” Using Bayes’ theorem, we can write this conditional probability as

$\begin{equation*} \mathrm{Pr}(\text{anomaly} \, | \, \text{signif}) = \frac{ \mathrm{Pr}( \text{signif} \, | \, \text{anomaly} ) \cdot \mathrm{Pr}(\text{anomaly}) }{ \mathrm{Pr}( \text{signif} \, | \, \text{anomaly} ) \cdot \mathrm{Pr}(\text{anomaly}) + \mathrm{Pr}( \text{signif} \, | \, \text{spurious} ) \cdot \mathrm{Pr}(\text{spurious}) } \end{equation*}$

Clearly, if we underestimate the false-positive rate, $\mathrm{Pr}(\text{signif} \, | \, \text{spurious})$ , then we’re going to overestimate this conditional probability because we’re going to be dividing by a smaller number on the right-hand side.

But, $\mathrm{Pr}( \text{signif} \, | \, \text{spurious} )$ isn’t the only term on the right-hand side of the equation! We care about more than just the false-positive rate when updating our priors. e.g., if we knew there were never any anomalies, then we could guarantee that every single significant result was a false positive. So, we would always conclude that $\mathrm{Pr}(\text{anomaly} \, | \, \text{signif}) = 0$ regardless of how much we underestimated the false-positive rate.

Let’s sharpen this insight with a little algebra. First, suppose that the unconditional probability of finding a tradable anomaly is $\mu \in [0, \, \sfrac{1}{2})$ :

$\begin{equation*} \mathrm{Pr}(\text{anomaly}) = 1 - \mathrm{Pr}(\text{spurious}) = \mu \end{equation*}$

Next, suppose that the probability of observing a significant test result for a spurious correlation is $(5\% + \vartheta)$ for $\vartheta \geq 0$ while the probability of observing a significant test result for a tradable anomaly is $100\%$ :

$\begin{equation*} \begin{split} \mathrm{Pr}( \text{signif} \, | \, \text{spurious} ) &= 0.05 + \vartheta \\ \mathrm{Pr}( \text{signif} \, | \, \text{anomaly} ) &= 1 \end{split} \end{equation*}$

Thus, $\mu$ represents the base rate of observing anomalies, and $\vartheta$ represents the additional false-positive rate introduced by the data-mining problem described above. Roughly speaking, a larger $\mu$ means that anomalies are more common. And, a large $\vartheta$ means that regressions are easier to run.

We can express our posterior beliefs that a particular variable represents a tradable anomaly given a significant test result as

$\begin{equation*} \underset{=\mathrm{Lik}(\mu, \, \vartheta)}{\mathrm{Pr}( \text{anomaly} \, | \, \text{signif} )} = \frac{ \mu }{ \mu + (0.05 + \vartheta) \cdot (1 - \mu) } \end{equation*}$

Thus, ignoring the additional false positives introduced by data mining biases our posterior beliefs by

$\begin{equation*} \mathrm{Bias}_{\vartheta}(\mu) = \mathrm{Lik}(\mu, \, 0) - \mathrm{Lik}(\mu, \, \vartheta) \end{equation*}$

The figure to the right plots this bias (y-axis: $\mathrm{Bias}_{\vartheta}(\mu)$ ) as a function of the excess false-positive rate (x-axis: $\vartheta$ ). The line is always sloping upward because the more you underestimate a test’s false-positive rate the more confident you will be that you’ve found an anomaly. However, the shape of the line dramatically changes as you play around with the slider, which adjusts $\mu$ . When $0.05 < \mu < 0.12$ , the plot has more or less the same upward-sloping concave shape. But, when $\mu < 0.01$ , the shape of the plot flattens out dramatically. In other words, if the base rate is sufficiently small, then underestimating the false-positive rate doesn’t affect our posterior beliefs very much.

This observation implies something a little counterintuitive: a paper can’t simultaneously argue that A) almost all documented anomalies are in fact spurious correlations, and that B) it’s super important for other researchers to use a test procedure that they’ve proposed which minimizes the false-positive rate. Although they get made one after the other (e.g., here), these two claims aren’t internally consistent. It’s one or the other. Minimizing the false-positive rate can only matter if there’s a non-negligible chance of finding a tradable anomaly.

Reason #2

If you’re headed to the doctor’s office for a pregnancy test, then false positives matter. Finding out you’re pregnant is a big deal. Later discovering that it was a mistake would be traumatic. But, testing for Coeliac disease is different. If you have the disease, then you really want to know. But, treating the disease only involves changing your diet. There’s no need to undergo risky surgery or take expensive medication. So, when testing for Coeliac disease, false positives aren’t such a big deal (unless you looooove bread). And, if given the choice, your doctor should choose a test for the disease that has the lowest false-negative rate, even if that means asking a few perfectly healthy patients to cut gluten out of their diets.

The same sort of logic applies to testing for tradable anomalies. If you can trade on a statistically significant anomaly using liquid actively-traded stocks, then why spend time worrying about the false-positive rate. If you find out you’re wrong, then you can quickly and painlessly exit the position. If this is the sort of world you’re operating in, then you might actually want to set up your statistical tests to minimize your false-negative rate. This is one way to interpret pithy trader sayings like “invest first, investigate later”.

The medical literature also gives a nice way of formalizing this idea using something called the number needed to treat (e.g., see here). Suppose I came to you with a bunch of variables that each seemed to predict the cross-section of expected returns. They each delivered significant excess returns in backtesting. If you choose $S \geq 1$ of these variables

$\begin{equation*} S = \frac{1}{\mathrm{Pr}(\text{anomaly} \, | \, \text{signif}) - \mathrm{Pr}(\text{anomaly})} \end{equation*}$

then you should expect your selections to contain one more tradable anomaly than if you had just picked $S$ variables at random—i.e., regardless of whether they had delivered significant excess returns in backtesting.

Now, think about a portfolio that invests the same amount of money in strategies based on each of the $S$ candidate variables you choose. This portfolio’s expected return will depend on both the profitability of your one extra tradable anomaly and the losses of your $(S-1)$ other spurious predictors:

$\begin{equation*} {\textstyle \left(\frac{1}{S}\right)} \cdot \mathrm{E}[\sfrac{\text{profit}}{\text{$\mathdollar 1$ invested in anomaly}}] - {\textstyle \left(\frac{S-1}{S}\right)} \cdot \mathrm{E}[\sfrac{\text{loss}}{\text{$\mathdollar 1$ invested in spurious correlation}}] \end{equation*}$

Clearly, a higher false-positive rate means a larger $S$ . But, the formulation above illustrates why this might not be such a bad thing. If trading on your one genuine anomaly is really profitable and you can quickly identify/exit your remaining $(S-1)$ spurious positions, then who cares if $S$ is large?

How Many Assets Are Needed To Test a K-Factor Model?

November 23, 2017 by Alex

1. Motivation

Imagine you’re a financial economist who thinks that some risk factor, ${\color{white}i}f_t$ , explains the cross-section of expected returns. And, you decide to test your hunch. First, you regress the realized returns of $N$ different assets on ${\color{white}i}f_t$ to estimate each asset’s exposure to the risk factor, $\tilde{b}_n$ :

$\begin{equation*} r_{n,t} = \tilde{a}_n + \tilde{b}_n \cdot f_t + \tilde{e}_{n,t} \qquad t = 1, \, \ldots, \, T \, \, \text{for each } n \end{equation*}$

Then, you regress these same assets’ average returns on their exposures the risk factor, $\tilde{b}_n$ :

$\begin{equation*} \mathrm{E} [ r_n ] = \hat{\alpha} + \hat{\lambda} \cdot \tilde{b}_n + \hat{\epsilon}_n \qquad n = 1, \, \ldots, \, N \end{equation*}$

If ${\color{white}i}f_t$ is a priced factor, then the slope coefficient, $\hat{\lambda}$ , should be large and the intercept, $\hat{\alpha}$ , should be zero.

This two-stage methodology dates back to Fama and MacBeth (1973). And, in keeping with the original paper, many financial economists still use a small set of portfolios as the $N$ assets in their empirical analysis. For example, Fama and French (1993) use $N = 25$ portfolios created by sorting stocks based on size and book-to-market. In theory, this is fine. If you’ve found a priced risk factor, then exposure to ${\color{white}i}f_t$ should affect the average returns of all assets. There is no theoretical guidance about which assets to use.

But, here’s the thing: econometrically, the number of assets has to matter. We live in a world with lots of factors to choose from. This is the “anomaly zoo” coined in Cochrane (2011). If you consider a model with more risk factors than you have assets $K \geq N$ , then of course you can perfectly fit the data:

$\begin{equation*} \mathrm{E} [ r_n ] = 0 + {\textstyle \sum_{f \in \mathcal{K}}} \, \hat{\lambda}_f \cdot \tilde{b}_{n,f} + 0 \qquad n = 1, \, \ldots, \, N \leq K \end{equation*}$

After all, a system of $N$ linear equations with $K$ unknowns is guaranteed to have a solution if $K \geq N$ .

Clearly, you need at least $K$ assets to test a $K$ -factor model. This is obvious. But, you can elaborate on this idea and say something useful in the related setting where $K \leq N \leq F$ . This post answers the question: how many assets do you need when testing a $4$ -factor model chosen in a world with $F = 97$ candidate factors?

2. Problem Formulation

Suppose we live in a world with $N$ assets and $F$ candidate factors. Think about $F = 97$ (McLean and Pontiff; 2016), $F = 333$ (Harvey, Liu, and Zhu; 2016), or $F = 447$ (Hou, Xue, and Zhang; 2017). We are looking for the asset-pricing model with the fewest factors that perfectly explains the cross-section of expected returns:

(★) $\begin{equation*} \min_{\mathcal{K} \subseteq \mathcal{F}} \left\{ \, {\textstyle \sum_{f \in \mathcal{K}}} \, 1_{\{\hat{\lambda}_f \neq 0\}} \quad \text{s.t.} \quad \mathrm{E}[ r_n ] = {\textstyle \sum_{f \in \mathcal{K}}} \, \hat{\lambda}_f \cdot \tilde{b}_{n,f} \, \right\} \end{equation*}$

This is just a mathematical way of applying Occam’s razor to our model-selection problem.

Here’s what I want to know: if we find a solution to this problem, $\hat{\mathcal{K}}$ , how likely is it that $\hat{\mathcal{K}}$ is the only solution? In other words, if we find a $\hat{K}$ -factor model that solves Problem (★) and perfectly explains the cross-section of expected returns, should we celebrate or slow clap? Have we found the simplest model of the world? Or, is this just one of many such $\hat{K}$ -factor models that we were bound to uncover?

3. Similar Exposures

It turns out that the answer to this question is going to critically depend on the similarity of the $N$ assets’ exposures to the $F$ risk factors. To see why, just think about a world where $\mathcal{F}$ contains two copies of some risk factor in $\hat{\mathcal{K}}$ . Clearly, $\hat{\mathcal{K}}$ can’t be a unique solution to Problem (★) because you could just switch out the risk factor for its twin and have another $\hat{K}$ -factor model explaining the cross-section of expected returns.

Donoho and Huo (2001) gives a nice way of generalizing this notion of similarity to situations where $\mathcal{F}$ doesn’t contain multiple copies of the same factor. Specifically, they define the idea of mutual coherence:

$\begin{equation*} \rho_{\max} \overset{\scriptscriptstyle \text{def}}{=} \max_{1 \leq f < f' \leq F} \left\{ \, \frac{ \left| \sum_{n=1}^N \tilde{b}_{n,f} \cdot \tilde{b}_{n,f'} \right| }{ \sqrt{\sum_{n=1}^N \tilde{b}_{n,f}^2} \cdot \sqrt{\sum_{n=1}^N \tilde{b}_{n,f'}^2} } \, \right\} \end{equation*}$

Roughly speaking, you should think about mutual coherence as measuring the maximum correlation between the $N$ assets’ exposures to any pair of risk factors in $\mathcal{F}$ . A large value of $\rho_{\max}$ , means that the $N$ assets have really similar exposures to some pair of risk factors in $\mathcal{F}$ . And, in the extreme case where $\mathcal{F}$ contains two exact copies of the same risk factor, we have $\rho_{\max} = 1$ . By contrast, $\rho_{\max} = 0$ if there are more assets than risk factors and the risk factors were independent.

But, since we are thinking about a world where there are more candidate risk factors than assets $F > N$ , we know that $\rho_{\max} > 0$ . Even if the $F$ factors really are independent, some of them are going to have to look correlated in such a small sample. And, Welch (1974) characterizes exactly how correlated they will look:

(W) $\begin{equation*} \rho_{\max} \geq \sqrt{\frac{F - N}{N \cdot (F - 1)}} \qquad \text{given } F > N \end{equation*}$

4. Theoretical Minimum

Now, we can answer our original question: when can we be sure that a solution to Problem (★) is unique? i.e., if we find a $K$ -factor model that perfectly explains the cross-section of returns for $N$ assets, should we be surprised? Donoho and Elad (2003) show that, if $\hat{\mathcal{K}}$ is a solution to Problem (★) and

(DE) $\begin{equation*} \hat{K} < \frac{1}{2} \cdot \left( 1 + \frac{1}{\rho_{\max}} \right), \end{equation*}$

then $\hat{\mathcal{K}}$ is a unique solution. There are no other factor models that explain the cross-section of expected returns using $\hat{K}$ or fewer factors. Inserting the bound from Equation (W) into the bound from Equation (DE) and then solving for $N$ yields:

$\begin{equation*} N_{\min} \overset{\scriptscriptstyle \text{def}}{=} \left\lceil \frac{F \cdot (2 \cdot K - 1)^2}{(F - 1) + (2 \cdot K - 1)^2} \right\rceil \end{equation*}$

What’s this equation telling us? Suppose we find a $\hat{K}$ -factor model that perfectly explains the cross-section of expected returns—i.e., a factor model that solves Problem (★). If our empirical analysis used at least $N_{\min}$ assets, then we can be sure that we’ve found the simplest possible model. Whereas, if we only used $(N_{\min} - 1)$ assets, then there might be another asset-pricing model with the same number of factors that perfectly explains the cross-section of expected returns.

5. Plugging in Numbers

Let’s plug in some numbers to see if our formula for $N_{\min}$ makes any sense. Here’s the first exercise. Suppose that there really are only $F=97$ candidate risk factors like in McLean and Pontiff (2016). The figure to the right plots the minimum number of assets we’d need to include in our empirical analysis (y-axis) to identify a model with $K$ factors (x-axis). We need to study at least $N_{\min} = 9$ assets to be sure that we’ve found a unique $2$ -factor model; we need to study at least $N_{\min} = 21$ assets to be sure that we’ve found a unique $3$ -factor model; and, we need to study at least $N_{\min} = 33$ assets to be sure that we’ve found a unique $4$ -factor model. This is an interesting exercise because it says that we shouldn’t be surprised if someone found a $4$ -factor model that perfectly explained the cross-section of expected returns for the $25$ size- and value-sorted portfolios used in Fama and French (1993). In other words, since Gene and Ken already claimed the first $3$ factors, we can’t do cross-sectional asset pricing tests with just the Fama and French (1993) portfolios any more. Even if we found a $4$ -factor model that perfectly explained the cross-section of expected returns, there’d be no guarantee that it’d be unique.

Now, let’s consider a second exercise. In a recent NBER working paper, Kozak, Nagel, and Santosh (2017) found used a machine-learning rule to identify a cross-sectional asset-pricing model with $\hat{K} = 33$ factors. For the sake of argument, let’s imagine that this $33$ -factor model perfectly explained the cross-section of expected returns. The figure to the left plots the minimum number of assets they’d need to include in their empirical analysis (y-axis) to be sure that they’d found the only such $33$ -factor model in a world with $F$ candidate factors (x-axis). There are roughly $2500$ stocks at the moment. So, if their $33$ -factor model had perfectly explained the cross-section of expected returns, then you should be really excited by this result if you think there are less than $F \approx 6000$ candidate factors.

Let’s do one last exercise before we call it quits. Notice that, as the number of candidate factors gets large, the minimum number of assets we’d need to identify a $K$ -factor model only depends on $K$ :

(1) $\begin{equation*} \lim_{F \to \infty} N_{\min} = (2 \cdot K - 1)^2 \end{equation*}$

This observation is cool because, by setting $N_{\min} = 2500$ and solving for $K$ , we can compute the size of the most complicated factor model that we can estimate with $2500$ stocks is $25 = \lfloor \sfrac{1}{2} \cdot( \sqrt{2500} + 1) \rfloor$ . No matter how many candidate factors there are, if we find a $25$ -factor model that perfectly explains the cross-section of expected returns, then we know it is unique if we use the universe of all stocks in our empirical analysis.

Neglecting The Madness Of Crowds

September 3, 2017 by Alex

Motivation

This post is motivated by two stylized facts about bubbles and crashes. The first is that these events are often attributed to the madness of crowds. In popular accounts, they occur when a large number of inexperienced traders floods into the market and mob psychology takes over. For some examples, just think about day traders during the DotCom bubble, out-of-town buyers during the housing bubble, or first-time investors during the Chinese warrant bubble.

The second stylized fact is that, even though bubbles and crashes have a large impact on the market, traders seems to ignore the risk posed by the madness of crowds during normal times. Gripped by “new-era thinking”, they often insist on justifying market events with fundamentals until some sudden price movement forces them to reckon with the madness of crowds. This phenomenon is referred to as “neglected risk” in the asset-pricing literature.

With these two stylized facts in mind, this post investigates how hard it is for traders to learn about aggregate noise-trader demand when the number of noise traders can vary over several orders of magnitude—i.e., when there’s a possibility that the crowd’s gone mad. I find something surprising: it makes sense for existing traders to neglect the madness of crowds during normal times. Here’s the logic. Noise traders push prices away from fundamentals. So, if you don’t see a large unexpected price movement away from fundamentals, then there must not be very many noise traders in the market. And, if there aren’t very many noise traders, then they can’t affect the equilibrium price very much. But, this means that there’s no way for you to learn about aggregate noise-trader demand from the equilibrium price, which means that there’s no reason for you to revise your beliefs about aggregate noise-trader demand away from zero.

To illustrate this point, I’m going to make use of a happy mathematical coincidence. It turns out that, if you assume changes in the number of noise traders are governed by a stochastic version of the logistic growth model (see here, here, or here for examples), then the stationary distribution for the number of noise traders will be Exponential. And, the right way to learn about the mean of a Gaussian random variable whose variance is drawn from an Exponential distribution is to use the LASSO, a penalized regression which delivers point estimates that are precisely zero whenever the unpenalized estimate is sufficiently small.

Inference Problem

Here’s is the inference problem I’m going to study. Suppose there’s a stock with price $P$ , and there are $N > 0$ noise traders present in this market. And, assume that this price is a linear function of three variables:

(1) $\begin{equation*} P = \alpha + \beta \cdot F + \gamma \cdot \{C - S\} \end{equation*}$

Above, $F$ denotes the stock’s fundamental value, $C \sim \mathrm{Normal}(0, \, N)$ denotes noise due to the madness of crowds, and $S \sim \mathrm{Normal}(0, \, \sigma^2)$ denotes noise due to random supply shocks. The negative sign on supply noise comes from the fact that more supply means lower prices. You can think about the supply noise as the result of hedging demand or rebalancing cascades. The source doesn’t matter. The key point is that this noise source has constant variance.

Crowd noise is different, though. Its variance is equal to the number of noise traders in the market, $N$ , and this population can change. Suppose there are $n = 1, \, \ldots, \, N$ noise traders, and each individual trader in this crowd has demand that’s iid normal:

(2) $\begin{equation*} c_n \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{Normal}(0, \, 1) \end{equation*}$

Then, the aggregate demand of the entire crowd of noise traders has distribution:

(3) $\begin{equation*} C \overset{\scriptscriptstyle \text{def}}{=} {\textstyle \sum_{n=1}^N} \, c_n \sim \mathrm{Normal}(0, \, N) \end{equation*}$

In this setting, the rescaled pricing error, $\tilde{P} \overset{\scriptscriptstyle \text{def}}{=} \sfrac{1}{\gamma} \cdot \{P - \alpha - \beta \cdot F\}$ , is a normally distributed signal about the aggregate demand coming from the crowd of noise traders:

(4) $\begin{equation*} \tilde{P} = C - S \end{equation*}$

When the price is above its fundamental value, $\{P - \alpha - \beta \cdot F\} > 0$ , it must be because either the crowd of noise traders has high demand, $C > 0$ , or there is unexpectedly low supply, $S < 0$ . The question I want to answer below is: how hard is it for traders to figure out which noise source is responsible?

Bayes rule tells us how to learn about the crowd’s aggregate demand from the equilibrium pricing error:

(5) $\begin{equation*} \mathrm{Pr}(C|\tilde{P}) \propto \mathrm{Pr}(\tilde{P}|C) \times {\textstyle \int_0^{\infty}} \, \mathrm{Pr}(C|N) \cdot \mathrm{Pr}(N) \cdot \mathrm{d}N \end{equation*}$

Supply noise is normally distributed. This means we know how to calculate $\mathrm{Pr}(\tilde{P}|C)$ . So, if we knew the distribution of the number of noise traders in the market, then we could evaluate the remaining integral and solve for the most likely value for the crowd’s aggregate demand given the observed pricing error:

(6) $\begin{equation*} \hat{C}(\tilde{P}) \overset{\scriptscriptstyle \text{def}}{=} \underset{C \in \mathrm{R}}{\arg\min} \left\{ \, \log \mathrm{Pr}(\tilde{P}|C) + \log {\textstyle \int_0^{\infty}} \, \mathrm{Pr}(C|N) \cdot \mathrm{Pr}(N) \cdot \mathrm{d}N \, \right\} \end{equation*}$

Population Size

There are many ways that you could model the size of the noise-trader crowd. One way to go would be to use a population-dynamics model from the ecology literature, such as the stochastic version of the logistic growth model. This model is specifically designed to explain the unexpected booms and busts that we seen in wildlife populations. If we take this approach, then the number of noise traders, $N(t)$ , is governed by the following stochastic differential equation:

(7) $\begin{equation*} \mathrm{d}N = \theta \cdot \{ \mu - N \} \cdot N \cdot \mathrm{d}t - \delta \cdot N \cdot \mathrm{d}t + \varsigma \cdot N \cdot \mathrm{d}W \end{equation*}$

In the equation above, $\theta \cdot \{\mu - N\}$ denotes the rate at which the crowd of noise traders grows, $\delta > 0$ denotes the rate at which noise traders lose interest, $\mathrm{d}W$ is a Wiener process capturing random fluctuations in the number of noise traders in the crowd, and $\varsigma > 0$ denotes the volatility of these random fluctuations.

The key property of the logistic growth model is that it’s nonlinear. Population growth, $\theta \cdot \{ \mu - N \} \cdot N$ , is a quadratic function of the number of noise traders as shown in the figure below. This nonlinearity is what allows the model to generate population booms and busts. This nonlinearity will occur if existing noise traders try to recruit their friends to enter the market as well (see here, here, here, or here). $\mu > 0$ denotes the typical number of noise traders that could potentially be persuaded to join the crowd. And, $\theta > 0$ captures the intensity with which existing noise traders persuade their remaining $\{\mu - N\}$ friends to join.

Thus, when there are only a handful of noise traders, the crowd grows slowly because there aren’t many traders to do the recruiting. As the crowd gets larger, population growth increases. But, this growth eventually slows down again because, when there are already lots of noise traders in the market, it is hard to increase the size of the crowd because there aren’t many traders left to be recruited, $\{\mu - N\} \approx 0$ .

Because the logistic growth model has been studied for so long, the population-size distribution that it generates is well known. There are two regimes: $\theta \cdot \mu < \delta$ and $\theta \cdot \mu > \delta$ . If $\theta \cdot \mu < \delta$ , then population of noise traders will eventually die out, $\lim_{t\to\infty}N = 0$ . To see why, think about how the system behaves as the crowd size gets small. When $N = \epsilon \approx 0$ , the crowd grows at an almost linear rate, $\theta \cdot \mu \cdot \epsilon +\mathcal{O}(\epsilon^2)$ . So, $\theta \cdot \mu < \delta$ means that, when the crowd gets small enough, existing noise traders lose interest faster then they can recruit their friends, which leads to the end of the crowd, $N=0$ . By contrast, if the growth rate is larger than the rate of decay when $N$ is small, $\theta \cdot \mu > \delta$ , then the population of noise traders will never get all the way to $N=0$ . And, if we rescale the units so that $\sfrac{\varsigma^2}{2} = \theta \cdot \mu - \delta$ , then we will find that number of noise traders in the crowd will be governed by an Exponential distribution with rate parameter $\sfrac{\lambda^2}{2}$ :

(8) $\begin{equation*} N \sim \mathrm{Exponential}(\sfrac{\lambda^2}{2}) \qquad \text{where} \qquad \lambda = \sqrt{{\textstyle \frac{2}{\mu} \cdot \left\{ \frac{\theta \cdot \mu}{\theta \cdot \mu - \delta} \right\}}} \end{equation*}$

The figure above shows the probability-density function for an Exponential distribution. It illustrates how, when the rate parameter is larger, the size of the noise-trader crowd tends to be smaller. And, the functional form for $\lambda$ reveals that the rate parameter will be largest as $\theta \cdot \mu \searrow \delta$ —i.e., when the growth rate is barely larger than the decay rate when the crowd size is small. Makes sense.

Neglected Risk

We now have our distribution for the number of noise traders in the market. So, we can return to our original inference problem and try to solve for the most likely value of the crowd’s aggregate demand given the observed pricing error, $\hat{C}(\tilde{P})$ . It turns out that, if the variance of the crowd’s aggregate demand is drawn from an Exponential distribution, then it’s easy to solve the integral in Equation (6). Andrews and Mallows shows that:

(9) $\begin{equation*} \frac{\lambda}{2} \cdot e^{- \, \lambda \cdot |C|} = \int_0^\infty \, \underset{C|N \sim \mathrm{Normal}(0, \, N)}{ \left\{ \frac{1}{\sqrt{2 \cdot \pi \cdot N}} \cdot e^{-\, \frac{\{C-0\}^2}{2 \cdot N}} \right\} } \cdot \underset{N \sim \mathrm{Exponential}(\sfrac{\lambda^2}{2})}{ \left\{ \frac{\lambda^2}{2} \cdot e^{-\,\frac{\lambda^2}{2} \cdot N} \right\} } \cdot \mathrm{d}N \end{equation*}$

So, if we set $\sfrac{\lambda}{2} \cdot e^{-\lambda \cdot |C|} = {\textstyle \int_0^{\infty}} \, \mathrm{Pr}(C|N) \cdot \mathrm{Pr}(N) \cdot \mathrm{d}N$ , then the optimization problem in Equation (6) simplifies:

(10) $\begin{equation*} \hat{C}(\tilde{P}) = \underset{C \in \mathrm{R}}{\arg\min} \left\{ \, \frac{1}{2 \cdot \sigma^2} \cdot \{\tilde{P} - C\}^2 + \lambda \cdot | C | \, \right\} \end{equation*}$

This simplification is really cool because the optimization problem above is just the optimization problem for the LASSO with a penalty parameter of $\sigma^2 \cdot \lambda$ (see Park and Casella).

There’s something weird about this result, though. Using the LASSO implies that:

(11) $\begin{equation*} \hat{C}(\tilde{P}) = \mathrm{Sign}(\tilde{P}) \cdot \left\{ \, |\tilde{P}| - \sigma^2 \cdot \lambda \, \right\}_+ \end{equation*}$

Above, $\{ x \}_+ = x$ if $x > 0$ and $0$ . So, if the observed pricing error is relatively small, $|\tilde{P}| < \sigma^2 \cdot \lambda$ , then a fully-rational trader will walk away from the market believing that $\hat{C}(\tilde{P}) = 0$ . He will completely neglect the risk coming from the crowd of noise traders.

Here’s the logic behind this neglect. If the pricing error is small, then there must not be very many noise traders in the market. If there aren’t very many noise traders, then they can’t affect the equilibrium price very much. And, this means that there’s no way for traders to learn about aggregate noise-trader demand from the equilibrium price, which means that there’s no reason for them to revise their beliefs about noise-trader demand away from zero after seeing the price.

What’s more, this line of reasoning is consistent with the functional form for the LASSO’s penalty parameter, $\sigma^2 \cdot \lambda$ . This expression says that traders will ignore the madness of crowds in the face of more extreme pricing errors (larger values of $|\tilde{P}|$ ) when either the crowd of noise traders tends to be smaller ( $\lambda$ is larger) or it’s easier to explain pricing errors with supply noise ( $\sigma^2$ is larger). And, this basic intuition should carry over to other situations where the size of the crowd of noise traders has some other fat-tailed distribution rather than an Exponential distribution.

« Previous Page