Research Notebook

Neglecting The Madness Of Crowds

September 3, 2017 by Alex

Motivation

This post is motivated by two stylized facts about bubbles and crashes. The first is that these events are often attributed to the madness of crowds. In popular accounts, they occur when a large number of inexperienced traders floods into the market and mob psychology takes over. For some examples, just think about day traders during the DotCom bubble, out-of-town buyers during the housing bubble, or first-time investors during the Chinese warrant bubble.

The second stylized fact is that, even though bubbles and crashes have a large impact on the market, traders seems to ignore the risk posed by the madness of crowds during normal times. Gripped by “new-era thinking”, they often insist on justifying market events with fundamentals until some sudden price movement forces them to reckon with the madness of crowds. This phenomenon is referred to as “neglected risk” in the asset-pricing literature.

With these two stylized facts in mind, this post investigates how hard it is for traders to learn about aggregate noise-trader demand when the number of noise traders can vary over several orders of magnitude—i.e., when there’s a possibility that the crowd’s gone mad. I find something surprising: it makes sense for existing traders to neglect the madness of crowds during normal times. Here’s the logic. Noise traders push prices away from fundamentals. So, if you don’t see a large unexpected price movement away from fundamentals, then there must not be very many noise traders in the market. And, if there aren’t very many noise traders, then they can’t affect the equilibrium price very much. But, this means that there’s no way for you to learn about aggregate noise-trader demand from the equilibrium price, which means that there’s no reason for you to revise your beliefs about aggregate noise-trader demand away from zero.

To illustrate this point, I’m going to make use of a happy mathematical coincidence. It turns out that, if you assume changes in the number of noise traders are governed by a stochastic version of the logistic growth model (see here, here, or here for examples), then the stationary distribution for the number of noise traders will be Exponential. And, the right way to learn about the mean of a Gaussian random variable whose variance is drawn from an Exponential distribution is to use the LASSO, a penalized regression which delivers point estimates that are precisely zero whenever the unpenalized estimate is sufficiently small.

Inference Problem

Here’s is the inference problem I’m going to study. Suppose there’s a stock with price P, and there are N > 0 noise traders present in this market. And, assume that this price is a linear function of three variables:

(1)   \begin{equation*} P = \alpha + \beta \cdot F + \gamma \cdot \{C - S\} \end{equation*}

Above, F denotes the stock’s fundamental value, C \sim \mathrm{Normal}(0, \, N) denotes noise due to the madness of crowds, and S \sim \mathrm{Normal}(0, \, \sigma^2) denotes noise due to random supply shocks. The negative sign on supply noise comes from the fact that more supply means lower prices. You can think about the supply noise as the result of hedging demand or rebalancing cascades. The source doesn’t matter. The key point is that this noise source has constant variance.

Crowd noise is different, though. Its variance is equal to the number of noise traders in the market, N, and this population can change. Suppose there are n = 1, \, \ldots, \, N noise traders, and each individual trader in this crowd has demand that’s iid normal:

(2)   \begin{equation*} c_n \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{Normal}(0, \, 1) \end{equation*}

Then, the aggregate demand of the entire crowd of noise traders has distribution:

(3)   \begin{equation*} C  \overset{\scriptscriptstyle \text{def}}{=}  {\textstyle \sum_{n=1}^N} \, c_n   \sim  \mathrm{Normal}(0, \, N) \end{equation*}

In this setting, the rescaled pricing error, \tilde{P} \overset{\scriptscriptstyle \text{def}}{=} \sfrac{1}{\gamma} \cdot \{P - \alpha - \beta \cdot F\}, is a normally distributed signal about the aggregate demand coming from the crowd of noise traders:

(4)   \begin{equation*} \tilde{P} = C - S \end{equation*}

When the price is above its fundamental value, \{P - \alpha - \beta \cdot F\} > 0, it must be because either the crowd of noise traders has high demand, C > 0, or there is unexpectedly low supply, S < 0. The question I want to answer below is: how hard is it for traders to figure out which noise source is responsible?

Bayes rule tells us how to learn about the crowd’s aggregate demand from the equilibrium pricing error:

(5)   \begin{equation*} \mathrm{Pr}(C|\tilde{P}) \propto \mathrm{Pr}(\tilde{P}|C) \times {\textstyle \int_0^{\infty}} \, \mathrm{Pr}(C|N) \cdot \mathrm{Pr}(N) \cdot \mathrm{d}N \end{equation*}

Supply noise is normally distributed. This means we know how to calculate \mathrm{Pr}(\tilde{P}|C). So, if we knew the distribution of the number of noise traders in the market, then we could evaluate the remaining integral and solve for the most likely value for the crowd’s aggregate demand given the observed pricing error:

(6)   \begin{equation*} \hat{C}(\tilde{P}) \overset{\scriptscriptstyle \text{def}}{=} \underset{C \in \mathrm{R}}{\arg\min} \left\{ \, \log \mathrm{Pr}(\tilde{P}|C) + \log {\textstyle \int_0^{\infty}} \, \mathrm{Pr}(C|N) \cdot \mathrm{Pr}(N) \cdot \mathrm{d}N \, \right\} \end{equation*}

Population Size

There are many ways that you could model the size of the noise-trader crowd. One way to go would be to use a population-dynamics model from the ecology literature, such as the stochastic version of the logistic growth model. This model is specifically designed to explain the unexpected booms and busts that we seen in wildlife populations. If we take this approach, then the number of noise traders, N(t), is governed by the following stochastic differential equation:

(7)   \begin{equation*} \mathrm{d}N = \theta \cdot \{ \mu - N \} \cdot N \cdot \mathrm{d}t  -  \delta \cdot N \cdot \mathrm{d}t + \varsigma \cdot N \cdot \mathrm{d}W \end{equation*}

In the equation above, \theta \cdot \{\mu - N\} denotes the rate at which the crowd of noise traders grows, \delta > 0 denotes the rate at which noise traders lose interest, \mathrm{d}W is a Wiener process capturing random fluctuations in the number of noise traders in the crowd, and \varsigma > 0 denotes the volatility of these random fluctuations.

The key property of the logistic growth model is that it’s nonlinear. Population growth, \theta \cdot \{ \mu - N \} \cdot N, is a quadratic function of the number of noise traders as shown in the figure below. This nonlinearity is what allows the model to generate population booms and busts. This nonlinearity will occur if existing noise traders try to recruit their friends to enter the market as well (see here, here, here, or here). \mu > 0 denotes the typical number of noise traders that could potentially be persuaded to join the crowd. And, \theta > 0 captures the intensity with which existing noise traders persuade their remaining \{\mu - N\} friends to join.

Thus, when there are only a handful of noise traders, the crowd grows slowly because there aren’t many traders to do the recruiting. As the crowd gets larger, population growth increases. But, this growth eventually slows down again because, when there are already lots of noise traders in the market, it is hard to increase the size of the crowd because there aren’t many traders left to be recruited, \{\mu - N\} \approx 0.

Because the logistic growth model has been studied for so long, the population-size distribution that it generates is well known. There are two regimes: \theta \cdot \mu < \delta and \theta \cdot \mu > \delta. If \theta \cdot \mu < \delta, then population of noise traders will eventually die out, \lim_{t\to\infty}N = 0. To see why, think about how the system behaves as the crowd size gets small. When N = \epsilon \approx 0, the crowd grows at an almost linear rate, \theta \cdot \mu \cdot \epsilon +\mathcal{O}(\epsilon^2). So, \theta \cdot \mu < \delta means that, when the crowd gets small enough, existing noise traders lose interest faster then they can recruit their friends, which leads to the end of the crowd, N=0. By contrast, if the growth rate is larger than the rate of decay when N is small, \theta \cdot \mu > \delta, then the population of noise traders will never get all the way to N=0. And, if we rescale the units so that \sfrac{\varsigma^2}{2} = \theta \cdot \mu - \delta, then we will find that number of noise traders in the crowd will be governed by an Exponential distribution with rate parameter \sfrac{\lambda^2}{2}:

(8)   \begin{equation*} N \sim \mathrm{Exponential}(\sfrac{\lambda^2}{2}) \qquad \text{where} \qquad \lambda = \sqrt{{\textstyle \frac{2}{\mu} \cdot \left\{ \frac{\theta \cdot \mu}{\theta \cdot \mu - \delta} \right\}}} \end{equation*}

The figure above shows the probability-density function for an Exponential distribution. It illustrates how, when the rate parameter is larger, the size of the noise-trader crowd tends to be smaller. And, the functional form for \lambda reveals that the rate parameter will be largest as \theta \cdot \mu \searrow \delta—i.e., when the growth rate is barely larger than the decay rate when the crowd size is small. Makes sense.

Neglected Risk

We now have our distribution for the number of noise traders in the market. So, we can return to our original inference problem and try to solve for the most likely value of the crowd’s aggregate demand given the observed pricing error, \hat{C}(\tilde{P}). It turns out that, if the variance of the crowd’s aggregate demand is drawn from an Exponential distribution, then it’s easy to solve the integral in Equation (6). Andrews and Mallows shows that:

(9)   \begin{equation*} \frac{\lambda}{2} \cdot e^{- \, \lambda \cdot |C|} = \int_0^\infty  \, \underset{C|N \sim \mathrm{Normal}(0, \, N)}{ \left\{ \frac{1}{\sqrt{2 \cdot \pi \cdot N}} \cdot e^{-\, \frac{\{C-0\}^2}{2 \cdot N}} \right\} } \cdot \underset{N \sim \mathrm{Exponential}(\sfrac{\lambda^2}{2})}{ \left\{ \frac{\lambda^2}{2} \cdot e^{-\,\frac{\lambda^2}{2} \cdot N} \right\} } \cdot  \mathrm{d}N \end{equation*}

So, if we set \sfrac{\lambda}{2} \cdot e^{-\lambda \cdot |C|} = {\textstyle \int_0^{\infty}} \, \mathrm{Pr}(C|N) \cdot \mathrm{Pr}(N) \cdot \mathrm{d}N, then the optimization problem in Equation (6) simplifies:

(10)   \begin{equation*} \hat{C}(\tilde{P}) = \underset{C \in \mathrm{R}}{\arg\min} \left\{ \, \frac{1}{2 \cdot \sigma^2} \cdot \{\tilde{P} - C\}^2  +  \lambda \cdot | C | \, \right\} \end{equation*}

This simplification is really cool because the optimization problem above is just the optimization problem for the LASSO with a penalty parameter of \sigma^2 \cdot \lambda (see Park and Casella).

There’s something weird about this result, though. Using the LASSO implies that:

(11)   \begin{equation*} \hat{C}(\tilde{P}) = \mathrm{Sign}(\tilde{P}) \cdot  \left\{ \, |\tilde{P}|  -  \sigma^2 \cdot \lambda \, \right\}_+ \end{equation*}

Above, \{ x \}_+ = x if x > 0 and 0. So, if the observed pricing error is relatively small, |\tilde{P}| < \sigma^2 \cdot \lambda, then a fully-rational trader will walk away from the market believing that \hat{C}(\tilde{P}) = 0. He will completely neglect the risk coming from the crowd of noise traders.

Here’s the logic behind this neglect. If the pricing error is small, then there must not be very many noise traders in the market. If there aren’t very many noise traders, then they can’t affect the equilibrium price very much. And, this means that there’s no way for traders to learn about aggregate noise-trader demand from the equilibrium price, which means that there’s no reason for them to revise their beliefs about noise-trader demand away from zero after seeing the price.

What’s more, this line of reasoning is consistent with the functional form for the LASSO’s penalty parameter, \sigma^2 \cdot \lambda. This expression says that traders will ignore the madness of crowds in the face of more extreme pricing errors (larger values of |\tilde{P}|) when either the crowd of noise traders tends to be smaller (\lambda is larger) or it’s easier to explain pricing errors with supply noise (\sigma^2 is larger). And, this basic intuition should carry over to other situations where the size of the crowd of noise traders has some other fat-tailed distribution rather than an Exponential distribution.

Filed Under: Uncategorized

A Tell-Tale Sign of Short-Run Trading

January 26, 2017 by Alex

Motivation

Trading has gotten a lot faster over the last two decades. The term “short-run trader” used to refer to people who traded multiple times a day. Now, it refers to algorithms that trade multiple times a second.

Some people are worried about this new breed of short-run trader making stock prices less informative about firm fundamentals by trading so often on information that’s unrelated to companies’ long-term prospects. But, this is a red herring. By the same logic, market-neutral strategies should be making market indexes like the Russell 3000 less informative about macro fundamentals. And, no one believes this.

Short-run traders aren’t ignoring fundamentals; they’re learning about fundamentals before everyone else by studying order flow. And, this post shows why they also make trading volume look asymmetric and lumpy as a result. The logic is simple. If short-run traders get additional information from order flow, then they’ll use this information to cluster their trading at times when everyone else is moving in the opposite direction.

Benchmark Model

Consider a market with a single company that’s going to pay a dividend, d_t, in future periods t = 1, \, 2, \, \ldots And, suppose that there’s a unit mass of small agents, i \in (0, \, 1], who have noisy priors about these dividends,

    \begin{equation*} d_t \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}( \mu_t^{(i)}, \, \sfrac{\sigma^2\!}{2} ) \quad \text{for some} \quad  \sigma^2 > 0, \end{equation*}

which are correct on average, d_t = \int_0^1 \mu_t^{(i)} \cdot \mathrm{d}i. This assumption means that, by aggregating agents’ demand, equilibrium prices can contain information about dividends that isn’t known by any individual agent.

This unit mass of agents is split into two different groups: long-term investors and short-run traders. At time t=0, each group of agents trades shares in its own separate fund, f \in \{L, \, H\}, that offers frequency-specific exposure to the company’s dividends at times t=1,\,2. The long-term investors, i \in (0, \, \sfrac{1}{2}], trade the low-frequency fund which has a payout:

    \begin{equation*} d_L \overset{\scriptscriptstyle \text{def}}{=} d_1 + d_2 \end{equation*}

And, the short-run traders, i \in (\sfrac{1}{2}, \, 1], trade the high-frequency fund which has a payout:

    \begin{equation*} d_H \overset{\scriptscriptstyle \text{def}}{=} d_1 - d_2 \end{equation*}

At time t=0, each agent observes the equilibrium price of a frequency-specific fund, p_f, and then chooses the number of shares to buy, x_f^{(i)}, in order to maximize his expected utility at the end of time t=2:

(1)   \begin{equation*} \max_{x_f^{(i)}} \mathrm{E}^{(i)}\left[ \, - e^{ \, - \alpha \cdot \{d_f - p_f\} \cdot x_f^{(i)}  \, } \, \middle| \, p_f \, \right] \quad \text{for some} \quad \alpha > 0 \end{equation*}

Above, \mathrm{E}^{(i)}[\cdot|p_f] denotes agent i‘s conditional expectation, and \alpha denotes his risk-aversion parameter.

Let z_t denote the number of shares of the company’s stock that are available for purchase at time t. We say that “markets clear” at time t=0 if the dividend payout from each available share at times t=1,\,2 has been unambiguously assigned to exactly one trader via their fund holdings:

(2)   \begin{align*} {\textstyle \int_0^{\sfrac{1}{2}}} x_L^{(i)} \cdot \mathrm{d}i  + {\textstyle \int_{\sfrac{1}{2}}^1} x_H^{(i)} \cdot \mathrm{d}i &=  z_1 \\ \text{and} \quad {\textstyle \int_0^{\sfrac{1}{2}}} x_L^{(i)} \cdot \mathrm{d}i  - {\textstyle \int_{\sfrac{1}{2}}^1} x_H^{(i)} \cdot \mathrm{d}i &=  z_2 \end{align*}

Let z_L \overset{\scriptscriptstyle \text{def}}{=} z_1 + z_2 and z_H \overset{\scriptscriptstyle \text{def}}{=} z_1 - z_2 denote the number of available shares at each frequency.

An equilibrium then consists of a demand rule, \mathrm{X}(\mathrm{E}^{(i)}[d_f|p_f], \, p_f) = x_f^{(i)}, and a price function, \mathrm{P}(d_f, \, z_f) = p_f, such that 1) demand maximizes the expected utility of each agent given the price and 2) markets clear.

Because the equilibrium price of each fund only depends on its promised payout and the number of available shares, if agents knew the number of available shares, then they could reverse engineer a fund’s future payout at times t=1, \, 2 by studying its equilibrium price at time t=0. And, an equilibrium in such a market wouldn’t be well-defined. So, to make sure that equilibrium prices at time t=0 aren’t fully revealing, let’s assume that the number of available shares in each period is a random variable:

    \begin{equation*} z_t \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}( 0, \, \sfrac{1}{2}) \end{equation*}

This means thinking about “available” shares as shares that haven’t already been purchased by noise traders.

The key fact about this benchmark model is that agents don’t use order-flow information at time t=1 to update their time t=0 beliefs. As a result, the model really isn’t about short-run traders in spite of how the variables are named. Crouzet/Dew-Becker/Nathanson shows that with some clever relabeling you could just as easily think about the low- and high-frequency funds as index and market-neutral funds, respectively.

Trading Volume

Although agents are only active at time t=0 in the benchmark model, each fund has to trade at times t=1,\,2 in order to deliver frequency-specific payouts. Let’s use x_L and x_H to denote aggregate demand:

    \begin{align*} x_L &\overset{\scriptscriptstyle \text{def}}{=} {\textstyle \int_0^{\sfrac{1}{2}}} x_L^{(i)} \cdot \mathrm{d}i \\ \text{and} \quad x_H &\overset{\scriptscriptstyle \text{def}}{=} {\textstyle \int_{\sfrac{1}{2}}^1} \, x_H^{(i)} \cdot \mathrm{d}i \end{align*}

To deliver d_L to every one of its shareholders, the low-frequency fund has to buy x_L shares of the company’s stock between times t=0 and t=1 and then liquidate this position between t=2 and t=3. And, to deliver d_H to every one of its shareholders, the high-frequency fund has to buy x_H shares between times t=0 and t=1, sell 2 \cdot x_H shares between t=1 and t=2, and then buy back x_H shares between times t=2 and t=3.

So, trading volume in the benchmark model is:

    \begin{align*} \mathit{vlm}_{0|1} &\overset{\scriptscriptstyle \text{def}}{=} |x_L| + |x_H| \\ \mathit{vlm}_{1|2} &\overset{\scriptscriptstyle \text{def}}{=} 2 \cdot |x_H| \\ \text{and} \quad \mathit{vlm}_{2|3} &\overset{\scriptscriptstyle \text{def}}{=} |x_L| + |x_H| \end{align*}

The key thing to notice is that trading volume is symmetric, \mathit{vlm}_{0|1} = \mathit{vlm}_{2|3}, because short-run traders don’t get any new information after time t=0.

Model Solution

So, how many shares of each frequency-specific fund are agents going to demand in the benchmark model? To solve the model and answer this question, let’s first guess that the price function is linear:

    \begin{equation*} \mathrm{P}(d_f, \, z_f) = d_f - \sqrt{\sfrac{\sigma^2\!}{\mathit{SNR}}} \cdot z_f \quad \text{for some} \quad \mathit{SNR} > 0 \end{equation*}

This guess introduces a new parameter, \mathit{SNR}, which represents the signal-to-noise ratio of fund prices at time t=0. If this parameter is large, then the time t=0 prices of the low- and high-frequency funds will reveal a lot of the information about the company’s time t=1, \, 2 dividend payouts.

Here’s the upshot of this guess. It implies that each fund’s price is a normally-distributed signal about its future payout, d_f \sim \mathrm{N}(p_f, \, \sfrac{\sigma^2\!}{\mathit{SNR}}). And, with normally-distributed signals, we know how to compute agents’ posterior beliefs about d_f after seeing p_f:

    \begin{equation*} \mathrm{E}^{(i)}[d_f|p_f] =  {\textstyle \left\{ \frac{1}{1 + \mathit{SNR}}\right\}} \cdot \mu_f^{(i)} + {\textstyle \left\{ \frac{\mathit{SNR}}{1 + \mathit{SNR}}\right\}} \cdot p_f \end{equation*}

Above, \mu_f^{(i)} denotes agent i‘s priors about a particular fund, either \mu_L^{(i)} \overset{\scriptscriptstyle \text{def}}{=} \mu_1^{(i)} + \mu_2^{(i)} or \mu_H^{(i)} \overset{\scriptscriptstyle \text{def}}{=} \mu_1^{(i)} - \mu_2^{(i)}. And, we can then use these posterior beliefs to compute agents’ equilibrium demand rule by solving the first-order condition of Equation (1) with respect to x_f^{(i)}:

(3)   \begin{equation*} \mathrm{X}(\mathrm{E}^{(i)}[d_f|p_f], \, p_f) = \{1 + \mathit{SNR}\} \cdot {\textstyle \big\{ \frac{\mathrm{E}^{(i)}[d_f|p_f] - p_f}{\sigma} \big\}} \cdot {\textstyle \big\{ \frac{1}{\alpha \cdot \sigma} \big\}} \end{equation*}

Finally, to verify that our original guess about a linear price function was correct, we can plug this equilibrium demand rule into the market-clearing conditions in Equation (2) and solve for p_f:

    \begin{equation*} \mathrm{P}(d_f, \, z_f) = d_f - \alpha \cdot \sigma^2 \cdot z_f \end{equation*}

The resulting price function is indeed linear, so our solution is internally consistent (…though not unique). And, by matching coefficients, we can solve for the equilibrium signal-to-noise ratio, \mathit{SNR} = \{ \alpha \cdot \sigma \}^{-2}. In other words, fund prices at time t=0 reveal more information about a company’s dividend at times t=1, \, 2 when agents are less risk averse (\alpha small) or when they have more precise priors (\sigma small).

Order-Flow Info

How would this solution have to change if short-run traders could learn from time t=1 order flow?

For markets to clear at time t=1, the aggregate demand for the low-frequency fund plus the aggregate demand for the high-frequency fund has to equal the total number of available shares, x_L + x_H = z_1. And, from Equation (3), we know that the aggregate demand for the low-frequency fund is related to the company’s total dividend payout:

    \begin{equation*} x_L = \sqrt{\sfrac{\mathit{SNR}}{\sigma^2}} \cdot \{ d_L - p_L \} \end{equation*}

So, by looking at the time t=1 order flow, short-run traders can get a signal about the company’s time t=2 dividend since d_2 = d_L - d_1:

    \begin{equation*} d_2 \sim \mathrm{N}\left( \{p_L - d_1\} - \sqrt{\sfrac{\sigma^2\!}{\mathit{SNR}}} \cdot x_H, \, \sfrac{\sigma^2\!}{\mathit{SNR}} \right), \end{equation*}

And, this information about the time t=2 dividend is helpful since \mathrm{Cov}[d_2, \, d_H] = - \,\sigma^2.

With this additional signal, the short-run traders who previously invested in the high-frequency fund would now rather trade the company’s stock directly at times t=1, \, 2. Let \tilde{x}_H denote their demand at time t=2. When they observe a high price for the low-frequency asset at time t=0, p_L > 0, and a large dividend payout at time t=1, d_1 > 0, they know that d_2 is likely small. And, as a result, they’ll short more shares at time t=2, |\tilde{x}_H| > |x_H|. By contrast, when they observe a high price for the low-frequency asset at time t=0, p_L > 0, and a small dividend payout at time t=1, d_1 < 0, they know that d_2 is likely large. So, they’ll short fewer shares at time t=2, |\tilde{x}_H| < |x_{H}|.

Either way, trading volume is going to look asymmetric and lumpy as a result, with relatively more of the trading volume clustered at one of the end points. If p_L > 0 and d_1 > 0, then relatively more of the trading volume will occur at between time t=2 and t=3 because \mathit{vlm}_{0|1} is unchanged and:

    \begin{equation*} \mathit{vlm}_{2|3} = |x_L| + |x_H| < |x_L| + |\tilde{x}_H| \overset{\scriptscriptstyle \text{def}}{=} \widetilde{\mathit{vlm}}_{2|3} \end{equation*}

Whereas, if p_L > 0 and d_1 < 0, then relatively more of the trading volume will occur between time t=0 and t=1 because \mathit{vlm}_{0|1} is unchanged and now \mathit{vlm}_{2|3} > \widetilde{\mathit{vlm}}_{2|3}.

What’s more, to long-term investors who can’t see short-run order flow, short-run traders are going to add execution risk. The price at which the low-frequency fund executes its time t=2 orders will now vary. And, this variation will be related to the magnitude (but not the sign) of their time t=0 demand.

Finally, note that this analysis shows why it’s easier to model indexers and stock pickers than long-term investors and short-run traders. An equilibrium in either model has to contain a demand rule and a price function (e.g., see the setup in the benchmark model). But, an equilibrium in a model with multi-frequency trade also has to contain a rule for how long-term investors think short-run traders will affect their order execution. And, this rule is the crux of any model with long-term investors and short-run traders.

Filed Under: Uncategorized

The Tension Between Learning and Predicting

January 24, 2017 by Alex

1. Motivation

Imagine we’re traders in a market where the cross-section of returns is related to V \geq 1 variables:

    \begin{align*} r_s = \alpha^\star + {\textstyle \sum_v} \, \beta_v^{\star} \cdot x_{s,v} + \epsilon_s^{\star}. \end{align*}

In the equation above, \alpha^\star denotes the mean return, and each \beta_v^\star captures the relationship between returns and the vth right-hand-side variable. Some notation: I’m going to be using a “star” to denote true parameter values and a “hat” to denote in-sample estimates of these values. e.g., \hat{\beta}_v denotes an in-sample estimate of \beta_v^\star. To make things simple, let’s assume that \sum_s x_{s,v} = 0, \sum_s x_{s,v}^2 = S, \sum_s x_{s,v} \cdot x_{s,v'} = 0, and \epsilon_s^{\star} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2).

Notice that learning about the most likely parameter values,

    \begin{align*} \{ \hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}, \, \ldots, \, \hat{\beta}_V^{\text{ML}} \} &= \arg \max_{\alpha, \, \beta_1, \, \ldots, \, \beta_V}  \left\{ \, \mathrm{Pr}(\mathbf{r} | \{\alpha, \, \beta_1, \, \ldots, \, \beta_V\}) \, \right\}, \end{align*}

is really easy in this setting because \epsilon_s^{\star} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2). These maximum-likelihood estimates are just the coefficients from a cross-sectional regression,

    \begin{align*} r_s = \hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v} + \hat{\epsilon}_s^{\text{ML}}. \end{align*}

So, finding \{ \hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}, \, \ldots, \, \hat{\beta}_V^{\text{ML}} \} is homework question from Econometrics 101. Nothing could be simpler.

But, what if we’re interested in predicting returns,

    \begin{align*} \min_{\alpha, \, \beta_1, \, \ldots, \, \beta_V}  \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, \left( r_s - [\alpha + {\textstyle \sum_v} \, \beta_v \cdot x_{s,v}] \right)^2 \, \right\}, \end{align*}

rather than learning about the most likely parameter values? It might seem like this is the same exact problem. And, if we’re only considering 1 right-hand-side variable, then it is the same exact problem. When V = 1 the best predictions come from using \{\hat{\alpha}^{\text{ML}}, \, \hat{\beta}_1^{\text{ML}}\}. But, it turns out that when there’s 2 or more right-hand-side variables (and an unknown mean), this is no longer true. When V \geq 2 we can make better predictions with less likely parameters. When V \geq 2 there’s a tension between the learning and predicting.

Why? That’s the topic of today’s post.

2. Maximum Likelihood

Finding the most likely (ML) parameter values is equivalent to minimizing the negative log likelihood. So, because we’re assume that \epsilon_s^\star \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2), this is just

    \begin{align*} - \, \log \mathrm{Pr}(\mathbf{r} | \{\alpha, \, \beta_1, \, \ldots, \, \beta_V\}) &= {\textstyle \frac{1}{2 \cdot (S \cdot \sigma^2)}} \cdot {\textstyle \sum_s} \left(r_s - [\alpha + {\textstyle \sum_v} \, \beta_v \cdot x_{s,v}] \right)^2 + \cdots \end{align*}

where the “+ \cdots” at the end denotes a bunch of terms that don’t include any of the parameters that we’re optimization over. Optimizing each parameter value then gives the first-order conditions below:

    \begin{align*} 0 &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s}  \left( r_s  -  [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}]  \right)  \cdot  1 \\ \text{and} \quad 0 &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \left( r_s -  [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right) \cdot  x_{s,v} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}

And, solving this system of (V+1) equations and (V+1) unknowns gives the most likely parameter values:

    \begin{align*} \hat{\alpha}^{\text{ML}} &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, r_s \cdot 1 \\ \text{and} \quad \hat{\beta}_v^{\text{ML}}  &= {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} \, (r_s - \hat{\alpha}^{\text{ML}}) \cdot x_{s,v} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}

Clearly, the most likely parameter values are just the coefficients from a cross-sectional regression.

Now, for the sake of argument, let’s imagine there’s an oracle who knows the true parameter values, \{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \}. With access to this oracle, we can compute our mean squared error when using the maximum-likelihood estimates to predict returns given any choice of true parameter values:

    \begin{align*} \mathrm{E} \left[ \, \left(  r_s  -  [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right)^2 \, \middle| \, \{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \} \, \right] &= \sigma^2  +  1 \cdot (\sfrac{\sigma^2}{S})  +  V \cdot (\sfrac{\sigma^2}{S}). \end{align*}

The first term, \sigma^2, captures the unavoidable error. i.e., even if we knew the true parameter values, we still wouldn’t be able to predict \epsilon_s \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, \sigma^2). And, the second and third terms, 1 \cdot (\sfrac{\sigma^2}{S}) and V \cdot (\sfrac{\sigma^2}{S}), capture the error that comes from using the most likely parameter values rather than the true parameter values.

3. A Better Predictor

With this benchmark in mind, let’s take a look at a variant of the James-Stein (JS) estimator:

    \begin{align*} \hat{\beta}_v^{\text{JS}} &\overset{\scriptscriptstyle \text{def}}{=}  (1 - \lambda) \cdot \hat{\beta}_v^{\text{ML}} \quad \text{for each } v = 1, \, \ldots, \, V. \end{align*}

In the definition above, \lambda \in [0, \, 1] denotes a bias factor that shrinks the maximum-likelihood estimates towards zero whenever \lambda > 0. So, with access to an oracle, we can compute our mean squared error when using the James-Stein estimates to predict returns just like before:

    \begin{align*} \mathrm{E} \left[ \, \left(  r_s  -  [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{JS}} \cdot x_{s,v}] \right)^2 \, \middle| \, \{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \} \, \right] &= \sigma^2  +  1 \cdot (\sfrac{\sigma^2}{S})  \\ &\quad + V  \cdot \left\{  \, (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2  \cdot ({\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2) \, \right\}. \end{align*}

Now the third term is more complicated. If we use a more biased estimator, \lambda \to 1 and |\hat{\beta}_v^{\text{JS}}| \to 0, then using the most likely parameter values rather than the true parameter values to predict returns is going to cause less damage. But, bias is going to generate really bad predictions whenever the true parameter value is large |\beta_v^\star| \gg 0. The (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) and \lambda^2 \cdot (\frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2) terms capture these two opposing forces.

Comparing the maximum-likelihood and James-Stein prediction errors reveals that we should prefer the James-Stein estimator to the maximum-likelihood estimator if there’s a \lambda \in (0, \, 1] such that:

    \begin{align*} (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2  \cdot ({\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2) < (\sfrac{\sigma^2}{S}). \end{align*}

But, here’s the thing: if we have access to an oracle, then there’s always going to be some \lambda > 0 that satisfies this inequality. This is easier to see if we rearrange things a bit:

    \begin{align*} {\textstyle  \frac{ \frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2 }{ (\sfrac{\sigma^2}{S}) } } < {\textstyle  \frac{ 2 - \lambda }{ \lambda } }. \end{align*}

So, there’s always a sufficiently small \lambda such that this inequality holds. Thus, for any \{ \alpha^\star, \, \beta_1^\star, \, \ldots, \, \beta_V^\star \}, there’s some \lambda such that the James-Stein estimates gives better predictions than the most likely parameter values.

4. Abandoning the Oracle

Perhaps this isn’t a useful comparison? In the real world, we can’t see the true parameter values when deciding which estimator to use. So, we can’t know ahead of time whether or not we’ve picked a small enough \lambda. It turns out that having to estimate \lambda changes things, but only when there’s just V=1 right-hand-side variable. When there are V \geq 2 variables, James-Stein with an estimated \lambda still gives better predictions.

To see where this distinction comes from, let’s first solve for the optimal choice of \lambda when we still have access to the oracle. This will tell us what we have to estimate when we abandon the oracle. The optimal choice of \lambda solves:

    \begin{align*} \lambda^\star &= \arg \min_{\lambda \in [0, \, 1]} \left\{ \, (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}) + \lambda^2  \cdot  \left( {\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2 \right) \, \right\} \end{align*}

So, if we differentiate, then we can solve the first-order condition for \lambda^\star:

    \begin{align*} \lambda^\star &= {\textstyle  \frac{ (\sfrac{\sigma^2}{S}) }{ \frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2 - (\sfrac{\sigma^2}{S}) } }. \end{align*}

If the maximum-likelihood estimates are really noisy relative to the size of the true parameter values (i.e., \sfrac{\sigma^2}{S} is close to \frac{1}{V} \cdot \sum_v \, |\beta_v^\star|^2), then using the most likely parameter values rather than the true parameter values is going to increase our prediction error a lot. So, we should use more biased parameter estimates.

But, notice what this formula is telling us. To estimate the right amount of bias, all we have to do is estimate the variance of the true parameter values. We don’t have to estimate every single one. So, we can estimate the variance of the true parameter values as follows,

    \begin{align*} {\textstyle \frac{1}{V-1}} \cdot {\textstyle \sum_v} \, |\hat{\beta}_v^{\text{ML}}|^2 &= (\sfrac{\sigma^2}{S}) + {\textstyle \frac{1}{V}} \cdot {\textstyle \sum_v} \, |\beta_v^\star|^2, \end{align*}

where the factor of (V - 1) on the left-hand side is a degrees-of-freedom correction. To see why we need this correction, recall that

    \begin{align*} \mathrm{E} \left[ \, \left(  r_s  -  [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{ML}} \cdot x_{s,v}] \right)^2 \, \right] &= \sigma^2  +  1 \cdot (\sfrac{\sigma^2}{S})  +  V \cdot (\sfrac{\sigma^2}{S}), \end{align*}

so not all V maximum-likelihood estimates can move independently. There is an adding-up constraint.

Thus, if we have to estimate the right amount of bias to use, then we should choose:

    \begin{align*} \hat{\lambda} &= {\textstyle \frac{ (\sfrac{\sigma^2}{S}) }{ \frac{1}{V-1} \cdot \sum_v \, |\hat{\beta}_v^{\text{ML}}|^2 } }. \end{align*}

Notice that, when we have to estimate the right amount of bias, \hat{\lambda} > 0 only if V \geq 2. If V = 1, then the denominator is infinite and \hat{\lambda} = 0. After all, if there’s only 1 right-hand-side variable, then the equation to estimate the variance of the true parameter values has the same first-order condition as the equation to estimate \hat{\beta}_1^{\text{ML}}. With this estimated amount of bias, our prediction error becomes

    \begin{align*} \mathrm{E} \left[ \, \left(  r_s  -  [\hat{\alpha}^{\text{ML}} + {\textstyle \sum_v} \, \hat{\beta}_v^{\text{JS}} \cdot x_{s,v}] \right)^2 \, \right] &= \sigma^2  +  1 \cdot (\sfrac{\sigma^2}{S})  +  (1 - \lambda)^2 \cdot (\sfrac{\sigma^2}{S}), \end{align*}

which is always less than the maximum-likelihood prediction error whenever V \geq 2.

5. What This Means

My last post looked at one reason why it’s harder to predict the cross-section of returns when V \geq 2: Bayesian variable selection doesn’t scale. If we’re not sure which subset of variables actually predict returns, then finding the subset of variables that’s the most likely to predict returns means solving a non-convex optimization problem. It turns out that solving this optimization problem means doing an exhaustive search over the powerset containing all 2^V possible subsets of variables. And, this just isn’t feasible when V \gg 2.

But, this scaling problem isn’t the only reason why it’s harder to predict the cross-section of returns when V \geq 2. And, this post points out another one of these reasons: even if you could solve this non-convex optimization problem and find the most likely parameter values, these parameter values wouldn’t give the best predictions. When V \geq 2, there’s a fundamental tension between making good predictions and learning about the most likely parameter values in the data-generating process for returns. So, when V \geq 2 traders are going to solve the prediction problem and live with the resulting biased beliefs about the underlying parameter values. What’s more, the \hat{\lambda} with the best out-of-sample fit in the data is going to quantify how much the desire to make good predictions distorts traders’ beliefs.

Filed Under: Uncategorized

Why Bayesian Variable Selection Doesn’t Scale

January 19, 2017 by Alex

1. Motivation

Traders are constantly looking for variables that predict returns. If x is the only candidate variable traders are considering, then it’s easy to use the Bayesian information criterion to check whether x predicts returns. Previously, I showed that using the univariate version of the Bayesian information criterion means solving

(●)   \begin{align*} \hat{\beta} &= \arg \min_{\beta}  \big\{  \,  \underset{\text{Prediction Error}}{{\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2} + \underset{\text{Penalty}}{ \lambda \cdot 1_{\{ \beta \neq 0 \}} } \,  \big\} \qquad \text{with} \qquad \lambda = {\textstyle \frac{1}{S}} \cdot \log(S) \end{align*}

after standardizing things so that \hat{\mu}_x, \, \hat{\mu}_r = 0 and \hat{\sigma}_x^2 = 1. If the solution is some \hat{\beta} \neq 0, then x predicts returns. Notation: Parameters with hats are in-sample estimates. e.g., if x_s \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1), then \frac{1}{S} \cdot \sum_s x_s = \hat{\mu}_x \sim \mathrm{N}(0, \, \sfrac{1}{S}).

But, what if there’s more than 1 variable? There’s an obvious multivariate extension of (●):

(⣿)   \begin{align*} \{\hat{\beta}_1, \, \ldots, \, \hat{\beta}_V \} &= \arg \min_{\beta_1, \, \ldots, \, \beta_V}  \left\{  \,  {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_v} \, \beta_v \cdot x_{s,v} )^2 + \lambda \cdot {\textstyle \sum_v}  1_{\{ \beta_v \neq 0 \}} \,  \right\}. \end{align*}

So, you might guess that it’d be easy to check which subset of V \geq 1 variables predicts returns by evaluating (⣿). But, it’s not. To evaluate the multivariate version of the Bayesian information criterion, traders would have to check 2^V different parameter values. That’s a combinatorial nightmare when V \gg 1. Thus, traders can’t take a strictly Bayesian approach to variable selection when there are lots of variables to choose from.

Why is evaluating (⣿) so hard? That’s the topic of today’s post.

2. Non-Convex Problem

Let’s start by looking at what makes (●) so easy. The key insight is that you face the same-sized penalty no matter what \hat{\beta} \neq 0 you choose when solving the univariate version of the Bayesian information criterion:

    \begin{align*} \lambda \cdot 1_{\{ 0.01 \neq 0 \}} = \lambda \cdot 1_{\{ 100 \neq 0 \}} = \lambda \qquad \text{or, put differently} \qquad {\textstyle \frac{\mathrm{d}\lambda}{\mathrm{d}\beta}} = 0. \end{align*}

So, if you’re going set \hat{\beta} \neq 0, then you might as well choose the value that minimizes your prediction error:

    \begin{align*} \arg \min_{\beta \neq 0}  \left\{  \,  {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 + \lambda \cdot 1_{\{ \beta \neq 0 \}} \,  \right\} &=  \arg \min_{\beta}  \left\{  \,  {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 + \lambda \,  \right\} \\ &= \arg \min_{\beta}  \left\{  \,  {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 \,  \right\} \\ &= \hat{\beta}^{\text{OLS}}. \end{align*}

Thus, to evaluate (●), all you have to do is check 2 parameter values, \beta = 0 and \beta = \hat{\beta}^{\text{OLS}}, and see which one gives a smaller result. Practically speaking, this means running an OLS regression, r_s = \hat{\beta}^{\text{OLS}} \cdot x_s + \hat{\epsilon}_s, and checking whether or not the penalized residual variance, \hat{\sigma}_\epsilon^2 + \lambda, is smaller than the raw return variance, \hat{\sigma}_r^2.

Most explanations for why (⣿) is hard to evaluate focus on the fact that (⣿) is a non-convex optimization problem (e.g., see here and here). But, the univariate version of the Bayesian information criterion is also a non-convex optimization problem. Just look at the region around \beta = 0 in the figure to the left, which shows the objective function from (●). So, non-convexity can only part of the explanation for why (⣿) is hard to evaluate. Increasing the number of variables must be add a missing ingredient.

3. The Missing Ingredient

DATA + CODE

If there are many variables to consider, then these variables can be correlated. Correlation. This is the missing ingredient that makes evaluating (⣿) hard. Let’s look at a numerical example to see why.

Suppose there are only S = 9 stocks and V = 3 variables. The diagram above summarizes the correlation structure between these variables and returns. The red bar is the total variation in returns. The blue bars represent the portion of this variation that’s related to each of the 3 variables. If you can draw a vertical line through a pair of bars (i.e., the bars overlap), then the associated variables are correlated. So, because the first 2 blue bars don’t overlap, x_1 and x_2 are perfectly uncorrelated in-sample:

    \begin{align*} \widehat{\mathrm{Cor}}[x_1, \, x_2] &= 0. \end{align*}

Whereas, because the first 2 blue bars both overlap the third, x_1 and x_2 are both correlated with x_3:

    \begin{align*} \widehat{\mathrm{Cor}}[x_1, \, x_3] = \widehat{\mathrm{Cor}}[x_2, \, x_3] &= 0.41. \end{align*}

Finally, longer overlaps denote larger correlations. So, x_3 is the single best predictor of returns since the third blue bar has the longest overlap with the top red bar:

    \begin{align*} \widehat{\mathrm{Cor}}[r, \, x_1] = \widehat{\mathrm{Cor}}[r, \, x_2] &= 0.62 \\ \widehat{\mathrm{Cor}}[r, \, x_3] &= 0.67. \end{align*}

And, this creates a problem. If you had to pick only 1 variable to predict returns, then you’d pick x_3:

    \begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_3^{\text{OLS}} \cdot x_{s,3} )^2 + \lambda = 0.80 < 0.86 &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_1^{\text{OLS}} \cdot x_{s,1} )^2 + \lambda \\ &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_2^{\text{OLS}} \cdot x_{s,2} )^2 + \lambda. \end{align*}

But, \{x_1, \, x_2 \} is actually the subset of variables that minimizes (⣿) in this example:

    \begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{1,2\}}} \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda = 0.72. \end{align*}

In other words, the variable that best predicts returns on its own isn’t part of the subset of variables that collectively best predict returns. In other examples it might be, but there’s no quick way to figure out which kind of example we’re in because (⣿) is a non-convex optimization problem. Until you actually plug \{x_1, \, x_2 \} into (⣿), there’s absolutely no reason to suspect that either variable belongs to subset that minimizes (⣿). Think about it. x_1 and x_2 are both worse choices than x_3 on their own. And, if you start with x_3 and add either variable, things get even worse:

    \begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_3^{\text{OLS}} \cdot x_{s,3} )^2 + \lambda = 0.80 < 0.90 &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{3,1\}}} \, \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda \\ &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{3,2\}}} \, \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda. \end{align*}

With many correlated variables, you can never tell how close you are to the subset of variables that best predicts returns. To evaluate (⣿), you’ve got to check all 2^V possible combinations. There are no shortcuts.

4. Birthday-Paradox Math

If correlation makes it hard to evaluate (⣿), then shouldn’t we be able to fix the problem by only considering independent variables? Yes… but only in a fairytale world where there are an infinite number of stocks, S \to \infty. The problem is unavoidable in the real world where there are almost as many candidate variables as there are stocks because independent variables are still going to be correlated in finite samples.

Suppose there are V \geq 2 independent variables that might predict returns. Although these variables are independent, they won’t be perfectly uncorrelated in finite samples. So, let’s characterize the maximum in-sample correlation between any pair of variables. After standardizing each variable so that \hat{\mu}_{x,v} = 0 and \hat{\sigma}_{x,v}^2 = 1, the in-sample correlation between x_v and x_{v'} when S \gg 1 is roughly:

    \begin{align*} \hat{\rho}_{v,v'} \sim \mathrm{N}(0, \, \sfrac{1}{S}). \end{align*}

Although \lim_{S \to \infty} \sfrac{1}{S} \cdot (\hat{\rho}_{v,v'} - 0)^2 = 0, our estimates won’t be exactly zero in finite samples.

CODE

Since the normal distribution is symmetric, the probability that x_v and x_{v'} have an in-sample correlation more extreme than c is:

    \begin{align*} \mathrm{Pr}[ |\hat{\rho}_{v,v'}| > c ] &= 2 \cdot \mathrm{Pr}[ \hat{\rho}_{v,v'} > c ] = 2 \cdot \mathrm{\Phi}(c \cdot \!\sqrt{S}). \end{align*}

So, since there are {V \choose 2} \leq \frac{1}{2} \cdot V^2 pairs of variables, we know that the probability that no pair has a correlation more extreme than c is:

    \begin{align*} \mathrm{Pr}[ \max |\hat{\rho}_{v,v'}| \leq c ] &\leq \big( \, 1 - 2 \cdot \mathrm{\Phi}(c \cdot \!{\textstyle \sqrt{S}}) \, \big)^{\frac{1}{2} \cdot V^2}. \end{align*}

Here’s the punchline. Because V^2 shows up as an exponent in the equation above, the probability that all pairwise in-sample correlations happen to be really small is going to shrink exponentially fast as traders consider more and more variables. This means that finite-sample effects are always going to make evaluating (⣿) computationally intractable in real-world settings with many variables, even when the variables are truly uncorrelated as S \to \infty. e.g., the in-sample correlation of \hat{\rho}_{1,3} = \hat{\rho}_{2,3} = 0.41 from the previous example might seem like an unreasonably high number for independent random variables, something that only happens when S=9. But, the figure above shows that even when there are S = 50 stocks, there’s still a 50{\scriptstyle \%} chance of observing an in-sample correlation of at least 0.41 when considering V = 20 candidate variables.

5. What This Means

Market efficiency has been an “organizing principle for 30 years of empirical work” in academic finance. The principle is based on on a negative feedback loop: predictable returns suggest profitable trading strategies, but implementing these strategies eliminates the initial predictability. So, if there are enough really smart traders, then they’re going to find the subset of variables that predicts returns and eliminate this predictability. That’s market efficiency.

Even for really smart traders, finding the subset of variables that predicts returns is hard. And, this problem gets harder when there are more candidate variables to choose from. But, while researchers have thought about this problem in the past, they’ve primarily focused on the dangers of p-hacking (e.g., see here and here). If traders regress returns on V = 20 different variables, then they should expect that 1 of these regressions is going to produce a statistically significant coefficient with a p\text{-value} \leq 0.05 even if none of the variables actually predicts returns. So, researchers have focused on correcting p-values.

But, this misses the point. Searching for the subset of variables that predicts returns isn’t hard because you have to adjust your p-values. It’s hard because it requires a brute-force search through the powerset of all possible subsets of predictors. It’s hard because any optimization problem with a hard-thresholding rule like (⣿) can be re-written as an integer programming problem,

    \begin{align*} \min_{{\boldsymbol \delta} \in \{0, \, 1\}^V}  \left\{  \,  {\textstyle \frac{1}{S}}  \cdot {\textstyle \sum_s} (r_s - {\textstyle \sum_v} [\hat{\beta}_v^{\text{OLS}} \cdot x_{s,v}] \cdot \delta_v)^2 \quad \text{subject to} \quad k \geq {\textstyle \sum_v}  \delta_v \, \right\}, \end{align*}

which means that it’s NP-hard (e.g., see here, here, or here). It’s hard because, in the diagram above finding the subset of 30 variables that predicts returns is equivalent to finding the cheapest way to cover the red bar with blue bars at a cost of \lambda per blue bar, which means solving the knapsack problem.

<

p style=”text-indent: 15px;”>So, the fact that Bayesian variable selection doesn’t scale is a really central problem. It means that, even if there are lots and lots of really smart traders, they may not be able to find the best subset of variables for predicting returns. You’re probably on a secure wifi network right now. This network is considered “secure” because cracking its 128-bit passcode would involve a brute-force search over 2^{128} parameter values, which would take 1 billion billion years. So, if there are over V = 313 predictors documented in top academic journals, why shouldn’t we consider the subset of variables that actually predicts returns “secure”, too? After all, finding it would involve a brute-force search over 2^{313} parameter values. We might be able to approximate it. But, the exact collection of variables may as well be in a vault somewhere.

Filed Under: Uncategorized

The Bayesian Information Criterion

January 3, 2017 by Alex

1. Motivation

Imagine that we’re trying to predict the cross-section of expected returns, and we’ve got a sneaking suspicion that x might be a good predictor. So, we regress today’s returns on x to see if our hunch is right,

    \begin{align*} r_{n,t}  = \hat{\mu}_{\text{OLS}} + \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1}  +  \hat{\epsilon}_{n,t}. \end{align*}

The logic is straightforward. If x explains enough of the variation in today’s returns, then x must be a good predictor and we should include it in our model of tomorrow’s returns, \mathrm{E}_t(r_{n,t+1}) = \hat{\mu}_{\text{OLS}} + \hat{\beta}_{\text{OLS}} \cdot x_{n,t}.

But, how much variation is “enough variation”? After all, even if x doesn’t actually predict tomorrow’s returns, we’re still going to fit today’s returns better if we use an additional right-hand-side variable,

    \begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2  \leq  {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}})^2. \end{align*}

The effect is mechanical. If we want to explain all of the variation in today’s returns, then all we have to do is include N right-hand-side variables in our OLS regression. With N linearly independent right-hand-side variables we can always perfectly predict N stock returns, no matter what variables we choose.

The Bayesian information criterion (BIC) tells us that we should include x as a right-hand-side variable if it explains at least \sfrac{\log(N)}{N} of the residual variation,

    \begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2  + {\textstyle \frac{\log(N)}{N}} &\leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\mu}_{\text{OLS}})^2. \end{align*}

But, where does this \sfrac{\log(N)}{N} penalty come from? And, why is following this rule the Bayesian thing to do? Bayesian updating involves learning about a parameter value by combining prior beliefs with evidence from realized data. So, what parameter are we learning about when using the Bayesian information criterion? And, what are our priors beliefs? These questions are the topic of today’s post.

2. Estimation

Instead of diving directly into our predictor-selection problem (should we include x in our model?), let’s pause for a second and solve our parameter-estimation problem (how should we estimate the coefficient on x?). Suppose the data-generating process for returns is

    \begin{align*} r_{n,t} =  \beta_\star \cdot x_{n,t-1} + \epsilon_{n,t} \end{align*}

where \beta_\star \sim \mathrm{N}(0, \, \sigma^2), \epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1), and x is normalized so that \frac{1}{N} \cdot \sum_{n=1}^N x_n^2 = \widehat{\mathrm{Var}}[x_n] = 1. For simplicity, let’s also assume that \mu_\star = 0 in the analysis below.

If we see N returns from this data-generating process, \mathbf{r}_t = \{ \, r_{1,t}, \, r_{2,t}, \, \ldots, \, r_{N,t} \, \}, then we can estimate \beta_\star by choosing the parameter value that would maximize the posterior probability of realizing these returns:

    \begin{align*} \hat{\beta}_{\text{MAP}} &\overset{\scriptscriptstyle \text{def}}{=} \arg \max_{\beta} \{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta ) \times \mathrm{Pr}(\beta) \, \} \\ &= \arg \min_{\beta} \{ \, - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta )  -  {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}(\beta) \, \}. \end{align*}

This is known as maximum a posteriori (MAP) estimation, and the second equality in the expression above points out how we can either maximize the posterior probability or minimize -(\sfrac{1}{N}) \cdot \log(\cdot) of this function,

    \begin{align*} \mathrm{f}(\beta)  &\overset{\scriptscriptstyle \text{def}}{=}   - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta )  -  {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}(\beta). \end{align*}

We can think about \mathrm{f}(\beta) as the average improbability of the realized returns given \beta_\star = \beta.

So, what is this answer? Because \beta_\star \sim \mathrm{N}(0, \, \sigma^2) and \epsilon_{n,t} \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1), we know that

    \begin{align*} \mathrm{f}(\beta)  &= {\textstyle \frac{1}{N}}  \cdot  \left\{ \, {\textstyle \sum_{n=1}^N} {\textstyle \frac{1}{2}} \cdot (r_{n,t} - \beta \cdot x_{n,t-1})^2 + N \cdot {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) \, \right\} \\ &\qquad \quad  + \, {\textstyle \frac{1}{N}}  \cdot  \left\{  \, {\textstyle \frac{1}{2 \cdot \sigma^2}} \cdot (\beta - 0)^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) + {\textstyle \frac{1}{2}} \cdot \log(\sigma) \, \right\} \end{align*}

where the first line is -(\sfrac{1}{N}) \cdot \log \mathrm{Pr}(\mathbf{r}_t|\mathbf{x}_{t-1}, \, \beta) and the second line is -(\sfrac{1}{N}) \cdot \log \mathrm{Pr}(\beta). What’s more, because we’re specifically choosing \hat{\beta}_{\text{MAP}} to minimize \mathrm{f}(\beta), we also know that

    \begin{align*} \mathrm{f}'(\hat{\beta}_{\text{MAP}}) &= 0 = - \, {\textstyle \frac{1}{N}}  \cdot  {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{MAP}} \cdot x_{n,t-1}) \cdot x_{n,t-1} + {\textstyle \frac{1}{N}}  \cdot  {\textstyle \frac{1}{\sigma^2}} \cdot \hat{\beta}_{\text{MAP}}. \end{align*}

And, solving this first-order condition for \hat{\beta}_{\text{MAP}} tells us exactly how to estimate \beta_\star:

    \begin{align*} \hat{\beta}_{\text{MAP}} &= \frac{ N \cdot \left\{ \frac{1}{N} \cdot \sum_{n=1}^N r_{n,t} \cdot x_{n,t-1} \right\} }{ \frac{1}{\sigma^2} + N \cdot \left\{ \frac{1}{N} \cdot \sum_{n=1}^N x_{n,t-1}^2 \right\} } = \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] }. \end{align*}

3. Selection

Now that we’ve seen the solution to our parameter-estimation problem, let’s get back to solving our predictor-selection problem. Should we include x in our predictive model of tomorrow’s returns? It turns out that answering this question means learning about the prior variance of \beta_\star. Is \beta_\star is equally likely to take on any value, \sigma^2 \to \infty? Or, should we assume that \beta_\star = 0 regardless of the evidence, \sigma^2 \to 0?

To see where the first choice comes from, let’s think about the priors we’re implicitly adopting when we include x in our predictive model. Since \beta_\star \sim \mathrm{N}(0, \, \sigma^2), this means looking for a \sigma^2 such that \hat{\beta}_{\text{MAP}} = \hat{\beta}_{\text{OLS}}. Inspecting the solution to our parameter-estimation problem reveals that

    \begin{align*} \lim_{\sigma^2 \to \infty} \hat{\beta}_{\text{MAP}} &= \lim_{\sigma^2 \to \infty}  \left( \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] } \right) = \frac{ \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \widehat{\mathrm{Var}}[x_{n,t-1}] } = \hat{\beta}_{\text{OLS}}. \end{align*}

So, by including x, we’re adopting an agnostic prior that \beta_\star is equally likely to be any value under the sun.

To see where the second choice comes from, let’s think about the priors we’re implicitly adopting when we exclude x from our predictive model. This means looking for a \sigma^2 such that \hat{\beta}_{\text{MAP}} = 0 regardless of the realized data, \mathbf{x}_{t-1}. Again, inspecting the formula for \hat{\beta}_{\text{MAP}} reveals that

    \begin{align*} \lim_{\sigma^2 \to 0} \hat{\beta}_{\text{MAP}}  &=  \lim_{\sigma^2 \to 0}  \left( \frac{ N \cdot \widehat{\mathrm{Cov}}[r_{n,t}, \, x_{n,t-1}] }{ \frac{1}{\sigma^2} + N \cdot \widehat{\mathrm{Var}}[x_{n,t-1}] } \right) =0. \end{align*}

So, by excluding x, we’re adopting a religious prior that \beta_\star = 0 regardless of any new evidence.

Thus, when we decide whether to include x in our predictive model, what we’re really doing is learning about our priors. So, after seeing N returns, \mathbf{r}_t, we can decide whether to include x in our predictive model by choosing the prior variance, \sigma^2, that maximizes the posterior probability of realizing these returns,

    \begin{align*} \hat{\sigma}_{\text{MAP}}^2  &\overset{\scriptscriptstyle \text{def}}{=}  \arg \max_{\sigma^2 \in \{\infty, \, 0\}}  \left\{  \,  \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \times \mathrm{Pr}( \sigma^2 ) \,  \right\} \\ &=  \arg \min_{\sigma^2 \in \{ \infty, \, 0\}}  \left\{  \,  - {\textstyle \frac{1}{N}}  \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) + \log \mathrm{Pr}( \sigma^2 ) \,  \right\}, \end{align*}

where the second equality in the expression above points out how we can either maximize the posterior probability or minimize -(\sfrac{1}{N}) \cdot \log(\cdot) of this function—i.e., its average improbability. Either way, if we estimate \hat{\sigma}_{\text{MAP}}^2 \to \infty, then we should include x; whereas, if we estimate \hat{\sigma}_{\text{MAP}}^2 \to 0, then we shouldn’t.

4. Why log(N)/N?

The posterior probability of the realized returns given our choice of priors is given by

    \begin{align*} \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \cdot \mathrm{Pr}( \sigma^2 ) &=  {\textstyle \int_{-\infty}^\infty} \mathrm{Pr}(\mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta) \cdot \mathrm{Pr}(\beta|\sigma^2) \cdot  \mathrm{d}\beta  \\ &=  {\textstyle \int_{-\infty}^\infty} \, e^{-N \cdot \mathrm{f}(\beta)} \cdot  \mathrm{d}\beta. \end{align*}

In this section, we’re going to see how to evaluate this integral. And, in the process, we’re going to see precisely where that \sfrac{\log(N)}{N} penalty term in the Bayesian information criterion comes from.

Here’s the key insight in plain English. The realized returns are affected by noise shocks. By definition, excluding x from our predictive model means that we aren’t learning about \beta_\star from the realized returns, so there’s no way for these noise shocks to affect either our estimate of \hat{\beta}_{\text{MAP}} or our posterior-probability calculations. By contrast, if we include x in our predictive model, then we are learning about \beta_\star from the realized returns, so these noise shocks will distort both our estimate of \hat{\beta}_{\text{MAP}} and our posterior-probability calculations. The distortion caused by these noise shocks are going to be the source of the \sfrac{\log(N)}{N} penalty term in the Bayesian information criterion.

Now, here’s the same insight in Mathese. Take a look at the Taylor expansion of \mathrm{f}(\beta) around \hat{\beta}_{\text{MAP}},

    \begin{align*} \mathrm{f}(\beta)  &= \mathrm{f}(\hat{\beta}_{\text{MAP}})  + {\textstyle \frac{1}{2}} \cdot \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \cdot (\beta - \hat{\beta}_{\text{MAP}})^2. \end{align*}

There’s no first-order term because \hat{\beta}_{\text{MAP}} is chosen to minimize \mathrm{f}(\beta), and there are no higher-order terms because both \beta_\star and \epsilon_{n,t} are normally distributed. From the formula for \mathrm{f}(\beta) we can calculate that

    \begin{align*} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) &= {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} \cdot x_{n,t-1}^2 + {\textstyle \frac{1}{N}} \cdot  {\textstyle \frac{1}{\sigma^2}}. \end{align*}

Recall that \mathrm{f}(\beta) measures the average improbability of realizing \mathbf{r}_t given that \beta_\star = \beta. So, if \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \to \infty for a given choice of priors, then having any \beta_\star \neq \hat{\beta}_{\text{MAP}} is infinitely improbable under those priors. And, this is exactly what we find when we exclude x from our predictive model, \lim_{\sigma \to 0} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) = \infty. By contrast, if we include x in our predictive model, then \lim_{\sigma \to \infty} \mathrm{f}''(\hat{\beta}_{\text{MAP}}) = 1, meaning that we are willing to entertain the idea that \beta_\star \neq \hat{\beta}_{\text{MAP}} due to distortions caused by the noise shocks.

To see why these distortions warrant a \sfrac{\log(N)}{N} penalty, all we have to do then is evaluate the integral. First, let’s think about the case where we exclude x from our predictive model. We just saw that, if \sigma \to 0, then we are unwilling to consider any parameter values besides \hat{\beta}_{\text{MAP}} = 0. So, the integral equation for our posteriors given that \sigma^2 \to 0 simplifies to

    \begin{align*} \lim_{\sigma^2 \to 0} \left\{ \, \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma^2 ) \, \right\} &= \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \beta_\star = 0 ) \\ &= {\textstyle \big(\frac{1}{\sqrt{2 \cdot \pi}}\big)^N} \cdot e^{ - \, \sum_{n=1}^N \frac{1}{2} \cdot (r_{n,t} - 0)^2 }. \end{align*}

This means that the average improbability of realizing \mathbf{r}_t given the priors \sigma^2 \to 0 is given by

    \begin{align*} \lim_{\sigma \to 0}  \left\{ \, - {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) \, \right\} &= {\textstyle \frac{1}{2 \cdot N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - 0)^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi). \end{align*}

To calculate our posterior beliefs when we include x, let’s use this Taylor expansion around \hat{\beta}_{\text{MAP}} again,

    \begin{align*} \lim_{\sigma \to \infty} \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) &= \lim_{\sigma \to \infty}  \left\{  \, {\textstyle \int_{-\infty}^\infty} \,  e^{-\, N \cdot \mathrm{f}(\beta)}  \cdot  \mathrm{d}\beta \,  \right\} \\ &= \lim_{\sigma \to \infty} \left\{  \, {\textstyle \int_{-\infty}^\infty} \, e^{-\, N \cdot \left\{ \mathrm{f}(\hat{\beta}_{\text{MAP}}) + \frac{1}{2} \cdot \mathrm{f}''(\hat{\beta}_{\text{MAP}}) \cdot (\beta - \hat{\beta}_{\text{MAP}})^2 \right\}} \cdot \mathrm{d}\beta \,  \right\} \\ &= \left\{  \, e^{-\, N \cdot \mathrm{f}(\hat{\beta}_{\text{OLS}})}  \,  \right\} \times  \left\{  \, {\textstyle \int_{-\infty}^\infty} \, e^{ - \, \frac{N}{2} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2} \cdot \mathrm{d}\beta \,  \right\}. \end{align*}

The first term is the probability of observing the realized returns assuming that \beta_\star = \lim_{\sigma^2 \to \infty} \hat{\beta}_{\text{MAP}} = \hat{\beta}_{\text{OLS}}. The second term is a penalty that accounts for the fact that \beta_\star might be different from the estimated \hat{\beta}_{\text{OLS}} in finite samples. Due to the central-limit theorem, this difference between \beta_\star and \hat{\beta}_{\text{OLS}} is going to shrink at a rate of \sqrt{(\sfrac{1}{N})}:

    \begin{align*} {\textstyle \int_{-\infty}^\infty} \,  e^{ - \, \frac{N}{2} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2}  \cdot  \mathrm{d}\beta &= {\textstyle \int_{-\infty}^\infty} \,  e^{ - \, \frac{1}{2 \cdot (\sfrac{1}{N})} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2}  \cdot  \mathrm{d}\beta \\ &= \sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}  \cdot \int_{-\infty}^\infty \,  {\textstyle \frac{1}{\sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}}} \cdot e^{ - \, \frac{1}{2 \cdot (\sfrac{1}{N})} \cdot (\beta - \hat{\beta}_{\text{OLS}})^2}  \cdot  \mathrm{d}\beta \\ &= \sqrt{2 \cdot \pi \cdot (\sfrac{1}{N})}. \end{align*}

So, the average improbability of realizing \mathbf{r}_t given the priors \sigma^2 \to \infty is given by

    \begin{align*} \lim_{\sigma \to \infty} \left\{ \, - \, {\textstyle \frac{1}{N}} \cdot \log \mathrm{Pr}( \mathbf{r}_t | \mathbf{x}_{t-1}, \, \sigma ) \, \right\} &= {\textstyle \frac{1}{2 \cdot N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{1}{2}} \cdot \log(2 \cdot \pi) + {\textstyle \frac{1}{2}} \cdot {\textstyle \frac{\log(N)}{N}}  + \mathrm{O}(\sfrac{1}{N}) \end{align*}

where \mathrm{O}(\sfrac{1}{N}) is big-“O” notation denoting terms that shrink faster than \sfrac{1}{N} as N \to \infty.

5. Formatting

Bringing everything together, hopefully it’s now clear why we can decide whether to include x in our predictive model by checking whether

    \begin{align*} {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \hat{\beta}_{\text{OLS}} \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}}  + \mathrm{O}(\sfrac{1}{N}) \leq {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - 0)^2. \end{align*}

The \sfrac{\log(N)}{N} penalty term accounts for the fact that you’re going to be overfitting the data in sample when you include more right-hand-side variables. This criterion was first proposed in Schwarz (1978), who showed that the criterion becomes exact as N \to \infty. The Bayesian information criterion is often written as an optimization problem as well:

    \begin{align*} \hat{\beta}_{\text{MAP}} &=  \arg \min_{\beta} \left\{  \, {\textstyle \frac{1}{N}} \cdot {\textstyle \sum_{n=1}^N} (r_{n,t} - \beta \cdot x_{n,t-1})^2 + {\textstyle \frac{\log(N)}{N}} \cdot \mathrm{1}_{\{ \beta \neq 0 \}} \, \right\}. \end{align*}

Both ways of writing down the criterion are the same. They just look different due to formatting. There is one interesting idea that pops out of writing down the Bayesian information criterion as a optimization problem. Solving for the \hat{\beta}_{\text{MAP}} suggests that you should completely ignore any predictors with sufficiently small OLS coefficients:

    \begin{align*} \hat{\beta}_{\text{MAP}} &=  \begin{cases} \hat{\beta}_{\text{OLS}} &\text{if } |\hat{\beta}_{\text{OLS}}| \geq \sqrt{{\textstyle \frac{1}{2}} \cdot {\textstyle \frac{\log(N)}{N}}}, \text{ and} \\ 0 &\text{otherwise.} \end{cases} \end{align*}

Filed Under: Uncategorized

« Previous Page
Next Page »

Pages

  • Publications
  • Working Papers
  • Curriculum Vitae
  • Notebook
  • Courses

Copyright © 2026 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in

 

Loading Comments...
 

You must be logged in to post a comment.