Why Bayesian Variable Selection Doesn’t Scale

1. Motivation

Traders are constantly looking for variables that predict returns. If $x$ is the only candidate variable traders are considering, then it’s easy to use the Bayesian information criterion to check whether $x$ predicts returns. Previously, I showed that using the univariate version of the Bayesian information criterion means solving

(●) $\begin{align*} \hat{\beta} &= \arg \min_{\beta} \big\{ \, \underset{\text{Prediction Error}}{{\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2} + \underset{\text{Penalty}}{ \lambda \cdot 1_{\{ \beta \neq 0 \}} } \, \big\} \qquad \text{with} \qquad \lambda = {\textstyle \frac{1}{S}} \cdot \log(S) \end{align*}$

after standardizing things so that $\hat{\mu}_x, \, \hat{\mu}_r = 0$ and $\hat{\sigma}_x^2 = 1$ . If the solution is some $\hat{\beta} \neq 0$ , then $x$ predicts returns. Notation: Parameters with hats are in-sample estimates. e.g., if $x_s \overset{\scriptscriptstyle \text{iid}}{\sim} \mathrm{N}(0, \, 1)$ , then $\frac{1}{S} \cdot \sum_s x_s = \hat{\mu}_x \sim \mathrm{N}(0, \, \sfrac{1}{S})$ .

But, what if there’s more than $1$ variable? There’s an obvious multivariate extension of (●):

(⣿) $\begin{align*} \{\hat{\beta}_1, \, \ldots, \, \hat{\beta}_V \} &= \arg \min_{\beta_1, \, \ldots, \, \beta_V} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_v} \, \beta_v \cdot x_{s,v} )^2 + \lambda \cdot {\textstyle \sum_v} 1_{\{ \beta_v \neq 0 \}} \, \right\}. \end{align*}$

So, you might guess that it’d be easy to check which subset of $V \geq 1$ variables predicts returns by evaluating (⣿). But, it’s not. To evaluate the multivariate version of the Bayesian information criterion, traders would have to check $2^V$ different parameter values. That’s a combinatorial nightmare when $V \gg 1$ . Thus, traders can’t take a strictly Bayesian approach to variable selection when there are lots of variables to choose from.

Why is evaluating (⣿) so hard? That’s the topic of today’s post.

2. Non-Convex Problem

Let’s start by looking at what makes (●) so easy. The key insight is that you face the same-sized penalty no matter what $\hat{\beta} \neq 0$ you choose when solving the univariate version of the Bayesian information criterion:

$\begin{align*} \lambda \cdot 1_{\{ 0.01 \neq 0 \}} = \lambda \cdot 1_{\{ 100 \neq 0 \}} = \lambda \qquad \text{or, put differently} \qquad {\textstyle \frac{\mathrm{d}\lambda}{\mathrm{d}\beta}} = 0. \end{align*}$

So, if you’re going set $\hat{\beta} \neq 0$ , then you might as well choose the value that minimizes your prediction error:

$\begin{align*} \arg \min_{\beta \neq 0} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 + \lambda \cdot 1_{\{ \beta \neq 0 \}} \, \right\} &= \arg \min_{\beta} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 + \lambda \, \right\} \\ &= \arg \min_{\beta} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - \beta \cdot x_s)^2 \, \right\} \\ &= \hat{\beta}^{\text{OLS}}. \end{align*}$

Thus, to evaluate (●), all you have to do is check $2$ parameter values, $\beta = 0$ and $\beta = \hat{\beta}^{\text{OLS}}$ , and see which one gives a smaller result. Practically speaking, this means running an OLS regression, $r_s = \hat{\beta}^{\text{OLS}} \cdot x_s + \hat{\epsilon}_s$ , and checking whether or not the penalized residual variance, $\hat{\sigma}_\epsilon^2 + \lambda$ , is smaller than the raw return variance, $\hat{\sigma}_r^2$ .

Most explanations for why (⣿) is hard to evaluate focus on the fact that (⣿) is a non-convex optimization problem (e.g., see here and here). But, the univariate version of the Bayesian information criterion is also a non-convex optimization problem. Just look at the region around $\beta = 0$ in the figure to the left, which shows the objective function from (●). So, non-convexity can only part of the explanation for why (⣿) is hard to evaluate. Increasing the number of variables must be add a missing ingredient.

3. The Missing Ingredient

DATA + CODE

If there are many variables to consider, then these variables can be correlated. Correlation. This is the missing ingredient that makes evaluating (⣿) hard. Let’s look at a numerical example to see why.

Suppose there are only $S = 9$ stocks and $V = 3$ variables. The diagram above summarizes the correlation structure between these variables and returns. The red bar is the total variation in returns. The blue bars represent the portion of this variation that’s related to each of the $3$ variables. If you can draw a vertical line through a pair of bars (i.e., the bars overlap), then the associated variables are correlated. So, because the first $2$ blue bars don’t overlap, $x_1$ and $x_2$ are perfectly uncorrelated in-sample:

$\begin{align*} \widehat{\mathrm{Cor}}[x_1, \, x_2] &= 0. \end{align*}$

Whereas, because the first $2$ blue bars both overlap the third, $x_1$ and $x_2$ are both correlated with $x_3$ :

$\begin{align*} \widehat{\mathrm{Cor}}[x_1, \, x_3] = \widehat{\mathrm{Cor}}[x_2, \, x_3] &= 0.41. \end{align*}$

Finally, longer overlaps denote larger correlations. So, $x_3$ is the single best predictor of returns since the third blue bar has the longest overlap with the top red bar:

$\begin{align*} \widehat{\mathrm{Cor}}[r, \, x_1] = \widehat{\mathrm{Cor}}[r, \, x_2] &= 0.62 \\ \widehat{\mathrm{Cor}}[r, \, x_3] &= 0.67. \end{align*}$

And, this creates a problem. If you had to pick only $1$ variable to predict returns, then you’d pick $x_3$ :

$\begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_3^{\text{OLS}} \cdot x_{s,3} )^2 + \lambda = 0.80 < 0.86 &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_1^{\text{OLS}} \cdot x_{s,1} )^2 + \lambda \\ &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_2^{\text{OLS}} \cdot x_{s,2} )^2 + \lambda. \end{align*}$

But, $\{x_1, \, x_2 \}$ is actually the subset of variables that minimizes (⣿) in this example:

$\begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{1,2\}}} \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda = 0.72. \end{align*}$

In other words, the variable that best predicts returns on its own isn’t part of the subset of variables that collectively best predict returns. In other examples it might be, but there’s no quick way to figure out which kind of example we’re in because (⣿) is a non-convex optimization problem. Until you actually plug $\{x_1, \, x_2 \}$ into (⣿), there’s absolutely no reason to suspect that either variable belongs to subset that minimizes (⣿). Think about it. $x_1$ and $x_2$ are both worse choices than $x_3$ on their own. And, if you start with $x_3$ and add either variable, things get even worse:

$\begin{align*} {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - \hat{\beta}_3^{\text{OLS}} \cdot x_{s,3} )^2 + \lambda = 0.80 < 0.90 &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{3,1\}}} \, \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda \\ &= {\textstyle \frac{1}{9}} \cdot {\textstyle \sum_s} ( r_s - {\textstyle \sum_{v \in \{3,2\}}} \, \hat{\beta}_v^{\text{OLS}} \cdot x_{s,v} )^2 + 2 \cdot \lambda. \end{align*}$

With many correlated variables, you can never tell how close you are to the subset of variables that best predicts returns. To evaluate (⣿), you’ve got to check all $2^V$ possible combinations. There are no shortcuts.

4. Birthday-Paradox Math

If correlation makes it hard to evaluate (⣿), then shouldn’t we be able to fix the problem by only considering independent variables? Yes… but only in a fairytale world where there are an infinite number of stocks, $S \to \infty$ . The problem is unavoidable in the real world where there are almost as many candidate variables as there are stocks because independent variables are still going to be correlated in finite samples.

Suppose there are $V \geq 2$ independent variables that might predict returns. Although these variables are independent, they won’t be perfectly uncorrelated in finite samples. So, let’s characterize the maximum in-sample correlation between any pair of variables. After standardizing each variable so that $\hat{\mu}_{x,v} = 0$ and $\hat{\sigma}_{x,v}^2 = 1$ , the in-sample correlation between $x_v$ and $x_{v'}$ when $S \gg 1$ is roughly:

$\begin{align*} \hat{\rho}_{v,v'} \sim \mathrm{N}(0, \, \sfrac{1}{S}). \end{align*}$

Although $\lim_{S \to \infty} \sfrac{1}{S} \cdot (\hat{\rho}_{v,v'} - 0)^2 = 0$ , our estimates won’t be exactly zero in finite samples.

CODE

Since the normal distribution is symmetric, the probability that $x_v$ and $x_{v'}$ have an in-sample correlation more extreme than $c$ is:

$\begin{align*} \mathrm{Pr}[ |\hat{\rho}_{v,v'}| > c ] &= 2 \cdot \mathrm{Pr}[ \hat{\rho}_{v,v'} > c ] = 2 \cdot \mathrm{\Phi}(c \cdot \!\sqrt{S}). \end{align*}$

So, since there are ${V \choose 2} \leq \frac{1}{2} \cdot V^2$ pairs of variables, we know that the probability that no pair has a correlation more extreme than $c$ is:

$\begin{align*} \mathrm{Pr}[ \max |\hat{\rho}_{v,v'}| \leq c ] &\leq \big( \, 1 - 2 \cdot \mathrm{\Phi}(c \cdot \!{\textstyle \sqrt{S}}) \, \big)^{\frac{1}{2} \cdot V^2}. \end{align*}$

Here’s the punchline. Because $V^2$ shows up as an exponent in the equation above, the probability that all pairwise in-sample correlations happen to be really small is going to shrink exponentially fast as traders consider more and more variables. This means that finite-sample effects are always going to make evaluating (⣿) computationally intractable in real-world settings with many variables, even when the variables are truly uncorrelated as $S \to \infty$ . e.g., the in-sample correlation of $\hat{\rho}_{1,3} = \hat{\rho}_{2,3} = 0.41$ from the previous example might seem like an unreasonably high number for independent random variables, something that only happens when $S=9$ . But, the figure above shows that even when there are $S = 50$ stocks, there’s still a $50{\scriptstyle \%}$ chance of observing an in-sample correlation of at least $0.41$ when considering $V = 20$ candidate variables.

5. What This Means

Market efficiency has been an “organizing principle for 30 years of empirical work” in academic finance. The principle is based on on a negative feedback loop: predictable returns suggest profitable trading strategies, but implementing these strategies eliminates the initial predictability. So, if there are enough really smart traders, then they’re going to find the subset of variables that predicts returns and eliminate this predictability. That’s market efficiency.

Even for really smart traders, finding the subset of variables that predicts returns is hard. And, this problem gets harder when there are more candidate variables to choose from. But, while researchers have thought about this problem in the past, they’ve primarily focused on the dangers of p-hacking (e.g., see here and here). If traders regress returns on $V = 20$ different variables, then they should expect that $1$ of these regressions is going to produce a statistically significant coefficient with a $p\text{-value} \leq 0.05$ even if none of the variables actually predicts returns. So, researchers have focused on correcting p-values.

But, this misses the point. Searching for the subset of variables that predicts returns isn’t hard because you have to adjust your p-values. It’s hard because it requires a brute-force search through the powerset of all possible subsets of predictors. It’s hard because any optimization problem with a hard-thresholding rule like (⣿) can be re-written as an integer programming problem,

$\begin{align*} \min_{{\boldsymbol \delta} \in \{0, \, 1\}^V} \left\{ \, {\textstyle \frac{1}{S}} \cdot {\textstyle \sum_s} (r_s - {\textstyle \sum_v} [\hat{\beta}_v^{\text{OLS}} \cdot x_{s,v}] \cdot \delta_v)^2 \quad \text{subject to} \quad k \geq {\textstyle \sum_v} \delta_v \, \right\}, \end{align*}$

which means that it’s NP-hard (e.g., see here, here, or here). It’s hard because, in the diagram above finding the subset of $30$ variables that predicts returns is equivalent to finding the cheapest way to cover the red bar with blue bars at a cost of $\lambda$ per blue bar, which means solving the knapsack problem.

p style=”text-indent: 15px;”>So, the fact that Bayesian variable selection doesn’t scale is a really central problem. It means that, even if there are lots and lots of really smart traders, they may not be able to find the best subset of variables for predicting returns. You’re probably on a secure wifi network right now. This network is considered “secure” because cracking its $128$ -bit passcode would involve a brute-force search over $2^{128}$ parameter values, which would take 1 billion billion years. So, if there are over V = 313 predictors documented in top academic journals, why shouldn’t we consider the subset of variables that actually predicts returns “secure”, too? After all, finding it would involve a brute-force search over $2^{313}$ parameter values. We might be able to approximate it. But, the exact collection of variables may as well be in a vault somewhere.