How Bad Are False Positives, Really?

Imagine you’re looking for variables that predict the cross-section of expected returns. No search process is perfect. As you work, you will inevitably uncover both tradable anomalies as well as spurious correlations. To figure out which are which, you regress returns on each variable that you come across:

$\begin{equation*} r_n = \hat{\alpha} + \hat{\beta} \cdot x_n + \hat{\epsilon}_n \end{equation*}$

This is helpful because “any predictive regression can be expressed as a portfolio sort” and vice versa. So, a statistically significant $\hat{\beta}$ suggests a profitable stock-picking strategy.

But, what qualifies as a “statistically significant” test result? If a variable doesn’t actually predict returns, then the probability that its $\hat{\beta}$ will have a $t$ -stat greater than $1.96$ is $5\%$ by definition:

$\begin{equation*} \mathrm{Pr}\big( \, \mathrm{t}(\hat{\beta}) > 1.96 \, \big| \, \text{variable is spurious} \, \big) = 0.05 \end{equation*}$

But, what if you didn’t just consider one variable on its own? Instead, suppose that you ran $K \gg 1$ separate regressions. If all $K$ variables were just spurious correlations, the probability that at least one of these $\hat{\beta}$ s has a $t$ -stat greater than $1.96$ would be much larger than $5\%$ :

$\begin{equation*} \mathrm{Pr}\big( \, \max\{ \mathrm{t}(\hat{\beta}_1), \, \mathrm{t}(\hat{\beta}_2), \, \ldots, \, \mathrm{t}(\hat{\beta}_K) \} > 1.96 \, \big| \, \text{all $K$ variables are spurious} \, \big) = 1 - (1 - 0.05)^K > 0.05 \end{equation*}$

Finding a $\mathrm{t}(\hat{\beta}_k) > 1.96$ becomes meaningless as you run more and more regressions. e.g., if $K = 10$ , then you should expect to see a $t$ -stat larger than $1.96$ more than $1 - (1 - 0.05)^{10} \approx 40\%$ of the time.

When people in academic finance talk about the problem of “data mining”, this is what they’re referring to. It seems patently obvious that having such a high false-positive rate is a bad thing. And, at first glance, it seems like there’s an easy way to fix the problem: just use a larger cutoff for statistical significance. e.g., researchers have suggested using a t-stat greater than 3.00 rather than $1.96$ to account for the fact that we’ve proposed thousands of candidate variables. But, is our obsession with minimizing the false-positive rate really the right approach? Do we always want to choose our statistical tests so that they have the lowest possible false-positive rate? Not necessarily. And, this post describes two reasons why.

Reason #1

We don’t care about the false-positive rate for its own sake. What we really want to know is: “Conditional on observing a significant test result, how likely is it that we’ve found an honest-to-goodness anomaly?” Using Bayes’ theorem, we can write this conditional probability as

$\begin{equation*} \mathrm{Pr}(\text{anomaly} \, | \, \text{signif}) = \frac{ \mathrm{Pr}( \text{signif} \, | \, \text{anomaly} ) \cdot \mathrm{Pr}(\text{anomaly}) }{ \mathrm{Pr}( \text{signif} \, | \, \text{anomaly} ) \cdot \mathrm{Pr}(\text{anomaly}) + \mathrm{Pr}( \text{signif} \, | \, \text{spurious} ) \cdot \mathrm{Pr}(\text{spurious}) } \end{equation*}$

Clearly, if we underestimate the false-positive rate, $\mathrm{Pr}(\text{signif} \, | \, \text{spurious})$ , then we’re going to overestimate this conditional probability because we’re going to be dividing by a smaller number on the right-hand side.

But, $\mathrm{Pr}( \text{signif} \, | \, \text{spurious} )$ isn’t the only term on the right-hand side of the equation! We care about more than just the false-positive rate when updating our priors. e.g., if we knew there were never any anomalies, then we could guarantee that every single significant result was a false positive. So, we would always conclude that $\mathrm{Pr}(\text{anomaly} \, | \, \text{signif}) = 0$ regardless of how much we underestimated the false-positive rate.

Let’s sharpen this insight with a little algebra. First, suppose that the unconditional probability of finding a tradable anomaly is $\mu \in [0, \, \sfrac{1}{2})$ :

$\begin{equation*} \mathrm{Pr}(\text{anomaly}) = 1 - \mathrm{Pr}(\text{spurious}) = \mu \end{equation*}$

Next, suppose that the probability of observing a significant test result for a spurious correlation is $(5\% + \vartheta)$ for $\vartheta \geq 0$ while the probability of observing a significant test result for a tradable anomaly is $100\%$ :

$\begin{equation*} \begin{split} \mathrm{Pr}( \text{signif} \, | \, \text{spurious} ) &= 0.05 + \vartheta \\ \mathrm{Pr}( \text{signif} \, | \, \text{anomaly} ) &= 1 \end{split} \end{equation*}$

Thus, $\mu$ represents the base rate of observing anomalies, and $\vartheta$ represents the additional false-positive rate introduced by the data-mining problem described above. Roughly speaking, a larger $\mu$ means that anomalies are more common. And, a large $\vartheta$ means that regressions are easier to run.

We can express our posterior beliefs that a particular variable represents a tradable anomaly given a significant test result as

$\begin{equation*} \underset{=\mathrm{Lik}(\mu, \, \vartheta)}{\mathrm{Pr}( \text{anomaly} \, | \, \text{signif} )} = \frac{ \mu }{ \mu + (0.05 + \vartheta) \cdot (1 - \mu) } \end{equation*}$

Thus, ignoring the additional false positives introduced by data mining biases our posterior beliefs by

$\begin{equation*} \mathrm{Bias}_{\vartheta}(\mu) = \mathrm{Lik}(\mu, \, 0) - \mathrm{Lik}(\mu, \, \vartheta) \end{equation*}$

The figure to the right plots this bias (y-axis: $\mathrm{Bias}_{\vartheta}(\mu)$ ) as a function of the excess false-positive rate (x-axis: $\vartheta$ ). The line is always sloping upward because the more you underestimate a test’s false-positive rate the more confident you will be that you’ve found an anomaly. However, the shape of the line dramatically changes as you play around with the slider, which adjusts $\mu$ . When $0.05 < \mu < 0.12$ , the plot has more or less the same upward-sloping concave shape. But, when $\mu < 0.01$ , the shape of the plot flattens out dramatically. In other words, if the base rate is sufficiently small, then underestimating the false-positive rate doesn’t affect our posterior beliefs very much.

This observation implies something a little counterintuitive: a paper can’t simultaneously argue that A) almost all documented anomalies are in fact spurious correlations, and that B) it’s super important for other researchers to use a test procedure that they’ve proposed which minimizes the false-positive rate. Although they get made one after the other (e.g., here), these two claims aren’t internally consistent. It’s one or the other. Minimizing the false-positive rate can only matter if there’s a non-negligible chance of finding a tradable anomaly.

Reason #2

If you’re headed to the doctor’s office for a pregnancy test, then false positives matter. Finding out you’re pregnant is a big deal. Later discovering that it was a mistake would be traumatic. But, testing for Coeliac disease is different. If you have the disease, then you really want to know. But, treating the disease only involves changing your diet. There’s no need to undergo risky surgery or take expensive medication. So, when testing for Coeliac disease, false positives aren’t such a big deal (unless you looooove bread). And, if given the choice, your doctor should choose a test for the disease that has the lowest false-negative rate, even if that means asking a few perfectly healthy patients to cut gluten out of their diets.

The same sort of logic applies to testing for tradable anomalies. If you can trade on a statistically significant anomaly using liquid actively-traded stocks, then why spend time worrying about the false-positive rate. If you find out you’re wrong, then you can quickly and painlessly exit the position. If this is the sort of world you’re operating in, then you might actually want to set up your statistical tests to minimize your false-negative rate. This is one way to interpret pithy trader sayings like “invest first, investigate later”.

The medical literature also gives a nice way of formalizing this idea using something called the number needed to treat (e.g., see here). Suppose I came to you with a bunch of variables that each seemed to predict the cross-section of expected returns. They each delivered significant excess returns in backtesting. If you choose $S \geq 1$ of these variables

$\begin{equation*} S = \frac{1}{\mathrm{Pr}(\text{anomaly} \, | \, \text{signif}) - \mathrm{Pr}(\text{anomaly})} \end{equation*}$

then you should expect your selections to contain one more tradable anomaly than if you had just picked $S$ variables at random—i.e., regardless of whether they had delivered significant excess returns in backtesting.

Now, think about a portfolio that invests the same amount of money in strategies based on each of the $S$ candidate variables you choose. This portfolio’s expected return will depend on both the profitability of your one extra tradable anomaly and the losses of your $(S-1)$ other spurious predictors:

$\begin{equation*} {\textstyle \left(\frac{1}{S}\right)} \cdot \mathrm{E}[\sfrac{\text{profit}}{\text{$\mathdollar 1$ invested in anomaly}}] - {\textstyle \left(\frac{S-1}{S}\right)} \cdot \mathrm{E}[\sfrac{\text{loss}}{\text{$\mathdollar 1$ invested in spurious correlation}}] \end{equation*}$

Clearly, a higher false-positive rate means a larger $S$ . But, the formulation above illustrates why this might not be such a bad thing. If trading on your one genuine anomaly is really profitable and you can quickly identify/exit your remaining $(S-1)$ spurious positions, then who cares if $S$ is large?

Reason #1

Reason #2

Trackbacks