How Many Assets Are Needed To Test a K-Factor Model?

1. Motivation

Imagine you’re a financial economist who thinks that some risk factor, ${\color{white}i}f_t$ , explains the cross-section of expected returns. And, you decide to test your hunch. First, you regress the realized returns of $N$ different assets on ${\color{white}i}f_t$ to estimate each asset’s exposure to the risk factor, $\tilde{b}_n$ :

$\begin{equation*} r_{n,t} = \tilde{a}_n + \tilde{b}_n \cdot f_t + \tilde{e}_{n,t} \qquad t = 1, \, \ldots, \, T \, \, \text{for each } n \end{equation*}$

Then, you regress these same assets’ average returns on their exposures the risk factor, $\tilde{b}_n$ :

$\begin{equation*} \mathrm{E} [ r_n ] = \hat{\alpha} + \hat{\lambda} \cdot \tilde{b}_n + \hat{\epsilon}_n \qquad n = 1, \, \ldots, \, N \end{equation*}$

If ${\color{white}i}f_t$ is a priced factor, then the slope coefficient, $\hat{\lambda}$ , should be large and the intercept, $\hat{\alpha}$ , should be zero.

This two-stage methodology dates back to Fama and MacBeth (1973). And, in keeping with the original paper, many financial economists still use a small set of portfolios as the $N$ assets in their empirical analysis. For example, Fama and French (1993) use $N = 25$ portfolios created by sorting stocks based on size and book-to-market. In theory, this is fine. If you’ve found a priced risk factor, then exposure to ${\color{white}i}f_t$ should affect the average returns of all assets. There is no theoretical guidance about which assets to use.

But, here’s the thing: econometrically, the number of assets has to matter. We live in a world with lots of factors to choose from. This is the “anomaly zoo” coined in Cochrane (2011). If you consider a model with more risk factors than you have assets $K \geq N$ , then of course you can perfectly fit the data:

$\begin{equation*} \mathrm{E} [ r_n ] = 0 + {\textstyle \sum_{f \in \mathcal{K}}} \, \hat{\lambda}_f \cdot \tilde{b}_{n,f} + 0 \qquad n = 1, \, \ldots, \, N \leq K \end{equation*}$

After all, a system of $N$ linear equations with $K$ unknowns is guaranteed to have a solution if $K \geq N$ .

Clearly, you need at least $K$ assets to test a $K$ -factor model. This is obvious. But, you can elaborate on this idea and say something useful in the related setting where $K \leq N \leq F$ . This post answers the question: how many assets do you need when testing a $4$ -factor model chosen in a world with $F = 97$ candidate factors?

2. Problem Formulation

Suppose we live in a world with $N$ assets and $F$ candidate factors. Think about $F = 97$ (McLean and Pontiff; 2016), $F = 333$ (Harvey, Liu, and Zhu; 2016), or $F = 447$ (Hou, Xue, and Zhang; 2017). We are looking for the asset-pricing model with the fewest factors that perfectly explains the cross-section of expected returns:

(★) $\begin{equation*} \min_{\mathcal{K} \subseteq \mathcal{F}} \left\{ \, {\textstyle \sum_{f \in \mathcal{K}}} \, 1_{\{\hat{\lambda}_f \neq 0\}} \quad \text{s.t.} \quad \mathrm{E}[ r_n ] = {\textstyle \sum_{f \in \mathcal{K}}} \, \hat{\lambda}_f \cdot \tilde{b}_{n,f} \, \right\} \end{equation*}$

This is just a mathematical way of applying Occam’s razor to our model-selection problem.

Here’s what I want to know: if we find a solution to this problem, $\hat{\mathcal{K}}$ , how likely is it that $\hat{\mathcal{K}}$ is the only solution? In other words, if we find a $\hat{K}$ -factor model that solves Problem (★) and perfectly explains the cross-section of expected returns, should we celebrate or slow clap? Have we found the simplest model of the world? Or, is this just one of many such $\hat{K}$ -factor models that we were bound to uncover?

3. Similar Exposures

It turns out that the answer to this question is going to critically depend on the similarity of the $N$ assets’ exposures to the $F$ risk factors. To see why, just think about a world where $\mathcal{F}$ contains two copies of some risk factor in $\hat{\mathcal{K}}$ . Clearly, $\hat{\mathcal{K}}$ can’t be a unique solution to Problem (★) because you could just switch out the risk factor for its twin and have another $\hat{K}$ -factor model explaining the cross-section of expected returns.

Donoho and Huo (2001) gives a nice way of generalizing this notion of similarity to situations where $\mathcal{F}$ doesn’t contain multiple copies of the same factor. Specifically, they define the idea of mutual coherence:

$\begin{equation*} \rho_{\max} \overset{\scriptscriptstyle \text{def}}{=} \max_{1 \leq f < f' \leq F} \left\{ \, \frac{ \left| \sum_{n=1}^N \tilde{b}_{n,f} \cdot \tilde{b}_{n,f'} \right| }{ \sqrt{\sum_{n=1}^N \tilde{b}_{n,f}^2} \cdot \sqrt{\sum_{n=1}^N \tilde{b}_{n,f'}^2} } \, \right\} \end{equation*}$

Roughly speaking, you should think about mutual coherence as measuring the maximum correlation between the $N$ assets’ exposures to any pair of risk factors in $\mathcal{F}$ . A large value of $\rho_{\max}$ , means that the $N$ assets have really similar exposures to some pair of risk factors in $\mathcal{F}$ . And, in the extreme case where $\mathcal{F}$ contains two exact copies of the same risk factor, we have $\rho_{\max} = 1$ . By contrast, $\rho_{\max} = 0$ if there are more assets than risk factors and the risk factors were independent.

But, since we are thinking about a world where there are more candidate risk factors than assets $F > N$ , we know that $\rho_{\max} > 0$ . Even if the $F$ factors really are independent, some of them are going to have to look correlated in such a small sample. And, Welch (1974) characterizes exactly how correlated they will look:

(W) $\begin{equation*} \rho_{\max} \geq \sqrt{\frac{F - N}{N \cdot (F - 1)}} \qquad \text{given } F > N \end{equation*}$

4. Theoretical Minimum

Now, we can answer our original question: when can we be sure that a solution to Problem (★) is unique? i.e., if we find a $K$ -factor model that perfectly explains the cross-section of returns for $N$ assets, should we be surprised? Donoho and Elad (2003) show that, if $\hat{\mathcal{K}}$ is a solution to Problem (★) and

(DE) $\begin{equation*} \hat{K} < \frac{1}{2} \cdot \left( 1 + \frac{1}{\rho_{\max}} \right), \end{equation*}$

then $\hat{\mathcal{K}}$ is a unique solution. There are no other factor models that explain the cross-section of expected returns using $\hat{K}$ or fewer factors. Inserting the bound from Equation (W) into the bound from Equation (DE) and then solving for $N$ yields:

$\begin{equation*} N_{\min} \overset{\scriptscriptstyle \text{def}}{=} \left\lceil \frac{F \cdot (2 \cdot K - 1)^2}{(F - 1) + (2 \cdot K - 1)^2} \right\rceil \end{equation*}$

What’s this equation telling us? Suppose we find a $\hat{K}$ -factor model that perfectly explains the cross-section of expected returns—i.e., a factor model that solves Problem (★). If our empirical analysis used at least $N_{\min}$ assets, then we can be sure that we’ve found the simplest possible model. Whereas, if we only used $(N_{\min} - 1)$ assets, then there might be another asset-pricing model with the same number of factors that perfectly explains the cross-section of expected returns.

5. Plugging in Numbers

Let’s plug in some numbers to see if our formula for $N_{\min}$ makes any sense. Here’s the first exercise. Suppose that there really are only $F=97$ candidate risk factors like in McLean and Pontiff (2016). The figure to the right plots the minimum number of assets we’d need to include in our empirical analysis (y-axis) to identify a model with $K$ factors (x-axis). We need to study at least $N_{\min} = 9$ assets to be sure that we’ve found a unique $2$ -factor model; we need to study at least $N_{\min} = 21$ assets to be sure that we’ve found a unique $3$ -factor model; and, we need to study at least $N_{\min} = 33$ assets to be sure that we’ve found a unique $4$ -factor model. This is an interesting exercise because it says that we shouldn’t be surprised if someone found a $4$ -factor model that perfectly explained the cross-section of expected returns for the $25$ size- and value-sorted portfolios used in Fama and French (1993). In other words, since Gene and Ken already claimed the first $3$ factors, we can’t do cross-sectional asset pricing tests with just the Fama and French (1993) portfolios any more. Even if we found a $4$ -factor model that perfectly explained the cross-section of expected returns, there’d be no guarantee that it’d be unique.

Now, let’s consider a second exercise. In a recent NBER working paper, Kozak, Nagel, and Santosh (2017) found used a machine-learning rule to identify a cross-sectional asset-pricing model with $\hat{K} = 33$ factors. For the sake of argument, let’s imagine that this $33$ -factor model perfectly explained the cross-section of expected returns. The figure to the left plots the minimum number of assets they’d need to include in their empirical analysis (y-axis) to be sure that they’d found the only such $33$ -factor model in a world with $F$ candidate factors (x-axis). There are roughly $2500$ stocks at the moment. So, if their $33$ -factor model had perfectly explained the cross-section of expected returns, then you should be really excited by this result if you think there are less than $F \approx 6000$ candidate factors.

Let’s do one last exercise before we call it quits. Notice that, as the number of candidate factors gets large, the minimum number of assets we’d need to identify a $K$ -factor model only depends on $K$ :

(1) $\begin{equation*} \lim_{F \to \infty} N_{\min} = (2 \cdot K - 1)^2 \end{equation*}$

This observation is cool because, by setting $N_{\min} = 2500$ and solving for $K$ , we can compute the size of the most complicated factor model that we can estimate with $2500$ stocks is $25 = \lfloor \sfrac{1}{2} \cdot( \sqrt{2500} + 1) \rfloor$ . No matter how many candidate factors there are, if we find a $25$ -factor model that perfectly explains the cross-section of expected returns, then we know it is unique if we use the universe of all stocks in our empirical analysis.