A Model of Hard-to-Diagnose Mispricings

1. Introduction

Important market events often have a variety of interpretations. For example, a recent Financial Times article outlined several different readings Facebook’s “feeble showing… in the weeks since its $\mathdollar 16{\scriptstyle \mathrm{bn}}$ initial public offering”. “Maybe Morgan Stanley, which organized the IPO, got complacent. Maybe Facebook neglected to adapt its platform fully to the world of mobile devices. Maybe, if we are to believe the Los Angeles Times, the company, for all its $900{\scriptstyle \mathrm{m}}$ users, is ‘losing its cool’.” The article then tossed another hat into the ring. “Those explanations are wrong. There may be a simpler explanation: political risk… Facebook is less a revolution in technology than a revolution in property rights. It is to social life what enclosure was to grazing. Fed-up users might begin to question Facebook’s claim to full ownership of so much valuable personal information that they, the public, have generated.”

Whatever you think the right answer is, one thing is clear: traders can hold the exact same views for entirely different reasons. Moreover, while these views happen to line up for Facebook, they have wildly different implications for how a trader should behave in the rest of the market. For instance, if you think the poor performance was a result of Morgan Stanley’s hubris, then you should change the way you trade their upcoming IPOs. Alternatively, if you think the poor performance was a consequence of Facebook losing its cool, then you should change the way you trade Zynga. Finally, if you agree with the Financial Times reporter and think the poor performance was due to privacy concerns, then you should change the way you trade other companies, like Apple, which hoard users’ personal information.

Motivated by these observations, this post outlines an asset-pricing model where each asset has many plausibly relevant features, and, in order to turn a profit, arbitrageurs must diagnose which of these is relevant using past data.

2. Feature Space

I study a market with $N = 4$ assets. Let’s begin by looking at how I model each asset’s exposure to $Q \gg 4$ different features, each representing a different explanation for the asset’s performance. I use the indicator function, $x_{n,q}$ , to capture whether or not asset $n$ has exposure to the $q$ th feature:

(1) $\begin{align*} x_{n,q} &= \begin{cases} 1 &\text{if asset $n$ has feature $q$} \\ 0 &\text{else} \end{cases} \end{align*}$

For example, while both National Semiconductor and Sequans Communications are in the semiconductor industry, $x_{\text{NatlSemi},\text{SemiCond}} = 1$ and $x_{\text{Sequans},\text{SemiCond}} = 1$ , only National Semiconductor was involved in M&A rumors in Q1 2011, so $x_{\text{NatlSemi},\text{M\&A}} = 1$ but $x_{\text{Sequans},\text{M\&A}} = 0$ . Feature exposures are common knowledge. Everyone knows each value in the $(N \times Q)$ -dimensional matrix $\mathbf{X}$ , so there is no uncertainty about whether or not National Semiconductor belongs to the semiconductor industry. Each asset’s fundamental value stems from its exposure to exactly half of the $Q \gg 1$ different payout-relevant features.

Fundamental values have a sparse representation in this space of $Q$ features. Only $K$ of the $Q$ possible features actually matter:

(2) $\begin{align*} Q \gg N \geq K \end{align*}$

There are enough observations, $N$ , to estimate the value of the $K$ feature-specific shocks using OLS if you knew ahead of time which features to analyze; however, there are many more possible features, $Q$ , than observations. Without an oracle, OLS is an ill-posed problem in this setting. This sparseness assumption embodies the idea that financial markets are large and diverse, so finding the right trading opportunity is a needle-in-a-haystack type problem.

For analytical convenience, I study the case with $N = 4$ and $K = 2$ where $2$ of the assets have exposure to $1$ of the feature-specific shocks and $2$ of the assets have exposure to the other feature-specific shock. For example, if there is a shock to all big-box stores and to all companies based in Ohio, then there are no superstores based in Ohio like Big Lots in the list of $N = 4$ assets. This is the simplest possible model in which the feature-specific average matters and every asset has exposure to the same number of shocks.

If only $2$ of the $Q$ features actually realize shocks and each shock affects a separate subset of $2$ firms, then there are:

(3) $\begin{align*} H = Q \times \frac{1}{6} \cdot (Q - 1) < {Q \choose 2} \end{align*}$

possible combinations of shocks. There are $Q$ different shocks to choose from for the first shock, and only $\sfrac{1}{6}$ th of the remaining $(Q - 1)$ shocks will not overlap assets with the first shock. I index each combination with $h = 1,2,\ldots,H$ where $h_\star$ denotes the true set of shocked features. Let $\mathcal{Q}$ denote the set of all features and $\mathcal{K}_h$ denote the $2$ features associated with index $h$ . Nature selects which of the $H$ combinations of $2$ features realizes feature-specific shocks uniformly at random:

(4) $\begin{align*} \mathrm{Pr}(\mathcal{K}_{h_\star} = \mathcal{K}_h) = \sfrac{1}{H} \end{align*}$

prior to the start of trading in period $t=1$ .

3. Asset Structure

We just saw what might impact asset values. Let’s now examine how these features actually affect markets. I study a model where nature selects fundamental values, $v_n \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0,\sigma_v^2)$ , prior to the start of trading. And, these fundamental values are a function of $2$ components: the particular feature-specific shock affecting each asset together with an idiosyncratic shock:

(5) $\begin{align*} v_n &= \mathbf{x}_n^{\top}{\boldsymbol \beta} = \sum_{q=1}^Q \beta_q \cdot x_{n,q} + \beta_{0,n} \qquad \text{with} \qquad 2 = \Vert {\boldsymbol \beta} \Vert_0 = \sum_{q=1}^Q 1_{\{\beta_q \neq 0\}} \end{align*}$

where $\beta_q$ denotes the extent to which the $q$ th feature affects fundamental values:

(6) $\begin{align*} \beta_q \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \begin{cases} \mathrm{N}(0, \sigma_{\beta}^2) &\text{if } q \in \mathcal{K}_{h_\star} \\ 0 &\text{else} \end{cases} \end{align*}$

and $\beta_{0,n} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(0, \sigma_{\beta}^2)$ denotes the idiosyncratic shock. I use the $0$ th subscript to denote the idiosyncratic component of each stock for brevity, but always omit it when writing the $(Q \times 1)$ -dimensional vector of feature-specific shocks, ${\boldsymbol \beta}$ .

Each asset has exposure to only $1$ of the feature-specific shocks since it has exposure to a random subset of $\sfrac{1}{2}$ of all features. Thus, its fundamental volatility is given by:

(7) $\begin{align*} \sigma_v^2 = 2 \cdot \sigma_\beta^2 \end{align*}$

since both the feature-specific shocks and each asset’s idiosyncratic shock have variance $\sigma_{\beta}^2$ . The main benefit of forcing each asset to have exposure to exactly $1$ of the feature-specific shocks is that, under these conditions, every single one of the assets will have identical unconditional variance.

4. Naifs’ Objective Function

How does information about these feature-specific shocks gradually creep into prices? Naive asset-specific investors. There are $2$ such investors studying each of the $4$ stocks, one investor per shock. These so-called naifs choose how many shares to hold of a single asset, $\theta_{n,t}^{(k)}$ , in order to maximize their mean-variance utility over end-of-market wealth:

(8) $\begin{align*} \max_{\theta_{n,t}^{(k)} \in \mathrm{R}} \left\{ \, \mathrm{E}_{n,t}^{(k)}[w_{n,t}^{(k)}] - \frac{\gamma}{2} \cdot \mathrm{Var}_{n,t}^{(k)}[w_{n,t}^{(k)}] \, \right\} \quad \text{with} \quad w_{n,t}^{(k)} = (v_n - p_{n,t}) \cdot \theta_{n,t}^{(k)} \end{align*}$

where $\gamma > 0$ is their risk-aversion parameter. The $(k)$ superscript is necessary because there are $2$ kinds of naifs trading each asset: one that has information about the feature-specific shock and one that has information about the idiosyncratic shock.

Naifs trading in each of the $4$ assets see a private signal each period, $\epsilon_{n,t}^{(k)}$ , about how a single shock affects their asset:

(9) $\begin{align*} \epsilon_{n,t}^{(k)} \overset{\scriptscriptstyle \mathrm{iid}}{\sim} \mathrm{N}(\beta_{k}, \sigma_{\epsilon}^2) \end{align*}$

where $k = 0$ denotes a signal about stock $n$ ‘s idiosyncratic component. For example, a naif studying Target Corp might get a signal about how the company’s fundamental value will rise due to an industry-specific supply-chain management innovation (big-box store feature-specific shock). The other naive asset-specific investor studying Target might then get a signal about how the company’s fundamental value will fall due to the unexpected death of their CEO (Target-specific idiosyncratic shock).

I make $3$ key assumptions about how the naifs solve their optimization problem. First, I assume that these investors believe each period that they’ll hold their portfolio until the liquidating dividend at time $t=2$ . Second, I assume that, while naifs see private signals about the size of a feature-specific shock, they do not generalize this information and apply it to other assets with this feature. For example, the naif who realized that Target’s fundamental value will rise due to the supply-chain innovation won’t use this information to reevaluate the correct price of Wal-Mart. Third, these naive investors do not condition on current or past asset prices when forming their expectations. To continue the example, this same investor studying Target won’t analyze the average returns of all big-box stores to get a better sense of how big a value shock the industry-specific supply-chain innovation really was.

All $3$ of these assumptions are motivated by bounded rationality. A naive asset-specific investor must use all his concentration just to figure out the implications of his private signals. With no cognitive capacity left to spare, he can’t implement a more complex, dynamic, trading strategy (first assumption), extend his insight to other companies (second assumption), or use prices to form a more sophisticated forecast of the liquidating dividend value (third assumption). These naifs behave similarly to the newswatchers from Hong and Stein (1999) and also neglect correlations in a similar fashion to Eyster and Weizsacker (2010).

5. Baseline Equilibrium

We now have enough structure to characterize a Walrasian equilibrium with private valuations. When no market-wide arbitrageurs are present, the price of each asset is given by:

(10) $\begin{align*} p_{n,1} &= \frac{1}{2} \cdot \left( \frac{\sigma_{\beta}^2}{\sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right) \cdot \left\{ \epsilon_{n,1}^{(0)} + \epsilon_{n,1}^{(k_n)} \right\} \\ p_{n,2} &= \frac{1}{2} \cdot \left( \frac{\sigma_{\beta}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right) \cdot \sum_{t=1}^2 \left\{ \epsilon_{n,t}^{(0)} + \epsilon_{n,t}^{(k_n)} \right\} \end{align*}$

where $k_n$ denotes the index of the particular shock affecting the $n$ th asset.

What do these formulas mean for an arbitrageur? Suppose that the big-box store supply-chain innovation occurred and affected assets $n=1,2$ . Naive asset-specific investors neglect the fact that they could use the average returns of assets in the big-box industry to refine their beliefs about the size of the shock. As an arbitrageur, you can profit from this neglected information by deducing the size of the shock from the industry average returns:

(11) $\begin{align*} \widehat{\beta}_k = \frac{1}{2} \cdot \sum_{n = 1}^2 \Delta \tilde{p}_{n,1} &\sim \mathrm{N}\left( \beta_k, \, 2 \cdot \sigma_{\epsilon}^2 \right) \end{align*}$

where $\Delta \tilde{p}_{n,1}$ is given by:

(12) $\begin{align*} \Delta \tilde{p}_{n,1} = 2 \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{\sigma_{\beta}^2}\right) \cdot \Delta p_{n,1} \end{align*}$

Simply buy shares of the underpriced big-box stock whose $p_{n,1} < \widehat{\beta}_k$ , and short shares of the overpriced big-box stock whose $p_{n,1} > \widehat{\beta}_k$ .

Of course, in the real world, you wouldn’t have an oracle. You wouldn’t know ahead of time that the big-box store shock had occurred. Instead, you’d have to not only value the big-box store shock but also identify that the shock had occurred in the first place. Let’s now introduce arbitrageurs to the model and study this joint problem.

6. Arbitrageurs’ Objective Function

Arbitrageurs start out with no private information; however, unlike the naifs, they can observe all $4$ asset returns in period $t=1$ . They can then use this information to both value and identify feature-specific shocks, submitting market orders to maximize their risk-neutral utility over end-of-game wealth:

(13) $\begin{align*} \max_{{\boldsymbol \theta}^{(a)} \in \mathrm{R}^4} \left\{ \, \mathrm{E}\left[ \, \sum_{n=1}^4 (v_n - p_{n,2}) \cdot \theta_n^{(a)} \, \middle| \, \widehat{\mathcal{K}} \, \right] \, \right\} \end{align*}$

where $\widehat{\mathcal{K}}$ is chosen as the model of the world that minimizes the arbitrageurs’ average prediction error over the assets’ fundamental values given the observed period $t=1$ prices, $\Delta \tilde{\mathbf{p}}_1$ . In this model, much like that of Hong and Stein (1999), the naifs effectively serve as market makers.

Because there are more features than assets, $Q \gg 4$ , arbitrageurs must engage in model selection a la Barberis, Shleifer, and Vishny (1998) or Hong, Stein, and Yu (2007). Choosing the right model of the world is their main challenge. It’s figuring out whether Facebook’s IPO failed due to Morgan Stanley’s complacency or due to under-appreciated political risks. If arbitrageurs knew which $2$ features to analyze ahead of time, $\mathcal{K}_{h_\star}$ , then their problem would be dramatically easier. It would be as if they had an oracle sitting on their shoulder interpreting market events for them. They would then be able to use the usual OLS techniques to form beliefs about the size of the $2$ feature-specific shocks:

(14) $\begin{align*} \widehat{\boldsymbol \beta}[\mathcal{K}_{h_\star}] &= \left( \mathbf{X}[\mathcal{K}_{h_\star}]^{\top}\mathbf{X}[\mathcal{K}_{h_\star}] \right)^{-1}\mathbf{X}[\mathcal{K}_{h_\star}]^{\top}\mathbf{y} \end{align*}$

where $\mathbf{X}[\mathcal{K}_h]$ is $\mathbf{X}$ restricted to columns $\mathcal{K}_h$ , and ${\boldsymbol \beta}[\mathcal{K}_h]$ is ${\boldsymbol \beta}$ restricted to rows $\mathcal{K}_h$ . There is no hat over the choice of feature-specific shocks, $\mathcal{K}$ . Only the ${\boldsymbol \beta}$ has a hat over it. Only exact values of the shocks are unknown.

By contrast, the market-wide arbitrageurs in this model have to use some thresholding rule to cull the number of potential features down to a manageable number. They have to both select $\widehat{\mathcal{K}}$ and estimate $\widehat{\boldsymbol \beta}[\widehat{\mathcal{K}}]$ . While this daunting real-time econometrics problem is new to the asset-pricing literature, researchers and traders confront this problem every single day. As Johnstone (2013) argues, this sort of behavior “is very common, even if much of the time it is conducted informally, or perhaps most often, unconsciously. Most empirical data analyses involve, at the exploration stage, some sort of search for large regression coefficients, correlations or variances, with only those that appear ‘large’, or ‘interesting’ being retained for reporting purposes, or in order to guide further analysis.”

7. Bayesian Inference

Let’s now turn our attention to how a fully-rational Bayesian arbitrageur with non-informative priors should select which features to use? Bayes’ rule tells us that the posterior probability of a particular combination of shocks, $\mathrm{Pr}( \mathcal{K}_h | \Delta \tilde{\mathbf{p}}_{1} )$ , is proportional to the likelihood of observing the realized data given the combination, $\mathrm{Pr}( \Delta \tilde{\mathbf{p}}_{1} | \mathcal{K}_h )$ , times the prior probability of Nature choosing the combination of shocks, $\mathrm{Pr}( \mathcal{K}_h )$ :

(15) $\begin{align*} \mathrm{Pr}( \mathcal{K}_h | \Delta \tilde{\mathbf{p}}_{1} ) \propto \mathrm{Pr}( \Delta \tilde{\mathbf{p}}_{1} | \mathcal{K}_h ) \times \mathrm{Pr}( \mathcal{K}_h ) \end{align*}$

So, this arbitrageur will select the collection of at most $2$ features that maximizes the log-likelihood of the observed data:

(16) $\begin{align*} \widehat{\mathcal{K}} &= \arg \max_{\mathcal{K} \subset \mathcal{Q}} \, \left\{ \, \log \mathrm{Pr}( \Delta \tilde{\mathbf{p}}_{1} | \mathcal{K}) \ \, \text{s.t.{}} \ \, |\mathcal{K}| \leq 2 \, \right\} \end{align*}$

since each of the combinations of shocks is equally likely.

Why is there an inequality sign in Equation (16)? That is, why isn’t the constraint $|\mathcal{K}| = 2$ ? Because some of the elements in ${\boldsymbol \beta}[\mathcal{K}_{h_\star}]$ will be small. After all, it’s drawn from a Gaussian distribution. A fully-rational Bayesian arbitrageur will want to ignore some of the smaller elements in ${\boldsymbol \beta}[\mathcal{K}_{h_\star}]$ since he faces overfitting risk. For instance, if all Houston-based firms realize a local tax shock that increases their realized returns to the tune of $0.25{\scriptstyle \%}$ per year, then it will be impossible for a market-wide arbitrageur to spot this shock. Firm-level volatility can exceed $40{\scriptstyle \%}$ per year. An arbitrageur trying to recover such a weak signal out from amongst so much noise is more likely to overfit the observed data and draw the wrong inference.

Schwartz (1978) showed that fully Bayesian arbitrageurs in this setting should ignore all coefficients smaller than $\beta_{\min} = \sigma_{\epsilon} \cdot \sqrt{2 \cdot \log(Q)}$ . This is the correct threshold for a Gaussian model in the following sense. Suppose that there were no shocks. That is, we had $\mathcal{K} = \emptyset$ and $v_n = \beta_{n,0}$ for each of the $4$ assets. Then, we would like our estimator to tell us that there are no shocks with overwhelming probability:

(17) $\begin{align*} \mathrm{Pr}\left[ \max_{q \in \mathcal{Q}} | \langle \Delta \tilde{p}_{n,1} \rangle_q| > \beta_{\min} \right] &\leq \alpha \end{align*}$

where $\alpha$ is an arbitrarily small number that is chosen in advance, and $\langle \cdot \rangle_q$ denotes the average over the set of assets with exposure to feature $q$ . This particular choice of $\beta_{\min}$ comes from the fact that:

(18) $\begin{align*} \lim_{Q \to \infty} \frac{\max_{q \in \mathcal{Q}} \sfrac{| \langle \Delta \tilde{p}_{n,1} \rangle_q|}{\sigma_{\epsilon}}}{\sqrt{2 \cdot \log(Q)}} = 1 \end{align*}$

almost surely for a Gaussian model.

8. Equilibrium with Arbitrageurs

Let’s now wrap up by looking at the effect of these market-wide arbitrageurs on equilibrium asset prices. Prices in period $t=1$ will be the same as before since arbitrageurs have no information in the first period. As a result, they do not trade in period $t=1$ . To solve for time $t=2$ prices as a function of arbitrageur demand, simply observe that market clearing implies:

(19) $\begin{align*} - \, \theta_n^{(a)} &= \sum_{k=0}^1 \frac{1}{\gamma} \cdot \frac{\mathrm{E}_{n,2}^{(k)}[v_n] - p_{n,2}}{\mathrm{Var}_{n,2}^{(k)}[v_n]} \end{align*}$

Some simplification then yields:

(20) $\begin{align*} p_{n,2} &= \frac{1}{2} \cdot \overbrace{\left( \frac{\sigma_{\beta}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)}^{=B} \cdot \sum_{k=0}^1 \left\{ \epsilon_{n,1}^{(k)} + \epsilon_{n,2}^{(k)} \right\} + \overbrace{\gamma \cdot \sigma_{\beta}^2 \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)}^{=C} \cdot \theta_n^{(a)} \end{align*}$

Thus, we can see that the price of each asset will be weighted average of the signals that the naifs receive and the arbitrageurs’ demand. An asset’s price will be higher if the naifs get more positive asset-specific signals or if arbitrageurs demand more as a result of a more positive feature-specific signal.

Suppose that, after observing period $t=1$ returns, arbitrageurs believe that features $\widehat{\mathcal{K}}$ have realized a shock. If they are using the Bayesian information criterion, this means that for each $k \in \widehat{\mathcal{K}}$ the estimated $\widehat{\beta}_k$ was larger than $\beta_{\min} = \sigma_{\epsilon} \cdot \sqrt{2 \cdot \log(Q)}$ . It’s possible to write the arbitrageurs’ beliefs about the value of each asset as a linear combination of an asset-specific component, $A_n$ , and the estimated feature-specific shock size, $\widehat{\beta}_{k_n}$ :

(21) $\begin{align*} \mathrm{E}[ v_n | \widehat{\mathcal{K}}, \Delta \tilde{\mathbf{p}}_1] &= A_n + \widehat{\beta}_{k_n} \end{align*}$

The asset-specific component, $A_n$ , comes from the fact that, if arbitrageurs believe that an asset’s value is due in part to a feature-specific shock of size $\widehat{\beta}_{k_n}$ , then they can use these beliefs to update their priors about the size of the asset’s idiosyncratic shock. Plugging this linear formula into arbitrageurs’ optimal portfolio holdings yields:

(22) $\begin{align*} \theta_n^{(a)} &= \frac{A_n}{2 \cdot C} + \left(\frac{1- B}{2 \cdot C}\right) \cdot \widehat{\beta}_{k_n} \end{align*}$

where the coefficient on $\widehat{\beta}_{k_n}$ can be simplified as follows:

(23) $\begin{align*} \frac{1- B}{2 \cdot C} = \frac{1 - \left( \frac{\sigma_{\beta}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)}{2 \cdot \gamma \cdot \sigma_{\beta}^2 \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)} = \frac{\frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2}}{2 \cdot \gamma \cdot \sigma_{\beta}^2 \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right)} = \frac{1}{2} \cdot \frac{1}{\gamma \cdot \sigma_{\beta}^2} \end{align*}$

This result implies that arbitrageurs decrease their demand for an asset with exposure to, say, a negative political-risk shock by $0.50 \times (\gamma \cdot \sigma_{\beta}^2)^{-1}$ shares for every $\mathdollar 1$ increase in the size of the shock.

The key implication of this model is that including a shocked feature in the arbitrageurs’ model of the world will yield a price shock of size:

(24) $\begin{align*} \mathrm{E}[p_{n,2}|\widehat{\mathcal{K}} = \mathcal{K}_{h_\star}] - \mathrm{E}[p_{n,2}|\widehat{\mathcal{K}} = \emptyset] &= \frac{1}{2} \cdot \left( \frac{\sigma_{\beta}^2 + \sigma_{\epsilon}^2}{2 \cdot \sigma_{\beta}^2 + \sigma_{\epsilon}^2} \right) \cdot \widehat{\beta}_{k_n} \end{align*}$

For instance, if arbitrageurs were using Bayesian updating, then there would be a discontinuous jump in the effect of a political-risk shock on social media companies like Facebook as the size of the shock crossed if the size of the shock crossed the $\beta_{\min}$ threshold.