Uncategorized – Research Notebook

Behavioral finance and corporate finance are both organized in the exact same way

February 4, 2023 by Alex

Behavioral finance and corporate finance are both organized in the exact same way. Neither is based on a grand unified theory. Instead, both fields proceed by looking for deviations from a benchmark model. The behavioral-finance literature is a list of ways to violate market efficiency. The corporate-finance literature is a collection of ways to relax the assumptions needed for capital-structure irrelevance. Same setup.

One reason for writing this post is to spread the word about this symmetry. I don’t think it’s widely appreciated. Occasionally I’ll mention it to somebody. When I do, I usually get either a blank stare or a look of sudden recognition. I’d like to live in a world where the comment gets a bland nod in agreement.

I also think it’s noteworthy how differently each field is viewed within the profession given that both fields are calling plays from the same playbook. True, behavioral finance has not proposed an alternative to market efficiency. But, then again, ain’t nobody asking corporate researchers to come up with an alternative to ModiglianiMiller58. Highlighting this disconnect is the other reason for writing this post.

Behavioral finance

Behavioral economists explain market outcomes by pointing to deviations from market efficiency—i.e., the idea that “security prices fully reflect all available information”. In John Cochrane’s words: “Informational efficiency is a natural consequence of competition, relatively free entry, and low costs of information in financial markets. If there is a signal, not now incorporated in market prices, that future values will be high, competitive traders will buy. In doing so, they bid the price up until it fully reflects the available information.”

If there’s a signal that an asset’s future payout will be high, then the present discounted value of that asset’s payout will go up—i.e., $\Exp[ \, \textit{Discount Rate} \times \textit{Future Payout} \, ]$ will increase. If the asset’s current price doesn’t increase accordingly, any trader who sees the signal could profit by buying a share, $\Delta = +1$ :

$\begin{equation*} \Big( \, \underbrace{\Exp[ \, \textit{Discount Rate} \times \textit{Future Payout} \, ]}_{\substack{\text{Present\phantom{j}discounted\phantom{j}value\phantom{j}of} \\ \text{asset w/ same future payout}}} - \textit{Current Price} \, \Big) \times \Delta \end{equation*}$

In the process, the trader will push up the current price until there’s no longer any benefit to continuing the trade. And we’d see the opposite pattern with $\Delta = -1$ in a world where traders saw a negative signal.

Given this benchmark, behavioral economists look for situations where there appears to be a persistent uncorrected pricing error. e.g., under the benchmark of market efficiency, it should not be possible to find situations where $\Exp[ \textit{Discount Rate} \times \textit{Future Payout} \, ] > \textit{Current Price}$ without traders taking action, $\Delta = 0$ . However, JegadeeshTitman93 document that the 30% of stocks with the highest past returns (past winners) tend to have higher future returns than the 30% of stocks with the lowest past returns (past losers). In a world where markets were efficient, traders would immediately bid up the prices of past winners until this pricing error disappeared. So it seems like real-world traders must be making some sort of behavioral error.

Corporate finance

ModiglianiMiller58 taught us that, if all the following assumptions hold, then a firm’s choice of leverage has no affect on its market valuation. (A1) Investors and firms can trade the same set of correctly priced securities. (A2) Investors and firms are taxed in the same way. (A3) Investors and firms face no transaction costs or portfolio restrictions. (A4) There are no bankruptcy costs or costs to issuing new securities. (A5) A firm’s choice of leverage doesn’t directly affect its future cash flows. And finally, (A6) firm leverage doesn’t signal additional information to investors about these cash flows. Firms clearly spend a lot of time worrying about their capital structure. And corporate researchers explain their decisions by pointing out ways that the above assumptions are violated in the real world. That’s the organizing principle behind this literature.

The streamlined proof given in ModiglianiMiller69 is based on a homemade-leverage argument. Suppose there are two firms with different capital structures but identical cash flows. The first firm is unlevered while the second firm as issued debt. In a world where all the above assumptions hold, an investor could effectively lever up the unlevered firm’s cash flows himself by constructing a portfolio that’s long the unlevered firm and short the debt issued by the levered firm:

$\begin{equation*} - \Big( \, \underbrace{[\textit{Unlevered Firm's Value} - \textit{Value of Debt Issuance}]}_{\substack{\text{Cost of building a portfolio that buys unlevered firm and} \\ \text{shorts the debt issued by otherwise identical levered firm}}} - \textit{Equity Value of Levered Firm} \, \Big) \times \Delta \end{equation*}$

If there’s any gap between the cost of this portfolio and the equity value of the levered firm, an investor could earn arbitrage profits given the assumptions above. Since both have identical future cash flows, the investor should continue buying the one with the lower current price until there’s no more price difference.

“The entire development of corporate finance since 1958—the publication date of the first MM article—can be seen and described essentially as the sequential (or simultaneous) relaxation of the assumptions listed before.” e.g., if corporate debt is taxed differently than the short positions of individual investors, then the homemade-leverage argument breaks down. Once assumption A2 has been violated, capital structure is no longer irrelevant. In a world where corporations get preferred tax treatment, firms should optimally choose to issue debt since it’d be more expensive for individual investors to homebrew this leverage themselves.

Nobody would say…

Given how I’ve described behavioral finance and corporate finance above, it’s obvious that the two fields are organized in the exact same way. Researchers in each field try to make sense of empirical regularities by pointing to specific deviations from their own respective benchmark. In fact, Mark Rubinstein argues that ModiglianiMiller58’s “real and enduring contribution was to point others in the direction of arbitrage reasoning.” And this sort of reasoning lies at the heart of the Efficient Market Hypothesis in Fama70.

That being said, the behavioral-finance and corporate-finance literatures clearly emphasize different things about their respective benchmark models. ModiglianiMiller58 weren’t trying to argue that the assumptions needed for capital-structure irrelevancy were realistic. As Merton Miller later wrote: “We first had to convince people (including ourselves!) that there could be any conditions, even in a ‘frictionless’ world, where a firm would be indifferent between issuing different kinds of securities.”

By contrast, market efficiency is treated as a good first approximation to the real world. While Franco Modigliani and Merton Miller didn’t think that capital structure was actually irrelevant in the real world, Eugene Fama actively defended the Efficient Market Hypothesis. e.g., Fama98 writes: “There is a developing literature… arguing that stock prices adjust slowly to information… It is time, however, to ask whether this literature, viewed as a whole, suggests that efficiency should be discarded. My answer is a solid no.”

This is fine. But the parallel structures of behavioral finance and corporate finance clearly put the lie to a common criticism of the behavioral literature. It’s often argued that, to be a honest-to-goodness scientific discipline, the behavioral-finance literature needs to offer a single coherent alternative model to challenge the Efficient Market Hypothesis. e.g., later in Fama98, it’s claimed that behavioral economists “rarely test a specific alternative to market efficiency… This is unacceptable… Following the standard scientific rule, however, market efficiency can only be replaced by a better specific model of price formation.”

That is nonsense. It’s a criticism that applies equally well to the corporate-finance literature, which has not produced a better specific model than the one in ModiglianiMiller58. However, no one would claim that Jean Tirole’s textbook is unscientific. Overturning ModiglianiMiller58 isn’t the point of corporate-finance research. Overturning the Efficient Market Hypothesis isn’t the point of behavioral-finance research. In both cases, the point is to provide good explanations for how the real-world works. If progress is fastest when researchers organize their thinking relative to a benchmark model, then so be it.

Asset-pricing models as theories of good synthetic controls

January 18, 2023 by Alex

In 1988, California passed a major piece of tobacco-control legislation called Proposition 99. This bill increased the tax on cigarettes by $0.25 a pack and triggered a wave of bans on smoking indoors throughout the state. After the bill was passed in California, it became more expensive to smoke in California and there were fewer places to do so.

It makes sense that Prop 99 could have caused lots of people in California to stop smoking. And, consistent with this hypothesis, AbadieDiamondHainmueller2010 found that per capita cigarette consumption in California fell by around 40 packs per year from 1985 to 1995. In 1985, the typical Californian bought 100 packs per year. By 1995, the average Californian bought fewer than 60 packs per year.

But was the effect causal? Did the passage of Prop 99 cause per capita cigarette consumption in California really drop by 40 packs per year? To answer this question, you need to know how many packs of cigarettes each Californian would have bought in 1995 had Prop 99 not been passed.

It’s not obvious how you should compute this counterfactual. You can’t just assume that, in the absence of Prop 99, cigarette consumption in California would have been the same in 1995 as it was in 1985. The popularity of smoking has been falling over time throughout the country. You also can’t naively compare cigarette consumption in California in 1995 to that of a neighboring state, like Nevada, in the same year. People in Nevada are more likely to partake in all sorts of vices (smoking, drinking, gambling, etc). Comparing per capita cigarette sales in California to that of Nevada in 1995 will tend to overstate the effect of Prop 99.

But, what if rather than using just Nevada in 1995 as your stand-in for California sans Prop 99, you instead used a composite Frankenstate that has the same observable characteristics as California. e.g., people in Nevada may be much more likely to smoke, drink, and gamble relative to people in California, but people in Utah are much less likely to do all of those things than Californians. So a weighted average of per capita cigarette consumption in Nevada and Utah in 1995 might represent a good synthetic control for California.

This post first outlines the idea behind using a synthetic control. Then, I make a connection between literatures: when an asset-pricing researcher computes a stock’s abnormal return by subtracting off the return of a replicating portfolio with the same risk exposures, he’s using a synthetic control. In fact, this is exactly what the OG synthetic control paper does! AbadieGardeazabal2003 computes abnormal returns for Basque companies relative to the CAPM and the FamaFrench1993 three-factor model. I wrap up by pointing out some interesting takeaways from this connection for both asset-pricing researchers and metrics folks.

Problem setup

Here’s the canonical synthetic-control problem. Imagine that you’ve got data on how much people spend on smoking and drinking in three different states, $n \in \{ \texttt{CA}, \, \texttt{NV}, \, \texttt{UT} \}$ , in two particular years, $t \in \{ \texttt{1985}, \, \texttt{1995} \}$ . For simplicity, I’m going to talk about Prop 99 as a policy that banned indoor smoking outright:

$\begin{equation*} \mathit{IndoorSmokingPolicy}_{n,t} = \left( \begin{array}{r|cc} & \texttt{1985} & \texttt{1995} \\ \hline \texttt{CA} & \mathtt{\emptyset} & \texttt{Ban} \\ \texttt{NV} & \mathtt{\emptyset} & \mathtt{\emptyset} \\ \texttt{UT} & \mathtt{\emptyset} & \mathtt{\emptyset} \end{array} \right) \end{equation*}$

Thus, you have one state-year observation with a smoking ban in place and five without one.

Let $\mathit{CigSales}_{n,t}(p)$ denote the number of packs bought by the average person in state $n$ during year $t$ given the prevailing indoor smoking policy $p \in \{ \mathtt{\emptyset}, \, \texttt{ban} \}$ . You want to know how Prop 99 affected cigarette sales:

$\begin{equation*} \underbrace{\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})}_{\substack{\text{\phantom{j}observed\phantom{j}} \\ \text{outcome}}} - \underbrace{\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})}_{\substack{\text{hypothetical} \\ \text{counterfactual}}} = \text{causal effect of Prop 99 on cigarette sales} \end{equation*}$

The first term, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})$ , is the per capita cigarette sales observed in California during 1995 after Prop 99 had been implemented. The second term, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ , reflects packs per person during 1995 in an alternative world where everything is the same except that Prop 99 was never passed.

The key empirical challenge rests on the fact that, while I can observe cigarette sales for the year 1995 in California where Prop 99 has already been passed, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})$ , I cannot directly observe cigarette sales in a version of 1995 California where Prop 99 didn’t go into law, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ . This counterfactual world is a hypothetical scenario. It never happened. The challenge is to come up with some stand-in value for $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ based on the data that I can observe in other states.

So, you want to think about a data-generating process where there’s a potential effect coming from a one-time policy change AND a bunch of things that affect statewide cigarette sales during normal times:

$\begin{equation*} \mathit{CigSales}_{n,t}(p) = \!\!\! \underbrace{\alpha \cdot 1_{\{p= \texttt{Ban}\}}}_{\substack{\text{causal\phantom{j}effect of one-} \\ \text{time policy change}}} \!\!\! + \underbrace{\mu_t + \lambda \cdot X_n + \varepsilon_{n,t}}_{\substack{\text{determinants of cigarette} \\ \text{sales during normal times}}} \qquad \qquad \varepsilon_{n,t} \overset{\scriptscriptstyle \text{IID}}{\sim} \mathrm{Normal}(0, \, \sigma^2) \end{equation*}$

You want to know whether the introduction of Prop 99, which is captured by the $1_{\{p= \texttt{Ban}\}}$ term, had an effect on cigarette sales in California. i.e., you want to know whether $\alpha < 0$ . If Prop 99 had no effect, then $\alpha = 0$ .

The remaining determinants of statewide cigarette sales during normal times are important because they dictate which observables might be a good stand-in for the counterfactual version of California in 1995 where Prop 99 was never passed as illustrated in the interactive figure below. The left panel depicts the number of cigarette packs purchased by an average resident in each of your three states during 1985 (y axis) as a function of liquor consumption in that state (x axis). The right panel shows the same thing but for 1995. The solid circles represent observed values of per capita annual cigarette sales. The dotted circle represents the counterfactual value for California in a world where Prop 99 was not passed ( $p = \mathtt{\emptyset}$ ).

$X_n$ represents liquor consumption in state $n$ . People in Nevada are more likely to spend money on all sorts of vices (smoking included) than people in California. $X_n$ is a proxy for this statewide predisposition. So, if you ignore this background variable and naively compare cigarette purchases in California to that in Nevada during 1995, then it’ll look like Prop 99 had an outsized effect. People in Nevada purchase an extra $15$ packs per year relative to California in the figure. Thus, using Nevada during 1995 as your counterfactual observation would cause you to overstate the true causal effect of Prop 99 by $15$ packs per person annually.

$\mu_t$ is the average per capita cigarette sales during year $t$ across all states. Cigarette sales have been falling over time, so $\mu_{\texttt{95}} < \mu_{\texttt{85}}$ . However, in its initial configuration, cigarette sales are constant over time in the figure above, $\Delta \mu = (\mu_{\texttt{95}} - \mu_{\texttt{85}}) = 0$ . If that were the case, then you could use per capita cigarette sales in California during 1985 as your counterfactual observation. But, by moving the $\Delta \mu$ slider, you can see how a downward time-series trend in cigarette sales would cause you to again overstate the effect of Prop 99:

$\begin{equation*} \Exp\big[ \, \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{95}}(\texttt{Ban})}_{\text{observed}} - \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{85}}(\mathtt{\emptyset})}_{\text{observed}} \, \big] = \Delta \mu + \alpha \end{equation*}$

Synthetic control

Because smoking has been growing less and less popular over time, you want to create the counterfactual for cigarette sales in California during 1995 using contemporaneous data from other states. But you also recognize that no other state is a perfect doppelganger for California. People in Nevada are more likely to smoke, drink, and gamble relative to people in California. Utah residents are less likely to do all of those things than Californians. So why no do the obvious thing and average these two values?

For conreteness, suppose that the drinking rate in California are exactly halfway in between the rates in Utah and Nevada:

$\begin{equation*} X_{\texttt{CA}} = (1/2) \cdot X_{\texttt{UT}} + (1/2) \cdot X_{\texttt{NV}} \end{equation*}$

In practice, the weights wouldn’t be exactly $(1/2)$ . But you could estimate these values using your 1985 data. And you could use these weights to construct a Voltron-esque counterfactual for cigarette sales in California during 1995 out of the contemporaneous values for Utah and Nevada:

$\begin{equation*} \begin{split} \widehat{\mathit{CigSales}}_{\texttt{CA}, \texttt{95}}(\mathtt{\emptyset}) &= (1/2) \cdot \overbrace{\mathit{CigSales}_{\texttt{UT},\texttt{95}}(\mathtt{\emptyset})}^{\text{observed}} + (1/2) \cdot \overbrace{\mathit{CigSales}_{\texttt{NV},\texttt{95}}(\mathtt{\emptyset})}^{\text{observed}} \\ &= \mu_{\texttt{95}} + \lambda \cdot X_{\texttt{CA}} + (1/2) \cdot (\varepsilon_{\texttt{UT},\texttt{95}} + \varepsilon_{\texttt{NV},\texttt{95}}) \end{split} \end{equation*}$

Since it’s made out of observations from 1995, this synthetic control is not confounded by the nationwide drop in cigarette sales from 1985 through 1995. By matching California’s alcohol sales, $X_{\texttt{CA}}$ , this synthetic control also accounts for persistent differences in cigarette sales across states due to differing propensities to partake in all vices. And, if we’ve done everything correctly, then we can compute

$\begin{equation*} \Exp\big[ \, \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{95}}(\texttt{ban})}_{\text{observed}} - \underbrace{\widehat{\mathit{CigSales}}_{\texttt{CA}, \texttt{95}}(\mathtt{\emptyset})}_{\text{calculated}} \, \big] = \alpha \end{equation*}$

where $\alpha$ denotes the true causal effect of Prop 99 on annual per capita cigarette sales in California.

Risk adjustment

The synthetic control for cigarette sales in California during 1995 was a weighted average of cigarette sales in Nevada and Utah where the weights were chosen to replicate California’s value of $X_{\mathtt{CA}}$ . While this approach has some advantages, it’s also somewhat unsatisfying in that there’s no real physical analog to the counterfactual it produces. There’s no process by which you can take a weighted average of Utah and Nevada. This is purely a statistical construct.

The key insight in this post is that, when an asset-pricing researcher computes a risk-adjusted return for some asset relative to particular model, he’s using this same synthetic control methodology. And in an asset-pricing context, there’s a clear physical analog to the resulting counterfactual. The synthetic control represents a portfolio of the underlying assets with appropriately chosen portfolio weights. e.g., in the case of the CAPM, a synthetic control observation for a particular asset is a replicating portfolio with weights chosen so that it has the exact same market beta.

For example, suppose you think expected returns are governed by the CAPM. Then $\mu_t = \mathit{RiskfreeRate}_t$ is the prevailing risk-free rate at time $t$ , $X_n = \Cov[\mathit{Return}_{n,t}, \, \mathit{Market}_t] \, / \, \Var[\mathit{Return}_{n,t}]$ is the market beta on the $n$ th asset, and $\lambda$ is price of an increase in exposure to this market risk factor:

$\begin{equation*} \mathit{Return}_{n,t}(p) = \underbrace{\alpha \cdot 1_{\{ p = \texttt{hi} \}}}_{\substack{\text{effect\phantom{j}of\phantom{j}anom-} \\ \text{lous predictor}}} + \underbrace{\mu_t + \lambda \cdot X_n + \varepsilon_{n,t}}_{\substack{\text{what\phantom{j}determines\phantom{j}returns} \\ \text{in asset-pricing model}}} \qquad \qquad \varepsilon_{n,t} \overset{\scriptscriptstyle \text{IID}}{\sim} \mathrm{Normal}(0, \, \sigma^2) \end{equation*}$

The asset’s return should be higher if the risk-free rate is higher ( $\mu_t$ is high), if it has lots of exposure to market risk ( $X_n$ is high), and/or if the price of this exposure to market risk is high ( $\lambda$ is high).

The core claim in any asset-pricing model (CAPM included) is that, after controlling for the $X$ variables specified in the model, it shouldn’t be possible to find another predictor, $p \in \{\texttt{lo}, \, \texttt{hi} \}$ , that forecasts returns:

$\begin{equation*} \text{claim: $\alpha = 0$ for every predictor $p$ that you can think of} \end{equation*}$

And how would an asset-pricing researcher test to see if $\alpha = 0$ ? He’d compare the $n$ th asset’s returns to the returns of a replicating portfolio whose weights were chosen so that it had the exact same value of $X_n$ . e.g., suppose we’re in a CAPM world, and the $n$ th asset has a market beta of $X_n = 0.50$ . If there are two other assets with betas equal to $0.20$ and $0.80$ respectively, you should compare the $n$ th asset’s return to the return of a equally weighted portfolio of those two assets, $X_n = (1/2) \cdot 0.20 + (1/2) \cdot 0.80 = 0.50$ . Exact same situation! And the original synthetic-control paper (AbadieGardeazabal2003) pointed out as much!

Some takeaways

This connection between the synthetic-control literature and the asset-pricing literature delivers some interesting takeaways on both sides. First, let’s think about it from the perspective of an asset-pricing researcher. There have been several recent econometric advances in the study of synthetic controls. e.g., Chen2023 frames the synthetic-control procedure as an online learning problem. The paper then uses this parallel to give policymakers some guidance on when and where synthetic control is most likely to be successful. By framing risk adjustment as a specific instance of a more general approach to producing synthetic controls, asset-pricing researchers might be able to port over some of these recent advances.

I think the message is a bit less positive when traveling from asset pricing back to the econometrics of synthetic controls. It’s been 50 years since Merton1973 introduced the ICAPM, and asset-pricing researchers have yet to agree on which $X$ s to use when doing our risk adjustments. This fact should give econometricians pause when considering the limits of the synthetic-control approach. In a recent review article, Abadie2021 argues that the synthetic-control methodology offers a “safeguard against specification searches. (p406)” Judging by the current state of the asset-pricing literature, I’m not sure this is true.

Abadie2021 also argues that, while a researcher using the synthetic-control procedure might make an error in choosing control variables, at least the procedure is transparent about how a counterfactual is being constructed. The procedure itself is certainly transparent. But I’m not sure how many people really think through the logic now that synthetic control has gone mainstream. How many people think of the buildup of pus in a pimple when they use the phrase “coming to a head”? The conceptual metaphor is perfectly transparent. But most people never look. In a similar vein, asset-pricing researchers often use the FamaFrench1993 three-factor model to “control for risk” in spite of the fact that real-world investors aren’t trying to use their stock portfolios to buy insurance against these risk factors. An empirical procedure which initially encourages introspection can eventually turn into a stale thoughtless idiom. The asset-pricing literature suggests that econometricians should be more worried about this trend.

Interpreting the LASSO as a really simple neural network

January 10, 2023 by Alex

Suppose you want to forecast the return of a particular stock using many different predictors (think: past returns, market cap, asset growth, etc…). One way to do this would be to use the LASSO. Alternatively, you could use a neural network to make your forecast. On the surface, these two approaches look very different. However, it turns out that it’s possible to recast the LASSO as a *really* simple neural network.

This post outlines how.

This connection suggests we can use penalized regressions, such as the LASSO, as microscopes for studying more complicated machine-learning models, like neural networks, which often exhibit surprising new behavior. For example, if you include more predictors than observations in an OLS regression, then you’ll be able to perfectly fit your training data but your out-of-sample performance will be terrible. By contrast, highly over-parameters neural networks often have the best out-of-sample fit.

Because these models are so complicated, it’s often hard to understand why a pattern like this might emerge. Penalized regression models like the LASSO occupy a middle ground between OLS and complicated machine-learning models. Thus, if the LASSO can be viewed as a really simple neural net, then it might be possible to use this intermediate setup as a laboratory for understanding more complicated procedures. That’s the idea behind HastieMontanariRossetTibshirani22. And KellyMalamudZhou22 build on their logic.

General setup

Imagine that you’ve got historical data on the returns of $N \gg 1$ different stocks, $\{ \mathit{Ret}_n \}_{n=1}^N$ , and you want to make the best forecast possible for the future return of the $(N+1)$ st stock, $\widehat{\mathit{Ret}}_{N+1}$ . You have access to $K \gg 1$ different return predictors. Let $X_{n,k}$ denote the value of the $k$ th predictor for the $n$ th stock. Assume that each predictor has been normalized to have mean zero and variance $(1/K)$ in the cross-section. Without loss of generality, also assume that the cross-sectional average return is zero.

If there were only one predictor, $K = 1$ , then it’d be possible to estimate the OLS regression below:

$\begin{equation*} \hat{\beta}^{\text{OLS}} = \arg \min_{\beta} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \beta \cdot X_n \, \right\}^2 \end{equation*}$

In this case, the solution is given by $\hat{\beta}^{\text{OLS}} \propto \sum_{n=1}^N \, (\mathit{Ret}_n - 0) \times (X_n - 0)$ . If the predictor tends to be positive, $X_n > 0$ , for stocks that subsequently realize positive returns, $\mathit{Ret}_n > 0$ , then the OLS slope coefficient associated with it will be positive. It will also be profitable to trade on this predictor, too.

You can also use an OLS regression to create a return forecast when you have more than one predictor

$\begin{equation*} \widehat{\mathit{Ret}}_{N+1}^{\text{OLS}} = \sum_{k=1}^K \hat{\beta}_k^{\text{OLS}} \cdot X_{N+1,k} \end{equation*}$

provided that you still have more observations than predictors, $N > K$ . If you’ve got $K=200$ predictors and $N=500$ stocks in your training data, then you’re in business. However, if your training data only contains $N=100$ stocks, then you’re SOL. You’ll have to use something other than an OLS regression.

The LASSO

One popular approach is to fit a LASSO specification. This is essentially an OLS regression with an additional absolute-value penalty applied to each predictive coefficient:

$\begin{equation*} \min_{\beta_1,\ldots,\beta_K} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \sum_{k=1}^K \beta_k \cdot X_{n,k} \, \right\}^2 + 2 \cdot \lambda \cdot \sum_{k=1}^K |\beta_k| \end{equation*}$

The pre-factor of $\lambda \geq 0$ in front of the penalty term is a tuning parameter, which can be optimally chosen via cross-validation. Notice that, when $\lambda = 0$ , there is no penalty at all and the LASSO is equivalent to OLS. But when $\lambda > 0$ , the LASSO’s coefficients will differ from OLS estimates as shown in the interactive figure below.

To see what I mean, let’s return to the case where there’s only one predictor. Alternatively, you could think about a world with orthogonal predictors, $\Cov(X_k, \, X_{k'}) = 0$ for all $k \neq k'$ . In either case, we have:

$\begin{equation*} \hat{\beta}_k^{\text{LASSO}} = \mathrm{Sign}(\hat{\beta}_k^{\text{OLS}}) \times \max\big\{ 0, \, |\hat{\beta}_k^{\text{OLS}}| - \lambda \big\} \end{equation*}$

This expression tells us that the LASSO does two things. First, it shrinks large OLS coefficients toward zero, $|\hat{\beta}_k^{\text{LASSO}}| < |\hat{\beta}_k^{\text{OLS}}|$ . Second, it forces all small OLS coefficients, $|\hat{\beta}_k^{\text{OLS}}| < \lambda$ , to be exactly zero, $\hat{\beta}_k^{\text{LASSO}} = 0$ .

Neural network

The LASSO is still able to make forecasts in situations where there are more predictors than observations because it kills off all the smallest predictors. Morally speaking, if only $5$ of your $K=200$ predictors have any forecasting power, then you shouldn’t need $N \geq 200$ observations to figure this out. $20$ data points should do just fine. An alternative approach to making a return forecast when $K > N$ would be to use a neural network. On the surface, this seems like a very different strategy. Instead of a bet on sparsity, large neural networks often perform best when highly over-parameterized.

There are lots of kinds of neural networks. In this post, I’m going to mainly focus on neural networks with only one hidden layer that has the same number of nodes as predictors. e.g., with $K=200$ predictors, there will be $H = 200$ hidden nodes. The diagram to the left shows what this would look like in a situation with $K=3$ predictors and $H=3$ hidden nodes so that we can see what’s going on.

The value of each hidden node is determined by an activation function that takes a linear combination of predictor values as its input:

$\begin{equation*} H_k = \mathrm{h}\!\left( \beta_{0 \to k} + \sum_{k'=1}^{K} \beta_{k' \to k} \cdot X_{n,k'}\right) \end{equation*}$

e.g., you could set $\mathrm{h}(z) = z$ , $\mathrm{h}(z) = \max\{0, \, z \}$ , or something else entirely. $\vec{\beta}_k = (\beta_{0 \to k}, \, \beta_{1 \to k}, \ldots, \, \beta_{K \to k})$ contains the weights that go into the $k$ th hidden node. It has $(K+1)$ elements due to the intercept term.

The return forecast generated by this neural network, $\widehat{\mathit{Ret}}_{N+1}^{\text{NNet}}$ , is then a weighted average of its $K$ hidden nodes where the weights are chosen by solving the optimization problem below:

$\begin{equation*} \min_{\substack{\alpha_1, \ldots, \alpha_K \\ \vec{\beta}_1, \ldots, \vec{\beta}_K}} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \sum_{k=1}^K \alpha_k \cdot \mathrm{h}\!\left( \beta_{0 \to k} + \sum_{k'=1}^{K} \beta_{k' \to k} \cdot X_{n,k'}\right) \, \right\}^2 \!\! + \lambda \cdot \sum_{k=1}^K \left( \alpha_k^2 + \beta_{0 \to k}^2 + \sum_{k'=1}^K \beta_{k'\to k}^2 \right) \end{equation*}$

This objective function includes a penalty term just like the LASSO, but the penalty is quadratic. It’s equivalent to the common practice of training a neural network via gradient descent with weight decay.

Degrees of freedom

If our goal is to write down the LASSO as a special case of a neural network, then there are two apparent differences that need to be finessed. The first involves degrees of freedom. In the LASSO, there is one parameter that needs to be estimated for each predictor. In the neural network above, each predictor is associated with $(K + 2)$ free parameters. In addition, you must also choose an activation function, $\mathrm{h}(\cdot)$ .

To represent the LASSO as a neural network, we’re going to have to shut down $(K+1)$ of the degrees of freedom associated with each predictor. So, let’s start by looking at a neural network that’s “simply connected”—i.e., a network where $\beta_{k' \to k} = 0$ whenever $k' \neq k$ . Let’s also assume a linear activation function, $\mathrm{h}(z) = z$ , and restrict ourselves to the case where there’s no constant term, $\beta_{0 \to k} = 0$ .

After making these assumptions, we are left with the neural network in the diagram above. There are now only two free parameters associated with each predictor: $\alpha_k$ and $\beta_{k \to k}$ . To estimate all $2 \cdot K$ of these values, we must maximize the objective below:

$\begin{equation*} \min_{\substack{\alpha_1, \ldots, \alpha_K \\ \beta_{1 \to 1}, \ldots, \beta_{K \to K}}} \, \sum_{n=1}^N \left\{ \, Y_n - \sum_{k=1}^K \alpha_k \cdot \beta_{k \to k} \cdot X_{n,k} \, \right\}^2 + \lambda \cdot \sum_{k=1}^K (\alpha_k^2 + \beta_{k \to k}^2) \end{equation*}$

This looks almost like the LASSO objective function. But there’s still one glaring difference left…

Nature of the penalty

In the LASSO, we’ve got an absolute-value penalty; whereas, the neural network has a quadratic penalty. This seems important! To see why, consider replacing the absolute-value penalty in the LASSO with

$\begin{equation*} \min_{\beta_1,\ldots,\beta_K} \, \sum_{n=1}^N \left\{ \, Y_n - \sum_{k=1}^K \beta_k \cdot X_{n,k} \, \right\}^2 + \lambda \cdot \sum_{k=1}^K \beta_k^2 \end{equation*}$

When you do this, you’re left with something called the Ridge regression.

Just like with the LASSO, we can characterize the Ridge estimates relative to OLS in the case where there’s only one predictor or all predictors are orthogonal to one another:

$\begin{equation*} \hat{\beta}_k^{\text{Ridge}} = \left( {\textstyle \frac{1}{1 + \lambda}} \right) \times \hat{\beta}_k^{\text{OLS}} \end{equation*}$

When you increase the value of $\lambda$ in the figure to the right, you’ll see that the slope of the line changes. The larger the $\lambda$ , the less $\hat{\beta}_k^{\text{Ridge}}$ changes in response to a change in $\hat{\beta}_k^{\text{OLS}}$ . Notice how this effect is qualitatively different from the effect of increasing $\lambda$ in a LASSO specification. There, $\lambda$ controlled the size of the inaction region. But, provided that $|\hat{\beta}_k^{\text{LASSO}}| > 0$ , the LASSO estimate always moved one-for-one with $\hat{\beta}_k^{\text{OLS}}$ .

However, this Ridge intuition is misleading. In the simply-connected neural-network structure that I outline above, we are not choosing a single coefficient $\beta_{k \to k}$ . Instead, because there is a hidden layer, we are choosing the product of $\alpha_k \cdot \beta_{k \to k}$ . And this makes all the difference. For any value of $c \geq 0$ , we have that

$\begin{equation*} \min_{\alpha,\beta \geq 0} \big\{ \, \alpha^2 + \beta^2 : \alpha \cdot \beta = c \, \big\} = 2 \cdot |c| \end{equation*}$

where the minimum is at $\alpha_k = \beta_{k \to k} = \sqrt{c}$ . This is just the inequality relating arithmetic and geometric averages. It’s what allows a single hidden layer to sneak in a threshold through the back door.

Some extensions

We’ve just seen that you can think about the LASSO as a simply-connected two-layer neural network with a linear activation function and no bias terms, which was trained via gradient descent with weight decay. This is not my observation. I first saw it in Tibshirani21. The step where you reduce the degrees of freedom is obvious enough. But I had never made the connection with the arithmetic/geometric mean inequality. That second step struck me (and still strikes me) as really cool. It’s also a very concrete example of the flexibility inherent in neural networks. The hidden layer allows a neural network to do things you wouldn’t guess possible based only on the functional forms involved.

In addition to outlining the argument above, Tibshirani21 also gives a couple of other interesting extensions. e.g., the note shows how, by increasing the number of hidden layers in the neural network, you can reproduce the output of a LASSO-like specification below

The more hidden layers you include, the closer you get to best-subset selection, $q=0$ . The note also shows that it’s possible to write group-LASSO as a neural network that ain’t quite so simply connected.

Where’s the “narrative” in “narrative economics”?

October 29, 2022 by Alex

Bob Shiller defines “narrative economics” as the study of “how narrative contagion affects economic events”. This research program focuses on two things: “(1) the word-of-mouth contagion of ideas in the form of stories and (2) the efforts that people make to generate new contagious stories or to make stories more contagious.” In other words, if you could somehow get people to stop telling each other tall tales, then stock prices, GDP, interest rates, housing sales, etc would be different.

I’m a huge fan of this work. And even the most ardent critics of narrative economics still appreciate the power of a good narrative. For example, like all economics papers, the Campbell-Cochrane habit-formation paper has an introduction. The introduction gives an intuitive explanation of how the paper works using evocative language. The paper is called By Force of Habit for god’s sake. If that isn’t an effort to make the paper’s story more contagious, I don’t know what is.

But many researchers feel that narrative economics (in its current form) is less than scientific. And I think there’s something to these claims. Narrative economics says that economic outcomes are different because of the fact that people tell each other stories and embellish these stories in the process. But there are no specific parameters corresponding to these two tendencies in any economic model. Put differently, There’s no “narrative” in models of “narrative economics”. Whether or not a paper gets classified as a narrative-based model often comes down to how it’s written. This makes it hard to analyze how tall tales affect economic outcomes. What would the counterfactual world without the contagious narratives look like?

Compare and contrast this state of affairs how other economic forces get modeled. For example, the Campbell-Cochrane habit-formation paper allows risk aversion to vary over time. We know this because we can directly point to this parameter in the model. And, as a result, we can imagine a world where risk aversion is no longer allowed to vary over time. It’s not clear how to do the same thing in narrative-based models. There’s no parameter that corresponds to the story-telling instinct.

To place narrative economics on firm footing, we need some way of toggling on and off peoples’ tendency to tell tall tales in our models. There needs to be a narrative module in our models. That way, we can analyze the effect of this module on things like stock prices, GDP, interest rates, housing sales, etc. This post highlights two problems with current narrative-based models that make it difficult to accomplish this goal. Then, I conclude by suggesting a way to put the “narrative” into models of narrative economics.

Narratives aren’t the only kind of epidemic

When Shiller talks and writes about narrative economics, he plays up the importance of how stories go “viral”. The reason why narratives have the power to affect economic outcomes is that they can spread contagiously from person to person via word of mouth like a meme or viruse. Shiller has been a strong proponent of using epidemiological models to study financial markets. I’m a huge fan of this idea! I’ve even got a paper which takes this exact approach. Epidemiological models can tell us a lot about how trader interactions affect market outcomes, leading to things like booms and busts.

That being said, it’s important to emphasize that there is no “narrative” in epidemiological models. In these models, something is exchanged when two agents interact with one another. That something could be a virus, a story, or an Egg McMuffin recipe. Anything that these models say about narratives must also apply to a virus or a delicious new way to start your day. Epidemiological models are models of interacting agents not models of what happens when agents interact. Social finance and narrative economics are distinct fields.

Suppose there are $N$ people, of which, $S$ are currently sick. The remaining $N - S = H$ people are healthy. Each instant, a sick person encounters a healthy person with probability $(H/N) \cdot \mathrm{d}t$ . And, when that happens, the healthy person becomes sick at a rate of $\beta$ per interaction. Otherwise, sick people recover with probability $\gamma \cdot \mathrm{d}t$ each instant. This implies that the total population of sick people will evolve as

(1) $\begin{equation*} \mathrm{d}S = \underset{\text{contract disease}}{[\beta \cdot (H/N) \cdot S] \cdot \mathrm{d}t} - \underset{\text{get healthy}}{[\gamma \cdot S] \cdot \mathrm{d}t} \end{equation*}$

Flipping the logic around then says that the population of healthy people must evolve according to:

(2) $\begin{equation*} \mathrm{d}H = \underset{\text{get healthy}}{[\gamma \cdot S] \cdot \mathrm{d}t} - \underset{\text{contract disease}}{[\beta \cdot (S/N) \cdot H] \cdot \mathrm{d}t} \end{equation*}$

Notice that there is no biology in this model. There’s nothing specific to structure of viruses or the life cycle of bacteria. There are just two interacting populations. These could be sick and healthy people. Or they could be excited speculators and rational investors. Epidemiological models can capture how an economic narrative spreads through a population. But they can tell us nothing about what an economic narrative actually is. This is a non-starter. We want to be able to plug the narrative module (whatever that happens to be) into an epidemiological model. But a narrative is different from how it spreads.

The tell-tale signs of people telling tall tales

Right now, economists mainly think in terms of models where investors solve forward-looking constrained optimization problems. Narrative economics argues that it’s valuable to think about narratives and models rather than just models. If this is true, then the story-telling instinct must be able to explain phenomenon that the existing paradigm can’t. Narratives must be more than just bad explanations. They must be something that is fundamentally outside the currently modeling paradigm. Otherwise, it won’t be possible to distinguish the implications of the narrative from the implications of a fine-tuned economic model.

Here’s what I mean. There are several recent papers (e.g., see here and here) that study narratives by incorporating ideas from the causal-inference literature. These papers model narratives using directed acyclical graphs (DAGs). If you have an underlying structural-equation model for the economy, then you can represent this model’s causal implications using a DAG in a way that is largely independent of the magnitudes of the parameter values involved. All that matters is the zero vs nonzero distinction.

Causal relationships without nitty-gritty parameter estimates… this might at first seem like a promising way of modeling narratives. But here’s the thing: anything that can be captured by a DAG can also be written down as a standard economic model. So, if you model narratives using DAGs, it can never be clear which is the real driver—the narrative or the underlying model.

To illustrate, suppose that the Campbell-Cochrane habit-formation model were strictly true. Suppose that the model in that paper was actually the data-generating process for observed asset-prices in the real world. In this fictitious scenario, further suppose that whenever you ask traders, they talk about the world in exactly the way that Campbell and Cochrane do in their introduction. In this set up, traders’ would have a clear narrative about what was going on in the market, and this narrative would fit perfectly into a DAG. But the narrative would not be responsible for the observed market data. If you didn’t ask traders about what they were doing, all economic outcomes would be the exact same.

DAGs capture the component of narratives that could be incorporated into well-posed model. If it’s important to consider narratives in addition to standard economic models, then their contribution must come from something that cannot be captured by simply adding a new variable to a DAG. Narrative economics must represent more than just throwing away some of the information in a model.

Narratives determine how people construe events

Narrative economics says that economic outcomes are different because people tell each other stories that get exaggerated with each retelling. To test this claim, we want some way of adding a narrative to an existing economic model. Then, we can flip the story-telling instinct on and off in the model and examine the consequences.

Whatever this narrative module looks like, it won’t be tied to epidemiological models. These models aren’t specific to narratives. They’re called “epidemiological models” for a reason. What’s more, if we want to distinguish narrative economics from the existing model-based paradigm, then a narrative must be more than just a new variable or parameter. Otherwise, it would be easy to achieve the same results using a standard model. If all you do is show that prices tend to go up with there are more positive words in the Wall Street Journal, it’s easy to gin up alternative stories that don’t involve investors tell each other good stories.

We can’t define a narrative by studying epidemiological models. And we can’t associate a narrative with a single new variable or parameter? Where does that leave us?

In his 2013 Nobel Prize lecture, Bob Shiller urged economists to incorporate more ideas from psychology, sociology, and other fields. And I think this is exactly the right way to go. But, rather than turning to epidemiology, let’s look at the subfield of psychology that studies the interface between language and the mind—namely, cognitive linguistics. We want to identify the economic effects of conveying information between people via the medium of story. It stands to reason that stories might be stored differently by the brain relative to statistics, song, interpretive dance, divine proclamation, etc.

Cognitive linguistics tells us that stories affect how people “construe events” being related to one another. For example, the force dynamics paradigm says that letting something be is not the same as pushing on it with zero force even though these two situations are identical according to every physics textbook. The narrative module we are looking for should tell us when to construe the same events in different ways.

Here’s another example. Suppose there are $600$ people with a deadly disease and doctors are asked to choose between two treatments. Treatment A results in $400$ deaths. Under treatment B there is a $33\%$ chance that no one will die but a $66\%$ chance that all $600$ people will die. The narrative should explain whether doctors frame this famous choice as “(treatment A saves $200$ and treatment B kills everyone $66\%$ of the time)” or as “(treatment A kills $400$ people and treatment B saves everyone $33\%$ of the time)”.

There is direct evidence that human brains reason about stories using something like the force dynamics paradigm (e.g., see studies like this one). So we are not talking about layering on a “narrative interpretation” to a model as is the case with epidemiological models. And, if a story can change the *relationships among entire collections of variables*, it’s not easy to account for its predictions using a single existing model. Because its effects are non-local, you would need an entirely new model for each construal of events. Turning off the narrative module would be akin to steadfastly adhering to only one model. A narrative module should account for the way that people pick and choose which model to apply at different points in time. That’s my guess about how to model the “narrative” in “narrative economics”.

Adversarial examples and quant quakes

September 15, 2022 by Alex

Imagine you’re a quantitative long-short equities trader. If you can predict which stocks will have above-average returns next period and which will have below-average returns, then you can profit by buying the winners and selling short the losers. Return predictability and trading profits are two sides of the same coin. Your entire job boils down to solving this classification problem. Ideally, the resulting trading strategy would be uncorrelated with every other quant strategy. That way, no other quants will eat into your profits.

Yet, even though every quant is actively trying to be his/her own special butterfly, we periodically experience quant quakes where lots of different strategies suddenly sync up and make the same bad trades. The most well-known example of this phenomenon is “The Quant Quake” which took place during the week of August 6th, 2007. “As soon as US markets started trading, the previously wildly successful automated investment algorithms… went horribly awry, and losses mounted at a frightening pace.”

To be clear, quant strategies generally tend to do a good job of being different from one another during normal times. It’s just that, every once in a while, they suddenly line up in the worst possible way. This synchronization seems like it should require a coordinating event. However, one of the most puzzling things about quant quakes is that they often occur at times when there’s nothing special about market conditions. For example, during the first week of August 2007, many sophisticated non-quant traders were totally unaware that a quant quake had taken place until they read about the events after the fact.

Thus, there are two questions to be answered here. First, why might quant strategies sometimes sync up and make the same wrong decisions even though they look quite different from one another during normal times? Second, how can this happen at times when market conditions look normal to non-quants?

This post proposes adversarial examples as a way to answer both these questions.

Suppose you fit a machine-learning model to historical data, and this model does a good job of classifying winners and losers in sample. It’s possible to create a new adversarial example that will reliably fool this model even though no human would ever make the same mistake. What’s more, this adversarial example will likely generalize to very different machine-learning models from the one you’re using so long as the model has also been trained on the same data. Even though quants are all trying to be different from one another, an adversarial example that fools you is likely to fool these other seemingly different quants, too.

The idea in this post is that quant quakes occur when quantitative long-short equities traders encounter a market populated by adversarial examples. The fact that adversarial examples generalize across models explains why different quant strategies suddenly sync up and make the same mistakes. Moreover, because adversarial examples involve changes that don’t fool human observers, it’s possible for the resulting quant quake to occur in market environments where nothing looks out of the ordinary to non-quants.

Classification problem

I start by laying out the simplest possible version of the classification problem you face as a quantitative long-short equity trader. There are $N \gg 1$ stocks. Next period, the $n$ th stock will realize a relative return of $R[n] \in \{-1, \, +\!1 \}$ . If the stock’s return is above average, then $R[n] = +1$ . If its return is below average, then $R[n] = -1$ . You could add finer gradations—positive, zero, and negative—very positive, positive, zero, negative, and very negative—and so on. But for now let’s just study a binary world for simplicity’s sake.

Each stock’s return is determined in equilibrium as some function of the asset’s $K \gg 1$ characteristics. Let $X_k[n]$ denote the value of the $k$ th characteristic for the $n$ th stock, and let $\boldsymbol{X}[n] = \{ X_k[n] \}_{k=1}^K$ denote a vector containing the values of all $K$ characteristics for the $n$ th asset. Think about $X_k[n]$ as a single variable in WRDS database. When talking about the return and/or characteristics of an arbitrary stock, I’ll simply drop the stock-specific argument and write $(R, \, \boldsymbol{X})$ rather than $(R[n], \, \boldsymbol{X}[n])$ .

You don’t know the true data-generating process for $\{ R[n] \}_{n=1}^N$ . Your goal is to find some function, $\mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X})$ , that correctly predicts what these outcomes given the observed characteristics $\{ \boldsymbol{X}[n] \}_{n=1}^N$ :

(1) $\begin{equation*} \mathrm{f}_{\boldsymbol{\theta}}: \mathcal{R}^K \mapsto \{ -1, \,+\!1 \} \end{equation*}$

$\boldsymbol{\theta}$ represents a vector of tuning parameters involved in constructing such a function. Think about all the strategic choices made when constructing the “simple” Fama and French (1993) HML factor: video tutorial.

The simplest classification rule you might use is

(2) $\begin{equation*} \mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X}) = 2 \cdot \mathrm{Sign}\big( \, \theta_0 + {\textstyle \sum_{k=1}^K} \theta_k \cdot X_k \, \big) - 1 \end{equation*}$

However, it probably makes more sense to use a more complicated machine-learning protocol, such as a multi-layer neural network. You do not need to believe that real-world investors actually use this classification rule to determine returns in equilibrium. You only care about whether it is “as if” they do.

Whatever the functional form of $\mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X})$ , you are free to choose the values of $\boldsymbol{\theta}$ to minimize your average classification error in your training sample:

(3) $\begin{equation*} \hat{\boldsymbol{\theta}} = \arg \min_{\boldsymbol{\theta}} \, \frac{1}{N} \cdot \sum_{n=1}^N \, \ell\big( \, \mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X}[n]), \, R[n] \, \big) \qquad \text{where} \qquad \ell(a, \, b) = 1\{ a \neq b \} \end{equation*}$

In other words, you get penalized one point for each incorrect prediction. And you tune your classification function to minimize your expected penalty for a randomly selected stock in your training sample.

Market structure

You want to correctly label winner and loser stocks next period. You first choose a classification rule, $\mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X})$ . This is the choice you are making when you deciding whether to deploy a neural network, a decision tree, or ensemble methods. Next, you fit this classification rule using training data to pin down $\hat{\boldsymbol{\theta}}$ :

(4) $\begin{equation*} \{ (R[n], \, \boldsymbol{X}[n]) \}_{n=1}^N \qquad \text{with each} \qquad (R[n], \, \boldsymbol{X}[n]) \sim \Pi_{\textit{Obs}} \end{equation*}$

Let $\Pi_{\textit{Obs}}$ denote the joint distribution of characteristics and returns in the previous trading period. Once you calibrate your model, you can trade on the resulting buy/sell recommendations next period.

You are not the only quant in the market. There are lots of other quantitative long-short equities traders who are following the same protocol. However, as discussed above, each of you wants to be as different from everyone else as possible. You do this by taking very different approaches to classifying stocks in the first step. Each of you is using a different classification rule, $\mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X})$ .

However, history only occurs once. You’re all fitting your different models to the same training data. Suppose each stock has $K = 100$ characteristics. Further suppose that I am using the simple logistic model in Equation (2) with $101$ tuning parameters while you are using multi-layer neural network with thousands of tuning parameters for the same $100$ characteristics. $\boldsymbol{\theta}$ is a very different object in each of our models. But we will both be using the same historical cross-section of returns to estimate these tuning parameters.

You might think that, because every quant in the market is using a very different classification rule from the rest, any errors in their buy/sell recommendations will be largely unrelated. If my simple logistic model says to buy a particular stock when the right move would have been to sell it short, why should we expect a multi-layer neural network to make the same mistake? And, if they did, it seems like there should be some obvious red flag signaling why this would be the case.

Adversarial examples

Both these intuitions are correct when looking at a randomly selected observation from the training data. However they don’t hold if we’re not sampling from $\Pi_{\textit{Obs}}$ . It’s possible to slightly perturb a stock’s characteristics in a way that will cause your algorithm to misclassify it even though no human observer would notice the difference. The resulting observation is called an adversarial example. And it turns out that adversarial examples tend to generalize across machine-learning algorithms trained on the same data. An adversarial example that fools your multi-layer neural network is likely to fool my simple logistic model.

Here’s the canonical adversarial example from Goodfellow, Shlens, and Szegedy (2014). The authors want to label pictures of animals in the ImageNet database with the kind of animal in the image. So they fit a 22-layer neural network (GoogLeNet). They then chose an image that their fitted model correctly labels as a “panda” (left panel) and strategically add some noise to this image (middle panel). When they ask their fitted model to classify this new altered image (right panel), it says the image contains a “gibbon” with high confidence even though no human would ever mistake the perturbed image for anything other than a panda.

An adversarial example is defined relative to a particular classification rule and a specific observation that this rule is known to classify correctly. Given these things, you then look for a small amount of noise $\boldsymbol{Z}$ that would flip your algorithm’s recommendation for that stock while leaving its realized return unaffected:

(5) $\begin{equation*} \boldsymbol{Z}^a[n] = \arg \max_{\boldsymbol{Z} \in \mathcal{Z}} \, \ell( \mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X}[n] + \boldsymbol{Z}), \, R[n]) \end{equation*}$

If a stock with characteristics $\boldsymbol{X}[n]$ has return $R[n] = +1$ , then a second stock that has characteristics $(\boldsymbol{X}[n] + \boldsymbol{Z}^a[n])$ should also have return $R[n]= +1$ . However, while your classification rule correctly labels the first stock as a “buy”, it should spit out $\mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X}[n] + \boldsymbol{Z}^a[n]) = -1$ when given the second stock’s info.

The standard way to define “a small amount of noise” in the machine-learning literature is via the $\ell_{\infty}$ norm:

(6) $\begin{equation*} \mathcal{Z} = \big\{ \, \boldsymbol{Z} \, : \, \max_{k=1,\ldots,K} | Z_k | < \epsilon \, \big\} \end{equation*}$

$\mathcal{Z}$ is the set of $K$ -dimensional vectors where no individual element is more than $\epsilon$ away from zero. For example, the middle panel in the figure above is a bunch of pixels where the colors aren’t too bright.

It’s not surprising that worst-case examples exist. But Goodfellow, Shlens, and Szegedy (2014) show that adversarial examples are easy to generate. You can reliably produce perturbations like the one in the middle panel above. All one has to do is compute the gradient of your loss function:

(7) $\begin{equation*} \hat{\boldsymbol{G}}[n] = \nabla_{\!\boldsymbol{X}}\big\{ \, \ell( \mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X}[n]), \, R[n]) \, \big\} \end{equation*}$

To get an adversarial perturbation vector that is close to optimal in the sense of Equation (5), just create a vector that always points a little bit in the same direction as the gradient:

(8) $\begin{equation*} \boldsymbol{Z}[n] = \epsilon \cdot \mathrm{Sign}(\hat{\boldsymbol{G}}[n]) \end{equation*}$

This is known as the fast gradient-sign method for constructing adversarial examples.

What’s more, adversarial examples generated using the fast gradient-sign method tend to generalize across machine-learning algorithms trained on the same data. Papernot, McDaniel, and Goodfellow (2016) document that it doesn’t matter whether you are using a different neural net, ensemble methods, a decision tree, etc. If it’s been trained on the same data, it’s likely to be fooled by the same adversarial examples.

Here’s the conventional wisdom about why adversarial examples generalize. Suppose that $\mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X})$ and $f_{\hat{\boldsymbol{\vartheta}}}(\boldsymbol{X})$ are different machine-learning algorithms that both predict winners and losers in historical data on the cross-section of returns. Given that they are both successful in sample, it would be weird if these two algorithms had very different gradients around the observed data points, $\{ \boldsymbol{X}[n] \}_{n=1}^N$ . At the very least, the signs should be the same. And this simple observation implies that an adversarial example created for $\mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X})$ via the fast gradient-sign method in Equation (8) will also likely fool $f_{\hat{\boldsymbol{\vartheta}}}(\boldsymbol{X})$ and vice versa.

Thus, the idea in this post is that quant quakes occur when quantitative long-short equities traders encounter a market populated by adversarial examples. Because adversarial examples generalize across models, different kinds of quant strategies can all make the same mistakes. And, because an adversarial example is imperceptibly different from a “normal” stock to human eyes, it’s possible for a quant quake to occur in a market where nothing looks out of the ordinary to non-quants.

Practical implications

So… why are markets sometimes full of adversarial examples?

Good question. I don’t know. Wish I did.

But, even if we don’t yet have an answer to this question, we still gain a lot of insight by looking at quant quakes through the lens of adversarial examples. We’re not just replacing one unknown with another.

To start with, “What economic force might distort stock characteristics in precisely the way needed to create adversarial examples?” is at least a well-posed research question. This is progress. In addition, even if we don’t yet understand the data-generating process of adversarial examples, the existing machine-learning literature offers clear suggestions about how a quant can protect his/herself against these events when they do occur. You should add them to your training data. And it’s possible to do so because adversarial examples are easy to create via the fast gradient sign method in Equation (8).

Last but not least, the adversarial-examples hypothesis makes clear predictions about what won’t stop quakes from happening. The amount of money invested in quant strategies has grown dramatically since the OG 2007 Quant Quake. We’ve seen tremors, and “some analysts fear that another 2007-style meltdown would be more severe due to the proliferation of quant strategies.” On the other hand, industry reports often suggest that these fears are overblown. Because quants now trade on so many more characteristics, it’s much less likely that they all make the same mistakes at the exact same time.

But if the adversarial-examples hypothesis is correct, then having more characteristics makes things worse! As Goodfellow, Shlens, and Szegedy (2014) point out, it’s easier to create convincing adversarial examples when $K$ is larger. Consider the simple logistic classification rule in Equation (2). To change a buy/sell recommendation, the sign of $\theta_0 + {\textstyle \sum_{k=1}^K} \theta_k \cdot X_k[n]$ needs to change. The adversarial perturbation in Equation (8) can change value of this expression by an amount proportional to $\epsilon \cdot K$ since no element of $\boldsymbol{Z}^a[n]$ can be larger than $\epsilon$ . So, when $K$ is larger, it’s easier to change buy/sell recommendations.

In some sense, you can think about this as the dark side of “betting on sparsity”. Suppose there are almost as many predictors as observations. If only a few of these predictors matter, it’s straightforward to estimate this handful of coefficients. By contrast, if all predictors matter a little bit, it’s damn near impossible to do so. But this implies that you can hide important information in the data by diffusely spreading out across a large number of predictors. There’s a generalized uncertainty principle at work a la Donoho and Stark (1989).