Adversarial examples and quant quakes

Imagine you’re a quantitative long-short equities trader. If you can predict which stocks will have above-average returns next period and which will have below-average returns, then you can profit by buying the winners and selling short the losers. Return predictability and trading profits are two sides of the same coin. Your entire job boils down to solving this classification problem. Ideally, the resulting trading strategy would be uncorrelated with every other quant strategy. That way, no other quants will eat into your profits.

Yet, even though every quant is actively trying to be his/her own special butterfly, we periodically experience quant quakes where lots of different strategies suddenly sync up and make the same bad trades. The most well-known example of this phenomenon is “The Quant Quake” which took place during the week of August 6th, 2007. “As soon as US markets started trading, the previously wildly successful automated investment algorithms… went horribly awry, and losses mounted at a frightening pace.”

To be clear, quant strategies generally tend to do a good job of being different from one another during normal times. It’s just that, every once in a while, they suddenly line up in the worst possible way. This synchronization seems like it should require a coordinating event. However, one of the most puzzling things about quant quakes is that they often occur at times when there’s nothing special about market conditions. For example, during the first week of August 2007, many sophisticated non-quant traders were totally unaware that a quant quake had taken place until they read about the events after the fact.

Thus, there are two questions to be answered here. First, why might quant strategies sometimes sync up and make the same wrong decisions even though they look quite different from one another during normal times? Second, how can this happen at times when market conditions look normal to non-quants?

This post proposes adversarial examples as a way to answer both these questions.

Suppose you fit a machine-learning model to historical data, and this model does a good job of classifying winners and losers in sample. It’s possible to create a new adversarial example that will reliably fool this model even though no human would ever make the same mistake. What’s more, this adversarial example will likely generalize to very different machine-learning models from the one you’re using so long as the model has also been trained on the same data. Even though quants are all trying to be different from one another, an adversarial example that fools you is likely to fool these other seemingly different quants, too.

The idea in this post is that quant quakes occur when quantitative long-short equities traders encounter a market populated by adversarial examples. The fact that adversarial examples generalize across models explains why different quant strategies suddenly sync up and make the same mistakes. Moreover, because adversarial examples involve changes that don’t fool human observers, it’s possible for the resulting quant quake to occur in market environments where nothing looks out of the ordinary to non-quants.

Classification problem

I start by laying out the simplest possible version of the classification problem you face as a quantitative long-short equity trader. There are $N \gg 1$ stocks. Next period, the $n$ th stock will realize a relative return of $R[n] \in \{-1, \, +\!1 \}$ . If the stock’s return is above average, then $R[n] = +1$ . If its return is below average, then $R[n] = -1$ . You could add finer gradations—positive, zero, and negative—very positive, positive, zero, negative, and very negative—and so on. But for now let’s just study a binary world for simplicity’s sake.

Each stock’s return is determined in equilibrium as some function of the asset’s $K \gg 1$ characteristics. Let $X_k[n]$ denote the value of the $k$ th characteristic for the $n$ th stock, and let $\boldsymbol{X}[n] = \{ X_k[n] \}_{k=1}^K$ denote a vector containing the values of all $K$ characteristics for the $n$ th asset. Think about $X_k[n]$ as a single variable in WRDS database. When talking about the return and/or characteristics of an arbitrary stock, I’ll simply drop the stock-specific argument and write $(R, \, \boldsymbol{X})$ rather than $(R[n], \, \boldsymbol{X}[n])$ .

You don’t know the true data-generating process for $\{ R[n] \}_{n=1}^N$ . Your goal is to find some function, $\mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X})$ , that correctly predicts what these outcomes given the observed characteristics $\{ \boldsymbol{X}[n] \}_{n=1}^N$ :

(1) $\begin{equation*} \mathrm{f}_{\boldsymbol{\theta}}: \mathcal{R}^K \mapsto \{ -1, \,+\!1 \} \end{equation*}$

$\boldsymbol{\theta}$ represents a vector of tuning parameters involved in constructing such a function. Think about all the strategic choices made when constructing the “simple” Fama and French (1993) HML factor: video tutorial.

The simplest classification rule you might use is

(2) $\begin{equation*} \mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X}) = 2 \cdot \mathrm{Sign}\big( \, \theta_0 + {\textstyle \sum_{k=1}^K} \theta_k \cdot X_k \, \big) - 1 \end{equation*}$

However, it probably makes more sense to use a more complicated machine-learning protocol, such as a multi-layer neural network. You do not need to believe that real-world investors actually use this classification rule to determine returns in equilibrium. You only care about whether it is “as if” they do.

Whatever the functional form of $\mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X})$ , you are free to choose the values of $\boldsymbol{\theta}$ to minimize your average classification error in your training sample:

(3) $\begin{equation*} \hat{\boldsymbol{\theta}} = \arg \min_{\boldsymbol{\theta}} \, \frac{1}{N} \cdot \sum_{n=1}^N \, \ell\big( \, \mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X}[n]), \, R[n] \, \big) \qquad \text{where} \qquad \ell(a, \, b) = 1\{ a \neq b \} \end{equation*}$

In other words, you get penalized one point for each incorrect prediction. And you tune your classification function to minimize your expected penalty for a randomly selected stock in your training sample.

Market structure

You want to correctly label winner and loser stocks next period. You first choose a classification rule, $\mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X})$ . This is the choice you are making when you deciding whether to deploy a neural network, a decision tree, or ensemble methods. Next, you fit this classification rule using training data to pin down $\hat{\boldsymbol{\theta}}$ :

(4) $\begin{equation*} \{ (R[n], \, \boldsymbol{X}[n]) \}_{n=1}^N \qquad \text{with each} \qquad (R[n], \, \boldsymbol{X}[n]) \sim \Pi_{\textit{Obs}} \end{equation*}$

Let $\Pi_{\textit{Obs}}$ denote the joint distribution of characteristics and returns in the previous trading period. Once you calibrate your model, you can trade on the resulting buy/sell recommendations next period.

You are not the only quant in the market. There are lots of other quantitative long-short equities traders who are following the same protocol. However, as discussed above, each of you wants to be as different from everyone else as possible. You do this by taking very different approaches to classifying stocks in the first step. Each of you is using a different classification rule, $\mathrm{f}_{\boldsymbol{\theta}}(\boldsymbol{X})$ .

However, history only occurs once. You’re all fitting your different models to the same training data. Suppose each stock has $K = 100$ characteristics. Further suppose that I am using the simple logistic model in Equation (2) with $101$ tuning parameters while you are using multi-layer neural network with thousands of tuning parameters for the same $100$ characteristics. $\boldsymbol{\theta}$ is a very different object in each of our models. But we will both be using the same historical cross-section of returns to estimate these tuning parameters.

You might think that, because every quant in the market is using a very different classification rule from the rest, any errors in their buy/sell recommendations will be largely unrelated. If my simple logistic model says to buy a particular stock when the right move would have been to sell it short, why should we expect a multi-layer neural network to make the same mistake? And, if they did, it seems like there should be some obvious red flag signaling why this would be the case.

Adversarial examples

Both these intuitions are correct when looking at a randomly selected observation from the training data. However they don’t hold if we’re not sampling from $\Pi_{\textit{Obs}}$ . It’s possible to slightly perturb a stock’s characteristics in a way that will cause your algorithm to misclassify it even though no human observer would notice the difference. The resulting observation is called an adversarial example. And it turns out that adversarial examples tend to generalize across machine-learning algorithms trained on the same data. An adversarial example that fools your multi-layer neural network is likely to fool my simple logistic model.

Here’s the canonical adversarial example from Goodfellow, Shlens, and Szegedy (2014). The authors want to label pictures of animals in the ImageNet database with the kind of animal in the image. So they fit a 22-layer neural network (GoogLeNet). They then chose an image that their fitted model correctly labels as a “panda” (left panel) and strategically add some noise to this image (middle panel). When they ask their fitted model to classify this new altered image (right panel), it says the image contains a “gibbon” with high confidence even though no human would ever mistake the perturbed image for anything other than a panda.

An adversarial example is defined relative to a particular classification rule and a specific observation that this rule is known to classify correctly. Given these things, you then look for a small amount of noise $\boldsymbol{Z}$ that would flip your algorithm’s recommendation for that stock while leaving its realized return unaffected:

(5) $\begin{equation*} \boldsymbol{Z}^a[n] = \arg \max_{\boldsymbol{Z} \in \mathcal{Z}} \, \ell( \mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X}[n] + \boldsymbol{Z}), \, R[n]) \end{equation*}$

If a stock with characteristics $\boldsymbol{X}[n]$ has return $R[n] = +1$ , then a second stock that has characteristics $(\boldsymbol{X}[n] + \boldsymbol{Z}^a[n])$ should also have return $R[n]= +1$ . However, while your classification rule correctly labels the first stock as a “buy”, it should spit out $\mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X}[n] + \boldsymbol{Z}^a[n]) = -1$ when given the second stock’s info.

The standard way to define “a small amount of noise” in the machine-learning literature is via the $\ell_{\infty}$ norm:

(6) $\begin{equation*} \mathcal{Z} = \big\{ \, \boldsymbol{Z} \, : \, \max_{k=1,\ldots,K} | Z_k | < \epsilon \, \big\} \end{equation*}$

$\mathcal{Z}$ is the set of $K$ -dimensional vectors where no individual element is more than $\epsilon$ away from zero. For example, the middle panel in the figure above is a bunch of pixels where the colors aren’t too bright.

It’s not surprising that worst-case examples exist. But Goodfellow, Shlens, and Szegedy (2014) show that adversarial examples are easy to generate. You can reliably produce perturbations like the one in the middle panel above. All one has to do is compute the gradient of your loss function:

(7) $\begin{equation*} \hat{\boldsymbol{G}}[n] = \nabla_{\!\boldsymbol{X}}\big\{ \, \ell( \mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X}[n]), \, R[n]) \, \big\} \end{equation*}$

To get an adversarial perturbation vector that is close to optimal in the sense of Equation (5), just create a vector that always points a little bit in the same direction as the gradient:

(8) $\begin{equation*} \boldsymbol{Z}[n] = \epsilon \cdot \mathrm{Sign}(\hat{\boldsymbol{G}}[n]) \end{equation*}$

This is known as the fast gradient-sign method for constructing adversarial examples.

What’s more, adversarial examples generated using the fast gradient-sign method tend to generalize across machine-learning algorithms trained on the same data. Papernot, McDaniel, and Goodfellow (2016) document that it doesn’t matter whether you are using a different neural net, ensemble methods, a decision tree, etc. If it’s been trained on the same data, it’s likely to be fooled by the same adversarial examples.

Here’s the conventional wisdom about why adversarial examples generalize. Suppose that $\mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X})$ and $f_{\hat{\boldsymbol{\vartheta}}}(\boldsymbol{X})$ are different machine-learning algorithms that both predict winners and losers in historical data on the cross-section of returns. Given that they are both successful in sample, it would be weird if these two algorithms had very different gradients around the observed data points, $\{ \boldsymbol{X}[n] \}_{n=1}^N$ . At the very least, the signs should be the same. And this simple observation implies that an adversarial example created for $\mathrm{f}_{\hat{\boldsymbol{\theta}}}(\boldsymbol{X})$ via the fast gradient-sign method in Equation (8) will also likely fool $f_{\hat{\boldsymbol{\vartheta}}}(\boldsymbol{X})$ and vice versa.

Thus, the idea in this post is that quant quakes occur when quantitative long-short equities traders encounter a market populated by adversarial examples. Because adversarial examples generalize across models, different kinds of quant strategies can all make the same mistakes. And, because an adversarial example is imperceptibly different from a “normal” stock to human eyes, it’s possible for a quant quake to occur in a market where nothing looks out of the ordinary to non-quants.

Practical implications

So… why are markets sometimes full of adversarial examples?

Good question. I don’t know. Wish I did.

But, even if we don’t yet have an answer to this question, we still gain a lot of insight by looking at quant quakes through the lens of adversarial examples. We’re not just replacing one unknown with another.

To start with, “What economic force might distort stock characteristics in precisely the way needed to create adversarial examples?” is at least a well-posed research question. This is progress. In addition, even if we don’t yet understand the data-generating process of adversarial examples, the existing machine-learning literature offers clear suggestions about how a quant can protect his/herself against these events when they do occur. You should add them to your training data. And it’s possible to do so because adversarial examples are easy to create via the fast gradient sign method in Equation (8).

Last but not least, the adversarial-examples hypothesis makes clear predictions about what won’t stop quakes from happening. The amount of money invested in quant strategies has grown dramatically since the OG 2007 Quant Quake. We’ve seen tremors, and “some analysts fear that another 2007-style meltdown would be more severe due to the proliferation of quant strategies.” On the other hand, industry reports often suggest that these fears are overblown. Because quants now trade on so many more characteristics, it’s much less likely that they all make the same mistakes at the exact same time.

But if the adversarial-examples hypothesis is correct, then having more characteristics makes things worse! As Goodfellow, Shlens, and Szegedy (2014) point out, it’s easier to create convincing adversarial examples when $K$ is larger. Consider the simple logistic classification rule in Equation (2). To change a buy/sell recommendation, the sign of $\theta_0 + {\textstyle \sum_{k=1}^K} \theta_k \cdot X_k[n]$ needs to change. The adversarial perturbation in Equation (8) can change value of this expression by an amount proportional to $\epsilon \cdot K$ since no element of $\boldsymbol{Z}^a[n]$ can be larger than $\epsilon$ . So, when $K$ is larger, it’s easier to change buy/sell recommendations.

In some sense, you can think about this as the dark side of “betting on sparsity”. Suppose there are almost as many predictors as observations. If only a few of these predictors matter, it’s straightforward to estimate this handful of coefficients. By contrast, if all predictors matter a little bit, it’s damn near impossible to do so. But this implies that you can hide important information in the data by diffusely spreading out across a large number of predictors. There’s a generalized uncertainty principle at work a la Donoho and Stark (1989).

Classification problem

Market structure

Adversarial examples

Practical implications

Trackbacks