Alex – Research Notebook

Gordon Prime

June 23, 2026 by Alex

At the moment, all of academic finance revolves around present-value logic. Every model starts this way. No one feels compelled to justify the choice. Researchers view the positive-NPV rule as the gold standard of financial decision-making. What would it be like to live on a world where everyone else felt the same way?

Life on Gordon Prime

On Gordon Prime, the whole civilization is organized around discounting and present-value logic. The positive-NPV rule is not an abstract idea that CEOs learn about in business school. It is what everyone does every day, as a matter of course. M&A press releases lead with the NPV surplus the deal creates. Gordian CEOs can quote every major project’s risk-adjusted discount rate and explain how they arrived at the number. NPV calculations are front and center in conference calls and Investor Day presentations.

The citizens of Gordon Prime are not clairvoyant. They cannot sum infinite series in their heads. They need computers to estimate multi-factor models. To cope, Gordian CEOs have developed many helpful shortcuts for picking reasonable discount rates. Like the rule of 72, but for choosing the right $\mathrm{r}$ . Moreover, on Gordon Prime, CEOs routinely make long-term cash-flow forecasts. A CEO who discounts at $\mathrm{r} \approx 5\%$ can predict cash flows $\big(\frac{1}{5\%}\big) = 20$ years in the future. For projects with lower discount rates, $\mathrm{r} \approx 2\%$ , it is not uncommon to see horizons of $\big( \frac{1}{2\%}\big) = 50$ years or more.

Corporations on Gordon Prime announce quarterly cash flows. Income statements exist, but CEOs and shareholders treat them as a mere accounting convention. Earnings get reconciled to cash flows, never the other way around. Because everyone instinctively knows to set “price equal to expected discounted payoff”, Gordian children learn the two Modigliani-Miller irrelevance theorems in kindergarten. Many find the results obvious. Investors on Gordon Prime expect a firm’s leverage and dividend policy to be dictated by frictions. Absent such complications, nobody would think to ask a Gordian CEO about either.

Asset pricing on Gordon Prime is a solved science. It has to be. Everyone needs this input to perform the right NPV calculation. John Cochrane has a counterpart on this distant planet. There, he gave an AFA presidential address titled “Discount Rates: We Know How To Calculate Them.” A Gordian CEO expects her decisions to move her firm’s multiple right away. To choose a corporate policy she has to understand how the market will price the resulting change in her firm’s future payout stream.

Research on Gordon Prime

Now put yourself in the shoes of a Gordian corporate-finance researcher. What would your day look like? You certainly would never survey CEOs about whether they use the positive-NPV rule. Of course they do. CEOs on Gordon Prime won’t shut up about it. They talk about the positive-NPV rule incessantly. On Gordon Prime, the question would be as strange as asking whether CEOs perform arithmetic. Instead, you might run surveys asking CEOs how they pick the right discount rate for specific kinds of unusual projects.

Discount rates are a property of a project’s future cash flows, not a characteristic of the firm. So, on Gordon Prime, it does not matter which company is evaluating a project. A CEO who greenlights the investment at one company would make an identical call while running another. Thus, much of corporate-finance research on Gordon Prime would be done at the project level.

Gordian CEOs make decisions by discounting cash flows that are expected to arrive decades in the future. Current interest rates and market conditions play a minor role. The Gordon Prime version of IBES contains 20- and 30-year cash-flow forecasts for most firms because this is the horizon that matters when thinking in present-value terms. The positive-NPV rule compares the present value of a project’s cash flows against the upfront cost, so the funding source is treated as a minor detail. What matters is a project’s total cost, not where the company gets the money from.

Prestige on Gordon Prime would go to researchers who study how corporate investment moves over time in subtle ways. Gordian researchers treat capital structure and dividend policy as second-tier topics, which only matter due to frictions. When these frictions bite, the dose-response curve is the same for every firm. The documented effects are smooth and continuous. Constrained CEOs invest a little less than otherwise-similar unconstrained ones.

A strange little model

One slow afternoon, you (dear Gordian researcher) write down a strange little model where the CEO does not discount anything. Instead, she makes decisions aimed at increasing her firm’s short-term EPS forecast—i.e., expected earnings over the next twelve months divided by current shares outstanding.

What would follow?

Quite a lot! The EPS-maximizing CEO in your model does not use the positive-NPV rule. Instead of converting the project’s future cash-flow stream into an upfront valuation, she translates the upfront cost into an expense flow. Your EPS maximizer only invests in projects that generate enough income next year to cover their own added financing expense using the firm’s cheapest available source of capital. In other words, the CEO funds projects with income yields, $\mathrm{IY} = \frac{\mathbb{E}[\Delta \mathrm{NOI}_1]}{\mathrm{Cost}}$ , that exceed her firm’s cheapest financing yield, $\mathrm{FY} = \min\{\mathrm{EY}, \mathrm{i}, \mathrm{rf}\}$ . This is the accretive investment rule.

One thing that immediately jumps out at you is the nature of this hurdle rate. It is a property of the firm, not the project. Equity would be cheapest for a firm whose earnings yield is below the riskfree rate, $\mathrm{EY} < \mathrm{rf}$ . Such a firm would fund investments by issuing stock even while sitting on cash. By contrast, a firm with $\mathrm{EY} > \mathrm{rf}$ would lever up, making cash by far the cheapest funding source if/when it ever appears. The line between the two types would fall at $\mathrm{EY} = \mathrm{rf}$ , and it would move around as the riskfree rate moved. The EPS-maximizing CEO in your model takes her firm’s current pricing as given, so her decisions track with market conditions rather than her beliefs about cash flows decades in the future.

It is a strange little model. No discounting. No present value. But sharp predictions, so you keep going.

Life on planet Accreton

To better understand your max EPS theory, you run a thought experiment. You imagine a crazy planet where accretive decision-making is the law of the land. You call it, “Accreton”. It must be a wild place, indeed. What would life on this hypothetical planet look like?

On Accreton, press releases would lead with EPS accretion. The deal’s NPV surplus would often go unmentioned, or even (**GASP**) uncalculated. EPS-maximizing CEOs would use rules of thumb, such as IRRs and payback periods, that avoid choosing a project-specific discount rate altogether. Accretonese companies would obsess over short-term earnings, not cash flows. Since the star of the show is next year’s EPS, no one on planet Accreton would bother making 20-year cash-flow forecasts. Maybe companies would talk about cash flows a bit… but only to reconcile why they differ from net income.

Unlike on Gordon Prime, leverage and payout policy would be first-order concerns on planet Accreton. CEOs would labor over these choices. Analysts would ask hard questions to make sure the values were correct. An Accretonese CEO would not need to know the correct asset-pricing model. It would not matter to her where the company’s PE ratio and marginal interest rate came from. All she would need to know is what the current values are. On planet Accreton, a CEO could take these numbers as given and evaluate corporate policies from there.

On planet Accreton, firms on either side of $\mathrm{EY} = \mathrm{rf}$ would pursue two different constellations of corporate policies. A growth stock ( $\mathrm{EY} < \mathrm{rf}$ ) would see equity as cheap compared to riskfree debt. A value stock ( $\mathrm{EY} > \mathrm{rf}$ ) would see equity as more expensive. The two kinds of firms could disagree about whether to fund the same project. If the Federal Reserve on planet Accreton cut rates, then the value stock might flip its verdict. Accretonese growth stocks fund investments by issuing equity even when cash is available. By contrast, value stocks would find it accretive to lever up until cash becomes their cheapest available source.

Accretonese research

This thought experiment has been fun so far. You decide to keep pushing. What would life be like for a corporate-finance researcher on planet Accreton? Golly gee willikers, you think to yourself as a proud Gordian citizen, the literature on that other planet sure must be different.

For one thing, you figure that no researcher on planet Accreton would bother running surveys asking whether CEOs use the positive-NPV rule. What would be the point? Every corporate statement on planet Accreton leads with EPS growth. The relevant metric there is obviously not NPV.

You imagine that Accretonese researchers would focus on yield spreads, not discount rates. When a CEO on that planet decides whether to fund a project, what matters is the income-vs-financing yield spread, $\mathrm{IY} {-} \mathrm{FY}$ . The discovery of the growth-versus-value divide at $\mathrm{EY} = \mathrm{rf}$ would be one of the foundational discoveries in the literature. On planet Accreton, capital structure and payout would not be sleepy backwaters. They would have a seat at the big boys’ table, right next to real investment. All three would be seen as ways for a CEO to generate value for her shareholders by increasing the firm’s EPS next year.

A firm’s PE ratio is just another way of writing its earnings yield, $\mathrm{PE} = \big( \frac{1}{\mathrm{EY}} \big)$ . You figure that Accretonese researchers would view IRRs and payback periods in a similar light. An IRR is a multi-period generalization of a project’s income yield, $\mathrm{IY}$ . A payback period is the same quantity expressed as a multiple, $\big( \frac{1}{\mathrm{IY}}\big)$ . If there’s a database like IBES on planet Accreton, then you have to imagine that it only contains 1- and 2-year EPS forecasts. Why would anyone there bother to forecast cash flows two decades into the future?

Stranger than fiction

Here is the crazy thing. That planet you dreamed up… (Q: You mean the one where accretive decision-making is the law of the land and most people never discount a cent?) Yes, that one. That planet is a pixel-perfect description of Earth. Every line of it. You did not invent Accreton. You described home.

Yet, corporate-finance researchers act as though they are living on Gordon Prime. They spend their days puzzled by the data that keeps arriving. Discount rates that barely move when the cost of capital does. Investment that lurches with cash flow it should not care about. CEOs quote payback periods even though textbooks call the method “stupid”. Researchers cannot tell CEOs which cost of capital to use because they cannot agree themselves. The literature contains a zoo of different factor models. None of it should be puzzling. It is Accretonese data read by researchers who insist they are somewhere else.

A sensible Gordian researcher would never run a survey asking whether CEOs use the positive-NPV rule or an IRR hurdle. On Gordon Prime the answer is plain, they discount, and everyone can see it. A sensible Accretonese researcher would never run that survey either. Here the answer is just as plain, they go by accretion, and every press release says so. The question only occurs to someone who cannot tell which planet she is standing on. It is the question of a lost interstellar traveler.

What regressions show

You can’t convince researchers that they live on Accreton by pitting the accretive rule against the positive-NPV rule in a horse race to see which one fits the data better. That contest is silly. The first Modigliani-Miller paper was published in 1958. Researchers have spent nearly 70 years adding ingredients to the same present-value framework: financing constraints, agency costs, behavioral biases, adjustment costs, etc. Collectively, all that machinery can be fine-tuned to fit almost any pattern. If we live on Accreton, then any explanatory power associated with the positive-NPV rule must come from either overfitting or these after-market add-ons. A comparison of predictive accuracy cannot separate those two stories.

That is the value of this thought experiment. It shows where the diagnostic evidence lives. It is not in the regression $R^2$ . It is in what CEOs say when they announce an M&A deal, in which numbers lead the press release, in how rarely CEOs mention discount rates. Only 1% of conference calls quote a discount rate. The fact that we have to ask CEOs whether they use the positive-NPV rule doesn’t prove we live on planet Accreton. But it sure is hard to square with the claim that Earth is Gordon Prime.

Why max EPS Persists: Ba (AER 2026) Interpretation

March 29, 2026 by Alex

The question

Researchers currently take it for granted that CEOs should maximize PV[Shareholder Payouts]. Suppose we agree. Under this premise, max EPS is the wrong objective. It’s a misspecified model for how to run a firm. In work with Itzhak Ben-David, I show that EPS maximization solves the 3 core problems in corporate finance: capital structure, real investment, and payout policy.

How does max EPS survive? CEOs have access to textbooks, MBA programs, consultants, and analysts—all of which teach present-value logic. The data generated by decades of corporate decisions are available for everyone to examine. If max EPS is misspecified, why hasn’t it been abandoned?

Ba (2026) provides a formal theory of exactly this phenomenon: when and why misspecified models persist, even when decision-makers are open to switching and have access to infinite data. This note spells out the connection to EPS maximization.

Ba’s framework

An agent uses a subjective model $\theta$ to guide decisions. Each period $t$ , she chooses an action $a_t$ from a finite set $\mathcal{A}$ and observes an outcome $y_t$ drawn from the true (unknown) data-generating process $Q^*(\cdot | a_t)$ . Her model $\theta$ is a parametric family of predicted DGPs, $\{Q^\theta(\cdot | a, \omega)\}_{a \in \mathcal{A}, \omega \in \Omega^\theta}$ , where $\omega$ indexes the parameter space $\Omega^\theta$ . The model is correctly specified if some $\omega$ recovers $Q^*$ ; it is misspecified otherwise.

The agent holds a prior $\pi_0^\theta$ over $\Omega^\theta$ and updates beliefs via Bayes’ rule within the model. Crucially, she also considers a competing model $\theta'$ with its own parameter space $\Omega^{\theta'}$ and prior $\pi_0^{\theta'}$ . She compares models using the Bayes factor

(1) $\begin{equation*} \lambda_t = \frac{\ell_t(\theta')}{\ell_t(\theta)} \end{equation*}$

where $\ell_t(\theta) = \sum_{\omega \in \Omega^\theta} \pi_0^\theta(\omega) \ell_t(\theta, \omega)$ is the marginal likelihood of the data under model $\theta$ , and $\ell_t(\theta, \omega) = \prod_{\tau=0}^{t} q^\theta(y_\tau | a_\tau, \omega)$ is the likelihood conditional on parameter $\omega$ .

As new data rolls in, the agent updates her Bayes factor recursively

(2) $\begin{equation*} \lambda_t = \lambda_{t-1} \cdot \frac{\sum_{\omega' \in \Omega^{\theta'}} \pi_t^{\theta'}(\omega') \, q^{\theta'}(y_t | a_t, \omega')}{\sum_{\omega \in \Omega^\theta} \pi_t^\theta(\omega) \, q^\theta(y_t | a_t, \omega)} \end{equation*}$

If $\lambda_t > \alpha$ where $\alpha \geq 1$ is a switching threshold, the agent switches to $\theta'$ . If $\lambda_t < 1/\alpha$ , she switches back. The threshold $\alpha$ controls switching stickiness. A larger $\alpha$ requires stronger evidence to switch.

Ba (2026) notation	EPS vs. PV application
Initial model $\theta$	Max EPS
Competing model $\theta'$	Max PV[Shareholder Payouts]
Action set $\mathcal{A}$	Corporate decisions: leverage choice, project selection, payout policy
Outcome $y_t$	Observable corporate outcomes: EPS level, EPS growth, stock-price reaction, analyst response
True DGP $Q^*(\cdot \mid a)$	The actual mapping from corporate decisions to outcomes (determined by the full economic environment)
Parameter space $\Omega^\theta$	Parameters of the EPS model (earnings yield, interest rates)
Parameter space $\Omega^{\theta'}$	Parameters of the PV model (discount rates, growth rates, terminal values, risk premia, payout schedules)
Switching threshold $\alpha$	Institutional friction: retraining costs, compensation redesign, regulatory reporting norms, board inertia
Bayes factor $\lambda_t$	Cumulative evidence that PV logic fits the data better than EPS logic

Result 1: Endogenous data lets misspecified models survive forever

The theorem

Theorem 1 (Ba 2026, p. 16). Suppose $\alpha > 1$ . The following are equivalent:

Model $\theta$ is globally robust for at least one full-support prior.

Model $\theta$ is locally robust for at least one full-support prior.

There exists a p-absorbing self-confirming equilibrium (SCE) under model $\theta$ .

A self-confirming equilibrium under $\theta$ is a strategy $\sigma$ supported by a belief $\pi^\theta$ such that (i) $\sigma$ is myopically optimal given $\pi^\theta$ , and (ii) the model’s prediction matches the true DGP on the equilibrium path

(3) $\begin{equation*} q^\theta(\cdot \mid a, \omega) \equiv q^*(\cdot \mid a) \qquad \forall \, a \in \text{supp}(\sigma), \; \forall \, \omega \in \text{supp}(\pi^\theta) \end{equation*}$

The strategy is p-absorbing if a dogmatic $\theta$ -modeler eventually plays only actions in $\text{supp}(\sigma)$ .

The key insight: a misspecified model need not be globally correct. It only needs to be correct on the equilibrium path—for the actions it actually induces. Off-path misspecification is never revealed because the agent’s own actions determine which data are generated. This is why endogenous data is essential: Ba (2026, p. 16, fn. 19) notes that “in an exogenous-data environment, Theorem 1 implies that the sufficient and necessary condition for both local robustness and global robustness is that the model is correctly specified.”

P-absorbingness adds a dynamic requirement on top of the static SCE condition. It is not enough for an SCE to exist; the agent’s belief dynamics must actually converge to it. Ba’s Section 5.2 (Proposition 2, p. 24) shows that this convergence property depends on the direction of belief reinforcement. When beliefs and actions are complements (so that the bias feeds on itself), the dynamics are positively reinforcing and convergence to the SCE is guaranteed. Hence, the SCE is p-absorbing. When beliefs and actions are substitutes, the bias is self-correcting. Dynamics may oscillate and fail to converge. A SCE exists but is not p-absorbing, and the misspecified model is not robust.

Application to EPS

When a CEO maximizes EPS, her decisions shape the observable corporate outcomes. The EPS model’s predictions are tested only against data generated by EPS-driven actions. Predictions about actions a CEO never takes are never tested.

Leverage. An EPS maximizer borrows when $\mathrm{EY} > \mathrm{i}$ (earnings yield exceeds the interest rate) and uses the proceeds to retire shares. This raises EPS mechanically. The outcome the CEO observes is: EPS went up, the stock price did not collapse, analysts applauded the “accretive” transaction. The EPS model’s prediction that the transaction would be good because it is accretive is confirmed by the data the decision itself generated. The CEO does not observe the counterfactual: what would’ve happened under the PV-optimal leverage choice given frictions.

Investment. An EPS maximizer uses $\mathrm{HR} = \min\{\mathrm{EY}, \, \mathrm{i}, \, \mathrm{rf}\}$ as the hurdle rate, not the WACC. She rejects projects with positive NPV but negative first-year EPS impact (dilutive projects) and accepts projects with negative NPV but positive first-year EPS impact (accretive projects). The observed outcome: EPS did not fall, the project looks like it “worked.” The NPV of rejected projects is never observed.

Payout. An EPS maximizer buys back stock whenever buybacks offer a higher yield than investing cash ( $\mathrm{EY} > \mathrm{CY}$ ) rather than evaluating the NPV of the buyback. The observed outcome: EPS went up, the market reacted positively to the announcement. The PV counterfactual (could the cash have been better deployed elsewhere?) is off-path.

Let $\sigma^{EPS}$ be the strategy induced by max EPS, and let $\hat{\omega}$ be a parameter value in the EPS model under which the predicted outcome distribution matches $Q^*(\cdot | a)$ for all $a \in \text{supp}(\sigma^{\mathrm{EPS}})$ . Then $\sigma^{\mathrm{EPS}}$ is an SCE under the EPS model. This is plausible because the EPS model does not mispredict the direction of EPS changes from leverage, buybacks, or accretive acquisitions. It correctly predicts that borrowing at $\mathrm{i} < \mathrm{EY}$ raises EPS, that using cash for buybacks at $\mathrm{EY} > \mathrm{CY}$ raise EPS, and so on. What it gets wrong is the welfare interpretation: whether these EPS changes correspond to value creation. But welfare is not directly observed in $y_t$ ; what is observed are EPS changes, stock-price reactions, and analyst ratings, all of which are consistent with the EPS model’s on-path predictions.

Moreover, the EPS model’s feedback dynamics are positively reinforcing in the sense of Ba’s Proposition 2: EPS-driven decisions raise EPS, which validates the model, which strengthens conviction, which leads to more EPS-driven decisions. This positive feedback ensures that the SCE is p-absorbing. Contrast this with the dynamics facing a CEO who switches to PV logic: she accepts a dilutive acquisition, EPS falls in the short run, analysts downgrade, the stock price drops, and the PV model appears to have failed, creating pressure to revert. The transition to the correct model generates short-run data that seem to disconfirm it. By Theorem 1, the existence of a p-absorbing SCE under the EPS model is sufficient for it to be globally robust. Hence, EPS maximization can persist against any competitor, including PV logic, with infinite data.

Result 2: Concise models can be more robust than correct ones

The theorem

Theorem 2 (Ba 2026, p. 19). Suppose $\alpha > 1$ and model $\theta$ has no traps. Then:

Model $\theta$ is globally robust at prior $\pi_0^\theta$ if and only if $\pi_0^\theta(C^\theta) \geq 1/\alpha$ .

Model $\theta$ is locally robust at all full-support priors if and only if $C^\theta \neq \emptyset$ .

$C^\theta$ is the set of consistent parameters: those $\omega$ for which the pure belief $\delta_\omega$ supports a p-absorbing SCE. The model’s prediction under $\omega$ matches the true DGP at every action in the equilibrium strategy’s support.

The condition $\pi_0^\theta(C^\theta) \geq 1/\alpha$ links three forces: the model’s structure (which determines $C^\theta$ ), the agent’s prior (which determines how much mass falls on $C^\theta$ ), and the switching threshold (which sets the bar). Prior tightness and switching stickiness are substitutes: a higher $\alpha$ lowers the bar for prior concentration, and a tighter prior lowers the bar for stickiness. Any asymptotically accurate model can be globally robust at a given prior, provided switching is sufficiently sticky.

The critical implication: correctly specified models are not necessarily more robust than misspecified ones. A misspecified model with a smaller parameter space $|\Omega^\theta|$ can satisfy the tightness condition more easily. Under an ignorance prior (uniform over $\Omega^\theta$ ), each parameter receives weight $1/|\Omega^\theta|$ . For a model where every parameter is consistent ( $C^\theta = \Omega^\theta$ ), the tightness condition is automatically satisfied at any $\alpha > 1$ , regardless of the prior—the model is unconditionally globally robust. But a correctly specified model with a large parameter space needs a correspondingly tight prior to be globally robust, and under a uniform prior it may fail. Ba (2026, p. 4): “simple misspecified models equipped with entrenched priors can be more robust than complex correctly specified models.”

In the media-bias application (Section 5.1, Proposition 1), Ba makes this concrete: a two-state misspecified model $\hat{\theta}$ is globally robust at all priors and all $\alpha \geq 1$ , while the correctly specified three-state model $\theta$ is globally robust only if $\pi_0^\theta(\omega^M) \geq 1/\alpha$ . The misspecified model permanently replaces the correct one with arbitrarily high probability as the prior on the extreme states increases.

Application to EPS

The EPS model has a small parameter space. For any given decision (borrow or not, invest or not, buy back or not), it requires the CEO to know essentially two things: the earnings yield $\mathrm{EY} = \frac{\mathbb{E}[\mathrm{EPS}]}{\mathrm{Price}}$ and the relevant financing cost (interest rate $\mathrm{i}$ or risk-free rate $\mathrm{rf}$ ). The decision rule is a direct comparison: act if and only if $\mathrm{EY} > \mathrm{HR}$ , where $\mathrm{HR} = \min\{\mathrm{EY}, \, \mathrm{i}, \, \mathrm{rf}\}$ .

The PV model requires knowledge of a much larger parameter space $\Omega^{\theta'}$ : the risk-free rate, market risk premium, firm beta (or multi-factor betas), the project-specific risk adjustment, the terminal growth rate, the expected path of future cash flows, and the probability distribution over states of the world.

Under Theorem 2, the prior tightness condition for global robustness is $\pi_0^\theta(C^\theta) \geq 1/\alpha$ . For the EPS model, if $C^\theta$ encompasses most or all of $\Omega^\theta$ (because the model is consistent for the small set of parameters it uses), then $\pi_0^\theta(C^\theta)$ is close to 1 and the condition is satisfied for any $\alpha > 1$ . The EPS model may be unconditionally globally robust.

For the PV model, even though it is correctly specified ( $C^{\theta'} \neq \emptyset$ ), the prior mass is spread across a large parameter space. Under a diffuse prior, $\pi_0^{\theta'}(C^{\theta'})$ may be small. The PV model is locally robust at all priors (by Theorem 2(ii)), but it is globally robust only if $\pi_0^{\theta'}(C^{\theta'}) \geq 1/\alpha$ . With large $|\Omega^{\theta'}|$ and diffuse prior, this can fail.

Moreover, switching stickiness in corporate settings is very large. Compensation contracts are tied to EPS targets. Analyst coverage is organized around EPS estimates, consensus forecasts, and PE multiples. Regulatory reporting (GAAP earnings) makes EPS the most salient and auditable metric, while PV calculations involve subjective inputs (discount rates, growth assumptions) that are harder to audit and verify. Board education is required to shift from a direct comparison (“is this accretive?”) to a multi-parameter model (“what is the NPV at the appropriate risk-adjusted discount rate?”). All of this amounts to a very high $\alpha$ , which further lowers the bar for the prior tightness that the EPS model must satisfy.

Result 3: Even slight switching friction is enough

The theorem

Theorem 3 (Ba 2026, p. 21). Suppose model $\theta$ has no traps and $\alpha = 1$ . Then model $\theta$ is locally or globally robust at any full-support prior $\pi_0^\theta$ if and only if $C^\theta = \Omega^\theta$ .

When switching is non-sticky ( $\alpha = 1$ ), local and global robustness coincide, robustness at some prior is equivalent to robustness at all priors, and both hold only when every parameter in the model is consistent. This is an extreme demand: the model must be correct for every DGP it entertains, not just on the equilibrium path. Only a model with full prior tightness ( $C^\theta = \Omega^\theta$ ) can survive frictionless comparison.

The set of robust models shrinks discontinuously at $\alpha = 1$ . For any $\alpha > 1$ , models with $C^\theta \neq \Omega^\theta$ can be robust (provided the prior tightness condition is met). At $\alpha = 1$ , they cannot. Ba (2026, p. 21): “the set of locally robust models and supporting priors shrinks discontinuously at $\alpha = 1$ , which highlights how stickiness helps more misspecified models persist.”

The mechanism: at $\alpha = 1$ , there always exists a nearby competing model that fits the data slightly better than the initial model on some dimension. Because there is no switching friction, this marginal improvement is sufficient to trigger a switch. The proof constructs such a competing model by preserving most DGPs in $\theta$ while slightly improving the accuracy of one DGP associated with a parameter in $\Omega^\theta \setminus C^\theta$ .

Application to EPS

Theorem 3 clarifies that the persistence of max EPS depends on switching friction being positive—but the required friction can be arbitrarily small. For any $\alpha > 1$ (even $\alpha = 1.01$ ), the EPS model can be globally robust provided the prior tightness condition is met. The discontinuity at $\alpha = 1$ means that even minimal institutional friction (a small cost of retraining, a slight reluctance to abandon a familiar framework) is qualitatively different from zero friction.

This matters because it addresses a potential objection: “surely CEOs could switch to PV logic if they wanted to; there’s no real barrier.” Ba’s result says that even a negligible barrier is enough, as long as it is positive. The EPS model does not need an enormous moat to survive. It needs (i) a p-absorbing SCE (Result 1), (ii) sufficient prior concentration on consistent parameters (Result 2), and (iii) any positive switching friction at all (Result 3). The first two conditions are structural properties of the EPS model. The third is almost trivially satisfied in any real institution.

Conversely, Theorem 3 identifies the knife-edge case where EPS would be displaced: a world with literally zero switching costs ( $\alpha = 1$ ) and a PV model that is a slight local improvement. In practice, this would correspond to an environment where CEOs face no career risk from short-term EPS misses, no analyst pressure around quarterly earnings, and no cognitive cost of estimating multi-parameter discount rates. These conditions do not describe any real-world setting.

Summary

Ba (2026) provides a formal framework for understanding when misspecified models persist despite competition from correctly specified alternatives. Applied to the EPS-vs.-PV question, the theory identifies three reinforcing mechanisms, one per main result.

Result	Mechanism	Application to EPS
Theorem 1	Misspecified model is robust iff it admits p-absorbing SCE; endogenous data insulates on-path predictions from off-path errors	Accretive actions generate data that confirm max EPS model; positive feedback dynamics ensure convergence to SCE
Theorem 2	Global robustness requires $\pi_0^\theta(C^\theta) \!\geq\! 1/\alpha$ ; concise models concentrate priors; stickiness and prior tightness substitutes	EPS model’s small parameter space makes tightness condition easy to satisfy; minor frictions and reporting norms matter
Theorem 3	Set of robust models shrinks discontinuously at $\alpha = 1$ ; any positive friction qualitatively expands what can persist	Even minimal switching costs suffice for EPS to survive; knife-edge $\alpha = 1$ case (zero friction) does not describe real world

The punchline: even if we take as given that maximizing PV[Shareholder Payouts] is the correct objective, the Ba (2026) framework gives formal reasons—grounded in Bayesian learning theory—for why maximizing EPS can persist indefinitely. The misspecified model is not merely sticky due to inertia or ignorance. It is robust in a precise sense: it admits a self-confirming equilibrium, its directness concentrates prior beliefs, the data it generates through the CEO’s own actions provide continuous apparent validation, and even minimal institutional friction is enough to protect it. These forces can be strong enough that the correct model is permanently abandoned.

Caveat: Is max PV[Shareholder Payouts] actually the correct model?

Everything above assumes that maximizing PV[Shareholder Payouts] is the correctly specified model. It’s the true DGP against which max EPS is judged misspecified. Ba’s framework requires us to designate one model as correct and ask whether the other persists. We chose PV as the correct model because that is what finance theory prescribes. But this assumption deserves scrutiny.

Shareholders do not get to spend corporate earnings. Earnings are an accounting construct; they accrue to the firm, not to the shareholder’s bank account. A dollar of EPS that is retained and reinvested never reaches the shareholder at all. In that sense, maximizing EPS is maximizing a fiction—a number that does not correspond to any cash flow the shareholder actually receives.

But PV[Shareholder Payouts] has a parallel problem. Shareholders do not get to spend the present discounted value of a dollar they expect to receive in 20 years. You cannot eat risk-adjusted returns. The “present value” of a distant payout is a mathematical object, not cash in hand. And yet these distant, heavily discounted payouts are the primary drivers of valuation in the PV framework.

The Gordon growth model makes this concrete. Under the standard formulation, an asset’s price equals expected cash flow next year times a forward-looking multiple

(4) $\begin{equation*} \mathrm{Price} = \mathbb{E}[\mathrm{CF}] \times \bigg( \frac{1}{\mathrm{r} - \mathrm{g}} \bigg) \end{equation*}$

For typical parameter values ( $\mathrm{r} \approx 10\%$ , $\mathrm{g} \approx 5\%$ ), the multiple is roughly $\big(\frac{1}{10\%-5\%}\big) = 20\times$ . But $\big(\frac{1}{\mathrm{r}-\mathrm{g}}\big)$ is also the Macaulay duration of the cash flow stream in years. So the “typical” dollar of present value corresponds to a payout roughly two decades in the future. The PV framework asks the CEO to make decisions today based on the risk-adjusted value of money that shareholders will not receive for 20 years. This is money that never appears on a financial statement, whose value depends on estimates of discount rates and growth rates that are themselves deeply uncertain.

This does not mean PV logic is wrong. It means that both models involve abstractions, and the question of which abstraction is “correct” is less obvious than textbook finance suggests. EPS is a fiction because earnings are not payouts. PV is a fiction because present values are not cash. The Ba (2026) framework shows that even if we grant the PV model the status of correct specification, the EPS model can persist indefinitely. But if we take seriously the possibility that neither model is unambiguously correctly specified, then the persistence of max EPS becomes even less surprising. In Ba’s terms, we may not be in a world where a correctly specified competitor exists at all, in which case the question is not whether EPS will be abandoned but which misspecified model proves more robust. Regardless, history has shown that EPS wins.

Behavioral finance and corporate finance are both organized in the exact same way

February 4, 2023 by Alex

Behavioral finance and corporate finance are both organized in the exact same way. Neither is based on a grand unified theory. Instead, both fields proceed by looking for deviations from a benchmark model. The behavioral-finance literature is a list of ways to violate market efficiency. The corporate-finance literature is a collection of ways to relax the assumptions needed for capital-structure irrelevance. Same setup.

One reason for writing this post is to spread the word about this symmetry. I don’t think it’s widely appreciated. Occasionally I’ll mention it to somebody. When I do, I usually get either a blank stare or a look of sudden recognition. I’d like to live in a world where the comment gets a bland nod in agreement.

I also think it’s noteworthy how differently each field is viewed within the profession given that both fields are calling plays from the same playbook. True, behavioral finance has not proposed an alternative to market efficiency. But, then again, ain’t nobody asking corporate researchers to come up with an alternative to ModiglianiMiller58. Highlighting this disconnect is the other reason for writing this post.

Behavioral finance

Behavioral economists explain market outcomes by pointing to deviations from market efficiency—i.e., the idea that “security prices fully reflect all available information”. In John Cochrane’s words: “Informational efficiency is a natural consequence of competition, relatively free entry, and low costs of information in financial markets. If there is a signal, not now incorporated in market prices, that future values will be high, competitive traders will buy. In doing so, they bid the price up until it fully reflects the available information.”

If there’s a signal that an asset’s future payout will be high, then the present discounted value of that asset’s payout will go up—i.e., $\Exp[ \, \textit{Discount Rate} \times \textit{Future Payout} \, ]$ will increase. If the asset’s current price doesn’t increase accordingly, any trader who sees the signal could profit by buying a share, $\Delta = +1$ :

$\begin{equation*} \Big( \, \underbrace{\Exp[ \, \textit{Discount Rate} \times \textit{Future Payout} \, ]}_{\substack{\text{Present\phantom{j}discounted\phantom{j}value\phantom{j}of} \\ \text{asset w/ same future payout}}} - \textit{Current Price} \, \Big) \times \Delta \end{equation*}$

In the process, the trader will push up the current price until there’s no longer any benefit to continuing the trade. And we’d see the opposite pattern with $\Delta = -1$ in a world where traders saw a negative signal.

Given this benchmark, behavioral economists look for situations where there appears to be a persistent uncorrected pricing error. e.g., under the benchmark of market efficiency, it should not be possible to find situations where $\Exp[ \textit{Discount Rate} \times \textit{Future Payout} \, ] > \textit{Current Price}$ without traders taking action, $\Delta = 0$ . However, JegadeeshTitman93 document that the 30% of stocks with the highest past returns (past winners) tend to have higher future returns than the 30% of stocks with the lowest past returns (past losers). In a world where markets were efficient, traders would immediately bid up the prices of past winners until this pricing error disappeared. So it seems like real-world traders must be making some sort of behavioral error.

Corporate finance

ModiglianiMiller58 taught us that, if all the following assumptions hold, then a firm’s choice of leverage has no affect on its market valuation. (A1) Investors and firms can trade the same set of correctly priced securities. (A2) Investors and firms are taxed in the same way. (A3) Investors and firms face no transaction costs or portfolio restrictions. (A4) There are no bankruptcy costs or costs to issuing new securities. (A5) A firm’s choice of leverage doesn’t directly affect its future cash flows. And finally, (A6) firm leverage doesn’t signal additional information to investors about these cash flows. Firms clearly spend a lot of time worrying about their capital structure. And corporate researchers explain their decisions by pointing out ways that the above assumptions are violated in the real world. That’s the organizing principle behind this literature.

The streamlined proof given in ModiglianiMiller69 is based on a homemade-leverage argument. Suppose there are two firms with different capital structures but identical cash flows. The first firm is unlevered while the second firm as issued debt. In a world where all the above assumptions hold, an investor could effectively lever up the unlevered firm’s cash flows himself by constructing a portfolio that’s long the unlevered firm and short the debt issued by the levered firm:

$\begin{equation*} - \Big( \, \underbrace{[\textit{Unlevered Firm's Value} - \textit{Value of Debt Issuance}]}_{\substack{\text{Cost of building a portfolio that buys unlevered firm and} \\ \text{shorts the debt issued by otherwise identical levered firm}}} - \textit{Equity Value of Levered Firm} \, \Big) \times \Delta \end{equation*}$

If there’s any gap between the cost of this portfolio and the equity value of the levered firm, an investor could earn arbitrage profits given the assumptions above. Since both have identical future cash flows, the investor should continue buying the one with the lower current price until there’s no more price difference.

“The entire development of corporate finance since 1958—the publication date of the first MM article—can be seen and described essentially as the sequential (or simultaneous) relaxation of the assumptions listed before.” e.g., if corporate debt is taxed differently than the short positions of individual investors, then the homemade-leverage argument breaks down. Once assumption A2 has been violated, capital structure is no longer irrelevant. In a world where corporations get preferred tax treatment, firms should optimally choose to issue debt since it’d be more expensive for individual investors to homebrew this leverage themselves.

Nobody would say…

Given how I’ve described behavioral finance and corporate finance above, it’s obvious that the two fields are organized in the exact same way. Researchers in each field try to make sense of empirical regularities by pointing to specific deviations from their own respective benchmark. In fact, Mark Rubinstein argues that ModiglianiMiller58’s “real and enduring contribution was to point others in the direction of arbitrage reasoning.” And this sort of reasoning lies at the heart of the Efficient Market Hypothesis in Fama70.

That being said, the behavioral-finance and corporate-finance literatures clearly emphasize different things about their respective benchmark models. ModiglianiMiller58 weren’t trying to argue that the assumptions needed for capital-structure irrelevancy were realistic. As Merton Miller later wrote: “We first had to convince people (including ourselves!) that there could be any conditions, even in a ‘frictionless’ world, where a firm would be indifferent between issuing different kinds of securities.”

By contrast, market efficiency is treated as a good first approximation to the real world. While Franco Modigliani and Merton Miller didn’t think that capital structure was actually irrelevant in the real world, Eugene Fama actively defended the Efficient Market Hypothesis. e.g., Fama98 writes: “There is a developing literature… arguing that stock prices adjust slowly to information… It is time, however, to ask whether this literature, viewed as a whole, suggests that efficiency should be discarded. My answer is a solid no.”

This is fine. But the parallel structures of behavioral finance and corporate finance clearly put the lie to a common criticism of the behavioral literature. It’s often argued that, to be a honest-to-goodness scientific discipline, the behavioral-finance literature needs to offer a single coherent alternative model to challenge the Efficient Market Hypothesis. e.g., later in Fama98, it’s claimed that behavioral economists “rarely test a specific alternative to market efficiency… This is unacceptable… Following the standard scientific rule, however, market efficiency can only be replaced by a better specific model of price formation.”

That is nonsense. It’s a criticism that applies equally well to the corporate-finance literature, which has not produced a better specific model than the one in ModiglianiMiller58. However, no one would claim that Jean Tirole’s textbook is unscientific. Overturning ModiglianiMiller58 isn’t the point of corporate-finance research. Overturning the Efficient Market Hypothesis isn’t the point of behavioral-finance research. In both cases, the point is to provide good explanations for how the real-world works. If progress is fastest when researchers organize their thinking relative to a benchmark model, then so be it.

Asset-pricing models as theories of good synthetic controls

January 18, 2023 by Alex

In 1988, California passed a major piece of tobacco-control legislation called Proposition 99. This bill increased the tax on cigarettes by $0.25 a pack and triggered a wave of bans on smoking indoors throughout the state. After the bill was passed in California, it became more expensive to smoke in California and there were fewer places to do so.

It makes sense that Prop 99 could have caused lots of people in California to stop smoking. And, consistent with this hypothesis, AbadieDiamondHainmueller2010 found that per capita cigarette consumption in California fell by around 40 packs per year from 1985 to 1995. In 1985, the typical Californian bought 100 packs per year. By 1995, the average Californian bought fewer than 60 packs per year.

But was the effect causal? Did the passage of Prop 99 cause per capita cigarette consumption in California really drop by 40 packs per year? To answer this question, you need to know how many packs of cigarettes each Californian would have bought in 1995 had Prop 99 not been passed.

It’s not obvious how you should compute this counterfactual. You can’t just assume that, in the absence of Prop 99, cigarette consumption in California would have been the same in 1995 as it was in 1985. The popularity of smoking has been falling over time throughout the country. You also can’t naively compare cigarette consumption in California in 1995 to that of a neighboring state, like Nevada, in the same year. People in Nevada are more likely to partake in all sorts of vices (smoking, drinking, gambling, etc). Comparing per capita cigarette sales in California to that of Nevada in 1995 will tend to overstate the effect of Prop 99.

But, what if rather than using just Nevada in 1995 as your stand-in for California sans Prop 99, you instead used a composite Frankenstate that has the same observable characteristics as California. e.g., people in Nevada may be much more likely to smoke, drink, and gamble relative to people in California, but people in Utah are much less likely to do all of those things than Californians. So a weighted average of per capita cigarette consumption in Nevada and Utah in 1995 might represent a good synthetic control for California.

This post first outlines the idea behind using a synthetic control. Then, I make a connection between literatures: when an asset-pricing researcher computes a stock’s abnormal return by subtracting off the return of a replicating portfolio with the same risk exposures, he’s using a synthetic control. In fact, this is exactly what the OG synthetic control paper does! AbadieGardeazabal2003 computes abnormal returns for Basque companies relative to the CAPM and the FamaFrench1993 three-factor model. I wrap up by pointing out some interesting takeaways from this connection for both asset-pricing researchers and metrics folks.

Problem setup

Here’s the canonical synthetic-control problem. Imagine that you’ve got data on how much people spend on smoking and drinking in three different states, $n \in \{ \texttt{CA}, \, \texttt{NV}, \, \texttt{UT} \}$ , in two particular years, $t \in \{ \texttt{1985}, \, \texttt{1995} \}$ . For simplicity, I’m going to talk about Prop 99 as a policy that banned indoor smoking outright:

$\begin{equation*} \mathit{IndoorSmokingPolicy}_{n,t} = \left( \begin{array}{r|cc} & \texttt{1985} & \texttt{1995} \\ \hline \texttt{CA} & \mathtt{\emptyset} & \texttt{Ban} \\ \texttt{NV} & \mathtt{\emptyset} & \mathtt{\emptyset} \\ \texttt{UT} & \mathtt{\emptyset} & \mathtt{\emptyset} \end{array} \right) \end{equation*}$

Thus, you have one state-year observation with a smoking ban in place and five without one.

Let $\mathit{CigSales}_{n,t}(p)$ denote the number of packs bought by the average person in state $n$ during year $t$ given the prevailing indoor smoking policy $p \in \{ \mathtt{\emptyset}, \, \texttt{ban} \}$ . You want to know how Prop 99 affected cigarette sales:

$\begin{equation*} \underbrace{\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})}_{\substack{\text{\phantom{j}observed\phantom{j}} \\ \text{outcome}}} - \underbrace{\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})}_{\substack{\text{hypothetical} \\ \text{counterfactual}}} = \text{causal effect of Prop 99 on cigarette sales} \end{equation*}$

The first term, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})$ , is the per capita cigarette sales observed in California during 1995 after Prop 99 had been implemented. The second term, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ , reflects packs per person during 1995 in an alternative world where everything is the same except that Prop 99 was never passed.

The key empirical challenge rests on the fact that, while I can observe cigarette sales for the year 1995 in California where Prop 99 has already been passed, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\texttt{Ban})$ , I cannot directly observe cigarette sales in a version of 1995 California where Prop 99 didn’t go into law, $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ . This counterfactual world is a hypothetical scenario. It never happened. The challenge is to come up with some stand-in value for $\mathit{CigSales}_{\texttt{CA},\texttt{95}}(\mathtt{\emptyset})$ based on the data that I can observe in other states.

So, you want to think about a data-generating process where there’s a potential effect coming from a one-time policy change AND a bunch of things that affect statewide cigarette sales during normal times:

$\begin{equation*} \mathit{CigSales}_{n,t}(p) = \!\!\! \underbrace{\alpha \cdot 1_{\{p= \texttt{Ban}\}}}_{\substack{\text{causal\phantom{j}effect of one-} \\ \text{time policy change}}} \!\!\! + \underbrace{\mu_t + \lambda \cdot X_n + \varepsilon_{n,t}}_{\substack{\text{determinants of cigarette} \\ \text{sales during normal times}}} \qquad \qquad \varepsilon_{n,t} \overset{\scriptscriptstyle \text{IID}}{\sim} \mathrm{Normal}(0, \, \sigma^2) \end{equation*}$

You want to know whether the introduction of Prop 99, which is captured by the $1_{\{p= \texttt{Ban}\}}$ term, had an effect on cigarette sales in California. i.e., you want to know whether $\alpha < 0$ . If Prop 99 had no effect, then $\alpha = 0$ .

The remaining determinants of statewide cigarette sales during normal times are important because they dictate which observables might be a good stand-in for the counterfactual version of California in 1995 where Prop 99 was never passed as illustrated in the interactive figure below. The left panel depicts the number of cigarette packs purchased by an average resident in each of your three states during 1985 (y axis) as a function of liquor consumption in that state (x axis). The right panel shows the same thing but for 1995. The solid circles represent observed values of per capita annual cigarette sales. The dotted circle represents the counterfactual value for California in a world where Prop 99 was not passed ( $p = \mathtt{\emptyset}$ ).

$X_n$ represents liquor consumption in state $n$ . People in Nevada are more likely to spend money on all sorts of vices (smoking included) than people in California. $X_n$ is a proxy for this statewide predisposition. So, if you ignore this background variable and naively compare cigarette purchases in California to that in Nevada during 1995, then it’ll look like Prop 99 had an outsized effect. People in Nevada purchase an extra $15$ packs per year relative to California in the figure. Thus, using Nevada during 1995 as your counterfactual observation would cause you to overstate the true causal effect of Prop 99 by $15$ packs per person annually.

$\mu_t$ is the average per capita cigarette sales during year $t$ across all states. Cigarette sales have been falling over time, so $\mu_{\texttt{95}} < \mu_{\texttt{85}}$ . However, in its initial configuration, cigarette sales are constant over time in the figure above, $\Delta \mu = (\mu_{\texttt{95}} - \mu_{\texttt{85}}) = 0$ . If that were the case, then you could use per capita cigarette sales in California during 1985 as your counterfactual observation. But, by moving the $\Delta \mu$ slider, you can see how a downward time-series trend in cigarette sales would cause you to again overstate the effect of Prop 99:

$\begin{equation*} \Exp\big[ \, \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{95}}(\texttt{Ban})}_{\text{observed}} - \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{85}}(\mathtt{\emptyset})}_{\text{observed}} \, \big] = \Delta \mu + \alpha \end{equation*}$

Synthetic control

Because smoking has been growing less and less popular over time, you want to create the counterfactual for cigarette sales in California during 1995 using contemporaneous data from other states. But you also recognize that no other state is a perfect doppelganger for California. People in Nevada are more likely to smoke, drink, and gamble relative to people in California. Utah residents are less likely to do all of those things than Californians. So why no do the obvious thing and average these two values?

For conreteness, suppose that the drinking rate in California are exactly halfway in between the rates in Utah and Nevada:

$\begin{equation*} X_{\texttt{CA}} = (1/2) \cdot X_{\texttt{UT}} + (1/2) \cdot X_{\texttt{NV}} \end{equation*}$

In practice, the weights wouldn’t be exactly $(1/2)$ . But you could estimate these values using your 1985 data. And you could use these weights to construct a Voltron-esque counterfactual for cigarette sales in California during 1995 out of the contemporaneous values for Utah and Nevada:

$\begin{equation*} \begin{split} \widehat{\mathit{CigSales}}_{\texttt{CA}, \texttt{95}}(\mathtt{\emptyset}) &= (1/2) \cdot \overbrace{\mathit{CigSales}_{\texttt{UT},\texttt{95}}(\mathtt{\emptyset})}^{\text{observed}} + (1/2) \cdot \overbrace{\mathit{CigSales}_{\texttt{NV},\texttt{95}}(\mathtt{\emptyset})}^{\text{observed}} \\ &= \mu_{\texttt{95}} + \lambda \cdot X_{\texttt{CA}} + (1/2) \cdot (\varepsilon_{\texttt{UT},\texttt{95}} + \varepsilon_{\texttt{NV},\texttt{95}}) \end{split} \end{equation*}$

Since it’s made out of observations from 1995, this synthetic control is not confounded by the nationwide drop in cigarette sales from 1985 through 1995. By matching California’s alcohol sales, $X_{\texttt{CA}}$ , this synthetic control also accounts for persistent differences in cigarette sales across states due to differing propensities to partake in all vices. And, if we’ve done everything correctly, then we can compute

$\begin{equation*} \Exp\big[ \, \underbrace{\mathit{CigSales}_{\texttt{CA}, \texttt{95}}(\texttt{ban})}_{\text{observed}} - \underbrace{\widehat{\mathit{CigSales}}_{\texttt{CA}, \texttt{95}}(\mathtt{\emptyset})}_{\text{calculated}} \, \big] = \alpha \end{equation*}$

where $\alpha$ denotes the true causal effect of Prop 99 on annual per capita cigarette sales in California.

Risk adjustment

The synthetic control for cigarette sales in California during 1995 was a weighted average of cigarette sales in Nevada and Utah where the weights were chosen to replicate California’s value of $X_{\mathtt{CA}}$ . While this approach has some advantages, it’s also somewhat unsatisfying in that there’s no real physical analog to the counterfactual it produces. There’s no process by which you can take a weighted average of Utah and Nevada. This is purely a statistical construct.

The key insight in this post is that, when an asset-pricing researcher computes a risk-adjusted return for some asset relative to particular model, he’s using this same synthetic control methodology. And in an asset-pricing context, there’s a clear physical analog to the resulting counterfactual. The synthetic control represents a portfolio of the underlying assets with appropriately chosen portfolio weights. e.g., in the case of the CAPM, a synthetic control observation for a particular asset is a replicating portfolio with weights chosen so that it has the exact same market beta.

For example, suppose you think expected returns are governed by the CAPM. Then $\mu_t = \mathit{RiskfreeRate}_t$ is the prevailing risk-free rate at time $t$ , $X_n = \Cov[\mathit{Return}_{n,t}, \, \mathit{Market}_t] \, / \, \Var[\mathit{Return}_{n,t}]$ is the market beta on the $n$ th asset, and $\lambda$ is price of an increase in exposure to this market risk factor:

$\begin{equation*} \mathit{Return}_{n,t}(p) = \underbrace{\alpha \cdot 1_{\{ p = \texttt{hi} \}}}_{\substack{\text{effect\phantom{j}of\phantom{j}anom-} \\ \text{lous predictor}}} + \underbrace{\mu_t + \lambda \cdot X_n + \varepsilon_{n,t}}_{\substack{\text{what\phantom{j}determines\phantom{j}returns} \\ \text{in asset-pricing model}}} \qquad \qquad \varepsilon_{n,t} \overset{\scriptscriptstyle \text{IID}}{\sim} \mathrm{Normal}(0, \, \sigma^2) \end{equation*}$

The asset’s return should be higher if the risk-free rate is higher ( $\mu_t$ is high), if it has lots of exposure to market risk ( $X_n$ is high), and/or if the price of this exposure to market risk is high ( $\lambda$ is high).

The core claim in any asset-pricing model (CAPM included) is that, after controlling for the $X$ variables specified in the model, it shouldn’t be possible to find another predictor, $p \in \{\texttt{lo}, \, \texttt{hi} \}$ , that forecasts returns:

$\begin{equation*} \text{claim: $\alpha = 0$ for every predictor $p$ that you can think of} \end{equation*}$

And how would an asset-pricing researcher test to see if $\alpha = 0$ ? He’d compare the $n$ th asset’s returns to the returns of a replicating portfolio whose weights were chosen so that it had the exact same value of $X_n$ . e.g., suppose we’re in a CAPM world, and the $n$ th asset has a market beta of $X_n = 0.50$ . If there are two other assets with betas equal to $0.20$ and $0.80$ respectively, you should compare the $n$ th asset’s return to the return of a equally weighted portfolio of those two assets, $X_n = (1/2) \cdot 0.20 + (1/2) \cdot 0.80 = 0.50$ . Exact same situation! And the original synthetic-control paper (AbadieGardeazabal2003) pointed out as much!

Some takeaways

This connection between the synthetic-control literature and the asset-pricing literature delivers some interesting takeaways on both sides. First, let’s think about it from the perspective of an asset-pricing researcher. There have been several recent econometric advances in the study of synthetic controls. e.g., Chen2023 frames the synthetic-control procedure as an online learning problem. The paper then uses this parallel to give policymakers some guidance on when and where synthetic control is most likely to be successful. By framing risk adjustment as a specific instance of a more general approach to producing synthetic controls, asset-pricing researchers might be able to port over some of these recent advances.

I think the message is a bit less positive when traveling from asset pricing back to the econometrics of synthetic controls. It’s been 50 years since Merton1973 introduced the ICAPM, and asset-pricing researchers have yet to agree on which $X$ s to use when doing our risk adjustments. This fact should give econometricians pause when considering the limits of the synthetic-control approach. In a recent review article, Abadie2021 argues that the synthetic-control methodology offers a “safeguard against specification searches. (p406)” Judging by the current state of the asset-pricing literature, I’m not sure this is true.

Abadie2021 also argues that, while a researcher using the synthetic-control procedure might make an error in choosing control variables, at least the procedure is transparent about how a counterfactual is being constructed. The procedure itself is certainly transparent. But I’m not sure how many people really think through the logic now that synthetic control has gone mainstream. How many people think of the buildup of pus in a pimple when they use the phrase “coming to a head”? The conceptual metaphor is perfectly transparent. But most people never look. In a similar vein, asset-pricing researchers often use the FamaFrench1993 three-factor model to “control for risk” in spite of the fact that real-world investors aren’t trying to use their stock portfolios to buy insurance against these risk factors. An empirical procedure which initially encourages introspection can eventually turn into a stale thoughtless idiom. The asset-pricing literature suggests that econometricians should be more worried about this trend.

Interpreting the LASSO as a really simple neural network

January 10, 2023 by Alex

Suppose you want to forecast the return of a particular stock using many different predictors (think: past returns, market cap, asset growth, etc…). One way to do this would be to use the LASSO. Alternatively, you could use a neural network to make your forecast. On the surface, these two approaches look very different. However, it turns out that it’s possible to recast the LASSO as a *really* simple neural network.

This post outlines how.

This connection suggests we can use penalized regressions, such as the LASSO, as microscopes for studying more complicated machine-learning models, like neural networks, which often exhibit surprising new behavior. For example, if you include more predictors than observations in an OLS regression, then you’ll be able to perfectly fit your training data but your out-of-sample performance will be terrible. By contrast, highly over-parameters neural networks often have the best out-of-sample fit.

Because these models are so complicated, it’s often hard to understand why a pattern like this might emerge. Penalized regression models like the LASSO occupy a middle ground between OLS and complicated machine-learning models. Thus, if the LASSO can be viewed as a really simple neural net, then it might be possible to use this intermediate setup as a laboratory for understanding more complicated procedures. That’s the idea behind HastieMontanariRossetTibshirani22. And KellyMalamudZhou22 build on their logic.

General setup

Imagine that you’ve got historical data on the returns of $N \gg 1$ different stocks, $\{ \mathit{Ret}_n \}_{n=1}^N$ , and you want to make the best forecast possible for the future return of the $(N+1)$ st stock, $\widehat{\mathit{Ret}}_{N+1}$ . You have access to $K \gg 1$ different return predictors. Let $X_{n,k}$ denote the value of the $k$ th predictor for the $n$ th stock. Assume that each predictor has been normalized to have mean zero and variance $(1/K)$ in the cross-section. Without loss of generality, also assume that the cross-sectional average return is zero.

If there were only one predictor, $K = 1$ , then it’d be possible to estimate the OLS regression below:

$\begin{equation*} \hat{\beta}^{\text{OLS}} = \arg \min_{\beta} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \beta \cdot X_n \, \right\}^2 \end{equation*}$

In this case, the solution is given by $\hat{\beta}^{\text{OLS}} \propto \sum_{n=1}^N \, (\mathit{Ret}_n - 0) \times (X_n - 0)$ . If the predictor tends to be positive, $X_n > 0$ , for stocks that subsequently realize positive returns, $\mathit{Ret}_n > 0$ , then the OLS slope coefficient associated with it will be positive. It will also be profitable to trade on this predictor, too.

You can also use an OLS regression to create a return forecast when you have more than one predictor

$\begin{equation*} \widehat{\mathit{Ret}}_{N+1}^{\text{OLS}} = \sum_{k=1}^K \hat{\beta}_k^{\text{OLS}} \cdot X_{N+1,k} \end{equation*}$

provided that you still have more observations than predictors, $N > K$ . If you’ve got $K=200$ predictors and $N=500$ stocks in your training data, then you’re in business. However, if your training data only contains $N=100$ stocks, then you’re SOL. You’ll have to use something other than an OLS regression.

The LASSO

One popular approach is to fit a LASSO specification. This is essentially an OLS regression with an additional absolute-value penalty applied to each predictive coefficient:

$\begin{equation*} \min_{\beta_1,\ldots,\beta_K} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \sum_{k=1}^K \beta_k \cdot X_{n,k} \, \right\}^2 + 2 \cdot \lambda \cdot \sum_{k=1}^K |\beta_k| \end{equation*}$

The pre-factor of $\lambda \geq 0$ in front of the penalty term is a tuning parameter, which can be optimally chosen via cross-validation. Notice that, when $\lambda = 0$ , there is no penalty at all and the LASSO is equivalent to OLS. But when $\lambda > 0$ , the LASSO’s coefficients will differ from OLS estimates as shown in the interactive figure below.

To see what I mean, let’s return to the case where there’s only one predictor. Alternatively, you could think about a world with orthogonal predictors, $\Cov(X_k, \, X_{k'}) = 0$ for all $k \neq k'$ . In either case, we have:

$\begin{equation*} \hat{\beta}_k^{\text{LASSO}} = \mathrm{Sign}(\hat{\beta}_k^{\text{OLS}}) \times \max\big\{ 0, \, |\hat{\beta}_k^{\text{OLS}}| - \lambda \big\} \end{equation*}$

This expression tells us that the LASSO does two things. First, it shrinks large OLS coefficients toward zero, $|\hat{\beta}_k^{\text{LASSO}}| < |\hat{\beta}_k^{\text{OLS}}|$ . Second, it forces all small OLS coefficients, $|\hat{\beta}_k^{\text{OLS}}| < \lambda$ , to be exactly zero, $\hat{\beta}_k^{\text{LASSO}} = 0$ .

Neural network

The LASSO is still able to make forecasts in situations where there are more predictors than observations because it kills off all the smallest predictors. Morally speaking, if only $5$ of your $K=200$ predictors have any forecasting power, then you shouldn’t need $N \geq 200$ observations to figure this out. $20$ data points should do just fine. An alternative approach to making a return forecast when $K > N$ would be to use a neural network. On the surface, this seems like a very different strategy. Instead of a bet on sparsity, large neural networks often perform best when highly over-parameterized.

There are lots of kinds of neural networks. In this post, I’m going to mainly focus on neural networks with only one hidden layer that has the same number of nodes as predictors. e.g., with $K=200$ predictors, there will be $H = 200$ hidden nodes. The diagram to the left shows what this would look like in a situation with $K=3$ predictors and $H=3$ hidden nodes so that we can see what’s going on.

The value of each hidden node is determined by an activation function that takes a linear combination of predictor values as its input:

$\begin{equation*} H_k = \mathrm{h}\!\left( \beta_{0 \to k} + \sum_{k'=1}^{K} \beta_{k' \to k} \cdot X_{n,k'}\right) \end{equation*}$

e.g., you could set $\mathrm{h}(z) = z$ , $\mathrm{h}(z) = \max\{0, \, z \}$ , or something else entirely. $\vec{\beta}_k = (\beta_{0 \to k}, \, \beta_{1 \to k}, \ldots, \, \beta_{K \to k})$ contains the weights that go into the $k$ th hidden node. It has $(K+1)$ elements due to the intercept term.

The return forecast generated by this neural network, $\widehat{\mathit{Ret}}_{N+1}^{\text{NNet}}$ , is then a weighted average of its $K$ hidden nodes where the weights are chosen by solving the optimization problem below:

$\begin{equation*} \min_{\substack{\alpha_1, \ldots, \alpha_K \\ \vec{\beta}_1, \ldots, \vec{\beta}_K}} \, \sum_{n=1}^N \left\{ \, \mathit{Ret}_n - \sum_{k=1}^K \alpha_k \cdot \mathrm{h}\!\left( \beta_{0 \to k} + \sum_{k'=1}^{K} \beta_{k' \to k} \cdot X_{n,k'}\right) \, \right\}^2 \!\! + \lambda \cdot \sum_{k=1}^K \left( \alpha_k^2 + \beta_{0 \to k}^2 + \sum_{k'=1}^K \beta_{k'\to k}^2 \right) \end{equation*}$

This objective function includes a penalty term just like the LASSO, but the penalty is quadratic. It’s equivalent to the common practice of training a neural network via gradient descent with weight decay.

Degrees of freedom

If our goal is to write down the LASSO as a special case of a neural network, then there are two apparent differences that need to be finessed. The first involves degrees of freedom. In the LASSO, there is one parameter that needs to be estimated for each predictor. In the neural network above, each predictor is associated with $(K + 2)$ free parameters. In addition, you must also choose an activation function, $\mathrm{h}(\cdot)$ .

To represent the LASSO as a neural network, we’re going to have to shut down $(K+1)$ of the degrees of freedom associated with each predictor. So, let’s start by looking at a neural network that’s “simply connected”—i.e., a network where $\beta_{k' \to k} = 0$ whenever $k' \neq k$ . Let’s also assume a linear activation function, $\mathrm{h}(z) = z$ , and restrict ourselves to the case where there’s no constant term, $\beta_{0 \to k} = 0$ .

After making these assumptions, we are left with the neural network in the diagram above. There are now only two free parameters associated with each predictor: $\alpha_k$ and $\beta_{k \to k}$ . To estimate all $2 \cdot K$ of these values, we must maximize the objective below:

$\begin{equation*} \min_{\substack{\alpha_1, \ldots, \alpha_K \\ \beta_{1 \to 1}, \ldots, \beta_{K \to K}}} \, \sum_{n=1}^N \left\{ \, Y_n - \sum_{k=1}^K \alpha_k \cdot \beta_{k \to k} \cdot X_{n,k} \, \right\}^2 + \lambda \cdot \sum_{k=1}^K (\alpha_k^2 + \beta_{k \to k}^2) \end{equation*}$

This looks almost like the LASSO objective function. But there’s still one glaring difference left…

Nature of the penalty

In the LASSO, we’ve got an absolute-value penalty; whereas, the neural network has a quadratic penalty. This seems important! To see why, consider replacing the absolute-value penalty in the LASSO with

$\begin{equation*} \min_{\beta_1,\ldots,\beta_K} \, \sum_{n=1}^N \left\{ \, Y_n - \sum_{k=1}^K \beta_k \cdot X_{n,k} \, \right\}^2 + \lambda \cdot \sum_{k=1}^K \beta_k^2 \end{equation*}$

When you do this, you’re left with something called the Ridge regression.

Just like with the LASSO, we can characterize the Ridge estimates relative to OLS in the case where there’s only one predictor or all predictors are orthogonal to one another:

$\begin{equation*} \hat{\beta}_k^{\text{Ridge}} = \left( {\textstyle \frac{1}{1 + \lambda}} \right) \times \hat{\beta}_k^{\text{OLS}} \end{equation*}$

When you increase the value of $\lambda$ in the figure to the right, you’ll see that the slope of the line changes. The larger the $\lambda$ , the less $\hat{\beta}_k^{\text{Ridge}}$ changes in response to a change in $\hat{\beta}_k^{\text{OLS}}$ . Notice how this effect is qualitatively different from the effect of increasing $\lambda$ in a LASSO specification. There, $\lambda$ controlled the size of the inaction region. But, provided that $|\hat{\beta}_k^{\text{LASSO}}| > 0$ , the LASSO estimate always moved one-for-one with $\hat{\beta}_k^{\text{OLS}}$ .

However, this Ridge intuition is misleading. In the simply-connected neural-network structure that I outline above, we are not choosing a single coefficient $\beta_{k \to k}$ . Instead, because there is a hidden layer, we are choosing the product of $\alpha_k \cdot \beta_{k \to k}$ . And this makes all the difference. For any value of $c \geq 0$ , we have that

$\begin{equation*} \min_{\alpha,\beta \geq 0} \big\{ \, \alpha^2 + \beta^2 : \alpha \cdot \beta = c \, \big\} = 2 \cdot |c| \end{equation*}$

where the minimum is at $\alpha_k = \beta_{k \to k} = \sqrt{c}$ . This is just the inequality relating arithmetic and geometric averages. It’s what allows a single hidden layer to sneak in a threshold through the back door.

Some extensions

We’ve just seen that you can think about the LASSO as a simply-connected two-layer neural network with a linear activation function and no bias terms, which was trained via gradient descent with weight decay. This is not my observation. I first saw it in Tibshirani21. The step where you reduce the degrees of freedom is obvious enough. But I had never made the connection with the arithmetic/geometric mean inequality. That second step struck me (and still strikes me) as really cool. It’s also a very concrete example of the flexibility inherent in neural networks. The hidden layer allows a neural network to do things you wouldn’t guess possible based only on the functional forms involved.

In addition to outlining the argument above, Tibshirani21 also gives a couple of other interesting extensions. e.g., the note shows how, by increasing the number of hidden layers in the neural network, you can reproduce the output of a LASSO-like specification below

The more hidden layers you include, the closer you get to best-subset selection, $q=0$ . The note also shows that it’s possible to write group-LASSO as a neural network that ain’t quite so simply connected.