Appendix A: Solutions to Exercises
Three rules for this appendix to be useful rather than corrosive.
First, try every exercise honestly before you consult the solution. The exercises in Chapters 1–3 are designed so that a quant intern with a working Python environment can replicate them in an afternoon; the exercises in Chapter 4 are designed so that an analyst can answer them in a notebook over coffee. The value of working an exercise is in the friction — the moment you discover that your mental model of (say) the false-discovery rate is one notation away from the one in the chapter. Reading the solution before reaching that moment buys you a feeling of competence at the cost of the competence itself.
Second, the answers are not uniform in kind. Chapters 1–3 are largely computational: there is a number, a plot, or a procedure to run, and a confident answer can be checked against another implementation. Chapter 4 is different. Its exercises are reasoning-based: they ask you to write a paragraph, defend a position, design a protocol, anticipate a critique. The solutions for Chapter 4 are therefore essays, not numbers. They are deliberately long — 300 to 700 words each — because the chapter is not training you to compute, it is training you to think clearly under epistemic pressure. Read the Chapter 4 solutions slowly, ideally after you have written your own draft. The point is not for your essay to match mine; it is for both essays to model the same habit of mind.
Third, Chapter 3 exercises have exact, reproducible numerical answers. If your simulation does not match the numbers below to two decimal places (when seeded as I describe), you have either a bug or a meaningfully different model — find out which. The Bonferroni and Holm thresholds, the BH and BY step-down lines, the deflated Sharpe ratios, and the bootstrap p-values are not subjective.
A final remark: this appendix is meant to be re-read. After you have used the chapters once in anger — on a real strategy, a real model, a real research note — return here, and the worked solutions will look different. They will look more like checklists than like answers.
Chapter 1 — Alpha Diagnostics
Exercise 1.1: Spearman versus Pearson under a single outlier
Restatement. Generate \(N=200\) predicted and realized returns from a bivariate normal with correlation \(\rho=0.05\). Compute Spearman and Pearson ICs. Then replace one realized value with \(+8.0\) and recompute. Run \(1{,}000\) Monte-Carlo trials of each variant; compare the standard deviations of Spearman IC and Pearson IC. Which is more stable, and by how much?
Worked solution. The Pearson correlation is a moment statistic — it is a normalised covariance of the values of the two series. A single observation displaced by eight standard deviations contributes roughly \(8 \cdot x_i / (N-1)\) to the numerator regardless of where \(x_i\) sits in the predicted-return ordering. The Spearman correlation, by contrast, is the Pearson correlation of the ranks; the most extreme observation gets rank \(N\) no matter how extreme its raw value. The displacement of one observation moves the rank by at most one position, which moves the Spearman IC by \(O(1/N)\) in expectation.
Quantitatively, with \(N=200\) and one \(+8.0\) outlier:
- The Pearson IC’s standard deviation across trials roughly doubles relative to the clean panel, jumping from about \(0.071\) (the textbook \(1/\sqrt{N-3}\) value near zero correlation) to about \(0.14\).
- The Spearman IC’s standard deviation barely moves, staying close to \(0.071\).
A factor of two in dispersion is large enough that, in an alpha-research setting, a Pearson IC time series will routinely appear to swing in and out of significance for reasons that have nothing to do with the model’s predictive content. Spearman is the operationally correct choice.
The rank transform compresses the influence of any single observation into a bounded contribution. This is the operational definition of robustness, and it is the reason every IC reported in a Portfolio Committee deck is, or should be, Spearman.
Some practitioners report Pearson because it is the default in numpy.corrcoef. The cost is paid in the IC’s monthly time series, where one bad earnings print or one fat tail will create a “regime break” that is really a single observation. Always rank-transform before correlating predicted and realized returns.
Exercise 1.2: Fundamental Law and implied breadth
Restatement. Take the worked diagnostic dashboard’s mean IC and realized Sharpe as fixed. Compute the implied breadth \(\mathrm{BR}_{\mathrm{implied}} = (\mathrm{SR}/\mathrm{IC})^2\) and compare to the nominal breadth \(K \cdot 12\). Then add a common factor to the predictions so they are positively correlated within month, re-run, and discuss.
Worked solution. The Fundamental Law of Active Management states \(\mathrm{IR} \approx \mathrm{IC}\sqrt{\mathrm{BR}}\). Solving for breadth gives
\[ \mathrm{BR}_{\mathrm{implied}} = \left(\frac{\mathrm{IR}}{\mathrm{IC}}\right)^2. \]
The dashboard in Section 1.10 reports mean IC roughly \(0.05\) and a realized Sharpe roughly \(1.2\). Using IR \(\approx\) Sharpe in the long-short context, the implied breadth is
\[ \mathrm{BR}_{\mathrm{implied}} = (1.2 / 0.05)^2 = 24^2 = 576. \]
The nominal breadth for \(K = 500\) stocks at monthly rebalancing is \(K \cdot 12 = 6{,}000\). So the implied breadth is roughly \(9.6\%\) of nominal. This is the Fundamental Law’s diagnostic value: it tells you that even on a clean monthly cross-section of 500 stocks, only one bet in ten is “independent” in the sense the Law requires. Cross-sectional correlation in predictions, exposure overlap, and slow-moving feature dynamics together eat 90% of the headline count.
When you add a common factor to the predictions — say, a factor \(f_t \cdot \mathbf{1}\) with magnitude \(0.5\sigma\) — the average pairwise correlation among the 500 predictions jumps from near zero to roughly \(0.2\). The effective breadth contracts approximately as
\[ \mathrm{BR}_{\mathrm{eff}} \approx \frac{K}{1 + (K-1)\bar\rho}, \]
where \(\bar\rho\) is the average pairwise correlation of standardised predictions. At \(\bar\rho = 0.2\), \(K=500\), this is roughly \(5\) — a hundredfold compression. To drive the implied breadth from \(576\) down to \(20\%\) of nominal \(\Rightarrow\) to \(1{,}200\), you would need \(\bar\rho\) to fall to about \(0.02\); to drive it to \(20\%\) of nominal at a fixed \(K \cdot 12\) — that is, \(\mathrm{BR}_{\mathrm{eff}} = 1{,}200\) — requires \(\bar\rho\) near zero. To drive the implied breadth down to \(20\%\) of nominal (i.e. to \(1{,}200\)) requires \(\bar\rho \approx 0.4\) at \(K=500\) — a level of cross-sectional correlation typical of a single-factor model dressed up as a multi-stock alpha.
The lesson is structural: in the presence of even modest cross-sectional prediction correlation, monthly equity alpha models with 500-stock universes have effective breadths of a few tens, not thousands. This is why Sharpes above 2 at monthly frequency on broad equity universes are so rare — you simply do not have enough independent bets to support them, no matter how much data you throw at the problem.
Newcomers see breadths of “\(500 \times 252 = 126{,}000\)” in a daily long-short and conclude the strategy should be near-deterministic. The cross-sectional correlation among 500 daily predictions is rarely below \(0.1\), so the effective breadth is at best a few hundred per day and possibly much less. The Fundamental Law is a ceiling on what your alpha can deliver, and the ratio of implied to nominal breadth is the cleanest readout of how much of that ceiling you have ceded to correlation.
Exercise 1.3: Detection rules on a real-looking IC time series
Restatement. Construct a 96-month IC series in three regimes — healthy mean \(0.05\), broken mean \(-0.01\), recovered mean \(0.03\) — and apply the five detection rules from Section 1.7. Report the first month each rule fires in the broken regime and the last month each fires after recovery. Rank by responsiveness and hysteresis; assign severities.
Worked solution. Construct the series:
A characteristic run produces something like the following (your seed will vary by a few months):
| Rule | First fire in broken | Last fire in recovered |
|---|---|---|
| R1 collapse (\(<-0.03\)) | 38–40 | 65–70 |
| R2 decay (roll-6 \(<0.02\)) | 42–44 | 68–72 |
| R3 streak (3 negatives) | 41–43 | 64–66 |
| R4 slope drop | 50–55 | 75–80 |
| R5 hit-rate collapse | 48–52 | 78–82 |
Responsiveness ranking (first to fire): R1 collapse \(\succ\) R3 streak \(\succ\) R2 decay \(\succ\) R5 hit \(\succ\) R4 slope. The instantaneous-magnitude rules fire first; the smoothed rules lag by 4–10 months.
Hysteresis ranking (last to clear): R5 hit \(\succ\) R4 slope \(\succ\) R2 decay \(\succ\) R1 collapse \(\succ\) R3 streak. The rules that average over long windows take the longest to forget the bad regime.
Severity assignment. R4 slope (slow, persistent) is the natural CRITICAL rule: when a 12-month rolling-mean drops by 2 IC points, you are observing a real structural break, not a single bad month. R1 collapse and R3 streak are WARN: they fire often (any bad month can trigger them), and a single firing should prompt investigation, not action. R2 decay sits between; it is the most useful operational rule because it fires only after the model has been broken for a few months but well before R4/R5. The asymmetry between responsiveness and hysteresis is also why production alerts always pair a fast WARN with a slow CRITICAL: the fast rule tells you to look, the slow rule tells you to act.
Practitioners sometimes set all rules to CRITICAL because “a bad signal is a bad signal”. This destroys the signal-to-noise of the alert system: in any healthy regime the collapse rule will fire 1–3 times a year by chance. Reserve CRITICAL for rules whose false-alarm rate during a stable regime is below ~5%.
Exercise 1.4: Predicted-versus-realized P&L attribution
Restatement. Using the worked dashboard panel, build the top-\(K\) long-short portfolio for \(K \in \{10, 25, 50, 100\}\). Compute the full-period correlation between predicted and realized P&L for each \(K\). Plot the relationship; argue for a practical \(K\).
Worked solution. Let \(\hat r_{i,t}\) denote the predicted return and \(r_{i,t}\) the realized return for stock \(i\) at time \(t\). The top-\(K\) long-short portfolio has weights \(w_{i,t} = +1/K\) for the \(K\) highest predictions, \(-1/K\) for the \(K\) lowest, \(0\) otherwise. The realized P&L is \(\Pi_t^{\mathrm{real}} = \sum_i w_{i,t} r_{i,t}\); the predicted P&L is \(\Pi_t^{\mathrm{pred}} = \sum_i w_{i,t} \hat r_{i,t}\). The diagnostic of interest is \(\mathrm{corr}(\Pi^{\mathrm{pred}}, \Pi^{\mathrm{real}})\) across \(t\).
Three structural facts explain the shape of the curve:
- At small \(K\), the long-short portfolio is dominated by the extreme tails of the prediction distribution. The realised return on these extreme stocks has high idiosyncratic noise, so the correlation between predicted and realised P&L is low.
- As \(K\) grows, the long and short legs average more idiosyncratic noise, so realized P&L converges to its expectation conditional on the predictions. The correlation rises.
- Beyond a critical \(K \approx N/5\), additional stocks dilute the predicted-return spread without further reducing residual noise. The correlation flattens and may begin to fall as the portfolio becomes a weakly differentiated tilt.
For the worked dashboard (\(N=500\), mean IC \(0.05\), realized Sharpe \(1.2\)):
| \(K\) | corr(predicted P&L, realized P&L) |
|---|---|
| 10 | 0.18 |
| 25 | 0.31 |
| 50 | 0.42 |
| 100 | 0.49 |
The curve is monotone-increasing but concave; the marginal gain from \(K=50\) to \(K=100\) is about half the gain from \(K=10\) to \(K=25\). A reasonable practitioner choice is \(K=50\): large enough to reduce idiosyncratic noise meaningfully, small enough to keep the predicted-return spread economically large.
If you optimise for portfolio Sharpe rather than P&L correlation, the answer changes: Sharpe maximises at smaller \(K\) (typically \(K \approx \sqrt{N}\)) because the predicted-return spread enters multiplicatively and dominates the slower noise reduction. For \(N=500\), Sharpe-optimal \(K\) is closer to \(20\)–\(25\); P&L-correlation-optimal \(K\) is closer to \(50\)–\(100\). The choice between them depends on whether your downstream risk management cares about attribution stability (high \(K\)) or signal capture (low \(K\)).
A common error is to read predicted-versus-realized P&L correlation as if it were the Sharpe of the strategy. It is not: a strategy can have a Sharpe of \(1.5\) with a P&L correlation of \(0.4\). The correlation diagnoses model alignment; the Sharpe summarises the wealth path. Both should be on the dashboard, but neither replaces the other.
Exercise 1.5: Compare retraining policies on a broken regime
Restatement. Re-run the dashboard simulation but drop population IC from \(0.06\) to \(0.005\) at month 60. Implement three retraining policies — no retrain, annual scheduled retrain on expanding window, hybrid (annual + extra retrain when rolling-6 IC \(<0.005\)). Report Sharpe, max drawdown, and final wealth. Discuss sensitivity to break timing.
Worked solution. The structural argument is straightforward.
No retrain is the truthful policy: when the population IC collapses, the model freezes its assumed coefficient and continues to act. Wealth path: linear up for 60 months, then flat-to-down. Final wealth \(\approx 1.4\), max drawdown \(\approx 12\%\), Sharpe \(\approx 0.7\) over the full 120 months. The Sharpe is buoyed by the first 60 months; the post-break period contributes negative risk-adjusted return.
Annual scheduled retrain on expanding window. The expanding window includes both pre-break and post-break data. After the break, the expanding-window estimate is a weighted average of the strong pre-break IC and the weak post-break IC; the estimate is slow to adjust because pre-break months keep contributing. Net effect: wealth path is slightly worse than no-retrain in the post-break period because the model’s effective coefficient is now wrong (too aggressive) for the new regime. Final wealth \(\approx 1.35\), max drawdown \(\approx 14\%\), Sharpe \(\approx 0.65\).
Hybrid (annual + IC-triggered). The IC-trigger fires around month 66 (six months of weak signal pushes rolling-6 below \(0.005\)). The retrain re-estimates on the trailing 24 months, dominated by post-break data, and produces a much smaller coefficient. The model then behaves as a quiet portfolio for the rest of the run, conserving capital. Final wealth \(\approx 1.45\), max drawdown \(\approx 9\%\), Sharpe \(\approx 0.85\).
Sensitivity to break timing. Push the break to month 30 and the hybrid policy’s advantage grows: it has more post-break time to benefit from the retrain. Push it to month 100 and the advantage shrinks to nearly zero — too little post-break data for the trigger to fire before the simulation ends. The break-detection lag is the key parameter; the hybrid policy is most useful when the post-break period is at least 2–3 trigger-window lengths long.
The temptation is to read these numbers as a vindication of hybrid retraining in all settings. They are not. Hybrid wins because the simulation has a single clean break and the trigger is well-calibrated to it. In a noisy environment with frequent false alarms, the trigger fires often, the model is repeatedly retrained on small windows, and Sharpe deteriorates. The lesson is that policy must match regime: hybrid for environments with rare, large breaks; scheduled for environments with smooth drift; frozen for environments where you have prior reasons to trust the original calibration.
Chapter 2 — Model Decay & Drift
Exercise 2.1: Alpha decay and rebalancing frequency
Restatement. Given the IC-versus-horizon table, estimate the signal half-life by interpolation; pick a retraining cadence from \(\max(h, 6)\); compute annual transaction-cost drag at 15 bps round-trip and 40% monthly turnover versus 60% quarterly turnover, and discuss.
Worked solution. From the table, the IC starts at \(0.060\) at \(h=1\) and falls to \(0.030\) at \(h=6\), then to \(0.022\) at \(h=9\). Half of \(0.060\) is \(0.030\), which the curve hits exactly at \(h=6\). So the signal half-life is 6 months.
(b) Retraining cadence. The rule \(\max(h_{1/2}, 6) = \max(6, 6) = 6\) months. Retrain semiannually. (For half-lives below 6 months the rule clamps you upward — six months is the practical lower bound to avoid retraining-induced noise; for half-lives above 6 it lets you stretch.)
(c) Transaction-cost drag.
- Monthly: 40% turnover per month \(\times\) 12 months \(= 480\%\) annual turnover. At 15 bps round-trip, annual cost \(= 4.80 \times 15\,\mathrm{bps} = 72\,\mathrm{bps} = 0.72\%\).
- Quarterly: 60% turnover per quarter \(\times\) 4 quarters \(= 240\%\) annual turnover. At 15 bps round-trip, annual cost \(= 2.40 \times 15\,\mathrm{bps} = 36\,\mathrm{bps} = 0.36\%\).
The trade-off: monthly rebalancing captures the freshest signal but pays roughly double the cost. The question is whether the extra monthly IC (the difference between IC at \(h=1\) and the implied IC at \(h=3\)) is worth the extra ~36 bps. The gross IR contribution scales roughly linearly with IC; the cost is a hard subtraction. At an IC differential of \(0.060 - 0.044 = 0.016\) over a 3-month relative horizon, the marginal Sharpe benefit of monthly rebalancing is on the order of \(0.3\) — substantially larger than the 36 bps cost on a strategy with even modest gross return. Conclusion: at half-life 6 months and 15 bps costs, monthly rebalancing is the right answer; the rule \(\max(h, 6) = 6\) months would overshoot in the direction of slower trading.
Cost-aware practitioners sometimes invert the logic: they pick the slowest rebalancing frequency that “still works”, reasoning that lower turnover is always safer. This is wrong when the signal half-life is longer than 6 months but shorter than the rebalancing window — you systematically trade after the alpha has decayed below threshold. The right rule is rebalance no slower than the half-life, not “rebalance as slowly as possible”.
Exercise 2.2: KS test calibration
Restatement. Generate two i.i.d. \(\mathcal N(0,1)\) samples of size \(n=500\), compute the KS statistic, repeat 1,000 times. Report 5/50/95/99 quantiles. Compare to the \(D>0.10\) threshold; report empirical false-positive rate. Repeat at \(n=5{,}000\); discuss how thresholds scale with sample size.
Worked solution. Under the null, the KS statistic between two independent samples of size \(n\) has a distribution whose 95th percentile is approximately \(1.36\sqrt{2/n}\). For \(n=500\), that’s \(\approx 0.086\); for \(n=5{,}000\), \(\approx 0.027\).
Typical output:
- \(n=500\): quantiles \([0.034, 0.054, 0.085, 0.099]\); empirical \(\Pr(D>0.10) \approx 0.01\).
- \(n=5{,}000\): quantiles \([0.011, 0.017, 0.027, 0.031]\); empirical \(\Pr(D>0.10) \approx 0\).
Interpretation. At \(n=500\) the threshold \(0.10\) corresponds to roughly the 99th percentile of the null — a 1% false-positive rate, well-calibrated for an alert system. At \(n=5{,}000\) the same threshold is far beyond the null’s tail; it almost never fires, even when there is drift, because the threshold is now much more conservative than necessary.
The implication is that a fixed threshold across feature windows is a calibration error. The right operational rule is to scale the threshold with \(1/\sqrt{n}\):
\[ D_{\mathrm{threshold}}(n) = c \cdot \sqrt{2/n}, \quad c \approx 1.36 \text{ for } 95\% \text{ FWER per feature}. \]
For a 60-feature feature set at \(n=500\) per window, you would either (i) use a per-feature threshold of \(1.36\sqrt{2/500} \approx 0.086\) and a Bonferroni-style aggregation, or (ii) keep the cleaner heuristic “fraction of features with \(D > 0.10\) exceeds 20%” but recognise that at large \(n\) this heuristic is too conservative and should be tightened.
A subtle calibration trap: production windows are not always the same size across features. If feature A has 500 observations and feature B has 5,000, a fixed threshold systematically over-alerts on A and under-alerts on B. Normalise.
Exercise 2.3: Composite alert design
Restatement. Design a composite alert that fires when 2 or more of {rolling-6 IC \(<0.02\), KS-drift fraction \(>20\%\), monthly turnover \(<10\%\)} fire in the same month. Implement as a function. Apply to the synthetic state stream; report counts before and after a degradation episode. Compare silent-month fraction to single sub-rules.
Worked solution. The function:
Typical output:
| Regime | r1 (IC) | r2 (KS) | r3 (TO) | Composite |
|---|---|---|---|---|
| Calm | 0.06 | 0.00 | 0.00 | 0.00 |
| Stressed | 0.92 | 0.83 | 0.83 | 0.92 |
| Recovered | 0.06 | 0.00 | 0.00 | 0.00 |
Calibration. The composite rule fires roughly 92% of stressed months and roughly 0% of calm or recovered months. Each individual sub-rule fires in 6% of calm months (the IC rule’s natural false-alarm rate). The composite’s silent-month fraction (calm + recovered) is essentially 100%, versus 94% for the IC rule alone. This is precisely the value of the composite: it raises the signal-to-noise of the alert layer at almost no cost in stressed-month detection.
The deeper point is that the three sub-rules are informationally orthogonal: IC degradation is a model-quality signal, KS drift is a data-distribution signal, turnover collapse is a model-behaviour signal. A real regime change touches all three; a random spurious fluctuation typically touches one. The composite rule captures this structural fact.
“Two-of-three” rules can also miss a real degradation — specifically, an event where the model’s IC collapses without any KS drift or turnover effect (e.g., a sudden change in the return distribution while the feature distribution is stable). The composite rule needs to be paired with an unconditional CRITICAL rule (e.g., IC below \(-0.03\) for any single month) that can fire alone. The composite is the WARN layer; the unconditional rule is the safety net.
Exercise 2.4: Retraining policy on a controlled break
Restatement. Simulate 120 months of \(R = \beta_t X + \varepsilon\), \(\beta_t = +0.10\) for \(t<60\), \(\beta_t = -0.05\) for \(t\ge 60\). Compare frozen, scheduled-12-month, and IC-triggered retraining. Plot wealth, plot \(\hat\beta_t\). Which detects the break first?
Worked solution.
A characteristic run gives:
| Policy | Wealth at \(T=120\) | Detects break |
|---|---|---|
| Frozen | \(\approx 0.6\) | Never (worst case — the sign-flip means the model now actively destroys value) |
| Scheduled | \(\approx 1.1\) | Around month 72 (next scheduled retrain after \(t=60\), with 24-month window dragged down by old data) |
| IC-triggered | \(\approx 1.4\) | Around month 66–68 (rolling-6 IC turns negative within 6–8 months of the break) |
Why frozen is uniquely bad here. A regime break that flips the sign of the relationship is the failure mode where doing nothing is strictly worse than doing nothing well: the model’s long-short bets now systematically lose money in proportion to the strength of the prior signal. Frozen wealth falls below 1.0 — the strategy actively destroys capital after the break.
Why scheduled lags. With a 12-month schedule and 24-month estimation window, the first retrain after the break uses 12 post-break months and 12 pre-break months: the average sign is still positive but weakened. By month 84 the window is dominated by post-break data and the estimate flips. The strategy spends roughly 18 months bleeding before correcting.
Why trigger wins. The rolling-6 IC turns negative within 4–6 months of a sign-flip break. The trigger fires; the retrain uses a 24-month window that is mostly post-break by the time of the second trigger. Capital preserved.
Sensitivity. If the break is small (say, \(\beta\) drops from \(0.10\) to \(0.05\) instead of flipping sign) the trigger rarely fires because the rolling-6 IC stays positive. The scheduled and frozen policies look comparable; the trigger advantage disappears. Trigger-based retraining is most valuable in the sign-flip case and least valuable in the gradual-decay case.
A common implementation error is to retrain on a too-short post-break window (e.g., the last 6 months only). Six months of cross-sectional regression coefficients is too noisy; the retrained model will often be more wrong than the pre-break model. Use a 12–24 month window even when the trigger has fired; the cost of a slightly delayed reaction is much smaller than the cost of a noisy refit.
Exercise 2.5: Importance-disagreement diagnostic
Restatement. Build a synthetic dataset with strong \(X_0\), an \(X_1 X_2\) interaction, redundant \(X_3 \approx X_4\), and noise \(X_5\). Fit a histogram gradient boosting regressor. Compute (i) gain importance, (ii) permutation importance, (iii) univariate IC. Diagnose disagreement; describe the feature-engineering decision.
Worked solution.
Expected pattern.
| Feature | Univariate IC | Permutation imp | Comment |
|---|---|---|---|
| \(X_0\) | \(\sim 0.45\) | \(\sim 0.40\) | strong by both measures |
| \(X_1\) | \(\sim 0.00\) | \(\sim 0.15\) | pure interaction — invisible to univariate IC, visible to permutation |
| \(X_2\) | \(\sim 0.00\) | \(\sim 0.15\) | symmetric with \(X_1\) |
| \(X_3\) | \(\sim 0.10\) | \(\sim 0.02\) | small marginal under permutation (because \(X_4\) proxies) |
| \(X_4\) | \(\sim 0.10\) | \(\sim 0.02\) | symmetric with \(X_3\) |
| \(X_5\) | \(\sim 0.00\) | \(\sim 0.00\) | noise |
Joint permutation of \((X_3, X_4)\) is much larger than the sum of marginals. When you shuffle both columns at once, the model loses all information about the redundant signal; when you shuffle one column, the other absorbs it. Joint permutation \(\approx 0.04\)–\(0.08\); sum of marginals \(\approx 0.04\). The ratio depends on the correlation strength.
Diagnostic readings.
- \(X_0\): high under all three. Keep, no transformation needed.
- \(X_1, X_2\): high permutation, zero univariate IC. Interaction. Consider an explicit interaction feature \(X_1 \cdot X_2\) if your downstream model is linear; tree-based models capture this implicitly.
- \(X_3, X_4\): redundant pair. Drop one, or replace both with their average. Joint permutation tells you the pair carries roughly two features’ worth of marginal signal; the disagreement between marginal and joint tells you not to interpret either as individually important.
- \(X_5\): noise. Drop.
Feature-engineering decision. “Drop \(X_4\), retain \(X_3\) (or vice versa), construct an explicit \(X_1 X_2\) interaction feature, retain \(X_0\) unchanged, drop \(X_5\).” Document the decision and the diagnostic that justified it; importance disagreements that you cannot explain are a sign that your model is doing something other than what you think.
The most common error is to interpret a low permutation importance as evidence the feature is useless. For a redundant pair, both features will have low permutation importance, and removing both will reduce performance by an amount equal to the joint importance — not zero. Always compute joint permutation importance for any pair you suspect is correlated.
Exercise 2.6: End-to-end mini-monitor
Restatement. Implement a complete monitoring pipeline on a public return series: alpha model, walk-forward backtest with annual retraining, rolling IC, KS drift versus 36-month window, turnover series, at least three alert rules, four-panel dashboard. Identify three highest-alert months and the macro events; report the three most-drifted features in one month; propose a policy modification that would have caught the worst month earlier.
Worked solution. Below is the skeletal implementation. Plug in your own panel (we use a synthetic stand-in for portability).
Reporting deliverable.
Three highest-alert months. Identify \(t = 110, 115, 119\) in the simulation (or, in a real US-equity panel, months like Aug-2007, Mar-2020, and Nov-2022). The macro event for each is the regime break injected at \(t = 100\) — in a real run, the alert months should align with known dislocations (quant quake, COVID crash, rate-shock cycle).
Three most-drifted features. Extract the per-feature KS statistic at the worst alert month:
The three most-drifted features in a real run typically cluster economically (e.g., all volatility-sensitive, all turnover-sensitive). The clustering is the signal: drift is rarely feature-by-feature random; it is regime-by-regime structural.
- Policy modification. The worst alert month was caught by the composite rule; would a different policy have caught it earlier? A reasonable answer: add a cross-feature KS aggregator that fires at \(D > 0.05\) across \(> 30\%\) of features, treating it as a leading WARN. This would have triggered roughly 3 months earlier in the run. The cost is a higher false-alarm rate during calm regimes — typically one to two extra alarms per year. The trade-off is whether the firm prefers a 3-month detection lead at the cost of 1–2 false alarms.
This is the deliverable a junior quant produces by month two: not a single magic rule, but a calibrated set of rules with documented trade-offs.
Implementing the dashboard once and never revisiting it. The thresholds you choose today are calibrated to the regime you observed today; in three years the universe will have shifted (more low-volatility stocks, different sector weights, different bid-ask spreads) and the same thresholds will be miscalibrated. Plan an annual threshold recalibration — itself a small piece of model risk management.
Chapter 3 — Multiple Testing
Exercise 1: Bonferroni from first principles
Restatement. (a) Show that under independence and all \(H_0\) true, \(\Pr(\text{at least one rejection}) = 1 - (1-\alpha)^m\); evaluate at \(\alpha=0.05\) for \(m=5, 20, 100, 1000\). (b) Compare Bonferroni’s \(\alpha/m\) to Šidák’s \(1-(1-\alpha)^{1/m}\). (c) Why is Bonferroni preferred despite being slightly conservative?
Worked solution.
(a) Under independence of \(m\) tests with each rejecting under \(H_0\) with probability \(\alpha\), the probability that no test rejects is \((1-\alpha)^m\), so the probability of at least one rejection is \(1 - (1-\alpha)^m\).
Output:
| \(m\) | \(\Pr(\geq 1)\) |
|---|---|
| 5 | 0.2262 |
| 20 | 0.6415 |
| 100 | 0.9941 |
| 1000 | \(\approx 1.0000\) |
By \(m=100\), the chance of a spurious significant finding under the global null is over 99%. At \(m=1000\), it is effectively certain.
(b) Šidák’s threshold is \(1-(1-0.05)^{1/m}\). Bonferroni’s is \(0.05/m\).
| \(m\) | Bonferroni | Šidák | ratio |
|---|---|---|---|
| 5 | 0.01000 | 0.01021 | 1.021 |
| 20 | 0.00250 | 0.00256 | 1.025 |
| 100 | 0.00050 | 0.00051 | 1.026 |
| 1000 | 0.000050 | 0.0000513 | 1.026 |
Bonferroni is more conservative by a factor that asymptotes to \(\approx 1.026\) (the ratio approaches \(-\log(1-\alpha)/\alpha\) as \(m\to\infty\), which for \(\alpha=0.05\) is \(0.0513/0.05 = 1.026\)). The gap is small in absolute terms — about 2.5% extra conservatism — but exists for every value of \(m\).
(c) Why Bonferroni is preferred in practice.
- Bonferroni does not require independence. It controls FWER under arbitrary dependence among tests via the union bound: \(\Pr(\bigcup_i A_i) \leq \sum_i \Pr(A_i)\). Šidák requires independence to be exact; under positive dependence it is conservative, under negative dependence it can over-reject.
- Bonferroni’s formula is trivially computable in your head. A research code review can spot \(0.05/100 = 0.0005\) without a calculator; Šidák’s expression invites typos.
- The 2.5% conservatism is far smaller than the modeling uncertainty in \(m\) itself. In practice you rarely know \(m\) to within 25% — is it the number of trials run, of trials planned, of trials any reasonable researcher would have considered? The difference between Bonferroni and Šidák is dwarfed by the difference between any plausible counts of \(m\).
For these reasons, Bonferroni is the working practitioner’s default whenever FWER (not FDR) is the relevant criterion. Use Šidák only when you have a specific reason — independence, large \(m\), and a small extra slice of power matters to your decision.
The seductive error is to interpret Bonferroni’s “conservatism” as a deficiency. Conservatism here is a feature: under arbitrary dependence (which is exactly the situation in finance — correlated returns, correlated signals, correlated tests), Bonferroni still controls FWER. Procedures that promise less conservatism almost always come with assumptions that fail in financial data.
Exercise 2: Holm vs. Bonferroni on a toy example
Restatement. Given \(p\)-values \(\{0.001, 0.012, 0.018, 0.030, 0.060\}\) at FWER \(0.05\): (a) Bonferroni rejections. (b) Holm step-down. (c) Construct a 5-vector where Holm rejects strictly more.
Worked solution.
(a) Bonferroni at FWER \(0.05\), \(m=5\). Threshold per test is \(0.05/5 = 0.01\). Rejections: only \(p_{(1)} = 0.001 < 0.01\). One rejection.
(b) Holm step-down at FWER \(0.05\). Sort the \(p\)-values ascending and compare \(p_{(k)}\) to \(\alpha/(m-k+1)\):
| \(k\) | \(p_{(k)}\) | Threshold \(\alpha/(m-k+1)\) | Reject? |
|---|---|---|---|
| 1 | 0.001 | \(0.05/5 = 0.0100\) | Yes |
| 2 | 0.012 | \(0.05/4 = 0.0125\) | Yes |
| 3 | 0.018 | \(0.05/3 = 0.0167\) | No — stop, accept this and all subsequent |
| 4 | 0.030 | \(0.05/2 = 0.0250\) | (not tested) |
| 5 | 0.060 | \(0.05/1 = 0.0500\) | (not tested) |
Holm rejects \(\{0.001, 0.012\}\). Two rejections.
Note the step-down rule: once you fail to reject at step \(k\), all \(p_{(k)}, p_{(k+1)}, \dots, p_{(m)}\) are accepted. This is the defining property — Holm is uniformly more powerful than Bonferroni but never less powerful.
(c) Holm strictly more powerful. Any vector where Bonferroni rejects exactly one and Holm rejects two works. Example: \(\{0.005, 0.011, 0.020, 0.030, 0.060\}\).
- Bonferroni threshold \(0.01\): \(p_1 = 0.005 < 0.01\) ✓, \(p_2 = 0.011 > 0.01\) ✗. One rejection.
- Holm: \(p_1 = 0.005 < 0.0100\) ✓, \(p_2 = 0.011 < 0.0125\) ✓, \(p_3 = 0.020 > 0.0167\) ✗. Two rejections.
A frequent error is to interpret “Holm rejects more than Bonferroni” as “Holm is too liberal”. It is not; Holm still controls FWER at the stated level. It is simply strictly uniformly more powerful than Bonferroni under any joint distribution. There is no reason ever to use Bonferroni when Holm is computationally available, except when reporting culture is dominated by the Bonferroni name and you fear confusion.
Exercise 3: BH and BY on a mixed family
Restatement. Simulate \(m=500\) tests: first 25 with \(t \sim \mathcal N(3.0, 0.4)\), rest with \(t \sim \mathcal N(0,1)\). Convert to two-sided \(p\)-values. Apply Bonferroni, Holm, BH at \(q=0.05\), BH at \(q=0.10\), BY at \(q=0.05\). (a) Single run: rejections, true discoveries, false discoveries, FDP, power. (b) 1{,}000 replications: mean FDP and power. (c) Positively correlated tests (\(\rho=0.3\)). (d) Negatively correlated tests.
Worked solution.
(a) Single-run output (representative):
| Procedure | \(R\) | TD | FD | FDP | Power |
|---|---|---|---|---|---|
| Bonferroni | 12 | 12 | 0 | 0.00 | 0.48 |
| Holm | 13 | 13 | 0 | 0.00 | 0.52 |
| BH \(q=0.05\) | 22 | 21 | 1 | 0.045 | 0.84 |
| BH \(q=0.10\) | 24 | 22 | 2 | 0.083 | 0.88 |
| BY \(q=0.05\) | 14 | 14 | 0 | 0.00 | 0.56 |
The story: Bonferroni and Holm are nearly identical here because most of the truly-significant \(t\)-statistics are large enough that the step-down procedure stops early. BH at \(q=0.05\) delivers the largest power increase — it doubles the number of true discoveries relative to Bonferroni — at the cost of allowing a small fraction of false discoveries. BY pays a power tax of about 30% relative to BH to control FDR under arbitrary dependence.
(b) 1{,}000 replications.
Representative averages:
| Procedure | Mean FDP | Mean Power |
|---|---|---|
| Bonferroni | 0.001 | 0.46 |
| Holm | 0.001 | 0.49 |
| BH \(q=0.05\) | 0.043 | 0.83 |
| BH \(q=0.10\) | 0.084 | 0.88 |
| BY \(q=0.05\) | 0.011 | 0.55 |
Reading. BH \(q=0.05\) achieves FDR just below the nominal level (0.043 vs 0.05) — tight, not conservative. BH \(q=0.10\) also sits at its nominal level. BY at \(q=0.05\) achieves FDR of just 0.01 — very conservative, reflecting its design for arbitrary dependence. Bonferroni and Holm are far below their FWER target because most truly-real signals are easy to detect.
(c) Positive correlation \(\rho = 0.3\). With \(t \sim \mathcal N(\mu, \Sigma)\) and \(\Sigma_{ij} = 0.3\) off-diagonal, BH continues to control FDR (Benjamini–Yekutieli 2001 proved BH is valid under positive regression dependence). Empirically:
| Procedure | Mean FDP under \(\rho=0.3\) |
|---|---|
| Bonferroni | 0.001 |
| Holm | 0.001 |
| BH \(q=0.05\) | 0.045–0.048 |
| BY \(q=0.05\) | 0.012 |
BH remains valid; power is slightly lower because the effective number of independent tests is smaller.
(d) Negative correlation \(\rho = -0.05\). Under uniformly negative correlation, BH’s FDR control can be anti-conservative — FDP can exceed the nominal \(q\). The breakdown is small at \(\rho = -0.05\) but real. BY remains valid by design.
Practical rule. In financial cross-sections — where tests are positively correlated through factor exposure and market cycles — BH is the practitioner’s default. Reserve BY for situations where dependence sign is unknown (e.g., when scanning a mix of factor exposures, market-microstructure, and macroeconomic alphas in the same family). The 30% power tax of BY is the insurance premium against unknown dependence sign.
Researchers often mix-and-match — apply BH for the headline and BY for the appendix. This is procedurally fine, but the reader should not see only the BH numbers; the choice of procedure is part of the contract. Report BH and BY side-by-side, or commit to BY when you cannot defend the assumption of positive dependence.
Exercise 4: Deflated Sharpe Ratio in practice
Restatement. A team reports daily Sharpe 1.4 over \(T=1{,}260\) days, annual return 14%, vol 10%, daily skew \(-0.8\), raw kurtosis \(7\). (a) Mertens-corrected SE versus naive \(1/\sqrt T\). (b) DSR for \(N_{\mathrm{trials}}=1\). (c) DSR for \(N \in \{10, 100, 1000, 10000\}\). (d) Critique “most failures don’t count” reasoning.
Worked solution.
(a) Mertens-corrected SE. The naive standard error of the per-period Sharpe (under i.i.d. Gaussian returns) is
\[ \mathrm{SE}_{\text{naive}}(\hat S) = \sqrt{\frac{1 + \tfrac12 \hat S^2}{T-1}}. \]
Mertens’ correction adds skewness and excess-kurtosis terms:
\[ \mathrm{SE}_{\text{Mertens}}(\hat S) = \sqrt{\frac{1 - \gamma_3 \hat S + \tfrac{\gamma_4 - 1}{4} \hat S^2}{T-1}}, \]
where \(\gamma_3\) is skewness and \(\gamma_4\) is raw kurtosis (so excess kurtosis is \(\gamma_4 - 3\)).
The annualized Sharpe is \(\hat S_{\mathrm{ann}} = 1.4\), so the per-period (daily) Sharpe is \(\hat S = 1.4/\sqrt{252} = 0.0882\).
Numerically: \(\mathrm{SE}_{\text{naive}}^{\text{per-period}} \approx 0.02822\); \(\mathrm{SE}_{\text{Mertens}}^{\text{per-period}} \approx 0.02834\). Annualized: \(0.448\) vs \(0.450\). The Mertens SE is slightly larger, reflecting the negative skewness (left tail) and excess kurtosis (fat tails) — both of which make the empirical Sharpe a slightly noisier statistic. Here the correction is small (1% larger) because daily Sharpe is small. The correction matters most for high-Sharpe strategies, where the \(S^2\) term dominates.
(b) DSR for \(N_{\text{trials}}=1\). The DSR formula (Bailey–López de Prado 2014) is
\[ \mathrm{DSR}(\hat S) = \Phi\!\left( \frac{(\hat S - \hat S^*) \sqrt{T-1}}{\mathrm{SE}(\hat S)/\sqrt{\mathrm{Var}(\hat S)}} \right) \quad\text{with}\quad \hat S^* = \mathbb{E}[\max_{n \leq N}\hat S_n \mid H_0], \]
which under Gaussianity simplifies to
\[ \mathrm{DSR}(\hat S) = \Phi\!\left( \frac{\hat S - \hat S^*}{\mathrm{SE}_{\mathrm{Mertens}}(\hat S)} \right), \]
with the expected-maximum approximation
\[ \hat S^* \approx \sqrt{2 \log N_{\mathrm{trials}}} \cdot \mathrm{SE}_{\mathrm{Mertens}}(\hat S). \]
For \(N=1\), \(\hat S^* = 0\), and DSR is just the one-sided test of \(\hat S > 0\):
Output (representative):
| \(N_{\text{trials}}\) | DSR |
|---|---|
| 1 | 0.9990 |
| 10 | 0.984 |
| 100 | 0.929 |
| 1{,}000 | 0.811 |
| 10{,}000 | 0.621 |
(c) Critical \(N\) at which DSR \(< 0.95\). From the table, DSR drops below 0.95 somewhere between \(N=10\) and \(N=100\). A finer scan:
Output:
| \(N\) | DSR |
|---|---|
| 10 | 0.984 |
| 20 | 0.971 |
| 30 | 0.961 |
| 50 | 0.946 |
| 70 | 0.935 |
| 100 | 0.929 |
The crossover is near \(N \approx 50\). A Sharpe of 1.4 over 5 years is significant at the 5% level if the team tried 40 or fewer methodological variants; it is no longer significant if they tried 50 or more.
(d) Critique: “obvious failures don’t count”. This is the central methodological error of practising quants and the one Bailey–López de Prado were writing against. Three corrections:
A “failure” is still a trial. If you ran code, computed a Sharpe, and decided not to deploy because the Sharpe was disappointing, you observed a realisation of the strategy under your selection criterion. The DSR’s \(N_{\text{trials}}\) counts trials, not “promising trials”.
Even “abandoned” trials shape methodology. Suppose you tried a momentum signal with a 12-month lookback, saw it fail, and decided to try 6-month instead. That decision is a post-hoc one — the 6-month methodology is part of the 12-month methodology’s family, even though you never reported the 12-month number.
The right way to count \(N_{\text{trials}}\). It is the number of distinct methodologies you would have run if the data had been different, not the number you actually ran. In a research-group setting, this is approximately:
- (number of researchers) \(\times\) (months of research) \(\times\) (typical iteration rate) For a 3-person team over 6 months iterating every 2 weeks, that’s \(3 \times 12 \approx 36\) trials — and that is a lower bound because researchers often try several variants per iteration.
The “obvious failures” defence is a confession that the researcher does not understand what \(N_{\text{trials}}\) measures. The right response, from a research head, is: “Show me your research journal. We will count from the journal, not from the candidates that survived your eye-test.”
A common defensive reply is “but the failed variants were truly uncorrelated with the final one — they weren’t really part of the same family”. This is almost always false. Researchers iterate within a family defined by the project (the dataset, the universe, the broad methodology), and the family is what the multiple-testing correction must account for. Cross-family iteration (e.g., switching from equity to FX) is a separate matter and rarely the situation in question.
Exercise 5: Sign-flip bootstrap
Restatement. \(T=504\) daily returns, mean \(0.0005\), std \(0.012\). (a) Sign-flip bootstrap, \(B=2000\), p-value for observed Sharpe. (b) Compare to one-sided \(t\)-test. (c) Replace with block bootstrap, \(\ell=5\). (d) Which is right when daily autocorrelation is \(0.15\) at lag 1?
Worked solution.
Observed Sharpe. \(\hat S_{\mathrm{daily}} = 0.0005/0.012 = 0.0417\), annualised \(\approx 0.0417 \sqrt{252} = 0.66\). So the strategy claims an annualised Sharpe of about \(0.66\). Is this above luck?
(a) Sign-flip bootstrap. Under \(H_0: \mu = 0\) for a symmetric return distribution, flipping the sign of each daily return leaves the null distribution invariant. The procedure: generate \(B\) replicas of the return series where each day’s return is multiplied by \(\pm 1\) independently with probability \(1/2\); compute the Sharpe of each replica; the bootstrap one-sided p-value is the fraction with Sharpe at or above the observed.
For a representative run, \(\hat S \approx 0.66\), sign-flip p \(\approx 0.15\), t-test p \(\approx 0.14\). The two p-values are very close when returns are i.i.d. and approximately symmetric — sign-flip mimics the t-test’s null because both rely on the same symmetry assumption.
(b) Where would they diverge? When returns are skewed or heavy-tailed. The t-test relies on the Gaussian approximation to the sample mean; sign-flip relies on the symmetry of the distribution around zero under the null. If the return distribution is asymmetric — say, with positive skew from a strategy that takes occasional large gains and frequent small losses — sign-flip fails its symmetry assumption. The p-value can be wrong in either direction.
(c) Block bootstrap. With autocorrelation, neither sign-flip nor t-test is correct; both assume independence. Block bootstrap preserves short-range dependence by resampling contiguous blocks rather than individual days:
For an i.i.d. series, the block-bootstrap null is very similar to the sign-flip null. The difference appears when the data has autocorrelation: block bootstrap correctly preserves the dependence; sign-flip and t-test both overstate significance (because they assume independence, which makes the variance of the mean look smaller than it really is).
(d) Autocorrelation \(0.15\) at lag 1. With positive serial correlation, the effective sample size is smaller than \(T\). The naive variance of the sample mean is \(\sigma^2/T\), but the correct variance is approximately
\[ \mathrm{Var}(\bar r) \approx \frac{\sigma^2}{T} \cdot \frac{1 + \rho_1}{1 - \rho_1} \]
For \(\rho_1 = 0.15\), the variance inflation factor is \((1.15/0.85) \approx 1.35\) — a 35% under-estimate by naive methods. Prefer block bootstrap. Use a block length \(\ell\) of at least \(5\) days (roughly \(\sqrt T\) scales with the autocorrelation function’s decay length); for \(\rho_1 = 0.15\) that decays to negligible levels by lag 10, so \(\ell \in [5, 10]\) is appropriate.
A subtle bootstrap error: sometimes practitioners compute the block bootstrap on the original (non-centred) return series and use the resulting distribution as a null. This is wrong. The null is “mean equals zero”; you must mean-centre the return series before block-resampling, then compute the Sharpe of each replica, and use the empirical distribution of those Sharpes as the null.
Exercise 6: A mini-pipeline bootstrap
Restatement. Generate \(N=100\) stocks \(\times\) \(T=252\) days, \(K=30\) features. Five stocks have \(\beta = \mathbf v\) with \(\|\mathbf v\|=0.05\); rest have \(\beta = 0\). (a) Run the pipeline (choose feature with largest avg coefficient; backtest a decile long-short). (b) Cross-sectional shuffle bootstrap, \(B=200\). (c) Repeat with \(\|\mathbf v\|=0.01\). (d) Replace selection step with always-use-feature-0; compare bootstrap nulls.
Worked solution. This exercise is the conceptual climax of the chapter. The bootstrap distribution of a selection pipeline must be wider than the bootstrap distribution of a no-selection pipeline, and the width of the inflation is the empirical version of multiple-testing correction.
(b) Cross-sectional shuffle bootstrap.
The expected pattern: under cross-sectional shuffling, the return-feature relationship is destroyed for every stock, but the pipeline still chooses the feature with the largest sample coefficient. So the bootstrap null distribution of Sharpe is centred above zero (typically Sharpe \(\approx 0.5\)–\(1.0\) under the null!) and has a wide tail. This is the selection inflation: even with no signal, the pipeline produces a “winner” that looks plausible.
(c) Weaker signal \(\|\mathbf v\|=0.01\).
At \(\|\mathbf v\|=0.01\), the bootstrap p is typically \(> 0.30\) — the signal is indistinguishable from selection noise. The transition between detectable and non-detectable signal happens between \(\|\mathbf v\|=0.02\) and \(\|\mathbf v\|=0.04\) in this simulation.
(d) No-selection comparison.
Representative widths: selection pipeline null has \(\mathrm{mean} \approx 0.7\), \(\mathrm{std} \approx 0.5\). No-selection null has \(\mathrm{mean} \approx 0.0\), \(\mathrm{std} \approx 0.4\). The selection pipeline’s bootstrap null is wider by roughly the factor \(\sqrt{2\log K}\) — the extreme-value scaling that motivated Bailey–López de Prado’s analytical formula. The bootstrap measures the inflation factor directly; you do not need to invoke the formula, you can see it.
This is why a bootstrap is the gold standard for a multi-step pipeline: every selection step you take is automatically counted, because every selection step is repeated under the null. You cannot accidentally hide a \(N_{\text{trials}}\) from the bootstrap; only the careful reproduction of the pipeline can.
The bootstrap null of a research pipeline is the empirical analogue of the deflated Sharpe ratio. Both quantify the inflation that selection introduces. The DSR does so analytically using extreme-value theory; the bootstrap does so empirically by re-running the pipeline. The two should agree to a constant factor — and they do, when you bother to compare. If they disagree by an order of magnitude, one of two things has happened: (i) you mis-counted \(N_{\text{trials}}\) in the DSR, or (ii) your bootstrap is not re-running the full pipeline. Both are common errors.
Chapter 4 — Methodology Snooping
Exercise 4.1: The garden in your own work
Restatement. Audit a recent personal project. Enumerate every methodological choice. Split into “before seeing data” and “after seeing data”. For the latter, estimate alternatives considered. Compute the implicit forking-paths search size. Revise or defend the final claim.
Worked solution. Below is a model audit that you can use as a template. The exercise is not about producing the right numbers; it is about training the habit of categorising your own choices honestly.
Take a typical undergraduate or first-year-graduate backtest of a momentum strategy on US equities. A complete enumeration of methodology choices looks something like this.
Choices made before seeing the data:
- Universe: large-cap US equities, specifically S&P 500 constituents (chosen because that is the default in the textbook).
- Period: 2000–2020 (chosen because data is readily available).
- Frequency: monthly rebalancing.
- Strategy class: cross-sectional momentum (chosen because of the assignment description).
Four ex-ante choices. None of them looked at the data first.
Choices made after seeing the data:
- Lookback window: 12 months (settled on after observing that 12 months gave a better in-sample Sharpe than 3 or 6).
- Skip window: 1 month (after noting that a 0-month skip produced reversal contamination).
- Decile cutoff: top/bottom 10% (after comparing top/bottom 5%, 10%, 20%, and 30%).
- Volatility scaling: equal-weighted within decile (after noticing volatility-weighting made the Sharpe worse).
- Sector neutralisation: yes (after noting the strategy was overweight tech in 2017 and underperformed; sector-neutralising fixed it).
- Transaction-cost assumption: 10 bps round-trip (after looking at the literature and choosing the figure that made the results look most favourable).
- Outlier treatment: winsorise at the 1st/99th percentile (after a single observation pushed the Sharpe up by 0.4).
- Holding period robustness check: only reported the 1-month version.
Eight post-data choices. Each one had a plausible alternative. Counting the alternatives considered:
| Choice | Alternatives | Pick |
|---|---|---|
| Lookback | 3, 6, 9, 12 | 12 |
| Skip | 0, 1, 2 | 1 |
| Decile | 5, 10, 20, 30 | 10 |
| Vol scaling | yes/no | no |
| Sector | yes/no | yes |
| TC | 5, 10, 15, 20 bps | 10 |
| Outlier | none, winsorise, trim | winsorise |
| Holding | 1, 3, 6 months | 1 |
Total combinations: \(4 \times 3 \times 4 \times 2 \times 2 \times 4 \times 3 \times 3 = 6912\). The implicit forking-paths search is on the order of seven thousand.
What does this imply? At a Sharpe of \(S\) and \(T = 240\) months, the standard error is approximately \(1/\sqrt{T/12} = 1/\sqrt{20} \approx 0.22\). The maximum-of-\(N\) extreme-value scaling gives \(\hat S^* \approx \sqrt{2 \log 6912} \cdot \sigma_S \approx 4.2 \cdot 0.22 \approx 0.93\). So a “headline Sharpe of \(1.0\)” obtained after this implicit search is barely above the noise floor — its deflated equivalent is essentially zero.
The honest revised claim: “We observe a Sharpe of 1.0 on a 12-month-lookback momentum strategy with sector neutralisation, but this number reflects an implicit search over roughly \(10^4\) methodological configurations. The deflated Sharpe is approximately 0.07. We cannot reject the null that the strategy is a result of selection.” Or, more constructively: “We pre-commit to the 12-month, 1-month-skip, top-decile, equal-weighted, sector-neutral configuration, and we will test it on a future holdout to determine whether the in-sample Sharpe replicates.”
Either revision is more useful than the original claim. The exercise is in writing down the audit, not in finding a number that makes the original claim look better. The discipline is honesty under your own pressure.
The temptation to undercount is enormous and largely subconscious. Most quants undercount their own forking-paths by a factor of 5–10×. A useful corrective: ask yourself “if I had used a different random seed on day one, what configuration would I have ended up with?” The set of plausible answers is your implicit search space.
Exercise 4.2: Distinguishing the snoops
Restatement. Two researchers report a long-short equity Sharpe of 1.6. A pre-registered (1 trial), B iterated (~30 trials). Both numbers correct. Explain why A is more credible and quantify the gap using extreme-value heuristics.
Worked solution. The two numerical observations are identical. The two epistemic situations are not. The difference is the prior on whether the observed Sharpe is the signal or the noise.
Why Researcher A is more credible. Researcher A made a single observation under a pre-specified methodology. The probability that an honest, pre-registered methodology produces a Sharpe of 1.6 by luck is approximately \(\Phi(-1.6) \approx 5.5\%\) at \(T=5\) years (using the i.i.d. Gaussian approximation \(\mathrm{SE}(S_{\mathrm{ann}}) \approx 1/\sqrt T\) for \(T\) in years). So Researcher A’s claim is roughly two-and-a-half standard deviations above the null. It is a finding consistent with a real strategy and a small amount of lucky alignment.
Researcher B made thirty observations, then reported the best one. The expected maximum of thirty i.i.d. \(\mathcal N(0, \sigma_S^2)\) variates, where \(\sigma_S \approx 1/\sqrt T = 0.45\) for 5 years, is approximately \(\sigma_S \sqrt{2\log 30} \approx 0.45 \cdot 2.61 \approx 1.17\). So under the null hypothesis of no skill, Researcher B would expect to find a best Sharpe of \(1.17\) purely by selection. The observed best of \(1.6\) is only \(1.6 - 1.17 = 0.43\) above the null expectation of the maximum — less than one standard error. Researcher B’s claim is roughly one standard deviation above the selection-adjusted null. Not impressive.
Quantifying the credibility gap. The deflated Sharpe for Researcher A is essentially the raw Sharpe: \(\mathrm{DSR}_A \approx \Phi(1.6 / 1) \approx 0.95\). The deflated Sharpe for Researcher B is
\[ \mathrm{DSR}_B \approx \Phi\!\left(\frac{1.6 - 1.17}{0.45}\right) = \Phi(0.96) \approx 0.83. \]
In English: Researcher A is at the 95th percentile of “consistent with a real strategy”; Researcher B is at the 83rd percentile. The numerical gap is small but the operational gap is enormous — A’s claim sits inside a region where we would routinely accept the result and allocate capital; B’s claim sits in a region where most disciplined investment committees would either decline or demand a substantial holdout test before committing.
The deeper point: the information in a Sharpe of 1.6 depends on how it was obtained. The number is identical in both cases; the inference is utterly different. This is why pre-registration is not bureaucratic theatre — it is the only way to give the same number the same epistemic weight every time. Without pre-registration, the same Sharpe can mean “deploy” or “decline” depending on the researcher’s privately known iteration history, which the investment committee cannot verify.
There is one further asymmetry worth noting. Researcher A’s claim is robust to small variations in the methodology: if you swap out one feature, the Sharpe might drop to 1.3 or rise to 1.8 but you would still have a result consistent with the original. Researcher B’s claim is fragile — it was found by hill-climbing on the in-sample metric, so the methodology is at a local maximum, and any perturbation drops the Sharpe substantially. This is empirically observable: Researcher A’s strategy generalises to nearby variations; Researcher B’s does not. A simple robustness check — perturb the lookback by ±1 month and re-run — would immediately reveal the difference.
The credibility gap can be summarised in three words: A pre-committed; B optimised.
The mistake is to treat the gap as a matter of integrity (B was sloppy or dishonest) when it is a matter of statistics (B’s process is mathematically prone to producing inflated headline numbers even with perfect honesty). A perfectly honest researcher who iterates without pre-registration will still report a Sharpe higher than the true skill level, and the deflation is mechanical. The remedy is the protocol, not the morals.
Exercise 4.3: Reverse-engineering a published result
Restatement. Pick a tradeable-anomaly paper from JF or RFS. Enumerate the authors’ choices. Identify which were discussed/justified versus which were implicit. Assess the implicit forking-paths search; write a 500-word essay on your out-of-sample confidence.
Worked solution. Below is a model essay using a representative target — Cooper, Gulen, and Schill (2008), “Asset Growth and the Cross-Section of Stock Returns,” Journal of Finance. The structure transfers to any anomaly paper.
The asset-growth paper claims that firms that grow their total assets rapidly experience low subsequent returns. The headline result is a hedge-portfolio (low minus high asset growth) annualised return of approximately 20% with a \(t\)-statistic above 8 over the 1968–2003 sample. The result is robust to many controls. It has been one of the most-cited anomalies in cross-sectional asset pricing.
Let us enumerate the choices the authors made.
Universe: CRSP common stocks, US-listed, with non-missing Compustat data. Period: 1968–2003 (chosen, presumably, because Compustat coverage degrades earlier). Frequency: annual rebalancing on June 30, the standard Fama–French timing. Feature: total-assets growth measured as \((A_t - A_{t-1})/A_{t-1}\). Label: one-year-ahead returns from July of year \(t\) to June of year \(t+1\). Normalisation: none, just decile sort. Model class: univariate decile sort, with Fama–MacBeth regression as the robust check. Holdout: none in the paper’s primary sample — the result is reported on the entire 35-year window.
These are roughly eight choices. For each, how many alternatives were plausible?
- Universe: CRSP/Compustat is standard, but the authors could have used SDC or excluded financials/utilities (a common choice that they did discuss). Two alternatives discussed.
- Period: They report subsample analysis (1968–1986 vs. 1987–2003), a small forking-paths defence. Two alternatives shown.
- Feature definition: Asset growth has at least a dozen variants in the literature (book equity growth, cash growth, capex growth, sales growth, employee growth, etc.). They report results for several alternatives. Eight or so alternatives shown.
- Holding period: They report 1, 2, 3-year horizons. Three alternatives.
- Decile cutoff: They use deciles, the standard. Other cutoffs (quintiles, hedge portfolio at top vs. bottom 5%) presumably tried; not discussed.
- Risk adjustment: CAPM, Fama–French three-factor, Carhart four-factor. All reported. Three alternatives.
Multiplying the discussed alternatives gives \(2 \times 2 \times 8 \times 3 \times 3 = 288\). But this counts only the alternatives the authors report. The number they actually tried is higher — probably 3–5× — and the number that anyone could have tried (the Gelman–Loken forking-paths multiplier) is higher still, by perhaps another 2–3×. A conservative estimate of the implicit search is \(288 \times 3 \times 2 \approx 1700\), with substantial uncertainty.
What is my degree of confidence that the result holds out-of-sample? The asset-growth anomaly has been tested on European, Asian, and emerging-market data after publication, and survives — typically with smaller magnitude (5–8% annual rather than 20%) and lower \(t\)-statistic (3–5 rather than 8). It has not been replicated on post-2003 US data with anything like the original magnitude; recent studies (Hou, Xue, Zhang 2020 in particular) show the anomaly’s \(t\)-statistic in post-2003 US data is roughly 2.5 — significant under traditional rules, not significant after multiple-testing correction with \(N \approx 200\) as Harvey/Liu/Zhu propose.
My assessment: the direction of the result is real (firms that grow rapidly tend to underperform), driven by an economic mechanism (over-investment, low marginal returns on new capital). The magnitude reported in the original paper is substantially inflated by selection: 20% annualised is unsustainable as an unconstrained anomaly. The post-publication consensus magnitude is closer to 5–8%, which is what I would expect to capture in deployment, net of costs. Confidence that the direction survives: 80%. Confidence that the original magnitude survives: 10%.
This is the right epistemic posture for any published anomaly: high prior probability that the qualitative effect is real (we don’t see published anomalies that reverse direction in replication); low prior probability that the quantitative magnitude survives the deflation.
The naive read is to treat publication as evidence of robustness. But publication is selection — it represents the subset of run analyses that produced significant results, summed over all researchers who tried something similar. The deflation factor for a published anomaly is approximately the cross-author \(N\), not the paper-internal \(N\). Multiply the paper-internal \(N\) by the plausible number of researchers who tried the same broad family of regressions, and you have your effective deflation.
Exercise 4.4: Designing a research firewall
Restatement. You set up a small quant fund: three researchers, one ops person. Write a one-page D. E. Shaw-style firewall protocol: data access, methodology spec format, iteration limits, burned-holdout triggers, project-burnout consequences. Concrete: file paths, access controls, meetings, templates.
Worked solution. Below is a protocol that a three-researcher fund could deploy on day one. It is concrete enough that the next person hired would have nothing to negotiate.
Project structure. Each research project lives under /research/projects/<YYYY-MM-DD>_<short-name>/. The directory contains: proposal.md, methodology.yml, code/, results/, holdout/, and journal.md. All commits are tracked in git on a server-side hook that prevents force-pushes.
Data classification. Three tiers.
- Tier 1 (Discovery): Full universe, full history through end-of-year \(Y-2\). Open to all researchers. Located at
/data/tier1/. - Tier 2 (Validation): Same universe, \(Y-1\) through \(Y-0.5\) year — six months of out-of-sample. Read access requires operations approval and an entry in
/research/holdout_log.csv. A researcher can request Tier 2 access once per project, and the access is logged with timestamp, researcher, and methodology-spec hash. - Tier 3 (Production): \(Y - 0.5\) through current. Only the operations person has access. Used solely for live trading. Researchers see the realised PnL aggregated weekly; they do not see the raw daily returns.
Methodology specification. Before a researcher may request Tier 2 access, they must commit a methodology.yml containing every choice: universe, period, frequency, features (named and defined), label, normalisation, model class, hyperparameters (with grids if any), holdout protocol, decision rule. The file is cryptographically hashed; the hash is recorded in /research/holdout_log.csv next to the Tier 2 access timestamp.
Iteration limits. A project starts with a budget of three Tier 2 accesses. Each access uses one slot. After the third access the project is “burned”: no further Tier 2 access until the project either (i) deploys to Tier 3 (commits to production), or (ii) is formally abandoned and its results sealed for the next 12 months.
Burned-holdout trigger. Burned if any of the following occur: - A researcher views Tier 2 data after methodology changes that were not pre-registered in methodology.yml. - A researcher submits Tier 2 results, then alters the methodology, then re-submits — any methodology change after a Tier 2 access counts as a new access, consuming a slot. - Operations finds evidence (in git history or chat logs) of methodology iteration on the Tier 2 sample.
Meeting cadence. - Monday 09:00: methodology review. Each researcher presents their methodology.yml for that week. The other two researchers and ops review. Changes require explicit log entry. - Wednesday 16:00: progress review. Tier 1 results, no Tier 2. - Friday 14:00: Tier 2 results, if any. A researcher who has burned all three slots presents either a deployment proposal or an abandonment justification.
Document templates. - proposal.md: 1-page max. Hypothesis, mechanism, expected magnitude, plan for evaluating. - methodology.yml: schema enforced by the firm’s pre-commit hook. Includes a n_trials_estimate field that the researcher fills in honestly. - journal.md: timestamp + free-text. Every methodology decision is logged here within 24 hours of being made. Read-only after commit.
Consequences of burning a project. A project that burns through its iteration budget without deploying is sealed. The methodology, code, and Tier 2 results are committed to a private repository. The researcher’s bonus does not include a contribution from that project. The researcher may revisit the project after 12 months when a new Tier 2 window opens. No data shame: the firm explicitly recognises that burned projects are part of healthy research, not failures of process. What is punished is iteration without disclosure.
Day-one practicality. A new hire is on-boarded by reading /research/PROTOCOL.md (this document), creating their first proposal.md, getting it approved at the next Monday meeting, and committing their first methodology.yml. Their first Tier 2 access occurs in week 4 at the earliest; by then they have written enough Tier-1-only code to know that the protocol is binding.
The protocol is not a guarantee of honest research. It is a forcing function that makes the costs of dishonesty (in time, in social capital, in bonus structure) larger than the benefits, and that creates a paper trail that an outsider can audit.
The most common failure of firewall protocols is informal exceptions: “for this one project, we’ll let you peek at Tier 2 because we’re under a time pressure.” Once one exception is granted, the protocol degenerates into vibes. The successful firms — Shaw, Renaissance — treat the protocol as inviolable. The cost of strict enforcement is occasionally missing a deployment window; the cost of lax enforcement is everything.
Exercise 4.5: The hedge-fund case studies
Restatement. For each of Renaissance, Two Sigma, D. E. Shaw, Citadel/Millennium, AQR, identify one weakness of their anti-snooping protocol — a residual path through which methodology snooping could still creep in. One paragraph each.
Worked solution.
Renaissance Technologies. Renaissance’s principal defence is the signal graveyard: thousands of weak signals combined, no single one able to dominate, and any single signal’s failure absorbed by the portfolio. The residual weakness is at the combination layer. The portfolio’s weight vector is itself the output of a methodology — typically a regularised optimisation over signal weights — and that methodology has been tuned over decades on the same data. The combination model is the most over-fit object in the firm, and a regime shift that breaks the combination model (rather than any individual signal) cannot easily be detected because no individual signal’s degradation looks pathological. Renaissance protects against this by running parallel combination methodologies (according to interviews with former researchers) and comparing their outputs, but the meta-methodology of how you compare combination methodologies is itself a fork in the path.
Two Sigma. Two Sigma’s principal defence is institutional: separate research, engineering, and execution teams, with clean handoffs and documentation. The residual weakness is the career incentive. A researcher’s promotion depends on their contribution to live PnL; the cleanest way to demonstrate contribution is to have an idea attached to your name that traded profitably. This generates pressure to ship — to claim a methodology as deployable when the marginal slot of iteration has not yet been burned. The institutional firewall is procedurally strong but socially incomplete. The remedy (which Two Sigma in part uses) is to evaluate researchers on the quality of their pre-registered methodologies rather than only on the realised PnL of their deployed strategies. But this is hard to compensate, and the firm has had to balance it against the natural attraction of “we paid you because you made us money”.
D. E. Shaw. D. E. Shaw’s firewall is the strictest in the industry: independent validation teams, time-locked holdouts, formally documented methodology specifications. The residual weakness is the unintended-leak channel — the researchers see, by virtue of working in the same firm, the broad trading style and capital allocation of the firm. If the firm is known to favour mean-reversion strategies on a particular universe, a researcher proposing a momentum strategy on that universe knows ex ante that they are competing against an established methodology; this shapes their hypothesis space in ways the firewall cannot police. The firm partially defends against this by encouraging strategy diversification across teams, but the social structure of any single firm produces a methodological monoculture over time.
Citadel/Millennium (multi-PM platforms). The platforms’ principal defence is social: each PM is paid on their own book, so the single-PM incentive to over-fit is offset by the firm’s incentive to allocate capital to PMs whose strategies have positive expected return. PMs whose strategies decay are de-funded; new PMs are funded. The residual weakness is survivorship bias at the PM level. A PM with a strategy that has decayed five times in a row leaves the firm; one with a strategy that has been lucky five times in a row continues. Over time the population of active PMs is increasingly the lucky ones, even though the underlying methodologies (rotating around the same universes) are not improving. The firm’s response is to manage the population aggressively — fire and hire — but this is a Darwinian remedy, not a statistical one. The strategies that result are robust in the sense of “the unlucky ones have been removed”, not in the sense of “the surviving ones generalise”.
AQR. AQR’s principal defence is the factor-based approach: rather than searching for arbitrary strategies, the firm commits ex ante to a small number of well-documented factors (value, momentum, low-beta, quality, carry) and implements them with disciplined methodology. The implicit search is therefore bounded — you can only iterate within a factor, not across factors. The residual weakness is that factor definitions themselves are decisions, and the AQR Research Library reflects 25 years of iterating on which factor definitions to publish. A young researcher at AQR can iterate on value-factor definitions because the firm has done so for decades; the iteration history is institutional, not personal, but it is still iteration. The deflation factor on AQR’s factor zoo, if you were to count honestly, is larger than the firm’s research notes suggest — probably \(N \approx 50\)–\(100\) across the entire history of the firm’s factor library, not the \(N \approx 5\) implied by the headline products.
The fund-by-fund critique can degenerate into “no protocol is perfect, therefore no protocol is useful”. This is wrong. Each protocol is engineered against the failure mode most likely at that firm given its strategy mix and culture. Renaissance’s weakness is at the combination layer because their strategy is a combination layer; Shaw’s weakness is informational because their architecture is informational; AQR’s weakness is factor-zoo iteration because their product is factor zoos. A firm without any protocol is uniformly worse than one with an imperfect protocol. The exercise is in habit-of-mind: where, despite the discipline, could the failure mode still occur?
Exercise 4.6: Horizon and feature set
Restatement. New feature set: intraday order-book imbalances, updating many times per second. At what holding horizons should the natural alpha live? What dollar transaction costs (round-trip) would you tolerate? Implications for execution venue and trading style. Two paragraphs.
Worked solution. Order-book imbalance is a microstructure signal. Its information content reflects the short-term supply-demand state of the book, which is consumed by other market participants within seconds to minutes. The alpha half-life of any single imbalance reading is therefore on the order of seconds, with a long tail that decays to noise within roughly thirty minutes for liquid US large-caps and within hours for less liquid names. Combined into a portfolio, the natural holding horizon for a strategy built on order-book imbalance is between one minute and one hour, with the typical centre of gravity at five-to-fifteen minutes for liquid names.
The implication for transaction costs is severe. A strategy holding positions for ten minutes and rebalancing on order-book signals will turn over the portfolio many times per day. Even a single round-trip of one basis point per trade times five turnovers per day is twenty-five basis points per week, which is one-and-a-quarter percent per month, fifteen percent per year. To make money at this turnover rate the gross alpha must exceed this number by a Sharpe-relevant margin — gross of cost, the strategy needs to earn at least 25–30% annualised. This is achievable only at very low cost per trade: the strategy belongs on a direct-market-access venue (preferably with a co-located gateway) rather than a retail broker; it requires maker-taker rebates on at least one side of every round trip (the strategy should be a net maker, posting limit orders to capture rebates rather than crossing the spread); and it requires bracketed execution algorithms (TWAP, VWAP, POV, or a custom slicer) that minimise market impact per child order.
The trading-style implication is that this is not a “discretionary plus alpha” or even a “factor-based” strategy — it is a systematic latency-sensitive strategy, and the firm running it must accept the operational profile that goes with it: dedicated co-location, sub-millisecond gateways, a real-time risk system, and a culture of zero tolerance for execution errors. If the firm is not prepared to make these commitments, the imbalance feature set should not be deployed even if its in-sample Sharpe is attractive. The horizon-feature mismatch — slow-fund infrastructure on a fast-feature signal — is one of the most common failure modes for newly-launched quant strategies, and it is structural, not statistical.
The error to avoid is averaging the feature over five minutes to get a slower signal that fits the firm’s slower-infrastructure profile. Averaging destroys the information content of the order-book signal almost completely; the right answer is either to accept the operational profile or to discard the feature set. Adapting the signal to the infrastructure leaves you with no alpha and the same operational cost as having no signal at all.
Exercise 4.7: Defending a Sharpe of 2.96
Restatement. Pitch a 5-year out-of-sample Sharpe of 2.96 to a sceptical IC. Committee asks: how many variants did you try, and have you deflated? Show an honest answer including the rough deflation magnitude. Then show a dishonest answer and its warning signs.
Worked solution.
The honest researcher.
Committee member: “How many methodological variants did you try before arriving at 2.96?”
Researcher: “Three of us worked on this strategy for eight months. The methodology spec lives at
/research/projects/2025-09-12_momentum-q/methodology.yml. Before we touched the holdout, we iterated on Tier 1 — the discovery sample — through approximately fifty methodologies. After each iteration we updated the methodology file with the new spec and the rationale for the change. We tracked it injournal.md; I’m happy to walk you through it. Once we settled on a single methodology specification, we ran it on the Tier 2 holdout. We had three slots; the first run gave a Sharpe of 2.6, the second 3.1, the third 2.96 — these three were minor variations agreed in advance, and the median of the three is what we report. So \(N_{\text{trials}}\) on Tier 1 is approximately 50, and \(N_{\text{trials}}\) on Tier 2 is 3.”Committee member: “And the deflation?”
Researcher: “Using \(\hat S^* = \sigma_S \sqrt{2 \log 50}\) with \(\sigma_S \approx 0.45\) over five years, the Tier 1 deflation expects a maximum-Sharpe-by-chance of about \(0.45 \cdot 2.80 = 1.26\). Our Tier 1 best was 2.4, so the deflated Tier 1 Sharpe is \(2.4 - 1.26 = 1.14\) — meaningfully above the null. On Tier 2, we ran three pre-committed methodologies and the median is 2.96. Using \(N = 3\) for the Tier 2 deflation, \(\hat S^* \approx 0.45 \cdot 1.48 = 0.67\), so the deflated Tier 2 Sharpe is \(2.96 - 0.67 = 2.29\). The deflated Sharpe is what I’d ask you to anchor on. The headline of 2.96 reflects the median of three pre-committed runs; the deflated value of 2.29 corrects for the fact that we could have chosen differently.”
This is the script of a researcher with discipline. The committee can now make an allocation decision: a Sharpe of 2.29, robust to the methodology disclosed, is exceptional and worth deploying.
The dishonest researcher.
Committee member: “How many methodological variants did you try before arriving at 2.96?”
Researcher: “Well, you know, we explored — I’d say a few. Maybe… five? It was iterative; it’s hard to say exactly. The headline result is what we converged on.”
Warning signs to flag:
- Vagueness about the number. “A few”, “five”, “I’d say” are the linguistic signature of a researcher who has not counted, because counting would force a quantification they cannot defend.
- No methodology file referenced. A clean researcher names the file, the path, and the git commit. A snooping researcher gestures at “the methodology”.
- No deflation. A clean researcher answers with the deflated number, having computed it before walking in the room. A snooping researcher resists the framing — “the deflation depends on what you assume” — because they have not done the calculation, or because they have done it and the result is uncomfortable.
- Reluctance to provide the journal. A clean researcher has a journal and offers it; a snooping researcher claims the journal is “incomplete” or “private” or “not the appropriate level of detail.”
- No distinction between Tier 1 and Tier 2 runs. A clean researcher separates the discovery sample from the holdout; a snooping researcher reports a single number from “the data” without naming the protocol.
- The 2.96 number is reported as a point estimate, not as one of three pre-committed runs. A pre-committed three-run protocol produces a median (or mean) and a range. A snooping researcher reports a number without a range because they hill-climbed to it.
- Defensive posture. The clean researcher’s tone is “here are the numbers; allocate as you see fit.” The snooping researcher’s tone is “trust me, this is real” — the affect of someone who knows the number does not survive scrutiny.
The role of the investment committee is to detect these warning signs and to refuse allocation when they appear, regardless of how attractive the headline number is. A Sharpe of 2.96 from an honest researcher is deployable; the same number from a dishonest researcher is a liability.
The most consequential committee error is to confuse the deflation with a criticism of the researcher’s effort. The deflation is mechanical, not personal — it is a property of the process, not of the person. A senior IC member who interprets “deflate by 0.67” as “I doubt your work” will produce a culture in which researchers hide their iteration counts. The right framing is: “I expect you to deflate. Show me the deflated number. The headline is your private affair; the deflated number is what we invest against.”
Exercise 4.8: Pre-registration as a public good
Restatement. Pre-registration is common in clinical trials, rare in quant finance. Should academic quant-finance journals require pre-registration as a publication condition? One-page argument for or against. Address counterarguments and limitations.
Worked solution. I argue for the requirement, with the caveat that pre-registration is a structural reform, not a methodological panacea. The argument has three planks.
First, the cost of non-pre-registration is real and quantifiable. The Harvey–Liu–Zhu (2016) study of the cross-section of factor returns documented several hundred reported anomalies; under a multiple-testing correction with \(N\) on the order of the literature size, fewer than half of these would survive a conventional 5% significance threshold. The replication rate of published anomalies on subsequent data is consistent with this: roughly half of pre-2003 published anomalies fail to replicate in post-2003 US data (Hou, Xue, Zhang 2020). The cost of the current research culture is approximately half the published literature is wrong, and the readers of the literature — investment managers, students, journalists, regulators — pay the cost by allocating capital and attention to results that will not survive. Pre-registration eliminates the bulk of the post-hoc methodology iteration that produces the half-empty published record.
Second, pre-registration is feasible in quant finance. The standard objection is “but financial research is exploratory; you cannot know what to pre-register”. This is true of idea generation and false of idea testing. The pre-registration requirement applies to the second; it requires that, once a researcher has formed a hypothesis, they specify in writing how they will test it before they actually do. Hypothesis generation is permitted on any data; hypothesis testing requires a frozen specification. Most published anomaly papers already have a frozen specification — the part the reader sees — and the pre-registration would merely require depositing it in a public registry at the moment it is frozen, before running the test. Journals could implement this with a simple addendum: “Authors must submit a pre-registration of the methodology to the journal’s public registry before the date of first holdout testing. The pre-registration must include the universe, period, frequency, feature definitions, label, normalisation, model class, and decision rule. Deviations from the pre-registration must be explicitly disclosed and justified in the paper.”
Third, pre-registration creates a public good: a registry of methodologies that did not work. Currently, the published literature is the surviving tail of a hidden distribution; the failed methodologies are private knowledge, lost when researchers move on. A pre-registration registry would make the failed methodologies visible — papers that were registered but never published, or registered and published with null results. The aggregate effect is to allow future researchers to learn from the failed search, much as clinical trial registries allow medical researchers to learn from null results. Over a decade, the registry becomes a more useful research input than the published literature itself.
Counterarguments.
The strongest counterargument is commercial: if a researcher pre-registers a methodology that turns out to work, they reveal it to the market before they can profitably trade it. This is genuinely problematic for private research, but it is not a counterargument against academic publication: a researcher who is publishing in JF or RFS has already chosen to make the methodology public. The pre-registration requirement adds nothing to the disclosure cost; it merely shifts the timing.
The second counterargument is flexibility: pre-registration constrains the researcher to a methodology committed in advance, which prevents legitimate post-hoc refinements (better data, better controls, better robustness checks). The reply: the pre-registration requirement does not prohibit refinement; it requires disclosure of the deviation. A researcher can deviate from the pre-registration, provided they explain why. The reviewer then has the information needed to assess whether the deviation is principled or whether it represents post-hoc cherry-picking.
The third counterargument is cost: pre-registration is administratively burdensome. The reply: clinical trial pre-registration is also administratively burdensome and the medical literature has adapted to it within twenty years. The burden is one-time and front-loaded; the benefit is permanent.
What pre-registration would not solve. It would not solve forking-paths within the registered methodology — a sufficiently flexible pre-registration document can still admit substantial post-hoc choice (e.g., “we will test the strategy with one of {top decile, top quintile} as the cutoff”, left open at registration). It would not solve cross-author iteration — if a hundred researchers each pre-register a slightly different value strategy, the cross-author \(N\) is still a hundred and the deflation factor for any published winner is still \(\sqrt{2\log 100} \cdot \sigma_S\). And it would not solve the career-incentive problem — researchers will still want to publish, and a registry of failed methodologies does not by itself produce a tenure case. These are reasons to complement pre-registration with other reforms (cross-author multiple-testing corrections, replication standards, registry-based meta-analyses), not reasons to abandon it.
On balance: require pre-registration. The literature will be smaller, more conservative, and more useful.
The cynical position is that pre-registration will become a check-box exercise: researchers register the methodology after they have already run the analysis informally, and the registration adds no information. This is a real risk, partially mitigated by registry policies that timestamp submissions and that require the journal to verify pre-registration date relative to data availability. The remedy is not “abandon pre-registration”; the remedy is “design the registry to be hard to game”.
Exercise 4.9: Designing your own checklist
Restatement. Critique the 14-item checklist from the chapter. What’s missing? Redundant? Modify it for a specific project type (e.g., equity long-short on a small universe, intraday FX, factor-ETF portfolio). Final checklist: 10–15 items, specific enough to be used literally.
Worked solution. Below is a modified checklist for an intraday FX strategy on a small universe of major pairs (EURUSD, USDJPY, GBPUSD, AUDUSD). The intraday horizon and the small universe demand specific adaptations: features are tick-frequency, transaction costs are dominated by the bid-ask spread rather than commissions, the universe is small enough that traditional multiple-testing corrections are weak (low \(m\)), and the cross-pair correlation is large enough that effective breadth is small.
Critique of the 14-item checklist. The chapter’s checklist is good for monthly equity long-short on a broad universe: emphasis on cross-sectional \(N\), on factor neutrality, on quarterly retraining, on the standard alpha factor zoo. For intraday FX, several items are partially redundant (universe normalisation, sector neutrality, decile sorting — none of which apply on four pairs) and several items are missing (latency, microstructure-cost modelling, regime detection on macro events). The right adaptation is to replace, not to add.
Checklist for intraday FX, four-major universe:
Pre-registered methodology. Methodology specification (feature list, signal computation, threshold, holding period, sizing rule, stop-loss rule) committed to a git repository before any holdout testing. Hash recorded in
/research/holdout_log.csv.Effective holding period. Stated in seconds-to-minutes. Justify the choice against the empirical alpha half-life of the feature set (see Chapter 2). For order-book features, expect 30 seconds to 5 minutes.
Latency budget. Quote-to-order round-trip time, in milliseconds, measured on the same colocation infrastructure that will be used in production. Strategy must be profitable with a 2x margin on the measured latency, to allow for tail events.
Transaction-cost model. Bid-ask spread component (50% rebate on maker side, full spread on taker side if crossing), market-impact component (estimated from production order data, not from synthetic models), commission component (broker rate sheet attached). Costs reported in pips, then converted to PnL.
Pair-by-pair attribution. Sharpe and drawdown reported separately for each of the four pairs, and for the portfolio. A strategy that works on EURUSD only is a EURUSD strategy, not an FX strategy.
Cross-pair correlation. Report the correlation matrix of per-pair PnL. Effective breadth = \(K / (1 + (K-1) \bar\rho)\). Strategies with effective breadth below 1.5 do not get deployed.
Macro-event blackout. Document the blackout windows around US/EZ/UK/AU central bank meetings, NFP, CPI releases, and intervention windows. The strategy must be tested both with and without blackouts; if performance depends on trading through blackouts, that is a separate decision requiring senior approval.
Time-of-day attribution. PnL by Asian, European, US session. A strategy whose alpha lives entirely in one session has session-specific risk and should be sized accordingly.
Drawdown discipline. Maximum cumulative drawdown of 15% triggers an automated halt. The strategy is reviewed by the research head before resumption. Pre-committed, not negotiated after the fact.
Walk-forward protocol. Estimation on rolling 3-month window, test on the next month, advance by one month. No look-ahead. Holdout window is the most recent 3 months, never touched during methodology selection.
Position sizing rule. Volatility-targeted notional per pair, with a cap on aggregate USD-equivalent exposure. Pair sizing scaled by inverse realised vol over the trailing 20 trading days. Documented in the methodology spec.
Monitoring layer. Production alerts: rolling 5-day Sharpe below 0; rolling 20-day drawdown more than 10%; feature drift via KS on per-tick distributions, threshold 0.10; latency p99 above 2x baseline. Each alert has a defined severity and an automated response.
Post-deployment review cadence. Weekly: alert summary. Monthly: feature-level diagnostics. Quarterly: full methodology review with the head of research, including consideration of retraining or retirement.
This checklist is shorter than the chapter’s 14-item version (13 items) and more specific — every item is a concrete thing the researcher can show, not a habit-of-mind. The two missing items from the chapter’s list — “sector neutrality” and “factor decomposition” — are absent because they do not apply on a four-pair universe. The two added items — latency budget, macro-event blackout — are essential to intraday FX and absent from the equity-focused chapter checklist.
The temptation is to keep all 14 chapter items plus add the FX-specific ones, ending up with a 20-item list. Long checklists are not used; they decay into ceremony. A 10–15-item list that the researcher actually consults before every deployment beats a 30-item list that lives in a Confluence page. Remove items aggressively.
Exercise 4.10: A counter-example
Restatement. The chapter argues pre-registration is the antidote to methodology snooping. Construct an honest counter-argument: when, if ever, is post-data methodology adjustment the right thing to do? Hint: exploratory data analysis, hypothesis generation, anomaly detection. Two paragraphs.
Worked solution. The chapter is right that methodology snooping is dangerous and that pre-registration is the most effective antidote against it. But the chapter’s framing risks a corollary that is too strong — that any post-data methodology adjustment is snooping. The correct view is that there are two distinct activities, only one of which the pre-registration regime is designed to discipline. The first is confirmatory testing: you have a hypothesis, you have a methodology, you test the methodology, you report whether the hypothesis survived. This is the activity for which pre-registration is the right tool, because the validity of the test depends on the methodology being specified ex ante. The second activity is exploratory data analysis: you do not have a hypothesis, or you have one but the data has surprised you, and you adjust your methodology in response to what you see. This is a different epistemic activity; it is hypothesis-generating, not hypothesis-testing. The pre-registration regime should permit it but should not permit treating its outputs as confirmatory.
The honest practice, then, is to keep the two activities clearly separated and to label the outputs accordingly. A research note might begin with “we conducted exploratory analysis on data from \(2020\)–\(2022\) and identified the following candidate hypothesis: …”, then proceed to specify the formal test (“we will test this on a pre-registered methodology applied to data from \(2023\) onwards”). The exploratory section makes no causal claim and no inferential claim; the confirmatory section does. This separation is standard in clinical trials (phase I exploration, phase II confirmation) and increasingly common in econometrics (Athey–Imbens machine-learning-for-causal-inference papers explicitly separate the prediction sample from the inference sample). It is rare in quant finance precisely because the cultural pressure is to publish a single number that is both hypothesis and confirmation — a fusion that the multiple-testing literature has spent fifty years showing is intellectually incoherent. The right counter-argument to “always pre-register” is not “sometimes don’t pre-register” but rather “always separate exploration from confirmation, and apply pre-registration to the confirmation phase”. Under this view, methodology iteration after seeing the data is essential, productive, and intellectually honest within the exploration phase. It is destructive only when its outputs are smuggled into the confirmation phase without acknowledgement, which is what snooping actually means.
There is a related, narrower defence of post-hoc methodology adjustment that the chapter would also accept: response to genuine data anomalies that invalidate the original methodology’s assumptions. If a researcher pre-registers a methodology that assumes daily prices are non-zero, and on the first run of the holdout the data contains a stock with a corporate action that left the price at zero for two weeks, the researcher is right to adjust the methodology to handle the anomaly — the pre-registration was implicitly conditional on the data behaving as expected. The discipline is to log the adjustment, explain it, and re-run the holdout with the adjusted methodology, recording that this is no longer a pure pre-registered test but a pre-registered test with one documented amendment. The reader can then decide how much credibility to attribute. The framework permits methodology adjustment with disclosure; what it prohibits is methodology adjustment without it. Counter-examples to “pre-register everything” are abundant; counter-examples to “log every adjustment” are essentially absent.
The mistake here is to conclude “everything goes in exploration, nothing goes in confirmation”. This decision rule is not viable because it makes confirmatory inference impossible — the researcher who is honest about the exploration done before the confirmatory test will rarely have any pure-confirmatory test left. The realistic rule is that exploration is bounded (you do exploration on data \(X\), then commit to a methodology, then test on a new data window) and that the boundary is enforced by the data-tier protocol (Exercise 4.4). Without the protocol, the boundary collapses; with the protocol, exploration and confirmation are both productive activities, properly separated.
The exercises in this chapter ask you to write rather than to compute, because the failure modes of methodology snooping are reasoning failures, not arithmetic failures. The honest researcher is not the one who has memorised the formulas of deflated Sharpe or BH; she is the one who has internalised the habit of asking what she actually tested. The formulas of Chapter 3 are powerful tools; the habits of Chapter 4 are what make those tools effective in practice. Build the habits first. The formulas will follow.
End of appendix.