Chapter 3: Multiple Testing
A quant who mines a thousand candidate signals and reports “the one that worked” is not doing statistics; they are doing storytelling. This chapter equips you with the formal machinery to reason about a family of hypothesis tests rather than a single test in isolation.
Before we go anywhere, three plain-English anchors that the whole chapter rests on:
- Hypothesis test — a procedure that asks “is this pattern real or just noise?” The output is a p-value: “if there were no real effect, how likely is this much pattern by chance?” Small p-value means the pattern is unlikely under pure noise, so it is likely real.
- Null hypothesis (\(H_0\)) — the “nothing is going on” baseline. For us, the recurring example is: “this trading rule has zero true edge.”
- Type I error — rejecting \(H_0\) when it is actually true. A false positive: you think you found something, but it is just noise.
- \(\alpha\) (significance level) — the false-positive rate you are willing to tolerate on any one test. Usually \(5\%\).
- Multiple testing — testing many things at once. If you test 100 random rules at \(\alpha = 0.05\), you expect 5 false positives by sheer chance. That is the trap this chapter exists to defuse.
With those in hand, you will learn what the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR) actually mean, how to apply the Bonferroni, Holm, Benjamini–Hochberg, and Benjamini–Yekutieli procedures, when each is appropriate in a quant workflow, how to deflate an observed Sharpe ratio for the number of trials that produced it (Bailey and López de Prado’s Deflated Sharpe Ratio), and how to construct an empirical bootstrap null distribution for the entire pipeline rather than for a single statistic. By the end of the chapter you should be able to take any candidate set of \(m\) trading rules and decide which — if any — survive a statistically honest filter.
A Cautionary Tale to Frame the Chapter
Where you’ll see this: every time you, or anyone around you, gets excited about a backtest, this is the story you should run through your head first. The reflex “look at that Sharpe!” is exactly what multiple testing exploits.
Derek Muller walks through the famous Ioannidis paper (“Why most published research findings are false”) in plain English. Everything in this chapter is the statistical machinery you need to avoid being one of his examples.
— Veritasium
Five Years on the Walk-Forward
Suppose you run a rule-mining pipeline on a panel of 832 US stocks for the eleven years from 2015 through early 2026. The pipeline scans thousands of feature triplets, requires each candidate rule to clear a \(t\)-statistic of \(2\) on a long training window, requires it to be stable across three disjoint sub-periods, and then assembles the survivors into a monthly-rebalanced portfolio. You walk it forward over 5.3 years from 2019 through late 2024. At the end the simulator reports the following:
| Metric | Strategy | SPY benchmark |
|---|---|---|
| Final wealth (from $10k) | $80,361 | $21,501 |
| Annualized return | 47.8% | 15.4% |
| Sharpe ratio | 1.87 | 0.91 |
| Maximum drawdown | \(-20.1\%\) | \(-24.5\%\) |
| Trades executed | 10,274 | — |
The headline looks magnificent. Multiplied wealth, doubled Sharpe, smaller drawdown. If you were preparing a pitch deck you would stop here.
But two facts about how this number was produced should disturb you. First, the pipeline tested on the order of three million candidate triplets to find the survivors that built the portfolio. Second, the pipeline itself — the choice of \(t\)-threshold, the number of sub-periods, the lookback window, the way disjoint features were enforced, the cost gate level — was iterated seven times before this particular variant emerged as the best. Reported Sharpe across those seven methodology variants ranged from \(1.02\) to \(2.96\) on the same data.
You have two distinct multiple-testing problems stacked on top of each other: a vast number of individual rules tested at the bottom level, and a smaller but still nontrivial number of full methodologies compared at the top level. The behavioural side of the second problem — the Garden of Forking Paths, researcher degrees of freedom, pre-registration, the social-engineering protocols that real hedge funds use to avoid fooling themselves — is the subject of Chapter 4. This chapter handles only the statistical machinery: given that you tested \(m\) things, how do you decide which discoveries to keep?
If you scan \(m = 1{,}000\) candidate strategies under the global null \(H_0: \mu = 0\) and apply the textbook rule “reject when \(p < 0.05\)”, you expect \(0.05 \times 1{,}000 = 50\) false positives before any real signal exists. The 50 strongest-looking strategies in your output are exactly what pure noise would produce. The whole point of this chapter is to choose a decision rule that keeps the family-level error budget under control.
What “the family” Means
Multiple-testing language uses the word family in a technical sense. A family is the collection of hypotheses you tested together and from which you intend to draw a single decision — typically “which of these are real?” If you trade only the rules that pass a filter, the filter is applied to the whole family at once, and the inferential error you care about is a property of the family, not of any individual test.
Two researchers who tested the same 100 rules will reach different conclusions if they disagree about the family. One says, “I only care about the single best rule, and the family is the 100 I tested.” The other says, “I built this set after looking at three other databases that suggested which 100 to try, so the family is really 1,000.” A third — closer to the truth in a quant shop — admits that the same team has been running variants of this pipeline for five years, and the relevant family is some much larger number that nobody has been counting.
The statistical procedures in this chapter all take \(m\) as an input. They cannot pick \(m\) for you. The honesty of the analysis lives in how honestly you count.
The Setup: \(m\) Tests Against a Null
Where you’ll see this: any time someone hands you a list of “candidate signals” — a feature library, a grid-search output, a research note with 30 ideas — the very first question you should ask is “what is the family?” The answer drives every formula below.
Notation and the Confusion Matrix
You test \(m\) null hypotheses \(H_{0,1}, H_{0,2}, \ldots, H_{0,m}\). After your decision rule, each hypothesis is either rejected (“the test calls this rule real”) or not rejected (“the test calls this rule noise”). The truth is either that the null is genuinely true (“no signal”) or genuinely false (“real signal”).
The combinations form a \(2 \times 2\) table that is the foundation of every multiple-testing concept in this chapter:
| Not rejected | Rejected | Total | |
|---|---|---|---|
| \(H_0\) true (no signal) | \(U\) | \(V\) | \(m_0\) |
| \(H_0\) false (real signal) | \(T\) | \(S\) | \(m_1\) |
| Total | \(m - R\) | \(R\) | \(m\) |
Here \(R\) is the number of rejections the procedure makes (a discovery in the language of FDR theory), \(V\) is the number of those rejections that are wrong (a false discovery), and \(S\) is the number of correct rejections (a true discovery). The researcher observes \(R\) directly but does not observe \(V\) or \(S\) separately — that is precisely what makes inference hard. Every multiple-testing procedure tries to control some functional of the unobserved \(V\) given that you only see \(R\).
Two functionals dominate the literature, and they correspond to the two definitions we promised in the opening callout:
- Family-Wise Error Rate (FWER) — probability that you make even one false positive in the whole family. Very strict.
- False Discovery Rate (FDR) — expected fraction of false positives among your reported discoveries. Less strict than FWER, and gives you more power when \(m\) is large.
Formally:
The Family-Wise Error Rate (FWER) is the probability that you make any false discovery at all:
\[ \mathrm{FWER} \;=\; \Pr(V \ge 1). \]
Controlling FWER at level \(\alpha\) means: “the chance that even one of my reported rules is a false positive is no more than \(\alpha\).” This is a stringent guarantee. It does not depend on \(m\) in the sense that you do not get to enjoy a budget that grows with the number of tests; one false positive in a thousand-rule report is still one false positive.
The False Discovery Rate (FDR) is the expected fraction of false discoveries among the discoveries made:
\[ \mathrm{FDR} \;=\; \mathbb{E}\!\left[\frac{V}{R \vee 1}\right], \]
where \(R \vee 1 = \max(R, 1)\) avoids division by zero when no rejections are made (by convention \(V/R = 0\) when \(R = 0\)). Controlling FDR at level \(q\) means: “on average, no more than \(q\) of every \(100\) rules I report is a false positive.” This is a proportion guarantee, not an event guarantee. It tolerates a few false positives in exchange for substantially more power, especially when \(m\) is large.
A procedure that controls FWER at \(0.05\) automatically controls FDR at \(0.05\), because \(\mathbb{E}[V/(R \vee 1)] \le \Pr(V \ge 1)\). The converse is false: an FDR-controlling procedure may make many false discoveries with probability close to one, as long as it also makes enough true discoveries to keep the ratio small. Choose your error measure based on what you will actually do with the discoveries.
Caption. The \(2\times 2\) table sits behind every multiple-testing procedure: \(V\) counts false positives (FWER controls the chance of any), \(S\) counts true discoveries (FDR controls the false-positive fraction among discoveries).
How Null p-values Behave
A piece of probability theory that every quant should internalize: under the null hypothesis the p-value is uniformly distributed on \([0, 1]\). “Uniform on \([0,1]\)” just means “every value between 0 and 1 is equally likely.” This is a definition of what a valid p-value means, and it is the single fact behind every result in this chapter.
If \(T\) is a test statistic and \(H_0\) is the null, the p-value is \(p = \Pr(T \ge t_{\text{obs}} \mid H_0)\) (for a one-sided test). View \(p\) as a function of the random data; under \(H_0\) the distribution of \(p\) is \(\mathrm{Unif}(0, 1)\). The consequence is that under the null, \(\Pr(p \le u) = u\) for every \(u \in [0, 1]\). The reflex “small p-value means surprise” is just “small \(u\) has small probability under uniform.”
When you scan \(m\) independent tests against the global null — every hypothesis is genuinely null — you obtain \(m\) independent draws from \(\mathrm{Unif}(0, 1)\). The formula below gives the probability that the smallest of those \(m\) p-values, written \(p_{(1)}\), falls below some threshold \(u\). The minimum p-value is the minimum of \(m\) uniforms, which is well-known to satisfy
\[ \Pr(p_{(1)} \le u) \;=\; 1 - (1 - u)^m, \]
What we just got. Even when every test is pure noise, the minimum p-value is small with high probability. With \(m = 1000\) and \(u = 0.05\), the chance that the smallest p-value lies below \(0.05\) is \(1 - 0.95^{1000} \approx 1 - 5 \times 10^{-23}\), i.e. essentially one. Naively rejecting at \(p < 0.05\) guarantees you find “a winner” even when nothing is real.
The pyodide block below demonstrates this directly. It draws \(m\) independent uniforms (the null distribution of p-values) and counts how often the minimum of the \(m\) values lies below \(0.05\) — i.e. how often a naive analyst would falsely declare a winner.
What we just got. The probability of seeing some p-value below \(0.05\) rises from about \(40\%\) at \(m = 10\) to essentially one already at \(m = 100\). The naive threshold is meaningless once \(m\) is non-trivial.
The picture below makes the global-null behaviour visceral. We draw \(m = 5{,}000\) p-values from the null (i.e. each is a uniform random number on \([0, 1]\)), plot the histogram, and shade the “naively significant” tail \(p < 0.05\) in red. Under the null, the histogram is flat — every bin has roughly the same height. The red region contains about \(5\%\) of the mass and represents the false positives a naive \(p < 0.05\) rule would harvest.
Caption. The flat histogram is the signature of a true null: every p-value is equally likely. The crimson bins to the left of \(\alpha = 0.05\) are the false positives a naive rule would report — about \(5\%\) of the family, all of them noise.
The Equity-Mining Illustration
Translate this into the cautionary tale at the start of the chapter. The formula below just multiplies the per-test false-positive rate \(\alpha = 0.05\) by the number of tests, to get the expected number of false positives. A pipeline that tested roughly \(3.2 \times 10^6\) candidate triplets against \(H_0: \mu = 0\) would, if every candidate were null, produce approximately
\[ \mathbb{E}[V] \;=\; m \cdot \alpha \;=\; 3.2 \times 10^6 \times 0.05 \;=\; 1.6 \times 10^5 \]
What we just got. Under naive \(0.05\)-thresholding you would expect about \(160{,}000\) false discoveries. The pipeline in fact retained around \(7.1 \times 10^5\) candidates after the \(|t| \ge 2\) filter. Subtracting the expected false-positive count yields roughly \(5.5 \times 10^5\) rules that might be real — but it is logically incoherent to claim that half a million distinct equity rules represent genuinely different alpha sources on the same 832 stocks. The truth is that the candidates are correlated (many triplets share features), and the global-null analysis above is itself an approximation. What multiple-testing corrections give you is a principled way to push the threshold high enough that the surviving set is more credibly real.
Family-Wise Error Rate (FWER)
Where you’ll see this: if you ever submit a finance paper to a top journal, the referee will ask “how many strategies did you test?” If you can’t answer or refuse to apply a correction, your paper is rejected. Same in real work. The Bonferroni correction is the universally accepted minimum.
The Bonferroni Correction
Bonferroni correction — divide \(\alpha\) by \(m\), and use that tighter cutoff for every test. Strict, simple, conservative.
You have \(m\) p-values, one per candidate rule, and you want to keep the chance of any false positive in the whole batch at or below \(5\%\). The recipe: pick the threshold \(\alpha/m\) (so \(0.05/m\) for the usual \(5\%\) target), then reject any rule whose p-value falls below that number. That is it — no sorting, no stepping, one cutoff for everyone. The cost: as \(m\) grows the cutoff becomes brutally tight, so weak-but-real signals will be missed. The benefit: it works no matter how the tests are correlated.
The oldest and simplest FWER-controlling procedure is the Bonferroni correction. Test each individual null at level \(\alpha/m\). If you wanted FWER \(\le 0.05\) and you tested \(m = 100\) strategies, you would reject each one only if its p-value were below \(0.0005\).
The argument is one line. The “union bound” below is just the fact that the probability of any of several events happening is at most the sum of their individual probabilities. By the union bound,
\[ \Pr(V \ge 1) \;=\; \Pr\!\left(\bigcup_{j: H_{0,j} \text{ true}} \{p_j \le \alpha/m\}\right) \;\le\; \sum_{j: H_{0,j} \text{ true}} \Pr(p_j \le \alpha/m) \;\le\; m_0 \cdot \frac{\alpha}{m} \;\le\; \alpha. \]
What we just got. The chance of even one false positive is at most \(\alpha\). The inequality holds for any joint distribution of the test statistics — independent, dependent, positively correlated, negatively correlated. That is why Bonferroni is the workhorse of regulators, drug trials, and any setting where you cannot make a credible dependence assumption.
It is also why Bonferroni is conservative. The union bound is loose when tests are correlated, and quant tests are almost always correlated (your value, momentum, and quality scores live on the same stocks at the same times). A Bonferroni correction designed for the worst case will reject far fewer rules than a method that exploits the structure of the dependence.
Reject \(H_{0,j}\) whenever \(p_j \le \alpha/m\). The FWER is then at most \(\alpha\), under any dependence structure. This is the universal fallback when you cannot assume anything about how the tests relate to one another.
A concrete check on what Bonferroni costs in the cautionary tale: with \(m = 3.2 \times 10^6\) and \(\alpha = 0.05\), the per-test threshold is \(0.05 / (3.2 \times 10^6) \approx 1.56 \times 10^{-8}\). Under a Gaussian approximation, a two-sided p-value of \(1.56 \times 10^{-8}\) corresponds to \(|t| \gtrsim 5.4\). The vast majority of candidate rules — even ones that look strong on a \(t\)-statistic basis — will not clear this threshold.
The cell below makes this tangible. For a range of family sizes \(m\), it prints the Bonferroni p-value cutoff \(\alpha/m\) and the corresponding minimum \(|t|\) a candidate rule must clear.
What we just got. As \(m\) climbs from \(10\) to a few million, the \(|t|\) a candidate must clear rises from about \(2.8\) to north of \(5\). That is the price you pay for an event-level FWER guarantee against \(3.2\) million null hypotheses tested in parallel.
Holm’s Step-Down Procedure
Holm step-down — like Bonferroni, but uses a different (slightly looser) threshold for each ranked p-value. More power than Bonferroni at exactly the same FWER guarantee.
Sort your \(m\) p-values from smallest to largest. Compare the smallest one to the strict Bonferroni cutoff \(\alpha/m\). If it passes, “remove” it (it is rejected), and now compare the next-smallest to a slightly looser cutoff \(\alpha/(m-1)\) — because there are now only \(m-1\) live tests. Keep going. The moment a p-value fails its threshold, stop: that p-value and everything bigger than it stays in the null pile. The intuition is that once you have already rejected the strongest evidence, the remaining tests deserve a little more breathing room.
Bonferroni applies the same threshold \(\alpha/m\) to every test. Holm’s step-down procedure tightens it for the most extreme p-values and relaxes it for the less extreme ones, in a sequence that retains FWER control while almost always rejecting at least as many hypotheses as Bonferroni — and often more.
Order the p-values from smallest to largest: \(p_{(1)} \le p_{(2)} \le \cdots \le p_{(m)}\). The subscript \((k)\) is “the \(k\)-th smallest p-value.” Compare them to a sliding threshold that gets looser as \(k\) grows:
\[ p_{(k)} \;\le\; \frac{\alpha}{m - k + 1}. \]
Start with \(k = 1\). The most extreme p-value \(p_{(1)}\) must clear the strictest threshold \(\alpha/m\) — identical to Bonferroni. If it does, move to \(k = 2\) and compare \(p_{(2)}\) to \(\alpha/(m - 1)\). Continue stepping down. The procedure stops at the first \(k\) where the inequality fails: hypotheses \(H_{0,(1)}, \ldots, H_{0,(k-1)}\) are rejected, and the rest are accepted.
The intuition is straightforward. After you have rejected the strongest hypothesis, there are effectively only \(m - 1\) “live” tests remaining, so the next-strongest hypothesis should be allowed to clear \(\alpha/(m - 1)\) rather than \(\alpha/m\). Holm captures this rigorously while preserving the FWER guarantee under any joint distribution.
Holm dominates Bonferroni in the sense that any hypothesis rejected by Bonferroni is rejected by Holm, and often Holm rejects strictly more. There is essentially no reason to use Bonferroni in a modern workflow when Holm is one line away — statsmodels.stats.multitest.multipletests provides both.
The cell below simulates a small family: 100 candidate strategies, of which 5 have a real edge (\(t \approx 3.5\)) and 95 are pure noise. We run both Bonferroni and Holm at \(\alpha = 0.05\) and count how many true signals each picks up.
What we just got. Bonferroni and Holm rejected nearly the same set; in this small, well-separated example Holm gives at most one extra pick. The gap widens as \(m\) grows and as more hypotheses sit near the rejection boundary.
The graphic below makes the step-down logic visible. We sort \(m = 30\) simulated p-values from smallest to largest and overlay two thresholds: the flat Bonferroni cutoff \(\alpha/m\) (a horizontal line) and the Holm step-down sequence \(\alpha/(m - k + 1)\) (an upward-sloping staircase). Holm walks the staircase from left to right and stops at the first p-value that lies above its step; everything to the left is rejected. Because the staircase rises, Holm always rejects at least as many hypotheses as Bonferroni — and usually more.
Caption. The crimson staircase is the Holm threshold; Holm rejects every rank to the left of its first failure. The dashed black Bonferroni line sits below the staircase for \(k > 1\), so any hypothesis Bonferroni clears, Holm clears as well — and Holm often clears strictly more.
Šidák and Other FWER Variants
A footnote that occasionally surfaces in textbooks: if the \(m\) tests are exactly independent under the null, the union bound in the Bonferroni argument can be replaced by the exact joint probability, giving the Šidák correction: reject when \(p_j \le 1 - (1 - \alpha)^{1/m}\). This is uniformly less conservative than Bonferroni but only correct under independence. Since financial tests are almost never independent, Šidák is rarely used in practice and we will not develop it further. Holm is the default upgrade from Bonferroni.
FWER control is appropriate when the cost of a single false discovery is high and you have a small enough family that you can afford the conservatism. Two natural examples in real work: (i) you are about to deploy a single rule with real capital and you want strong confidence that it is not noise; (ii) you are reporting a small number of headline findings (say, three) to senior management and one false positive would be embarrassing. For large-scale signal mining where the cost of a false discovery is low but the cost of missing real signals is high, FDR is usually a better target.
False Discovery Rate (FDR)
Where you’ll see this: genomics, drug discovery, fMRI brain studies, and quant finance all rely on FDR because they all do thousands* of tests. Bonferroni is too strict for that scale; FDR is what lets you actually surface real signals in a haystack.*
Why a Different Metric
Imagine you scan \(m = 10{,}000\) candidate trading rules. Bonferroni-Holm at FWER \(0.05\) requires \(|t|\) in the neighbourhood of \(4.5\) even for \(m\) this modest. If the true effect sizes in your candidate pool are mostly in the \(|t| = 2\) to \(|t| = 3\) range — which is realistic for weak-but-real factor signals — a FWER procedure will discard almost all of them. You guarantee no false positives at the cost of missing nearly every real positive. Statisticians call this low power.
Benjamini and Hochberg (1995) proposed a different framing. Instead of requiring zero false positives in expectation, accept that some discoveries will be false but demand that the proportion be small. In a portfolio of one hundred reported alphas, you may be perfectly content with three or four turning out to be noise as long as the other ninety-six are real. The FDR criterion formalises this: \(\mathbb{E}[V/R] \le q\), for whatever target \(q\) you choose (\(0.10\) is a common default in genomics, \(0.05\) is more typical in finance).
The mathematical magic is that controlling FDR rather than FWER lets the rejection threshold scale with the number of rejections. The more discoveries the procedure makes, the more “noise discoveries” it is allowed to tolerate, and the more permissive the threshold becomes. This is the source of FDR’s enormous power advantage when \(m\) is large.
The Benjamini–Hochberg (BH) Procedure
Benjamini–Hochberg — sort the p-values and find the largest \(k\) where \(p_{(k)} \le k\alpha/m\). Reject everything up to that rank. Controls FDR.
Sort your \(m\) p-values from smallest to largest, call them \(p_{(1)}, p_{(2)}, \ldots, p_{(m)}\). Walk down the list from the largest end, comparing each \(p_{(k)}\) to the threshold \(k\alpha/m\) (so the largest is compared to \(\alpha\), the next-largest to \((m-1)\alpha/m\), and so on, getting tighter as you move toward the small p-values). Find the largest \(k\) for which the p-value sits at or below its threshold. Then reject all of the \(k\) smallest p-values — even those that would individually have failed their own threshold; if a later one passes, the earlier ones come along. The promise: on average, the fraction of false alarms among the rejected set stays below \(q\).
Order the p-values: \(p_{(1)} \le p_{(2)} \le \cdots \le p_{(m)}\). Compare them to a step-up threshold that grows linearly with rank:
\[ p_{(k)} \;\le\; \frac{k}{m} \cdot q. \]
Find the largest \(k\) for which the inequality holds. Reject \(H_{0,(1)}, H_{0,(2)}, \ldots, H_{0,(k)}\) — all of the \(k\) smallest p-values, even those that do not individually satisfy their own threshold (some of the earlier ones may have failed, but they get to come along for the ride if a later one passes).
Three facts justify the procedure.
First, in expectation under independence (and under positive regression dependence — see below), this rule controls the FDR at the chosen \(q\). Benjamini and Hochberg’s original proof; Benjamini and Yekutieli’s extension. The proof is elegant but not obvious.
Second, the threshold \(k/m \cdot q\) is linear in the rank \(k\). The smallest p-value is held to a Bonferroni-like standard (\(1/m \cdot q\)), but later p-values are held to progressively looser standards. The \(m\)th p-value is only required to be below \(q\) — about the same as the per-test threshold an unsophisticated analyst would have used in the first place.
Third, the procedure is a step-up: you look from \(k = m\) downward for the first \(k\) that satisfies the inequality, and you reject all hypotheses with rank at most \(k\). This is the opposite of Holm’s step-down. In Holm, the procedure searches from the smallest p-value upward and stops at the first failure; in BH, it searches from the largest downward and stops at the first success. Different direction, different guarantee.
Caption. Sort ascending, overlay the \(k\alpha/m\) staircase, find the largest \(k\) at which \(p_{(k)}\) sits at or below the line — reject every p-value at rank \(\le k^\star\) (green) and keep the rest as null (red).
The simulation below uses a more realistic mix — 1000 candidates, 50 of which are real signals at modest \(t \approx 2.8\). We compare what Bonferroni, Holm, and BH (at \(q = 0.05\)) each report.
What we just got. The pattern is the lesson: BH typically rejects more hypotheses than FWER methods, picking up many true signals at a controlled false-discovery proportion. Bonferroni and Holm reject almost nothing because they cannot afford to. BH lets the threshold breathe.
Why \(k/m\)? An Intuitive Derivation
The factor \(k/m\) in the BH threshold is not arbitrary. Here is the back-of-envelope intuition.
Suppose the procedure rejects the \(k\) smallest p-values. Among those \(k\) rejections, the number of true nulls (false discoveries) is on the order of \(k\) times the fraction of true nulls in the rejected set. If you knew the proportion \(\pi_0 = m_0/m\) of true nulls in the full family you could compute the expected false discovery rate exactly; but conservatively you can replace \(\pi_0\) with one (\(m_0 \le m\)). Under the global null, the \(k\)th order statistic of \(m\) uniforms has mean \(k / (m + 1) \approx k/m\) for large \(m\). So \(p_{(k)} \le k/m \cdot q\) is roughly the condition “the \(k\)th smallest p-value is small enough that, even if every reject were a null, the expected FDR would be at most \(q\).”
This is heuristic, not a proof, but it captures why the threshold has the shape it does. The formal proof, due to Storey and others in addition to BH, leverages a martingale argument on the empirical distribution of p-values.
The BH Picture: a Single Graph
The most intuitive way to see BH is graphically. Plot the sorted p-values \(p_{(k)}\) against rank \(k\). Overlay the threshold line \(y = (k/m) q\). The procedure finds the largest \(k\) at which the data lie at or below the line, and rejects all hypotheses with rank up to that \(k\).
The cell below draws exactly that plot for a simulated family of 200 candidate rules (15 real, 185 null). The red line is the BH threshold; the dashed grey line marks the chosen cutoff \(k^*\).
What we just got. The plot makes the procedure transparent. Where the empirical p-value curve dips below the BH line, you have rejections. The procedure is adaptive: if your data have many small p-values (lots of real signal) the cutoff \(k^*\) is large; if every p-value is uniform-looking, \(k^* = 0\).
Benjamini–Yekutieli for Dependent Tests
Benjamini–Yekutieli (BY) — like BH, but tightens the threshold by a factor of \(\log m\) so the FDR guarantee holds even when tests can be negatively correlated. Safer, less powerful.
Run the same step-up search as BH on your sorted p-values, but compare each \(p_{(k)}\) against the tighter threshold \(\dfrac{k}{m \cdot c(m)} \cdot q\), where \(c(m) = 1 + 1/2 + 1/3 + \cdots + 1/m \approx \log m\). That extra divisor — typically a single-digit number for realistic \(m\) — buys you FDR control under any correlation structure, including negative correlations that would break vanilla BH. The cost is power: BY rejects fewer rules than BH, so use it as a conservative cross-check when you cannot rule out negative dependence (long–short pairs, hedged variants, etc.).
BH’s FDR guarantee assumes either independence among test statistics or a milder condition called positive regression dependence on a subset (PRDS). Roughly, PRDS holds when correlations among test statistics are non-negative — a reasonable approximation for many quant settings (factor exposures of overlapping rules are usually positively correlated through their common underlying factors).
When dependence is unrestricted — including the possibility of negative correlations — BH is no longer guaranteed to control FDR. Benjamini and Yekutieli (2001) showed that the threshold needs to be deflated by a logarithmic factor. The formula below is just the BH threshold divided by \(c(m)\), where \(c(m)\) is the harmonic number — a slowly-growing function of \(m\) that behaves like \(\log m\).
\[ p_{(k)} \;\le\; \frac{k}{m \cdot c(m)} \cdot q, \qquad c(m) \;=\; \sum_{i=1}^{m} \frac{1}{i} \;\approx\; \log m + \gamma_{\text{EM}}, \]
where \(\gamma_{\text{EM}} \approx 0.577\) is the Euler–Mascheroni constant. What we just got. With \(m = 1{,}000\), \(c(m) \approx 7.5\); the BY threshold is about seven-and-a-half times tighter than BH. The BY procedure pays a real power cost for dependence-robustness.
In real work the choice between BH and BY hinges on what you believe about the joint distribution of your test statistics. If your \(m\) candidate rules share underlying risk factors and are positively correlated, BH is fine. If they include short-side and long-side variants of the same trade, or pairs designed to be negatively correlated, BY is safer. Many quant researchers split the difference: run BH for the headline cutoff and BY for the conservative cross-check.
The cell below scales up the simulation — 5000 candidates, 250 real signals — and runs all four procedures side by side so you can see the power ordering directly.
What we just got. Bonferroni and Holm reject only a tiny set of the most extreme rules; BH rejects roughly the right number with FDP (false discovery proportion) near the target \(q = 0.05\); BY sits between the two, rejecting fewer than BH but more than the FWER procedures.
The two-panel chart below dramatises the power tradeoff. Both panels show the same \(m = 500\) tests (25 real signals, 475 nulls), sorted by p-value. The left panel marks the FWER-controlled rejections (Bonferroni) — a tiny set of the most extreme p-values. The right panel marks the FDR-controlled rejections (BH at \(q = 0.05\)) — a much larger set, almost all of them real signals. The same data, two different error-rate philosophies, very different sized discovery sets.
Caption. Bonferroni (left) rejects almost nothing — pristine but underpowered. BH (right) rejects many more hypotheses, almost all of them real signals (green), at the cost of a controlled fraction of false positives (red). FDR control is what lets the threshold breathe when \(m\) is large.
For a candidate set of trading rules where you cannot rule out positive correlations among rules, run BH at \(q = 0.05\) or \(q = 0.10\) as your main filter and BY at the same \(q\) as a conservative sanity check. If a rule survives BH but not BY, treat it as a “watch list” candidate rather than a confirmed signal. If it survives BY, you have a robust discovery.
Deflated Sharpe Ratio (DSR)
Where you’ll see this: when you (or anyone you work with) reports “the best strategy out of \(N\)” and shows you its Sharpe ratio, the number is biased upward. DSR is the standard tool to undo that inflation. Any serious quant interview, audit, or research note will ask whether DSR has been applied.
Deflated Sharpe Ratio (Bailey & López de Prado) — given that you tested \(N\) strategies and your best one has observed Sharpe \(\hat{SR}\), what is the probability the true Sharpe is greater than \(0\)? The deflation formula accounts for skew, kurtosis, and the multiplicity of trials.
The Selection Problem for Sharpe
Multiple-testing corrections so far operate on p-values. But the quant’s headline number is the Sharpe ratio, not a p-value, and there is a direct version of the multiple-testing problem at the Sharpe level. If you compute the Sharpe ratio of every one of \(N\) candidate strategies and report the maximum, the reported number is a biased estimator of the underlying skill.
Bailey and López de Prado (2014) gave this bias an explicit form and proposed a remedy. The mathematics is intuitive: the maximum of \(N\) independent draws from a distribution exceeds the mean of the distribution by an amount that grows roughly as \(\sqrt{2 \log N}\) in units of the distribution’s standard deviation (this is the extreme-value scaling of standard normals). If you report the maximum-Sharpe strategy across \(N\) candidates, you are inflating the truth by approximately this amount.
To make the formula tight, two refinements are needed. First, the standard error of an estimated Sharpe is not \(1/\sqrt{T}\) — that formula assumes returns are perfectly normal (bell-shaped). Real returns aren’t. Skewness \(\hat\gamma_3\) measures asymmetry (negative skew = occasional big losses, like a short-vol strategy). Kurtosis \(\hat\gamma_4\) measures tail-heaviness (high kurtosis = more extreme events than a bell curve would predict). The formula below is the Mertens (2002) correction: a more honest standard error that absorbs both effects.
\[ \widehat{\mathrm{SE}}(\widehat{SR}) \;=\; \sqrt{\frac{1 - \hat\gamma_3 \widehat{SR} + \tfrac{\hat\gamma_4 - 1}{4} \widehat{SR}^{\,2}}{T - 1}}. \]
What we just got. Heavier tails (high \(\hat\gamma_4\)) inflate the standard error; left-skewed returns (\(\hat\gamma_3 < 0\), common for short-volatility strategies) inflate it further. The naive formula \(\widehat{\mathrm{SE}} = 1/\sqrt{T}\) understates the noise around the point estimate for almost every realistic trading strategy.
Second, the maximum-order-statistic correction itself. If you draw \(N\) values from a standard normal and take the biggest, that “max” is on average larger than zero by roughly \(\sqrt{2\log N}\). The formula below is a more precise version of that quantity — the expected best-Sharpe under pure noise.
\[ \mathbb{E}[\hat{SR}_{\max} \mid H_0] \;\approx\; (1 - \gamma_{\text{EM}})\,\Phi^{-1}\!\left(1 - \frac{1}{N}\right) \;+\; \gamma_{\text{EM}}\,\Phi^{-1}\!\left(1 - \frac{1}{N \cdot e}\right), \]
where \(\gamma_{\text{EM}}\) is again Euler–Mascheroni and \(\Phi^{-1}\) is the inverse standard-normal CDF (the function that turns a probability into a \(z\)-score). For moderate \(N\) this is well-approximated by
\[ \mathbb{E}[\hat{SR}_{\max} \mid H_0] \;\approx\; \sqrt{2 \log N} \;-\; \frac{\log\log N + \log 4\pi}{2 \sqrt{2 \log N}}, \]
but the exact form above is what Bailey–López de Prado use.
Putting both pieces together, the Deflated Sharpe Ratio is the probability that the observed Sharpe exceeds the maximum that would be expected by chance alone under the null. The formula below is a \(z\)-score: “how many standard errors is the observed Sharpe above the expected-best-under-noise?”, then run through the normal CDF \(\Phi(\cdot)\) to convert it to a probability.
\[ \mathrm{DSR}(\widehat{SR}^*) \;=\; \Phi\!\left(\frac{\widehat{SR}^* - \mathbb{E}[\hat{SR}_{\max} \mid H_0]}{\widehat{\mathrm{SE}}(\widehat{SR})}\right). \]
Caption. The DSR formula in five labeled pieces: observed Sharpe, expected-best-of-\(N\) under noise, Mertens-corrected SE (using skew, kurtosis, \(T\)), all wrapped in \(\Phi(\cdot)\) to produce a probability that the true Sharpe is positive.
What we just got. A single number between \(0\) and \(1\). Close to \(1\) means “the observed Sharpe is well above what selection-from-\(N\)-tries would produce under noise.” Close to \(0.5\) means “this looks like noise plus selection.”
A DSR close to one says “even after accounting for the selection across \(N\) trials and the non-normality of returns, the observed Sharpe is well above what chance would produce.” A DSR close to \(0.5\) says “this looks like a draw from the maximum-of-\(N\) null distribution.”
You have one strategy’s daily returns and the count \(N\) of how many strategies you searched through to find it. Compute three things from the returns: the observed Sharpe \(\widehat{SR}\), the skewness \(\hat\gamma_3\), and the kurtosis \(\hat\gamma_4\). Plug them into the Mertens formula to get a fair-sized standard error. Separately, compute the expected best-Sharpe under noise from the order-statistic formula above using only \(N\). The DSR is the normal-CDF of (observed Sharpe − expected best-under-noise) divided by (Mertens SE) — a single probability you read as “chance the real edge is above zero.” Convention: DSR \(\ge 0.95\) counts as a discovery.
DSR is a probability of statistical significance after correcting for selection across multiple trials and return non-normality. It is not a corrected Sharpe ratio you can plug back into an optimizer or report in a marketing deck. Treat it as a hypothesis test: \(\mathrm{DSR} \ge 0.95\) means the strategy clears a one-sided \(5\%\) test against the maximum-of-\(N\) null.
Implementing DSR
The implementation is mechanical. You need: the observed Sharpe, the strategy’s return time series (to estimate \(T\), \(\hat\gamma_3\), \(\hat\gamma_4\), and the standard error), and the effective number of trials \(N\). The first three are easy; the third is the hard one and reflects all the multiple-testing ambiguity discussed above. We will return to it.
The function below packages all of that. It takes a return series and an \(N\) and returns a dictionary with the observed Sharpe, the Mertens SE, the expected-best-under-noise, and the DSR. The script then simulates one weak-edge strategy and shows how the DSR collapses as \(N\) grows.
What we just got. The output is the heart of the lesson. The observed Sharpe stays the same — the data did not change. But as the number of trials behind the selection grows, the expected maximum under the null grows, and the deflated significance shrinks toward \(0.5\). A Sharpe that looks impressive when reported in isolation may not survive even modest multiple-testing scrutiny.
Choosing \(N\)
The thorniest decision in DSR is what to put in for \(N\). Three approaches dominate:
Literal count. If your pipeline literally enumerated \(N\) candidate strategies and you reported the best, \(N\) is that count. This is appropriate for an explicit grid search.
Effective count under correlation. If the candidates are correlated, the effective number of independent trials is smaller. A common heuristic is \(N_{\text{eff}} = N \cdot (1 - \bar\rho) + \bar\rho\) for an average pairwise correlation \(\bar\rho\), but this is a rough guide. Bailey–López de Prado discuss a principal-components approach: count the number of eigenvalues of the candidate return correlation matrix that explain, say, \(99\%\) of the variance.
Trial budget. If you cannot reconstruct \(N\) but can defend a trial budget (e.g. “this team has run on the order of \(10^4\) backtests in the past two years”), use that. This is the most honest treatment in research-shop settings where iteration is informal.
Whatever choice you make, document it. The DSR is only as defensible as the \(N\) you put into it.
Bringing It Back to the Cautionary Tale
For the 5.3-year walk-forward pipeline at the top of the chapter, \(\widehat{SR} = 1.87\) and \(T = 10{,}274\) trades. If you take \(N = 10{,}000\) effective independent trials (a not-unreasonable budget given that the pipeline mined a few million candidates but they are heavily correlated, and seven methodology variants were tried), the expected maximum Sharpe under the null on a per-trade basis is roughly \(\sqrt{2 \log 10{,}000}/\sqrt{T} = 4.29/\sqrt{10{,}274} \approx 0.042\). The annualized version of this, depending on how trades aggregate to days, sits somewhere between \(0.5\) and \(1.0\).
The implication is that the headline Sharpe of \(1.87\) is meaningfully above what selection alone would have produced. Some real edge exists. But the honest point estimate of the true Sharpe — after deflating for selection — is closer to \(1.0\) than \(1.87\). That is the gap that real-money traders care about.
Bootstrap Null Distributions
Where you’ll see this: every time your pipeline is more complicated than “one rule, one test” — and that is essentially every pipeline you will ever build — closed-form corrections leave residual uncertainty. The bootstrap is the universal fallback. Modern quant funds use it as the final stage of vetting before any new strategy gets capital.
Bootstrap null distribution — instead of trusting a parametric formula, simulate what “no edge” would look like by re-running the strategy on shuffled or sign-flipped returns thousands of times. Compare your real Sharpe to this empirical null.
Why Bootstrap Beats Analytical Corrections for Pipelines
The corrections we have discussed so far — Bonferroni, Holm, BH, BY, DSR — assume specific forms for the joint distribution of test statistics (typically independence or PRDS) and for the marginal distribution of returns (typically iid Gaussian, with the Mertens correction for non-normality). For a single rule on a single time series, these assumptions are often defensible. For an entire pipeline that includes feature construction, signal selection, methodology choice, and combination logic, they are almost never quite right.
The bootstrap offers a cleaner path: instead of computing a corrected p-value analytically, simulate the entire pipeline on data where the null is known to hold, and read the p-value off the simulated distribution. The corrections are then automatic with respect to whatever dependence and selection your pipeline contains — you do not need to specify them.
The cleanest construction for a return-prediction pipeline is sign randomisation or return shuffling. The idea: take the actual feature matrix and the actual cross-section of returns; randomly permute the returns in a way that breaks any genuine signal but preserves the marginal distribution of returns and most of the dependence structure. Run the entire pipeline on this randomized data and record the headline Sharpe. Repeat \(B\) times (typically \(B = 100\) to \(B = 10{,}000\)). The distribution of bootstrap Sharpe values is the empirical null distribution against which the observed Sharpe is tested.
Three randomization schemes are standard.
Sign-flip bootstrap. Multiply each observed return by an independent \(\pm 1\) sign. This preserves the marginal distribution of returns (up to symmetry) and breaks the alignment between features and returns.
Block bootstrap. Resample contiguous blocks of returns of length \(\ell\) (preserving short-range autocorrelation) and reattach them to the original features at the same dates.
Cross-sectional shuffle. Within each date \(t\), randomly permute the returns across stocks. This preserves the time-series properties of the market portfolio and breaks the cross-sectional alignment with features. This is the right scheme for cross-sectional rule mining.
The block bootstrap is essential when returns exhibit substantial autocorrelation (rare for daily equity returns but common for monthly factor returns). The sign-flip is appropriate for zero-mean tests. The cross-sectional shuffle is what you want for cross-sectional alpha pipelines.
Take the realized return series of your strategy and destroy any true edge by either randomly flipping the sign of each daily return (sign-flip), sampling blocks of consecutive returns with replacement (block bootstrap), or shuffling returns across stocks within each date (cross-sectional). Run the strategy — or, more powerfully, the entire mining pipeline — on this scrambled data and record the headline Sharpe. Repeat \(B\) times (a few hundred to a few thousand). The \(B\) recorded Sharpes form the empirical null distribution. Compute the bootstrap p-value as the fraction of those null Sharpes that meet or beat your real one. Below \(0.05\): your pipeline really beats luck. Above \(0.30\): you are mostly seeing noise plus search.
DSR corrects for the number of trials given a single, fixed statistical procedure. A bootstrap simulates the entire procedure under the null — including any methodology iteration, feature selection, or combination logic — and so corrects for forms of overfitting that a closed-form deflation cannot reach. The cost is computation: a \(100\)-fold bootstrap requires running your pipeline \(100\) times. For an expensive pipeline this is days of compute; for a cheap one, minutes.
A Minimal Bootstrap Example
The pyodide block below illustrates the construction in the simplest setting: a single strategy whose realized Sharpe is to be tested against a sign-flip bootstrap null. We deliberately simulate two cases — a true signal and a no-signal placebo — so you can see the bootstrap distribution behave appropriately in both.
What we just got. The strategy with a true Sharpe of \(1.0\) produces an observed Sharpe near \(1.0\), an empirical null centred at zero with \(95\%\)-tail near \(0.10\), and a bootstrap p-value near zero — clearly real edge. The pure-noise strategy produces an observed Sharpe of comparable magnitude to the null’s \(95\%\)-tail and a p-value that scatters across \([0, 1]\) — exactly what should happen when there is no edge.
Bootstrapping the Whole Pipeline
The example above bootstraps a single time series of returns. The conceptually more powerful — and computationally more expensive — version applies the same logic to the entire mining-and-selection process:
- Take the original panel \((\mathbf X_{i,t}, R_{i,t+1})\).
- Within each date \(t\), randomly permute the cross-section of \(R_{i,t+1}\). Features stay attached to their stocks; only the returns are scrambled.
- Run the entire pipeline on this scrambled panel — feature construction, rule mining, stability filter, methodology selection, portfolio assembly, walk-forward backtest.
- Record the headline Sharpe (or whatever statistic you care about) of the resulting strategy.
- Repeat \(B\) times.
- Compute the bootstrap p-value as the fraction of runs in which the bootstrap headline Sharpe equals or exceeds the observed headline Sharpe.
The bootstrap p-value is a direct measure of how lucky your pipeline got. If it is below \(0.05\), the pipeline produces results that pure noise cannot match more than five percent of the time. If it is \(0.30\), your pipeline finds a “winning” result on a third of random datasets — and you have no statistical basis to claim that the original result is anything more than that luck.
For the 5.3-year cautionary tale, a pipeline-level bootstrap might take twenty-five hours of compute (one hundred re-runs at fifteen minutes each). This is a one-time cost and is the single most credible piece of evidence you can produce that a complex pipeline is doing real work.
Three things break a pipeline bootstrap. (i) If your shuffle accidentally preserves some real signal — for example, shuffling returns globally rather than within-date when returns have a market component — the null is not actually null and your p-value is biased toward zero. (ii) If your pipeline contains a step that depends on a random seed, you must fix the seed across bootstrap iterations or the noise in your pipeline becomes part of the null distribution. (iii) If your pipeline takes hours to run, \(B\) will be small (\(\le 100\)) and tail probabilities are unstable; report a Clopper–Pearson confidence interval on the bootstrap p-value rather than the point estimate alone.
A Worked Demonstration: \(m = 200\) Random Strategies
Where you’ll see this: this is the demo to bookmark. It is the single most common interview/whiteboard exercise on multiple testing — “you have 200 candidate strategies, apply every method we discussed, compare what they pick.”
To see all of these procedures interacting on a single dataset, the block below simulates a candidate pool of \(m = 200\) “strategies” where only a handful are real. Each strategy is one year of daily returns. The procedure ranks them by realized Sharpe and shows what each filter — naive p-value, Bonferroni, Holm, BH, BY, and a sign-flip bootstrap — picks.
Read the table column by column. R is the total number of rejections — discoveries the procedure reports. True is the count of those that hit a genuine signal; False is the count of false alarms. FDP is the realized false discovery proportion on this single dataset; Power is the fraction of the eight real signals that the procedure recovered.
You will typically see:
- Naive thresholding rejects about \(0.05 \times (m - m_1) + m_1 = 9.6 + 8 = 17\) on average — so about \(9\) to \(10\) false positives, FDP near \(55\%\). Power near \(100\%\).
- Bonferroni rejects roughly \(m_1\) on the strongest signals; FDP near zero; power \(\approx 50\%\) to \(80\%\) depending on draw.
- Holm rejects one or two more than Bonferroni; same FDP; marginally higher power.
- BH at \(q = 0.05\) rejects close to \(m_1\), with FDP near or below the target.
- BH at \(q = 0.10\) rejects more, FDP slightly higher but bounded.
- BY is between BH and Holm in count.
- Bootstrap “best-of-\(m\)” approves only strategies whose Sharpe exceeds the \(95\%\)-tail of the best-of-\(m\) null — typically the same \(1\) to \(4\) strongest survivors as Bonferroni.
The takeaways are visible in a single glance:
- The naive threshold is unusable in any setting where you tested more than a handful of strategies.
- FWER procedures (Bonferroni, Holm) are conservative — they will miss real signals — but produce nearly-pristine discovery sets.
- FDR procedures (BH, BY) trade a small, controlled rate of false positives for substantially higher power and are usually the right choice in alpha mining.
- The bootstrap is the closest match to “headline best-of-\(m\)” inference and the right test if you are reporting the single best strategy in a marketing context.
Practical Guidance: When to Use Which
Where you’ll see this: every time you stare at a list of methods and ask “which one do I actually run for this project?” Bookmark this section — it is the decision tree you will use for the rest of your career.
The methods you have learned are not interchangeable. Each is the right answer to a particular question. A short menu:
Use Bonferroni or Holm when:
- You have a small family (\(m \le 50\)) and care about strict event-level error control.
- You are about to deploy a single rule with real capital and one false positive would be a disaster.
- You are reporting to a regulator or an audit committee that wants the most conservative possible filter.
- You have no idea what the dependence among tests looks like.
Use Benjamini–Hochberg when:
- You have a large candidate pool (\(m \gg 100\)) and want to surface as many real signals as possible while bounding the proportion of false discoveries.
- The tests are independent or positively correlated — a reasonable assumption for most factor-mining settings.
- You are willing to tolerate a known fraction of false discoveries in your reported set.
Use Benjamini–Yekutieli when:
- You have a large candidate pool but cannot rule out negative correlations among tests (e.g., long–short pairs, hedged variants).
- You want a procedure that is robust to arbitrary dependence structure, at a real but bounded cost in power.
Use the Deflated Sharpe Ratio when:
- Your headline is a Sharpe ratio (or any other metric of cumulative outperformance) and you want to test “is this number significant after correcting for the \(N\) trials behind it?”
- You can defend an effective \(N\) from your selection process.
- The strategy’s return distribution is reasonably approximated by something with finite moments (heavy-tailed crypto returns or jump-prone short-vol strategies need additional care).
Use a bootstrap null distribution when:
- Your pipeline contains steps that no closed-form correction can capture: stability filters, methodology branches, ensemble combinations, calibration cycles.
- You have compute budget for \(B \ge 100\) pipeline re-runs.
- You want a single, defensible “did my pipeline beat luck?” p-value that you can put in a research note.
A multiple-testing correction is a contract between you and the statistician inside you. Once you have specified the family — the procedure, the candidate set, the trial count \(N\) — you owe the reader the result of that contract, not whichever rule on the family looks best in isolation. The contract you cannot enforce is the one you negotiated after seeing the data. That is the territory of Chapter 4.
A Bridge to Chapter 4
Where you’ll see this: the moment you (or anyone around you) starts iterating on a methodology — “let me try a different lookback window,” “let me change the cost gate” — you have left the territory of this chapter. Chapter 4 starts where the cleanly-enumerable family ends.
The procedures in this chapter assume that you can enumerate the family — that \(m\) is a number you can write down before drawing your conclusions. In the cautionary tale, the rule-level family is enumerable: three-million-plus candidate triplets, however laborious to count. The methodology-level family — seven variants of a walk-forward methodology, plus countless smaller choices about thresholds and filters made by humans staring at intermediate results — is not enumerable. The number \(7\) understates it. The true count is the number of methodology paths the researcher could have followed given the same data, and this is typically enormous.
This is the “Garden of Forking Paths” of Gelman and Loken (2014), and it requires a different kind of remedy: not a correction but a protocol. Pre-registration, time-locked holdouts, air-gapped validation teams, multi-PM social selection, factor-based justification — the practices that real hedge funds use to avoid fooling themselves. Chapter 4 develops these in detail. For now, hold the following thought: every multiple-testing correction in this chapter assumes you know what you tested. The hardest part of quant research is being honest about what you tested.
Exercises
Where you’ll see this: doing the exercises yourself is the only way the procedures will stick. Walking through the toy p-value vectors and the small simulations is exactly the work an interviewer or referee would ask you to reproduce on the spot.
Exercise 1 — Bonferroni from First Principles
You scan \(m\) pairs of value-momentum stock signals. Each test is two-sided at the nominal \(5\%\) level under \(H_0: \mu = 0\).
(a) Show that if all \(m\) tests are mutually independent and all \(H_0\) are true, the probability of at least one rejection at the per-test level \(\alpha\) is \(1 - (1 - \alpha)^m\). Evaluate this for \(\alpha = 0.05\) at \(m = 5, 20, 100, 1000\).
(b) Compare the Bonferroni-adjusted per-test threshold \(\alpha/m\) to the Šidák-adjusted threshold \(1 - (1 - \alpha)^{1/m}\) for the same set of \(m\). Which is more conservative? By how much?
(c) Why is Bonferroni preferred in practice despite being slightly more conservative?
Exercise 2 — Holm vs. Bonferroni on a Toy Example
You test \(m = 5\) strategies and obtain the following p-values: \(\{0.001,\ 0.012,\ 0.018,\ 0.030,\ 0.060\}\).
(a) Which hypotheses does Bonferroni reject at FWER \(0.05\)?
(b) Which hypotheses does Holm reject at FWER \(0.05\)? Walk through the step-down procedure explicitly, listing the threshold compared at each step.
(c) Construct (by hand) a 5-element vector of p-values for which Holm rejects strictly more than Bonferroni.
Exercise 3 — BH and BY on a Mixed Family
Simulate \(m = 500\) tests where the first \(25\) are real signals with \(t\)-statistics drawn from \(\mathcal N(3.0, 0.4)\) and the rest are null with \(t\)-statistics drawn from \(\mathcal N(0, 1)\). Convert to two-sided p-values.
(a) Apply Bonferroni, Holm, BH at \(q = 0.05\), BH at \(q = 0.10\), and BY at \(q = 0.05\). Report for each procedure: number of rejections, true discoveries, false discoveries, realized FDP, and power.
(b) Repeat the experiment \(1{,}000\) times and report the mean FDP and mean power for each procedure. Which procedures control FDR at or below their stated level? Which procedures sit comfortably below the level (and pay for it in power)?
(c) Now make the test statistics correlated: draw \(\mathbf t \sim \mathcal N(\boldsymbol\mu, \boldsymbol\Sigma)\) where \(\boldsymbol\Sigma\) has \(1\) on the diagonal and \(\rho = 0.3\) off the diagonal (positive correlation). Re-run (b). Does BH still control FDR? Does BY?
(d) Now switch to negative correlation \(\rho = -0.05\) (the most you can have while keeping \(\boldsymbol\Sigma\) positive semi-definite at this \(m\)). Repeat (b) and comment on whether BH’s FDR control breaks down.
Exercise 4 — Deflated Sharpe Ratio in Practice
A research team reports a single strategy with daily returns over \(T = 1{,}260\) trading days (five years). The annualized Sharpe is \(1.4\), the annualized return is \(14\%\), and the annualized volatility is \(10\%\). Skewness of daily returns is \(-0.8\) and (raw) kurtosis is \(7\).
(a) Compute the Mertens-corrected standard error of the per-period Sharpe and the annualized standard error. Compare to the naive \(1/\sqrt{T}\) approximation. Which is larger? Why?
(b) Compute the DSR assuming \(N_{\text{trials}} = 1\) (no selection). How significant is the result?
(c) Compute the DSR for \(N_{\text{trials}} \in \{10, 100, 1{,}000, 10{,}000\}\). At what point does the DSR drop below \(0.95\)?
(d) Suppose the team admits that they iterated on their methodology over \(N_{\text{trials}} = 50\) rounds. They argue that this is a small number because most rounds were “obvious failures” that did not really count. Critique this argument. What is the right way to count \(N_{\text{trials}}\) in a research-group setting?
Exercise 5 — Sign-Flip Bootstrap
You have \(T = 504\) daily returns of a strategy with mean \(0.0005\) per day and standard deviation \(0.012\) per day. You want to test \(H_0: \mu = 0\) via a sign-flip bootstrap.
(a) Implement the sign-flip bootstrap with \(B = 2{,}000\) replications. Compute the bootstrap one-sided p-value for the observed Sharpe.
(b) Compare the bootstrap p-value to the one-sided \(t\)-test p-value. Are they similar? Where would they diverge most?
(c) Modify your bootstrap to use a block bootstrap with block length \(\ell = 5\) instead of independent sign flips. (Hint: sample contiguous five-day blocks with replacement from the original return series, concatenate to length \(T\), and compute the Sharpe.) Compare the resulting null distribution to the sign-flip null. When does block length matter?
(d) Suppose your daily returns have autocorrelation \(0.15\) at lag one. Would you prefer sign-flip or block bootstrap? Why?
Exercise 6 — A Mini-Pipeline Bootstrap
This exercise puts everything together. You will run a small mining pipeline and use the bootstrap to test whether its headline result beats luck.
Setup. Generate a synthetic panel of \(N = 100\) stocks over \(T = 252\) days. Each stock has a \(30\)-dimensional feature vector at each date drawn from \(\mathcal N(0, 1)\). Returns are generated as \(R_{i,t+1} = \beta_i^\top X_{i,t} + \varepsilon_{i,t}\) where \(\beta_i = 0\) for most stocks but \(\beta_i = \mathbf v\) (a fixed direction in feature space) for a randomly chosen \(5\) of the \(100\) stocks, with \(\|\mathbf v\| = 0.05\).
Pipeline. For each of the \(30\) features, run a cross-sectional regression of \(R_{i,t+1}\) on \(X_{j,i,t}\) at each date \(t\), average the coefficients across dates, and report the feature with the largest absolute average coefficient. Backtest a long-short decile portfolio sorted on this feature, and report its annualized Sharpe.
(a) Run the pipeline on the original (signal-bearing) panel. Report the chosen feature, its average coefficient, and the resulting Sharpe.
(b) Run a cross-sectional shuffle bootstrap: within each date, randomly permute the returns across stocks. Run the full pipeline on each shuffled panel. Repeat \(B = 200\) times. Report the bootstrap distribution of the headline Sharpe and the bootstrap p-value of the observed Sharpe.
(c) Repeat (b) with \(\|\mathbf v\| = 0.01\) (much weaker signal). Compare the bootstrap p-value. At what signal magnitude does the bootstrap stop rejecting the null?
(d) Replace the feature-selection step in your pipeline with a no-selection version that simply uses feature \(0\) (the first feature, regardless of its in-sample coefficient). Re-run the bootstrap. Compare the bootstrap null distributions for the selected-feature pipeline and the no-selection pipeline. Which is wider? Why?
Part (d) is the conceptual heart. The bootstrap null distribution of a selection pipeline is wider than the null distribution of a no-selection pipeline because the selection step introduces an additional source of luck — the lucky feature. The width of the bootstrap null is the empirical version of the Bailey–López de Prado deflation. Bootstrap your own pipeline and you will see the multiple-testing inflation factor with your own eyes.
Chapter Summary
Where you’ll see this: any time you have to defend a research finding — to a referee, a PM, a regulator, your future self — these are the five tools (Bonferroni, Holm, BH, BY, DSR, bootstrap) that you will reach for, in roughly that order of complexity.
You have learned to reason about a family of tests rather than about each test in isolation. The Family-Wise Error Rate \(\Pr(V \ge 1)\) controls the probability of any false discovery; Bonferroni gives you this guarantee under any dependence structure and Holm tightens the Bonferroni cutoff while preserving FWER. The False Discovery Rate \(\mathbb{E}[V/(R\vee 1)]\) controls the expected proportion of false discoveries; Benjamini–Hochberg delivers it under independence or positive regression dependence and Benjamini–Yekutieli does so under arbitrary dependence at a logarithmic cost in power. The Deflated Sharpe Ratio applies the same correction directly to a Sharpe statistic, replacing the analytical p-value with an extreme-value adjustment for the number of trials behind a selected result. The bootstrap null distribution, finally, replaces every closed-form correction with a numerical one: shuffle the data, re-run the pipeline, and compare the observed statistic to its empirical null. Of these tools, the bootstrap is the most general and the most computationally expensive; the BH procedure is your default in real work for large-\(m\) scans; the DSR is what to put in your research note when reporting a single headline.
These procedures share one limitation: they require you to specify \(m\) honestly. In the next chapter we will see that the harder problem in quant research is not implementing a procedure but admitting what was tested. The Garden of Forking Paths waits.