• 📖 Cover
  • Contents

Chapter 2: Model Decay, Drift Detection & Retraining

What you will learn

Model decay — a quant model that “works” in backtest and the first year of live trading often stops working in year two or three. The Sharpe slowly bleeds out, the IC turns negative, and the model holders eventually shut it down. This chapter is about catching that early.

A model that backtested at Sharpe 1.5 on data through 2019 will not deliver Sharpe 1.5 in 2024. Returns, volatility surfaces, factor loadings, and feature distributions all drift; signals decay; the market crowds you out. This chapter is about how to see that decay before it eats your P&L. You will learn the canonical decay diagnostics — feature-importance drift, alpha-decay curves, portfolio turnover, and Kolmogorov–Smirnov OOD detection — and how to glue them together into an automated alerting layer. You will also learn how to choose a retraining policy: frozen, scheduled, IC-triggered, or hybrid; compare their wealth curves on real walk-forward results; and pick the cadence that matches your signal’s half-life.

Plain-English glossary for this chapter (each term is unpacked again where it first appears):

  • Regime change — the world changes. The 2020 COVID crash, the 2022 rate-hike cycle, the 2008 GFC — these are regime changes. A model trained on one regime can fail in the next.
  • Feature drift — the distribution of input features changes (e.g., average P/E ratio in 2021 is twice as wide as in 2018). The model is now predicting on inputs unlike anything it was trained on.
  • Crowding — too many funds running similar strategies. The signal still exists on paper but the trades erode it.
  • OOD (out-of-distribution) — new inputs that look unlike training data. A model has no reliable answer for OOD inputs.
  • Turnover — how often the portfolio’s holdings change. High turnover = more trading costs. Spike in turnover = the model is changing its mind a lot, often a sign of confusion.
  • Retraining — periodically refit the model on more recent data so it adapts to the changing world.

Why Models Decay

Where you’ll see this. The first thing you’ll hear on a real trading floor when a model starts losing money is “is it a regime change, feature drift, or crowding?” Those three words are the working vocabulary. Get them straight now and you’ll follow the post-mortem conversation on day one.

Intuition

Think of a model as a recipe written by tasting bread baked at sea level. The recipe says “bake 25 minutes”. Move the kitchen to a mountain top — lower air pressure, different humidity — and the same recipe burns the crust. The recipe didn’t change. The world changed. Quant models are recipes for predicting returns; markets are the kitchen; and markets move to a new altitude every few years. Three things can break the recipe: the oven temperature relationship to crust browning changes (regime change), the flour you buy now is different from the flour you tested with (feature drift), or your neighbour started selling the same bread at the same farmer’s market (crowding).

The Comfortable Fiction of a Stationary World

Every supervised learning textbook proves convergence under the i.i.d. assumption: each training example is an independent draw from a fixed joint distribution \(p(X, Y)\), and the test data are drawn from the same distribution (i.i.d. just means “independent and identically distributed” — every sample is a fresh draw from the same fixed bag of possibilities). Under that fiction, you train a model on the past, evaluate it on a held-out sample, and — if the held-out error is acceptable — deploy it forever. Bias and variance trade off; the bias-variance decomposition tells you everything. Cross-validation is the final word.

This is not the world quant traders inhabit. Financial markets are non-stationary by every operational definition. The cross-section of expected returns changes when monetary policy regimes change. The book-to-market premium that earned 4% per year for the half-century before 2007 has been roughly flat (and occasionally negative) since 2010. The post-1985 momentum effect collapsed in March 2009 and has been markedly weaker in mega-caps since 2014. The volatility risk premium narrowed steadily as institutional VIX-selling strategies proliferated. The list is long, the mechanisms vary, but the conclusion is uniform: the joint distribution of features and forward returns is not constant, and a model trained on the joint distribution from \([t_0, t_1]\) will eventually be evaluated on data drawn from a measurably different joint distribution from \([t_2, t_3]\).

The technical name for this phenomenon is distribution shift. The trader’s name for it is model decay. They are the same thing.

Three Mechanisms

Distribution shift in finance comes from at least three distinct mechanisms, and it is worth separating them because the appropriate response differs in each case.

Regime change. Regime change — the world changes. The 2020 COVID crash, the 2022 rate-hike cycle, the 2008 GFC — these are regime changes. A model trained on one regime can fail in the next. More formally: the macroeconomic or institutional environment changes in a way that alters the equilibrium relationship between features and returns. The post-2008 zero-interest-rate environment changed the duration risk premium; the post-2020 inflation shock changed it again. A model that learned the value premium from a sample dominated by stable-inflation decades will misfire in the new regime. Formally, the conditional distribution \(p(Y \mid X)\) changes even when the marginal \(p(X)\) does not. (Read \(p(Y \mid X)\) as “the probability of return \(Y\) given the features \(X\)” — i.e., the relationship the model learned.)

Feature drift. Feature drift — the distribution of input features changes (e.g., average P/E ratio in 2021 is twice as wide as in 2018). The model is now predicting on inputs unlike anything it was trained on. Mega-cap stocks now dominate the cross-section in a way they did not in 1995; the dispersion of price-to-book has widened; the tail of analyst-coverage ratios has thinned. The conditional relationship between features and returns may be unchanged, but the feature distribution at deployment looks unlike the training distribution, and the model is being asked to extrapolate. Tree ensembles and deep models extrapolate poorly. Formally, \(p(X)\) has shifted while \(p(Y \mid X)\) may or may not be stable.

Crowding. Crowding — too many funds running similar strategies. The signal still exists on paper but the trades erode it. Other quantitative traders find the same signal, push prices in the direction of the prediction at the moment of trade, and arbitrage away the next-period excess return. Crowding is not a passive observation about the data; it is an endogenous consequence of deploying the signal at scale (endogenous = caused by the strategy itself rather than by an outside force). The more capital that uses a value signal, the smaller the value premium remaining for any one user. Crowding manifests as a gradual decline in the alpha decay curve at every horizon, often without any visible change in the marginal feature distribution.

All three mechanisms produce the same surface symptom — falling out-of-sample Information Coefficient — but they call for different remedies. Regime change argues for full retraining on recent data. Feature drift argues for either dropping the drifted features or expanding the universe. Crowding argues for either capacity reduction or signal rotation.

Key takeaway

Model decay is not a single phenomenon. Regime change shifts \(p(Y \mid X)\); feature drift shifts \(p(X)\); crowding endogenously degrades the predictability of \(Y\) given \(X\) because everyone else is trading on the same prediction. The diagnostics in this chapter — KS for \(p(X)\), rolling IC for \(p(Y \mid X)\), alpha decay for crowding — are designed to separate the three.

The Alpha Decay Curve as Empirical Object

Alpha decay curve — IC at prediction horizon \(h\) (\(h\) = 1, 5, 10, 20 days). Short-horizon alphas decay fast; long-horizon alphas decay slow. The shape tells you how often to retrade.

Intuition

A weather forecast made at 8 a.m. is very accurate for “the next hour”, reasonably accurate for “this afternoon”, and basically useless for “this day next month”. A stock-return forecast behaves the same way. The alpha decay curve plots forecast accuracy (IC, the rank correlation between forecast and realized return) against the forecast horizon. The shape tells you how often to use the forecast: if accuracy halves every 4 days, you must rebalance every few days; if accuracy halves every 6 months, monthly rebalancing is plenty.

Before we move to monitoring infrastructure, it is worth defining the object we are monitoring. Let \(\hat R_{i,t+1}\) be the model’s one-period-ahead forecast made at time \(t\) (the hat means “estimated”; \(i\) indexes the stock, \(t\) the time). Define the IC at horizon \(h\) as

\[ \text{IC}(h) \;=\; \mathbb E_t \!\left[\, \text{Spearman}\!\bigl(\{\hat R_{i,t+1}\}_i,\; \{R_{i,t+h}\}_i\bigr) \,\right], \]

where \(R_{i,t+h}\) is the realized return for stock \(i\) over the period \([t, t+h]\). The Spearman rank correlation \(\in [-1, 1]\) asks “do the stocks the model ranks highest also realize the highest returns?” — it is the standard quant-finance accuracy score. The function \(h \mapsto \text{IC}(h)\) (“the function that sends horizon \(h\) to its IC”) is the alpha decay curve. A well-behaved alpha decay curve starts at a positive value at \(h = 1\), declines monotonically (i.e., never goes back up), and crosses some “half-decay” threshold \(\text{IC}(1)/2\) at a horizon called the signal half-life.

The half-life is the single most important number for setting your rebalancing frequency. A fast-decaying signal — momentum, news sentiment, intraday flow — has a half-life of one to three months and must be rebalanced monthly or more often to capture its predictive content before it dissipates. A slow-decaying signal — value, profitability, low volatility — has a half-life of six to twelve months and is wasted on monthly turnover; quarterly or semi-annual rebalancing captures essentially the same IC at a fraction of the transaction cost. Matching rebalancing cadence to signal half-life is the single largest implementation-level lever a portfolio manager has.

The cell below simulates a feature whose predictive content decays geometrically (each month it keeps a fixed fraction of its predictive power) and recovers the input half-life from the IC vs. horizon curve. It is the cleanest illustration of what the decay-curve estimate actually does.

What we just got. The empirical decay curve recovers the input half-life of four months almost exactly. In real data the curve is noisier but the shape is qualitatively the same. We will compute it on the HGBR production model later in the chapter.

In real work

Your first question when handed a new signal is what is the half-life? You can answer it before deciding anything else about portfolio construction, transaction costs, or capacity. The decay curve answers it directly.

Feature Importance Drift

Where you’ll see this. When a model starts losing money the first review meeting will project an “importance over time” stacked-bar chart and ask “which feature stopped working?” Reading that chart is a basic-skill question; learning the three different ways importance is measured (and why they disagree) lets you weigh in instead of nodding.

Intuition

Think of a soccer team. Each player has an importance to the team — how much winning depends on that player. When the team starts losing matches you don’t just ask “are we losing?”, you ask “who stopped contributing?” Maybe the star striker is injured (a once-key feature has lost its signal). Maybe the opponents figured out our play patterns (the relationship between players and goals has changed). Maybe two midfielders now do the same job and step on each other’s toes (a once-unique feature has become redundant). Feature-importance drift over time is the same diagnostic for a model.

Why Feature Importance Changes Over Time

A trained model assigns each feature an importance — a measure of how much of its predictive content the model attributes to that input. When you retrain the same architecture on a sliding window of data, the relative ranking of features should be roughly stable if the underlying signal structure is stationary (stationary just means “the statistical structure isn’t changing over time”). When the ranking reorders — when momentum drops from first to fourth, or short-interest jumps from tenth to second — something has changed in the data. The reordering is one of the earliest and most interpretable signs of decay.

Three things can produce a reordering. First, a previously informative feature loses its predictive power (alpha decay of that specific signal). Second, the joint distribution of features changes so that what used to be predictive in isolation now overlaps with other features (redundancy under drift). Third, the noise level in one feature increases relative to others (data quality drift). All three are worth catching, and the feature-importance series is one of the cheapest ways to catch them.

Mechanically, importance drift is measured by retraining on a sliding window and recording importance over time. Either the level changes (a feature’s importance falls from 0.20 to 0.05) or the ranking changes (a feature drops from rank 1 to rank 6). The level series is more informative for understanding magnitude of decay; the rank series is more robust for detecting qualitative changes when the absolute scale is hard to compare across regimes.

Three Ways to Measure Feature Importance

There is no canonical “correct” definition of feature importance for nonlinear models, and the three commonly used definitions give meaningfully different answers. A serious model-monitoring pipeline computes at least two and inspects disagreements.

Tree-native gain importance. For tree ensembles (random forests, gradient boosting), the simplest measure is the total decrease in loss attributable to splits on each feature, summed across all trees and normalized to sum to one. Every scikit-learn tree estimator exposes this as model.feature_importances_. It is essentially free — no extra computation after training — and reflects the model’s internal assessment of feature usefulness.

The known weakness is that gain importance is biased toward high-cardinality features. A continuous feature with thousands of distinct values offers many more candidate split points than a binary feature; even if the binary feature is economically more important, the continuous feature dominates the gain-importance ranking simply because it gets more opportunities to find a split that lowers training loss by chance. For a panel of normalized cross-sectional ranks (where every feature has the same effective cardinality \(N\)), the bias is mild; for raw heterogeneous features it can be severe.

Permutation importance. A model-agnostic alternative is to fit the model once, compute its out-of-sample score \(s_0\), then permute one feature column across observations (destroying its information content while preserving its marginal distribution), recompute the score \(s_k\), and define the importance as \(s_0 - s_k\). Repeat for each feature, ideally averaging over several permutations to reduce noise. Scikit-learn provides sklearn.inspection.permutation_importance.

Permutation importance is unbiased with respect to cardinality and reflects the model’s actual functional form rather than its internal split bookkeeping. The cost is computational: \(K\) extra inference passes are needed (or \(K \cdot n_\text{repeats}\) for stable estimates). For modern tree ensembles on a cross-section of size \(N \approx 10^4\) and \(K \approx 50\) features, this is a few seconds per retraining window.

Univariate Information Coefficient. A model-free measure (it requires no model at all — just the feature and the returns) is to compute, for each feature \(k\) separately, the cross-sectional rank correlation between that feature at time \(t\) and realized returns at \(t+1\), averaged across months:

\[ \text{IC}_k \;=\; \frac{1}{T} \sum_{t=1}^T \text{Spearman}\!\bigl(\{X_{k,i,t}\}_{i=1}^{N_t},\; \{R_{i,t+1}\}_{i=1}^{N_t}\bigr). \]

This is the predictive power of the feature in isolation, completely outside any model. It tells you whether a feature carries a marginal univariate signal (“marginal” here meaning “on its own, ignoring the others”).

The cell below builds a small dataset where we know the answer — \(X_0\) is a strong solo predictor, \(X_1 \times X_2\) matters only as an interaction, and \(X_3, X_4\) are 95% correlated twins — then runs all three importance measures so we can see which method detects which type of signal.

When the Three Measures Disagree

The three measures answer subtly different questions, and they diverge in informative ways.

  • A feature with high gain importance but low permutation importance is one the model uses heavily for splits during training but whose predictions do not actually depend on. This usually indicates cardinality bias: the model used the feature as a tiebreaker among observations that were essentially equivalent. Treat with suspicion.

  • A feature with high permutation importance but low univariate IC is one that contributes to predictions only through interactions with other features. The classic example is firm size: by itself, size has been a weak return predictor since the 1990s, but in conjunction with momentum (size \(\times\) momentum) it remains highly informative. The model captures the interaction; univariate IC misses it.

  • A feature with high univariate IC but low permutation importance is one that is predictive in isolation, but whose predictive content is redundant with another feature already in the model. Adding earnings-to-price to a model that already uses book-to-market may not change predictions much; both are value proxies.

  • A feature that declines on all three measures simultaneously is unambiguously decaying. This is the signature of crowding or regime change. Treat as a serious finding.

What we just got. Inspect the rows:

  • X0 should rank near the top of all three columns — strong univariate signal, used directly by the model.
  • X1 and X2 should rank near the top of permutation importance and gain importance but near zero on univariate IC. The model captures the interaction; isolation misses it.
  • X3 should have a moderate univariate IC and a moderate permutation importance, but its near-twin X4 will inherit some of the credit, making both look weaker than X3 alone would have.
  • The disagreements are informative, not contradictory. The pattern itself tells you that an interaction is being exploited and that two features carry overlapping signal.
In real work

For real work, our recommendation is permutation importance as the primary metric (model-agnostic, unbiased) and univariate IC as the secondary metric (model-free, robust to model misspecification). Gain importance is fine for a quick look at a freshly trained tree but should not be trusted in cross-sectional finance where many features have similar cardinality structure but very different economic content.

A Rolling-Window Drift Check

A single snapshot of feature importance tells you about the current model. A series of snapshots tells you whether anything is changing. The standard protocol is to retrain on a rolling window (e.g., trailing five years) every quarter or six months, record the importance vector each time, and plot it as a stacked bar or area chart. Reorderings, level changes, and emerging zeros all become visible.

A simple rule: flag a feature when its recent-window importance differs from its long-run mean by more than \(0.01\) (one IC point) and the sign has flipped or the magnitude has halved. The rule is conservative — most stable features will not trigger it from sampling noise alone — and catches both gradual decay (magnitude shrinkage) and regime flips (sign flip). In the real HGBR pipeline that motivates this chapter it produced a clean “three momentum features decaying simultaneously after 2015” signal that matched the qualitative academic consensus on the post-2009 momentum slowdown.

The next cell simulates 20 quarterly retrainings where feature X0 slowly decays to a third of its starting importance and X1 slowly grows; the others wobble around their baseline. We then apply the drift rule to see whether it correctly flags X0 and X1 while leaving the noisy-but-stable features alone.

What we just got. The simulated decay of X0 and growth of X1 are detected cleanly; the other features fluctuate inside the noise band and are correctly marked stable. In real work you would supplement this with a permutation-importance run on a held-out window to verify that the drift is not an artifact of split-counting bias.

Out-of-Distribution Detection with the Kolmogorov–Smirnov Test

Where you’ll see this. When you join a quant team, the first “production monitoring” dashboard they show you will have a KS plot per feature, daily. Learn to read it now and you can contribute on week one.

Intuition

Imagine you trained a self-driving car only on dry summer roads. The car works fine in summer; then it snows for the first time and the road looks completely different to its camera. We want a warning light that turns on the moment the inputs look unfamiliar — before the car has crashed and we read about it in the P&L report. The KS test is that warning light. It compares “the distribution of inputs the model was trained on” with “the distribution of inputs the model is seeing today” and gives a single number \(D \in [0,1]\): \(0\) = identical, \(1\) = completely different worlds.

What OOD Detection Buys You

OOD (out-of-distribution) — new inputs that look unlike training data. A model has no reliable answer for OOD inputs. A monitoring suite built only around realized IC has a deep weakness: it diagnoses the problem only after losses have been realized. By the time the rolling 6-month IC falls to zero, the model has been wrong for six months and the portfolio has been bleeding. We want a complementary signal that fires before P&L deteriorates — a leading indicator.

The natural leading indicator is the feature distribution itself. If the inputs the model sees today look unlike the inputs the model saw during training, the predictions are by definition out-of-sample in a way that random-fold cross-validation never measured. The model may still produce useful outputs — or it may extrapolate catastrophically. The honest position is that we do not know what the model is doing on novel inputs, and pausing the strategy or reducing exposure until the inputs return to the training distribution is the prudent default.

Out-of-distribution (OOD) detection is the formal name for this leading-indicator approach. The simplest and most widely used OOD test is the Kolmogorov–Smirnov (KS) two-sample test — a two-sample test comparing two empirical distributions. It returns a \(D\)-statistic in \([0,1]\): 0 = identical, 1 = completely disjoint. Useful for asking “has feature \(X\)’s distribution moved since training?”

The KS Statistic

The formula below looks intimidating but it is doing one simple thing: walking left-to-right along the x-axis and asking “what is the biggest vertical gap I see between the two CDF curves?” That biggest gap is the \(D\)-statistic.

Given a training sample \(\{x_1, \ldots, x_n\}\) from distribution \(F\) and a current sample \(\{y_1, \ldots, y_m\}\) from distribution \(G\), the KS statistic is the supremum (just a fancy word for “maximum over all values of \(x\)”) of the absolute difference between the two empirical CDFs (an empirical CDF \(F_n(x)\) is just “the fraction of my sample with value \(\le x\)”):

\[ D \;=\; \sup_x \bigl| F_n(x) - G_m(x) \bigr|, \]

where

\[ F_n(x) \;=\; \frac{1}{n} \sum_{i=1}^n \mathbb{1}\{x_i \le x\}, \qquad G_m(x) \;=\; \frac{1}{m} \sum_{j=1}^m \mathbb{1}\{y_j \le x\}. \]

Geometrically, \(D\) is the largest vertical gap between the two step-function CDFs. Under the null hypothesis that the two samples come from the same continuous distribution, \(D\) has a known distribution and the test produces a \(p\)-value. In monitoring applications we usually care less about formal significance and more about effect size, because with large samples even tiny distributional differences become “significant”.

Caption. The KS statistic is just the largest vertical gap between two empirical CDFs at any \(x\); a per-feature \(D\) above the \(0.10\) working threshold trips a drift flag.

Operationally useful thresholds for large samples (\(n, m \ge 1000\)):

KS statistic \(D\) Practical interpretation
\(D < 0.05\) Distributions essentially identical
\(0.05 \le D < 0.10\) Small shift, usually noise
\(0.10 \le D < 0.20\) Meaningful shift, worth flagging
\(D \ge 0.20\) Substantial shift, regime change likely

These are working thresholds, not theorems. They reflect the empirical observation that in stable financial cross-sections, per-feature KS statistics against a five-year training window fluctuate in the \(0.02\)–\(0.08\) range, occasionally spike above \(0.10\), and almost never sit above \(0.20\) except during genuine regime breaks (e.g., March 2020 COVID, December 2018 vol spike).

The next cell runs the KS test under four controlled scenarios — no shift, a small mean shift, a large mean shift, and a variance shift — and overlays the two empirical CDFs so you can see the largest vertical gap that the \(D\)-statistic is summarising.

What we just got. The “no shift” panel shows two essentially overlapping CDFs and a tiny KS statistic; the “large mean shift” panel shows a visible horizontal separation between the curves at the median region, where the vertical gap is maximized. The KS statistic is precisely that maximum vertical gap.

Per-Feature KS Monitoring

In a multi-feature model we cannot reduce drift to a single number; different features can drift independently and for different reasons. The right protocol is per-feature KS: for each of the \(K\) inputs, compute \(D_k\) between the training window and the current evaluation window. Then summarize the panel of \(K\) statistics with two top-line numbers:

  • The maximum \(\max_k D_k\) — the worst-drifted feature.
  • The fraction flagged \(\frac{1}{K}\sum_k \mathbb{1}\{D_k > 0.10\}\) — the share of the input space that has shifted.

A robust monitoring rule, used in many real-work pipelines:

Alert if the fraction flagged exceeds 20% (i.e., more than one in five features has \(D > 0.10\)).

The rule is calibrated against the observation that in stable regimes, a few percent of features will exceed \(D = 0.10\) from sampling noise alone, but \(> 20\%\) flagged simultaneously is essentially impossible unless something has genuinely changed in the input distribution.

The cell below packages the per-feature loop into a reusable detect_ood function and then runs it on two synthetic test sets: one where the recent data is statistically identical to training, and one where 15 of 50 features have been bumped by \(0.4\) standard deviations. The alert should fire on the second only.

What we just got. In the stable scenario, KS sampling noise produces only a handful of false positives — well below the 20% threshold. In the regime-break scenario, 15 of 50 features shift simultaneously and the alert fires. The rule is asymmetric by design: false negatives during a real regime break are far costlier than occasional false positives.

KS Monitoring on a Walk-Forward Time Series

The static comparison above is just the building block. The full monitoring application is a time series of KS panels: at each evaluation month \(t\), compare the trailing \(L\)-month training window (typically \(L = 36\)) against the cross-section at \(t\), and record the fraction of features flagged. Plot the series. Spikes above the 20% threshold are the alerts.

The walk-forward picture is informative for two reasons. First, it shows that the “20% flagged” rule is rarely violated — in our real HGBR pipeline on the S&P 500 cross-section, the rate sits around 5–10% in stable years and breaches 20% only during March 2020 (COVID), late 2022 (rate-hike cycle), and February 2018 (volmageddon). Each of those breaches coincides with a discrete event a human can name. Second, it lets you ask the follow-up question — which features drove the breach — by drilling into the per-feature KS table for that month.

The next cell simulates a 96-month panel with a known regime break at month 60 (we artificially shift 4 of 12 features), then runs the rolling KS monitor and plots the % flagged through time. The alert should fire the moment the break happens.

What we just got. Before the regime break at month 60, the series fluctuates near \(0.05\)–\(0.15\) — quiet sampling noise. After the break, it leaps above \(0.30\) and stays elevated as the rolling window slowly absorbs the new regime. The first alert fires immediately at the break point and stays on until the trailing window has fully ingested the new regime. This is exactly the leading-indicator behavior we wanted: the alert fires the month the inputs change, not six months later when IC has bled down.

A subtle pitfall

KS monitoring is most reliable when applied to cross-sectionally standardized features — features that have been rank-transformed or \(z\)-scored within each month before the test. If you compare raw price levels or raw volumes, you will catch every secular trend in the market as “drift”, even when no economic relationship has changed. Standardize first; monitor the residual distribution.

Turnover Diagnostics

Where you’ll see this. Trading desks watch turnover the way an airline pilot watches engine RPMs — not because turnover itself is bad, but because abnormal turnover (too high or too low compared with its usual range) is one of the earliest signs that the model is misbehaving. Get used to reading the turnover series alongside IC; the two together catch problems that either one alone misses.

Intuition

Turnover — how often the portfolio’s holdings change. High turnover = more trading costs. Spike in turnover = the model is changing its mind a lot, often a sign of confusion. Two failure modes need separate names. A stuck model is one that keeps producing the same forecasts month after month — turnover collapses, the portfolio holds the same names forever, P&L looks flat. A whipsawing model is one that flips its forecasts wildly — turnover spikes, the portfolio rotates almost completely each month, trading costs eat everything. Both are bad. A healthy model lives in the middle band.

Why Turnover Belongs in the Monitoring Stack

Turnover is the fraction of the portfolio that is bought and sold each rebalance. For a top-\(K\) equal-weight long-only book (i.e., we buy the \(K\) stocks with the highest model score and weight them equally), the one-way turnover from month \(t-1\) to month \(t\) is

\[ \tau_t \;=\; \frac{1}{2} \sum_{i \in \mathcal{U}} \bigl| w_{i,t} - w_{i,t-1} \bigr|, \]

where the sum runs over the union \(\mathcal U = \mathcal H_{t-1} \cup \mathcal H_t\) of all stocks held in either month (\(\mathcal H_t\) is just “the set of stocks in the portfolio at month \(t\)”). The \(\tfrac{1}{2}\) converts a two-sided change to a one-way trade (every name you sell, you replace with a name you buy — counting both sides doubles the trade). A turnover of 0.40 means 40% of the book is rotated each month — equivalently, the average holding period is \(1 / 0.40 = 2.5\) months.

Turnover is a behavior of the model in deployment, and it carries diagnostic information that IC does not. Two scenarios where turnover is the leading indicator:

Stuck signal. The model’s predictions are barely changing month-to-month. Possible causes: the feature inputs have stopped varying (data quality problem in a feed), the model has saturated some constraint and is returning the same top-\(K\) stocks every period, or the underlying signal has decayed to noise and the model is just locking in its training-period rankings. In all three cases, turnover drops below its usual range, often beneath a 10% threshold.

Whipsaw signal. The model’s predictions are flipping wildly month-to-month, possibly because the feature inputs are noisier than usual, possibly because the model has been retrained on a too-narrow window and is overfitting to recent noise. Turnover spikes above its usual range — 70% or more, with implied holding periods under six weeks.

Both patterns produce IC deterioration eventually, but they show up in turnover first. A turnover-based alert layer catches model malfunction even when the signal-quality metrics still look acceptable.

Computing Portfolio Turnover

The cell below builds a 60-month series of top-20 portfolios where, every month, we randomly swap a handful of names; then it computes turnover, annualises it, and runs a transaction-cost drag calculation at four different cost assumptions. The output shows you what monthly turnover translates to in annual P&L drag — the single number that decides whether a strategy is implementable.

What we just got. A monthly turnover of 25% implies 300% annual turnover. At a realistic round-trip transaction cost of 10 basis points (a “bp” is one one-hundredth of a percent; 10 bps = 0.10% — typical for liquid US large-cap), the annual drag is \(3.0 \times 0.0010 = 0.30\%\) — manageable. At 30 bps (typical for small-cap or international), the drag is \(0.90\%\) — large enough to halve the Sharpe of many alpha models.

Turnover Thresholds for Health Monitoring

A reasonable health band for a cross-sectional long-only equity book is

\[ 0.10 \;\le\; \tau \;\le\; 0.60 \quad \text{(monthly, one-way)}. \]

Below 0.10 the model is stuck; above 0.60 it is whipsawing. The exact thresholds depend on signal half-life — a fast-decaying momentum signal will naturally run at higher turnover than a slow-decaying value signal — but the band gives a starting place.

A turnover alert is simple:

def rule_low_turnover(state):
    return state.get('turnover') is not None and state['turnover'] < 0.10

def rule_high_turnover(state):
    return state.get('turnover') is not None and state['turnover'] > 0.70

In real work, the low-turnover alert is the more interesting one. Whipsaw is loud and visible in P&L; stagnation is silent. The model that has stopped producing new information looks fine on a daily P&L report — flat, low volatility, no losses — until you compare it to its own historical distribution and discover that it has been producing essentially the same forecast for six consecutive months.

Pair turnover with IC

The most reliable composite alert is low turnover and falling IC fired together. Either alone could be sampling noise; together they almost always indicate a stuck or decayed model. Build the alert layer around composite predicates, not univariate thresholds.

Caption. Four stacked tiles each watch one model-health signal; green/amber/red status with a trigger note next to each — a serious monitoring layer fires only when several tiles cross threshold together.

Retraining Policies

Where you’ll see this. A model that worked for a year and then stopped is the most common scenario in quant finance. The recipes in this section are what separates teams that catch it from teams that lose six months of P&L first.

Intuition

Retraining — periodically refit the model on more recent data so it adapts to the changing world. Think of it like updating the map on your car’s GPS. You can: (1) never update — frozen — works fine in a static neighbourhood but disastrous after a new highway opens; (2) update every January, regardless — scheduled — simple, but the highway can open in June; (3) update only when you get lost — IC-triggered — responsive, but you might over-react to a single missed turn; (4) update every January and whenever you’ve been lost for three trips in a row — hybrid — best of both. A hybrid policy retrains on a fixed schedule (e.g., quarterly) AND when IC turns negative for \(k\) consecutive months.

The Four Standard Policies

Once you have a monitoring layer that tells you when the model is decaying, you need a rule for what to do about it. There are four standard retraining policies in real use, ordered by increasing sophistication.

Static (frozen). Train the model once and never retrain. Used in academic backtests where the question is whether a signal exists, not whether it persists. In real work, frozen models are almost always a mistake; even slow-moving signals like value and quality require periodic retraining as the cross-section composition evolves.

Scheduled. Retrain on a fixed calendar — typically annually, occasionally quarterly. The advantage is operational simplicity: known cadence, known retraining costs, no ambiguity about when to act. The disadvantage is that the model may decay between scheduled retrainings; if the half-life of decay is shorter than the retraining interval, the schedule cannot keep up.

IC-triggered. Retrain whenever a model-health metric breaches a threshold. The classic trigger is “retrain when the rolling 6-month IC has been negative for \(k\) consecutive months” with \(k = 3\) being a typical choice (the consecutive part matters — a single bad month is just noise). The advantage is responsiveness: bad months trigger an immediate refresh. The disadvantage is the noise-chasing problem — short-horizon IC is dominated by sampling variability, and a model retrained on the most recent six months of data has fit largely noise. Pure IC-triggered policies tend to retrain too often during volatile periods.

Hybrid. Combine scheduled retraining with IC-triggered overrides. Retrain at least every twelve months on a fixed schedule, and retrain mid-cycle if rolling IC has been negative for three consecutive months (and at least four months have passed since the last retraining, to avoid back-to-back retrainings). The hybrid policy gets the operational simplicity of scheduled while picking up the responsiveness of IC-triggered for genuine breakdowns.

Caption. Four horizontal timelines, one per policy, with vertical ticks marking retraining events: static never fires, scheduled fires like clockwork, IC-triggered fires only on decay, hybrid combines both — empirically the strongest in real walk-forward tests.

A Worked Comparison

The lab notebook that motivates this chapter runs all four policies on the same HGBR alpha model over a 19-year walk-forward backtest on the US equity cross-section. The walk-forward results, cached in retrain_results_scheduled.parquet and retrain_results_hybrid.parquet, show the following pattern (numbers from the real run):

Policy Retrain events Annual return Annual volatility Sharpe Final wealth ($1 invested)
Scheduled (annual) 19 4.4% 15.0% 0.29 $1.89
Hybrid (annual + IC trigger) 23 5.1% 14.1% 0.36 $2.32

The hybrid policy fired four mid-year retrainings beyond the calendar schedule — each triggered by a three-month negative-IC streak — and those four extra retrainings shifted Sharpe from 0.29 to 0.36 and lifted final wealth by 23%. The qualitative pattern is robust: in regimes with mid-cycle decay events, hybrid beats scheduled by a meaningful margin; in regimes with no mid-cycle decay, the two are essentially identical.

The code skeleton below shows the structure used to produce those numbers: a model factory, two policy functions, and a single walk-forward loop that calls whichever policy you pass in. Reading this once is worth ten paragraphs of description — the implementation is small.

import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor

# --- Production HGBR config ---
HGBR_PARAMS = dict(
    loss='absolute_error', learning_rate=0.05, max_depth=2,
    max_leaf_nodes=31, min_samples_leaf=80, l2_regularization=1.0,
    max_iter=300, early_stopping=True, validation_fraction=0.20,
    n_iter_no_change=30, random_state=42,
)

def train_hgbr(panel, feats, target):
    sub = panel[feats + [target]].dropna(subset=[target])
    return HistGradientBoostingRegressor(**HGBR_PARAMS).fit(
        sub[feats].fillna(0).values, sub[target].values)

# --- Two retraining policies ---
def policy_scheduled(month_idx, last_retrain_idx, _state):
    """Retrain every 12 months on a fixed calendar."""
    return (last_retrain_idx is None) or (month_idx - last_retrain_idx >= 12)

def policy_hybrid(month_idx, last_retrain_idx, ic_lag):
    """Annual schedule + IC-triggered override when rolling-6m IC is
    negative for 3 consecutive months AND at least 4 months since last retrain."""
    if (last_retrain_idx is None) or (month_idx - last_retrain_idx >= 12):
        return True
    cur_ym = oos_months[month_idx]
    last3 = ic_lag.loc[:cur_ym].tail(3)
    if len(last3) >= 3 and (last3 < 0).all() and month_idx - last_retrain_idx >= 4:
        return True
    return False

# --- Walk-forward loop ---
def run_policy(policy_fn, label):
    model = None; last_retrain_idx = None; retrains = []; monthly_ret = {}
    for i, cur_ym in enumerate(oos_months):
        if policy_fn(i, last_retrain_idx, rolling_ic_6m_lag):
            train_panel = df[df['ym'] < cur_ym]
            model = train_hgbr(train_panel, feats, target)
            last_retrain_idx = i; retrains.append(cur_ym)
        cur_panel = df[df['ym'] == cur_ym].copy()
        cur_panel['pred'] = model.predict(cur_panel[feats].fillna(0).values)
        top = cur_panel.nlargest(K, 'pred')
        bot = cur_panel.nsmallest(K, 'pred')
        monthly_ret[cur_ym] = top[target].mean() - bot[target].mean()
    return pd.Series(monthly_ret).sort_index(), retrains

The structure is general. To plug in a different model, replace train_hgbr with the corresponding factory. To plug in a different policy, replace the policy function. The walk-forward harness is policy-agnostic and model-agnostic; it is the orchestrator, not the strategy.

Choosing the Retrain Frequency

How often should you retrain even when the trigger does not fire? The answer comes back to the alpha decay curve. If the half-life of your signal is \(h\) months, retraining at intervals materially shorter than \(h\) buys little — the underlying signal has not yet decayed enough for new data to provide a different answer. Retraining at intervals longer than \(2h\) to \(3h\) risks letting the model run on a stale parameterization (i.e., with old coefficients) through a full half-life of decay.

A practical rule:

\[ \text{retrain interval} \;\approx\; \max\bigl(\,h, \; 6\;\text{months}\bigr). \]

The minimum of six months reflects the bare-bones operational cost of retraining (data refresh, hyperparameter validation, model promotion) and the desire to absorb at least two quarters of new data into each retraining window.

For a fast-decaying momentum signal (\(h \approx 3\) months), the rule says retrain every six months. For a slow-decaying value signal (\(h \approx 12\) months), retrain every twelve months. For a mixed signal, take the harmonic mean of the constituent half-lives. The lab’s HGBR signal has an empirical half-life around nine months on US equity data, which is why the annual schedule is roughly right.

The cell below puts the rule to a controlled test. We simulate a stylised market where the true predictive coefficient decays each month and resets to its starting value every 30 months (think of these resets as discrete regime changes). We then race three retraining cadences — frozen, annual, every-six-months (= one half-life) — to see which one tracks the resets and ends with the most cumulative return.

What we just got. The frozen policy locks in the initial coefficient and watches its predictive power evaporate; the annual policy partly recovers but lags the regime resets at \(t = 30, 60, 90\); the every-six-months policy tracks the regime resets closely and ends with the highest cumulative return. The simulation is stylized but the qualitative ranking — match the retrain frequency to the half-life — generalizes.

Key takeaway

The retraining decision is structurally two-dimensional: a scheduled component matched to the signal half-life, and a triggered component matched to a small set of unambiguous decay alerts (multi-month negative IC, OOD breach, importance flip on critical features). The pure-scheduled policy is the safe baseline; the hybrid is the right default for real work.

Automated Alerts: A Walk-Forward Dashboard

Where you’ll see this. Every real trading team has a “morning monitor” page — the first thing the PM looks at after their coffee. Building that page is most of what an entry-level quant engineer does in the first six months on the job.

From Diagnostics to Alerts

The diagnostics from the preceding sections — rolling IC, KS drift fraction, turnover, feature-importance shifts, prediction-distribution moments — are useful one at a time when you are looking at them. Real systems do not have humans looking at them continuously. They have rules: small boolean predicates (i.e., yes/no tests) that fire when a diagnostic crosses a threshold, recorded in an alert log and routed to email, Slack, or a dashboard tile.

A useful alert rule has three properties: it is specific (a single, named condition that any team member can reproduce); it is calibrated (it fires rarely during stable periods, so when it fires it carries information); and it is actionable (the recipient knows what to do — pause the strategy, page the modeler, retrain the model — without needing a debate). The five rules in the table below are the production-grade defaults used in the motivating lab notebook.

Rule Severity Condition
IC decay WARN Rolling 6-month IC \(< 0.02\)
IC collapse CRITICAL Monthly IC \(< -0.03\)
Uniform signal CRITICAL All predictions same sign
OOD shift WARN \(> 20\%\) of features have KS \(> 0.10\)
Low turnover WARN Monthly turnover \(< 0.10\)

The severity tier matters: WARN alerts are logged and reviewed weekly; CRITICAL alerts are paged immediately and may auto-pause the strategy until a human acknowledges them. The two CRITICAL rules — IC collapse and uniform-signal — are the unambiguous “the model is broken” signals. The three WARN rules are the “something is changing” signals.

Implementing the Alert Layer

The alert layer is a tiny piece of code — five named rules, a state dict (a Python dictionary carrying the current diagnostic values), a loop. The complexity is not in the implementation; it is in the calibration (picking thresholds that fire often enough to be useful and rarely enough to be trustworthy) and in the response policy (what action follows each fire).

The cell below implements the five rules verbatim, then feeds them a synthetic 72-month stream of states where the first 48 months are a healthy regime and the last 24 a degraded regime. The output lists how often each rule fires and shows the first ten alerts in the degraded period — the clustering of alerts is the key behaviour to notice.

What we just got. Before month 48 the system is mostly silent — the occasional WARN from sampling noise but no clustering. After month 48, the rules fire continuously: IC decay, IC collapse, OOD shift, and low turnover all flag the degradation episode. A response policy reading this stream would page the modeling team on the first CRITICAL fire and auto-reduce exposure within a fixed number of consecutive WARN fires.

Silent Months: The Healthy Baseline

A useful complement to “how many alerts fired” is “how many months were completely silent — no rule fired at all”. In a well-calibrated system, the silent-month fraction should sit around 60–80% during normal operation. If it drops below 30%, the rules are too sensitive and team alert fatigue will follow (people start ignoring alerts because they fire too often — a real failure mode). If it rises above 95%, the rules are too lax and real decay will go undetected.

The two-line cell below counts the silent months in our simulation and reports the fraction. Notice how the answer comes out near the calibration band — that is by design.

In the simulated stream above, the silent fraction during the healthy phase is around 80% and during the degraded phase drops below 20%. That contrast — silent in calm regimes, vocal in stressed ones — is the operational signature of a well-calibrated alert layer.

A Worked Mini-Dashboard

We close the chapter with a single live cell that wires together the four headline diagnostics — rolling IC, KS drift fraction, turnover, and the composite alert state — into a one-shot dashboard. The cell takes a panel of predictions and realizations, simulates a controlled distribution break in the middle of the sample, and produces the four-panel view a model-monitoring team would inspect each morning. Read the output the way you would read a hospital monitor: each panel is one vital sign, and the alert track at the bottom is the buzzer.

What we just got. Before month 60, the dashboard is mostly green: rolling IC sits in the healthy range, KS drift fraction fluctuates around 10%, turnover lives in the 20–40% band, and the alert track at the bottom is quiet. At month 60 every panel shifts visibly within one or two periods: rolling IC drops through the alert threshold, KS drift fraction crosses 20%, turnover falls toward the low-turnover band, and the alert bar fires continuously. This is what a successful monitoring stack should look like. The dashboard does not predict the break — no monitor can — but it diagnoses the break immediately, in multiple independent panels, and that simultaneity is what separates a real regime event from sampling noise.

In real work

The dashboard runs on a schedule (cron, Airflow, Prefect — pick a workflow tool) once per trading day after the close. The output is a static PNG dropped into a shared folder, plus a row of alert flags appended to a Postgres or Parquet log. Slack notifications fire on transitions (no-alert \(\to\) alert), not on standing states; this minimizes noise without losing information. Once a week, the team reviews the log and decides whether any standing alerts warrant intervention.

Putting It Together

A complete model-monitoring and retraining system has six moving parts:

  1. A signal-quality diagnostic — rolling IC, computed daily or monthly, displayed as both a level series and a cumulative-IC equity curve.
  2. A drift diagnostic — per-feature KS against a rolling training window, summarized as the fraction of features with \(D > 0.10\).
  3. A behavior diagnostic — portfolio turnover, bounded by an upper and lower threshold.
  4. A feature-importance diagnostic — periodic recomputation of permutation importance and univariate IC, with a drift rule on the rank vector.
  5. An alert layer — five or six named rules of varying severity, logged and routed to a human channel.
  6. A retraining policy — typically hybrid (scheduled \(+\) IC-triggered), with a cadence matched to the empirical signal half-life.

None of the six components is mathematically deep. The KS test was published in 1933; permutation importance has been a standard idea since Breiman (2001); the alpha decay curve was understood at every fundamental quant fund in the 1990s. The work is in the plumbing: writing the diagnostics into a workflow that runs every day, the alert rules into a layer that fires reliably, the retraining policy into a deployable artifact that the trading desk trusts. Done well, it is invisible — the dashboard is green, the alerts are quiet, the strategy delivers its expected Sharpe. Done poorly, it is the difference between a strategy that compounds quietly for a decade and a strategy that runs into a regime break unprepared and is shut down by the risk committee in a single quarter.

Chapter summary
  • Models decay through three distinct mechanisms — regime change (\(p(Y\mid X)\) shifts), feature drift (\(p(X)\) shifts), and crowding (the predictability of \(Y\) given \(X\) degrades because everyone is trading on the same forecast). The chapter’s diagnostics target the three separately.
  • Feature importance can be measured three ways — tree-native gain, permutation, univariate IC. Compute at least two; trust the signal that appears in both.
  • The alpha decay curve \(h \mapsto \text{IC}(h)\) tells you the signal half-life. Match rebalancing frequency to the half-life, not to convention.
  • KS drift monitoring with the “\(> 20\%\) of features with \(D > 0.10\)” rule is a leading indicator of regime change; it fires before P&L deteriorates.
  • Turnover is a behavior diagnostic: too low signals a stuck model, too high signals a whipsawing model; both predict eventual IC deterioration.
  • Retraining policies range from frozen to hybrid; the hybrid policy (annual + IC-triggered overrides) is the real-work default and beat scheduled-only by 7 Sharpe points and 23% in final wealth on the lab’s 19-year US equity walk-forward.
  • The automated alert layer wraps the diagnostics in a small set of named rules and a logging mechanism. Calibrate it so that 60–80% of months are silent during stable regimes and the alerts cluster tightly around real decay events.

Exercises

Exercise 2.1 — Alpha decay and rebalancing frequency

Suppose you have an alpha signal with the following empirical decay curve, measured on five years of out-of-sample data:

Horizon \(h\) (months) \(\text{IC}(h)\)
1 0.060
2 0.052
3 0.044
6 0.030
9 0.022
12 0.016
18 0.010
  1. Estimate the signal half-life by linear interpolation.
  2. Using the rule \(\text{retrain interval} \approx \max(h, 6)\), what retraining cadence does the half-life recommend?
  3. Suppose round-trip transaction costs are 15 bps and the model runs at 40% monthly turnover. Compute the annual transaction-cost drag. If the same signal were rebalanced quarterly with 60% per-quarter turnover, what would the new annual drag be? Comment on the trade-off.

Exercise 2.2 — KS test calibration

Generate two samples of size \(n = 500\) each, both drawn i.i.d. from \(\mathcal N(0, 1)\). Compute their KS statistic. Repeat 1,000 times and form the empirical sampling distribution of \(D\) under the null.

  1. Report the 5%, 50%, 95%, and 99% quantiles of the null distribution.
  2. Compare to the working threshold \(D > 0.10\) (the one you’ll see used in real monitoring dashboards). What is the empirical false-positive rate at \(n = 500\)?
  3. Repeat the experiment with \(n = 5{,}000\). Does the threshold \(0.10\) become more conservative or more aggressive as sample size grows? What does this imply about choosing thresholds when window sizes vary across features?

Exercise 2.3 — Composite alert design

Design a composite alert rule that fires only when two or more of the following sub-rules fire in the same month:

  • Rolling 6-month IC below 0.02
  • Fraction of features with \(D > 0.10\) above 20%
  • Monthly turnover below 10%
  1. Implement the composite rule in Python as a function of a state dict.
  2. Apply it to the synthetic state stream in the chapter’s automated-alerts cell. Report the number of times the composite rule fires before and after the simulated degradation episode.
  3. Compare the composite rule’s silent-month fraction to that of any single sub-rule. Which is more calibrated in the sense of firing rarely in calm regimes and firmly during stressed ones?

Exercise 2.4 — Retraining policy on a controlled break

Simulate a 120-month walk-forward backtest of a one-feature linear alpha model where the true coefficient is \(\beta_t = 0.10\) for \(t < 60\) and \(\beta_t = -0.05\) for \(t \ge 60\) (a sign-flip regime break). Each month, observe \(N = 200\) cross-sectional observations \(X_{i,t} \sim \mathcal N(0, 1)\) and \(R_{i,t+1} = \beta_t X_{i,t} + \varepsilon_{i,t+1}\) with \(\varepsilon \sim \mathcal N(0, 1)\). Build a top-decile long-short portfolio each month.

  1. Compare three retraining policies: frozen (estimate \(\hat\beta\) once at \(t = 12\) and never update), scheduled (re-estimate every 12 months on the trailing 24 months), and IC-triggered (re-estimate whenever the trailing 6-month IC falls below zero, with at least 4 months between successive retrainings).
  2. Report the cumulative wealth for each policy at \(t = 120\).
  3. Plot the three wealth paths and the realized \(\hat\beta\) over time. Which policy detects the break first? Which loses the most before the detection?

Exercise 2.5 — Importance-disagreement diagnostic

Build a synthetic dataset with \(N = 5{,}000\) rows and \(K = 6\) features such that:

  • \(X_0\) has a strong univariate effect on \(Y\).
  • \(X_1 \cdot X_2\) has a pure interaction effect on \(Y\) but neither feature has a marginal effect.
  • \(X_3\) and \(X_4\) are 95% correlated and both carry the same weak marginal signal.
  • \(X_5\) is independent noise.

Fit a HistGradientBoostingRegressor with the production defaults from this chapter. Compute (i) Random-Forest gain importance, (ii) permutation importance on a held-out set, and (iii) univariate IC.

  1. Which features have high importance under all three measures? Which features have high importance under only one?
  2. For the redundant pair \((X_3, X_4)\), what does permutation importance assign to each individually? What if you permute both columns together (joint permutation importance)? Why is the joint number larger than either marginal?
  3. Describe a real-world feature engineering decision you would make based on this diagnostic.

Exercise 2.6 — End-to-end mini-monitor

Take a publicly available daily return series (e.g., SPY, or any cross-sectional dataset you have access to). Implement, in a single notebook:

  1. A simple alpha model (any model from Chapter 1 will do — even a 1-month momentum signal).
  2. A walk-forward backtest with annual retraining.
  3. A rolling IC series, a KS drift series against the trailing 36-month window, and a portfolio turnover series.
  4. An alert layer with at least three rules.
  5. A four-panel dashboard mirroring the worked dashboard in this chapter.
  1. Identify the three months in your sample with the highest alert density. What macro or market event is occurring in each?
  2. For one of those months, drill into the per-feature KS table and report which three features had the largest drift. Are they economically related (e.g., all volatility-sensitive, all liquidity-sensitive)?
  3. Propose a retraining policy modification (e.g., a new trigger rule) that would have caught the worst alert month earlier. Justify the modification with the diagnostic evidence.
 

Prof. Xuhu Wan · HKUST ISOM · Model Risk in Quantitative Finance