Comparison and Limitations
The finest complication is worthless if the watchmaker cannot tell you, honestly, when it keeps perfect time and when it does not.
This chapter puts Mainspring in context. We compare it systematically against every Timepiece, enumerate its fundamental limitations without euphemism, sketch possible extensions, and describe the hybrid pipeline where Mainspring’s speed meets Escapement’s statistical rigor.
Detailed Comparison
The following table compares Mainspring against each Timepiece along four axes: what the classical method trades away, how Mainspring addresses that limitation, what Mainspring sacrifices in return, and the practical consequence for a user choosing between them.
Timepiece |
Classical limitation |
How Mainspring addresses it |
What Mainspring sacrifices |
When to prefer the Timepiece |
|---|---|---|---|---|
Two haplotypes only; piecewise-constant \(N_e(t)\); slow EM convergence |
Processes \(n\) samples jointly; continuous \(N_e(t)\) via normalizing flow; single forward pass |
No convergence guarantee; opaque learned representation |
When you have a single diploid genome and need interpretable, reproducible inference with well-understood error properties |
|
Distinguished lineage assumption; ODE discretization artifacts; limited to ~200 undistinguished samples |
Permutation-equivariant encoder; no distinguished lineage; arbitrary \(n\) |
Trained on coalescent simulations, may not generalize to non-coalescent data |
When you need multi-population split-time estimation (SMC++’s core strength) |
|
\(O(S^2 K)\) per site; hours of MCMC for kilobase regions; limited to ~10 samples |
Single forward pass; linear-time sliding-window attention; handles 50–100 samples |
No asymptotic exactness; posterior may be miscalibrated |
When you need provably correct posterior samples and can afford the compute |
|
Gibbs sampling is slow; sequential processing; limited scalability |
Parallel across genomic windows; batched GPU inference |
No GP prior on branch lengths; less smooth posteriors |
When the GP prior on branch lengths is scientifically important (e.g., detecting rate variation) |
|
No posterior uncertainty; no node times; no demographic inference |
Full posterior via gamma output heads; joint topology and dating; demographic decoder |
Cannot scale to millions of samples (tsinfer’s key advantage) |
When you have biobank-scale data (>10,000 samples) and need only topology |
|
Requires fixed topology; factored posterior ignores cross-tree correlations |
Joint topology and dating; GNN captures cross-tree consistency |
tsdate’s variational gamma is mathematically principled; Mainspring’s GNN is a black box |
When you have a high-quality tree sequence from tsinfer and need calibrated date posteriors |
|
Discards linkage; limited to ~3 populations; diffusion PDE is slow for high-dimensional SFS |
Full sequence input retains linkage; SFS used as auxiliary loss |
Cannot model arbitrary numbers of populations (dadi is more flexible for multi-population models) |
When you need multi-population demographic inference with >3 populations |
|
Same as dadi (ODE rather than PDE); moment closure approximation |
Same as for dadi; SFS loss provides physics-informed regularization |
Same as for dadi |
When you need fast SFS-based inference with well-characterized approximation error |
|
Two haplotypes only; no ARG topology; forward-only (no smoothing across the full chromosome) |
Multi-sample; full ARG topology; bidirectional attention |
Gamma-SMC’s posterior is analytically motivated; Mainspring’s gamma heads are learned |
When you need analytically grounded pairwise coalescence-time posteriors |
|
Composite likelihood from pre-computed pairs; SVGD is expensive |
End-to-end training; normalizing flow is fast at inference time |
phlash has stronger theoretical grounding (score function estimator, SVGD convergence) |
When you need Bayesian demographic inference with well-characterized convergence properties |
No free lunch
Mainspring does not dominate any Timepiece on every axis. The pattern is consistent: Mainspring trades statistical guarantees and interpretability for speed and multi-output capability. This is the fundamental trade-off of amortized inference, and no architectural innovation can fully eliminate it.
What Mainspring Cannot Do
We identified six honest limitations in the overview. Here we expand on each with concrete failure modes.
1. The Simulation Fidelity Gap
Mainspring’s posterior is conditioned on the coalescent model implemented in msprime being correct. The real data-generating process may include:
Gene conversion (short-tract non-reciprocal recombination), which creates patterns that look like closely spaced double crossovers. msprime can simulate gene conversion, but it is not included in the default training prior.
Structural variants (inversions, duplications, translocations), which violate the sequential Markov property by creating non-local genealogical correlations.
Sequencing and phasing errors, which corrupt the genotype matrix in ways unrelated to the evolutionary process.
Background selection and selective sweeps, which distort the genealogy in ways not captured by the neutral coalescent.
When the true generative process lies outside the model family, the network’s posterior \(q(N_e, \mathcal{A} \mid \mathbf{D})\) may be arbitrarily wrong – and, worse, it may be confidently wrong, because the training data never included examples where the posterior should be diffuse.
Mitigation. Phase 4 of curriculum training (SLiM simulations) partially addresses this, but cannot cover all possible model violations. Users should always validate Mainspring’s output against at least one classical method on a subset of their data.
2. No Statistical Guarantees
MCMC methods (ARGweaver, SINGER) produce samples from the true posterior given enough iterations. Variational methods (tsdate’s variational gamma) provide a lower bound on the model evidence. Mainspring provides neither.
The network may produce:
Over-confident posteriors: gamma distributions that are too narrow, covering the true time less often than the nominal credible level.
Biased point estimates: systematically too young or too old node times in certain parts of the tree.
Poorly calibrated demographic posteriors: normalizing flow samples that do not represent the true posterior density.
Mitigation. Monitor calibration on simulated validation data. If the 90% credible interval covers the true value 85% of the time, the posteriors are under-dispersed and should be inflated by a calibration factor.
3. Extrapolation Failure
Neural networks interpolate well and extrapolate poorly. If the training prior covers \(N_e \in [100, 100{,}000]\) and the true population experienced a bottleneck of \(N_e = 10\), the network has never seen this regime and may produce nonsensical output.
Mitigation. Use a training prior that is deliberately wider than the expected range of real parameters. Validate on held-out simulations at the edges of the prior. Flag predictions that fall near the boundary of the training distribution.
4. Interpretability
A PSMC transition matrix can be inspected element by element: each entry has a clear physical meaning (probability of coalescence time changing from interval \(k\) to interval \(l\) between adjacent bins). Mainspring’s attention weights and GNN messages have no such direct interpretation.
We can probe the network with:
Attention maps: which positions attend to which? Do breakpoints in attention correspond to true recombination breakpoints?
Ablation studies: how does performance degrade when each design principle is removed?
Gradient-based attribution: which input sites contribute most to the predicted time of a specific node?
But these are post-hoc analyses, not built-in interpretability.
5. Training Cost
A representative training run:
Resource |
Requirement |
|---|---|
GPU |
4 × A100 (80 GB) for 3 days |
CPU (simulation) |
64 cores for on-the-fly msprime |
Storage |
~500 GB for checkpoints and logs |
Total GPU-hours |
~300 GPU-hours |
Estimated cloud cost |
~$600–1,200 (depending on provider) |
This is a one-time cost, amortized across all future inference. But it places Mainspring out of reach for labs without GPU access. By contrast, PSMC runs on a laptop.
6. Recombination Map Dependency
Mainspring requires a recombination map as input (or assumes a uniform rate). Errors in the recombination map propagate into errors in the predicted ARG:
Under-estimated recombination rate → too few predicted breakpoints → trees that are too wide, with averaged-out coalescence times.
Over-estimated recombination rate → too many predicted breakpoints → fragmented trees with noisy time estimates.
Methods that operate on summary statistics (dadi, moments) are immune to this because the SFS does not depend on the recombination map (only on the total branch length distribution).
Possible Extensions
Mainspring as described handles a single panmictic population under neutrality. Several extensions are natural:
Population structure. Replace the single-population demographic decoder with a multi-population version that infers migration rates and divergence times. The encoder and topology decoder are already population-agnostic (they process haplotypes without population labels). The demographic decoder would condition on population assignments (known or inferred) and output a structured demographic model.
Natural selection. Selection distorts the genealogy in characteristic ways: selective sweeps produce star-like trees, background selection reduces effective population size in low-recombination regions. A selection-aware Mainspring would add a selection decoder that predicts a selection coefficient \(s\) and a beneficial allele frequency trajectory from the local tree shape.
Ancient DNA. Ancient samples are leaves at non-zero time in the tree. The encoder can accommodate this by adding a “sampling time” feature to each leaf embedding. The training simulations would include samples drawn from different time points.
Iterative refinement with MCMC. Mainspring’s output can serve as the initial state for a classical MCMC sampler (ARGweaver or SINGER). Instead of starting MCMC from a random ARG, start from Mainspring’s prediction. This can reduce burn-in time from hours to minutes.
Self-supervised pre-training. Before training on labeled simulations, pre-train the encoder on unlabeled genotype matrices using a masked-site prediction objective (analogous to BERT’s masked language modeling). This teaches the encoder useful representations of haplotype structure without requiring expensive simulations.
def masked_site_pretraining_step(model, genotypes, mask_rate=0.15):
"""Self-supervised pre-training: predict masked sites."""
mask = torch.bernoulli(torch.full_like(genotypes, mask_rate)).bool()
masked_genotypes = genotypes.clone()
masked_genotypes[mask] = 0 # mask token
Z = model.encoder(masked_genotypes.unsqueeze(0))
predictions = model.site_predictor(Z).squeeze(0)
loss = F.binary_cross_entropy_with_logits(
predictions[mask], genotypes[mask].float()
)
return loss
The Hybrid Pipeline: Mainspring + Escapement
The most powerful use of Mainspring is not as a standalone method but as the first stage of a two-stage pipeline with Escapement.
Mainspring provides a fast, approximate posterior over ARGs and demography. Escapement provides a principled, simulation-free variational inference engine that refines any initial estimate using the coalescent likelihood itself (not simulations). Together:
Genotype matrix D
|
v
┌─────────────────────┐
│ MAINSPRING │ ~1 second
│ (amortized, fast) │
└─────────────────────┘
|
v
Initial ARG + N_e(t)
|
v
┌─────────────────────┐
│ ESCAPEMENT │ ~10 minutes
│ (variational, │
│ likelihood-based) │
└─────────────────────┘
|
v
Refined ARG + N_e(t)
with calibrated posteriors
The hybrid pipeline combines the strengths of both:
Property |
Mainspring alone |
Escapement alone |
Hybrid |
|---|---|---|---|
Speed |
Seconds |
Hours (from random init) |
Minutes (warm-started) |
Statistical guarantees |
None |
ELBO bound |
ELBO bound |
Posterior calibration |
Approximate |
Principled |
Principled |
Simulation dependency |
Yes (training) |
No |
Amortized (training only) |
Scalability |
50–100 samples |
20–50 samples |
50–100 samples |
Output |
Full ARG + \(N_e(t)\) |
Coalescent times + \(N_e(t)\) |
Full ARG + \(N_e(t)\) (refined) |
Why warm-starting matters
Escapement’s variational inference must optimize a complex, multi-modal objective (the coalescent likelihood as a function of the genealogy). From a random initialization, this can take thousands of gradient steps to converge, and may settle in a local optimum far from the truth.
Mainspring’s output provides an initialization that is already close to the global optimum. Escapement then needs only a few hundred gradient steps to refine the estimate and calibrate the posterior. The wall-clock time drops from hours to minutes, and the risk of poor local optima is greatly reduced.
This is analogous to the role of a mainspring in a mechanical watch: the mainspring provides the initial burst of energy that sets the mechanism in motion. The escapement then regulates that energy into precise, calibrated motion. Neither is sufficient alone – but together they keep perfect time.
When to Use What
A practical decision guide:
Scenario |
Recommended approach |
|---|---|
Screening 1,000 genomes for demographic events |
Mainspring alone (speed is paramount) |
Careful demographic inference from 50 samples |
Hybrid: Mainspring → Escapement |
Single diploid genome, well-characterized species |
PSMC (interpretable, proven, fast enough) |
Multi-population divergence times |
|
Posterior samples from the full ARG, provably correct |
ARGweaver (no shortcut to exactness) |
Biobank-scale tree sequence (>10,000 samples) |
|
Teaching and understanding |
The Timepieces, always (the whole point of this book) |
The watchmaker’s perspective
A grande complication is impressive, but the master watchmaker still keeps simple tools on the bench. The complication exists because the simpler mechanisms have been mastered first. If you have read this far, you have built every Timepiece by hand. You understand every gear. You can diagnose every failure mode.
That understanding is what makes Mainspring useful rather than dangerous. Without it, the neural network is a black box that occasionally tells the wrong time. With it, the neural network is a powerful tool whose outputs you can check, calibrate, and trust – because you know what the correct answer should look like.