Overview of Escapement
The escapement does not rely on an external standard. It generates its own rhythm from the geometry of the escape wheel and the physics of the balance spring. No reference clock. No calibration signal. Just first principles.
Every neural approach to population genetics faces a fundamental question: where does the training signal come from? Mainspring answers with simulations – millions of msprime runs, each producing a (genotype matrix, ARG, demography) triple. The network learns to invert the simulator. This is amortized inference: pay the computational cost once during training, then enjoy fast inference on new datasets.
But amortized inference carries a hidden debt: the simulation fidelity gap.
The Problem with Simulation-Based Inference
Mainspring’s posterior is conditioned on the simulator being correct:
where \(\mathcal{M}_{\text{sim}}\) is the simulation model (neutral coalescent, no gene conversion, no structural variants, no sequencing error, no background selection). When the real data-generating process departs from \(\mathcal{M}_{\text{sim}}\) – and it always does – the network’s posterior may be arbitrarily wrong. Worse, it may be confidently wrong, because the training data never included examples where the posterior should be diffuse.
This is not a criticism specific to Mainspring. It is the fundamental limitation of all simulation-based inference, from ABC to neural posterior estimation. The training distribution defines the support of the learned posterior. If reality lies outside that support, the method fails silently.
The simulation fidelity gap
The simulation fidelity gap is the distance between the true data-generating process and the simulator used for training. Every simulation-based method has this gap. The question is not whether the gap exists, but whether you can detect it and what you can do about it.
Classical Timepieces handle this differently: they specify a model and compute the likelihood under that model. If the model is wrong, the likelihood is wrong – but you can at least evaluate the likelihood on real data and check whether the fit is reasonable. Simulation-based methods cannot do this: they never compute the likelihood directly.
Escapement’s Philosophy: The Coalescent IS the Loss Function
Escapement takes a fundamentally different approach. Instead of learning a simulator-to-inference mapping, it uses the coalescent likelihood itself as the training objective. The key observation is simple:
Every Timepiece in this book derives two analytical quantities:
A prior over genealogies: \(P(\tau \mid N_e, \rho)\) from coalescent theory (derived in msprime, PSMC, ARGweaver)
A likelihood of data given a genealogy: \(P(\mathbf{D} \mid \tau, \mu)\) from the mutation model (derived in tsdate, ARGweaver)
Both are analytical. You can evaluate them in closed form for any proposed genealogy \(\tau\). You do not need simulations. The intractable part is the posterior:
The space of genealogies (tree sequences over \(n\) samples and \(L\) sites) is combinatorial and enormous. No method can enumerate it. But you can approximate the posterior with a parametric distribution \(q(\tau \mid \mathbf{D}, \phi)\) and optimize \(\phi\) to make \(q\) as close to the true posterior as possible. This is variational inference.
The critical insight of Escapement is that the variational posterior \(q\) can be parameterized by a neural network, and the optimization objective – the Evidence Lower Bound (ELBO) – requires only the analytical prior and likelihood that every Timepiece already provides:
No simulations appear in this equation. The loss is computed on the observed data \(\mathbf{D}\), using a sampled genealogy \(\tau \sim q\), and analytical coalescent formulas. The network trains directly on real data.
Why “Escapement”
In a mechanical watch, the escapement is the only component that generates its own timing reference. The mainspring provides energy; the gear train transmits it; but the escapement creates the rhythm. It needs no external calibration – just the physics of the balance wheel and the geometry of the escape wheel.
Similarly, Escapement creates its own training signal from the physics of the coalescent. Mainspring needs an external calibration source (simulations). Escapement does not.
What Came Before
The components of Escapement are not new. What is new is their combination into a single differentiable pipeline with a neural variational posterior.
Method |
Strategy |
Strength |
What Escapement adds |
|---|---|---|---|
Discretize time; HMM forward-backward; EM |
Exact under the model; interpretable |
Continuous time; multi-sample; neural posterior |
|
MCMC sampling of full ARG |
Asymptotically exact posterior |
Speed (gradient-based, not MCMC); scalability |
|
Gibbs sampling with GP prior on branch lengths |
Smooth posteriors; principled uncertainty |
Amortized encoder; parallel across windows |
|
Heuristic Li & Stephens matching |
Scales to millions of samples |
Posterior uncertainty; principled objective |
|
Hand-derived variational gamma updates (EP) |
Mathematically principled; fast |
Learned (not hand-derived) variational family; joint topology-time inference |
|
Gamma posteriors via forward-backward on SMC |
Continuous time; analytical posteriors |
Multi-sample; learned encoder; ARG topology |
Escapement synthesizes ideas from all of these. The SMC factorization from PSMC makes the coalescent prior tractable. The gamma posteriors from tsdate and Gamma-SMC inspire the branch-length parameterization. The attention mechanism from tsinfer informs the encoder. The continuous \(N_e(t)\) from phlash motivates the demographic module.
But the architecture is learned, not hand-derived. And the loss is the coalescent likelihood, not a simulation-matching objective.
Mainspring vs. Escapement
The two Complications are complementary, not competing. They represent two fundamentally different approaches to neural inference.
Property |
Escapement |
|
|---|---|---|
Training data |
Millions of msprime simulations |
The observed genotype matrix (no simulations) |
Loss function |
Supervised: match predicted ARG to simulated truth |
Unsupervised: maximize ELBO on real data |
What it learns |
Simulator \(\to\) inference mapping (amortized) |
Per-dataset variational posterior (not amortized) |
Speed at inference |
~1 second (single forward pass) |
~10 minutes (gradient-based optimization per dataset) |
Generalization |
Generalizes across datasets but not across model classes |
No generalization needed (optimizes per-dataset) |
Failure mode |
Silent failure when data deviates from training simulations |
ELBO provides a diagnostic (poor ELBO = poor fit) |
Statistical guarantees |
None (black-box posterior) |
ELBO provides a lower bound on model evidence |
When to prefer |
Rapid screening of many datasets; biobank-scale |
Careful inference on one dataset; model checking needed |
Scales to |
50–100 samples (limited by GPU memory) |
20–50 samples (limited by ELBO optimization cost) |
Demographic output |
\(N_e(t)\) via normalizing flow |
\(N_e(t)\) via learnable piecewise-constant or neural spline |
When each approach wins
Use Mainspring when you need to process many datasets quickly and the training simulations are a good match for your biological system. The simulation fidelity gap is acceptable if you validate on a subset.
Use Escapement when you have one dataset that matters and you need principled uncertainty quantification. The ELBO provides a built-in diagnostic: if it is low, the model does not fit the data well.
Use both (the hybrid pipeline from Comparison and Limitations): run Mainspring for a fast initialization, then refine with Escapement for calibrated posteriors. This is the recommended workflow for careful demographic inference.
Honest Limitations
Escapement is not a silver bullet. It has five fundamental limitations that cannot be engineered away.
1. Model misspecification. Escapement assumes the coalescent model is correct. The ELBO optimizes \(q(\tau)\) to maximize \(P(\mathbf{D} \mid \tau, \mu) \cdot P(\tau \mid N_e)\). If the true data-generating process violates the neutral coalescent (selection, population structure, gene conversion), the inferred \(\tau\) and \(N_e(t)\) will be biased. Unlike simulation-based methods, Escapement cannot incorporate selection or population structure into its prior without re-deriving the likelihood.
2. Slower per-dataset. Mainspring runs in seconds. Escapement requires hundreds to thousands of gradient steps per dataset (typically 10–30 minutes on a GPU). For screening thousands of genomic regions, Escapement is impractical as a standalone method.
3. Approximate posterior. The variational posterior \(q(\tau)\) is a factored approximation: topology, branch lengths, and breakpoints are approximately independent. The true posterior has strong correlations (e.g., between adjacent tree topologies, between branch lengths and topology). The ELBO provides a lower bound, but the gap between the ELBO and the true log-evidence can be large.
4. Hard topology optimization. Topology is discrete. Gumbel-softmax provides gradients through discrete choices, but the optimization landscape is rugged. The ELBO as a function of topology has many local optima, and gradient-based optimization may not find the global optimum. This is particularly problematic for deep internal nodes with uncertain ancestry.
5. Breakpoint detection. Recombination breakpoints are modeled as Bernoulli variables at each position. For closely spaced breakpoints or gene conversion tracts, this local model may miss complex recombination events. The SMC factorization (inherited from PSMC) assumes breakpoints are well-separated – an assumption that breaks down in regions of high recombination.
The Road Ahead
The remaining chapters of this Complication build Escapement from first principles:
Variational Inference Without Simulations – Deriving the ELBO for tree sequences, understanding the three terms, and explaining why this was not done before (the discrete topology barrier and the intractable prior).
Architecture – The four modules in detail: genealogy encoder, variational tree posterior, differentiable likelihood, and demographic inference. PyTorch pseudocode for each.
The Differentiable Likelihood – A deep dive into Module 3: mutation log-likelihood, coalescent log-prior, and entropy decomposition. The gradient flow from ELBO back through the reparameterization trick.
Training on Real Data – The training loop, temperature annealing, warm-starting from Mainspring or tsinfer, and practical considerations.
Comparison and Limitations – Systematic comparison against all methods, design principles borrowed from each Timepiece, and the hybrid pipeline.
Each chapter follows the book’s rhythm: motivation, math, code, verification. The math here is the coalescent likelihood – the same equations you derived in the Timepieces, now assembled into a single differentiable objective. The verification is not simulation-matching but ELBO convergence: does the variational posterior assign high probability to genealogies that explain the observed data?
import torch
from escapement import Escapement
model = Escapement(d_model=64, n_heads=4, n_layers=2, window=64,
Ne=10_000, mu=1.25e-8, rho=1e-8, span=100.0)
D = torch.bernoulli(torch.full((1, 20, 500), 0.15))
optimizer = torch.optim.Adam(model.get_param_groups())
for step in range(1000):
temperature = max(0.1, 1.0 - step / 1000)
loss = model.loss(D, temperature=temperature)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
optimizer.step()
results = model.infer(D)
ne_trajectory = model.get_Ne()
This is all there is. No simulations. No external training data. Just the observed genotype matrix, the coalescent likelihood, and gradient descent.