Overview of Mainspring

The master watchmaker does not discard the tools of their apprenticeship. They build a single instrument that embodies every lesson learned at the bench.

Every Timepiece in this book makes a bargain. PSMC trades multi-sample information for the elegance of a two-haplotype HMM. SMC++ gains samples but discretizes the coalescent into an ODE system. ARGweaver is exact under the discrete sequential Markov coalescent but costs \(O(S^2)\) per site and hours of MCMC per kilobase. tsinfer scales to millions of samples but surrenders posterior inference entirely, producing a single point estimate with no uncertainty. tsdate adds dates to the tree sequence but treats the topology as fixed. dadi and moments collapse the genome to a frequency histogram, discarding linkage information. Gamma-SMC maintains full posterior uncertainty over coalescence times but processes only two haplotypes at a time. phlash achieves scalable Bayesian demography but relies on composite likelihood.

Each Timepiece sits at a different point on the Pareto frontier between accuracy, scalability, and biological realism. No classical method occupies the corner where all three are maximized – the computational cost of exact inference under the full coalescent is simply too high.

Mainspring is an attempt to break this frontier. Not by inventing new population genetics – every equation in this book remains valid – but by compiling the structural insights of every Timepiece into a single neural architecture that learns to perform approximate posterior inference in a single forward pass.

What Mainspring Does

Given a genotype matrix \(\mathbf{D} \in \{0,1\}^{n \times L}\) (where \(n\) is the number of haploid samples and \(L\) is the number of segregating sites), Mainspring produces:

  1. A full ancestral recombination graph (ARG) in tskit format – topology, breakpoints, and node times.

  2. A posterior distribution over effective population size trajectories \(N_e(t)\).

Both outputs emerge from a single forward pass through the network. No MCMC sampling, no EM iterations, no iterative optimization at inference time.

\[f_\theta : \mathbf{D} \in \{0,1\}^{n \times L} \;\longrightarrow\; \bigl(\widehat{\text{ARG}},\; q(N_e(t))\bigr)\]

The function \(f_\theta\) is a neural network with parameters \(\theta\) trained on millions of simulated datasets from msprime. At training time, we have access to the true ARG and true demography for each simulation. At inference time, we have only the genotype matrix.

Amortized inference

The term amortized means that the computational cost of inference is paid once, during training, and then amortized across all future datasets. Training Mainspring takes days on a GPU cluster. But once trained, inference on a new dataset takes seconds. This is the same economics as compiling a program: the compiler is slow, but the compiled binary is fast. The simulations are the source code; the trained network is the compiled binary.

For a thorough introduction to amortized inference, see The Other Paradigm: Neural Networks and Amortized Inference.

Why “Structure-Aware” Matters

Neural approaches to population genetics are not new. pg-gan (Sheehan & Song 2016) used a GAN to match simulated and observed SFS. ImaGene (Torada et al. 2019) applied a CNN to genotype matrices for selection detection. ReLERNN (Adrion et al. 2020) used an RNN for recombination rate estimation. These methods treat the genotype matrix as a generic image or sequence, applying off-the-shelf architectures without encoding any domain knowledge.

Mainspring is different in a specific, measurable way: every architectural choice corresponds to a mathematical property of the coalescent.

Generic vs. structure-aware neural inference

Property

Generic approach

Mainspring

Positional structure

Flat convolution or full attention

Sliding-window causal attention (sequential Markov property from PSMC)

Sample exchangeability

Fixed sample ordering

Set Transformer, permutation-equivariant (from SMC++)

Output format

Scalar summary statistics

Full ARG in tskit format (from ARGweaver / SINGER)

Tree dating

Regression to point estimate

GNN message passing with gamma posteriors (from tsdate / Gamma-SMC)

Haplotype relationships

Implicit in convolution filters

Cross-attention as Li & Stephens copying (from tsinfer / lshmm)

Demographic output

Point estimate of \(N_e\)

Normalizing flow posterior \(q(N_e(t))\) (from phlash)

Physics regularization

None

Differentiable SFS loss (from dadi / moments / momi2)

The result is not that Mainspring is merely “better” in some generic sense – it is that the network converges faster, generalizes better to out-of-distribution demographies, and produces calibrated uncertainty estimates, because the architecture encodes the right inductive biases.

The Key Insight: Population Genetics Compiled into Architecture

The central thesis of Mainspring can be stated in one sentence:

Core thesis

The mathematical structure of population genetics – the sequential Markov property, permutation invariance of exchangeable samples, the Li & Stephens copying model, message-passing on trees, gamma-distributed coalescence times, and the site frequency spectrum as a sufficient statistic for demography – can be compiled into neural network architecture rather than hard-coded into likelihood functions.

What does “compiled” mean here? Consider the analogy to a mechanical watch. Each Timepiece in this book hand-crafts a specific gear train: PSMC builds a transition matrix, tsdate implements belief propagation, ARGweaver constructs an MCMC sampler. These are interpreted approaches – they execute the population-genetic equations step by step at inference time.

Mainspring takes a different approach. It compiles the equations into a fixed neural circuit during training. The circuit does not execute the equations – it has learned to shortcut them. But the circuit’s wiring diagram (the architecture) mirrors the structure of the equations, so the shortcuts are faithful.

This is why Mainspring is a Complication, not a replacement. A complication in horology adds functionality to the basic movement without altering the movement itself. The Timepieces are the movement. Mainspring is a complication that adds speed – but only because the movement is sound.

What Each Timepiece Trades Away

To appreciate what Mainspring attempts to unify, we must be precise about what each Timepiece sacrifices.

The Pareto frontier of classical methods

Timepiece

Core mechanism

What it trades away

Mainspring’s response

PSMC

HMM on two haplotypes

Multi-sample information; piecewise-constant \(N_e(t)\)

Process all samples jointly; continuous \(N_e(t)\) via normalizing flow

SMC++

ODE-based HMM with distinguished lineage

ARG topology; limited to \(\sim 200\) undistinguished samples

Output full ARG; permutation-equivariant encoder handles arbitrary \(n\)

ARGweaver

MCMC sampling of full ARG

Speed (\(O(S^2 K)\) per site); limited scalability

Single forward pass; linear-time sliding-window attention

SINGER

Gibbs sampling of ARG with GP prior

Speed (hours for megabase regions); sequential processing

Parallel across genomic windows; batched inference

tsinfer

Li & Stephens ancestor matching

Posterior uncertainty; dates; demographic inference

Posterior via dropout/ensemble; GNN dating; demographic decoder

tsdate

Inside-outside on fixed topology

Topology uncertainty; demographic inference

Joint topology and dating; demographic decoder

dadi

Diffusion equation for SFS

Linkage information; limited to \(\sim 3\) populations

Full sequence input; population structure as latent variable

moments

ODE system for SFS moments

Same as dadi (different numerical approach)

SFS used as auxiliary loss, not sole input

Gamma-SMC

Gamma-distributed coalescence-time posteriors

Two haplotypes only; no ARG topology

Gamma output heads on multi-sample GNN

phlash

SVGD over \(N_e(t)\) with composite likelihood

Composite likelihood approximation; pre-computed pairs

End-to-end training; full likelihood via simulation

Honest Limitations

Mainspring is not a universal solution. It has fundamental limitations that no amount of engineering can fully resolve.

1. The simulation fidelity gap. Mainspring is only as good as its training simulations. If the real data-generating process includes features absent from msprime – gene conversion, structural variants, sequencing error, population structure not captured by the demographic model – the network may produce confidently wrong answers. This is the Achilles’ heel of all simulation-based inference: the posterior is conditioned on the simulator being correct.

2. No statistical guarantees. Unlike MCMC methods (which are asymptotically exact given enough iterations) or variational methods (which provide a lower bound on the evidence), Mainspring’s posterior approximation has no formal guarantees. The network may be miscalibrated, especially in regions of parameter space poorly represented in the training set.

3. Extrapolation. Neural networks extrapolate poorly. If the true demography lies outside the prior used to generate training data – a bottleneck more severe than any in the training set, a population size larger than any simulated – the network will struggle. The training prior must be chosen carefully and validated against held-out scenarios.

4. Interpretability. A Timepiece’s likelihood function can be inspected term by term. The PSMC transition matrix tells you exactly how recombination and population size interact. Mainspring’s learned representations are opaque. We can probe them with attention maps and ablation studies, but we cannot point to a specific neuron and say “this computes the coalescence rate.”

5. Training cost. Training Mainspring requires millions of msprime simulations (each producing a full ARG), hundreds of GPU-hours, and careful hyperparameter tuning. This is a one-time cost, but it is substantial. A lab without GPU resources may find classical methods more practical.

6. Recombination map dependency. Mainspring assumes a known recombination map. Errors in the recombination map propagate into errors in the inferred ARG and demography. This limitation is shared with ARGweaver and SINGER but not with methods that operate on summary statistics (dadi, moments).

When to use Mainspring vs. a Timepiece

Use a Timepiece when you need interpretable, guaranteed inference on a well-characterized problem (e.g., estimating \(N_e(t)\) from a single diploid genome with PSMC). Use Mainspring when you need fast, approximate inference on many datasets and are willing to validate against classical methods on a subset.

The best practice is to use both: Mainspring for rapid screening, a Timepiece for careful analysis of interesting cases. This is the hybrid pipeline described in Comparison and Limitations.

The Road Ahead

The remaining chapters of this Complication build Mainspring from the ground up:

  1. Design Principles – One Per Timepiece – Ten design principles, one from each Timepiece. These are the architectural decisions that distinguish Mainspring from a generic neural network.

  2. Architecture – The four-stage architecture in full detail: genomic encoder, topology decoder, dating GNN, and demographic decoder. PyTorch pseudocode for every component.

  3. Training – The simulation engine, the composite loss function, the curriculum training strategy, and the complete training loop.

  4. Comparison and Limitations – A systematic comparison against every Timepiece, honest limitations revisited, and the hybrid pipeline that combines Mainspring with Escapement for principled refinement.

Each chapter follows the book’s rhythm: motivation, math, code, verification. But the “math” here is architecture design – the translation of population-genetic structure into neural network components. And the “verification” is not analytical but empirical: we check that the network recovers known truths from simulated data.

A minimal example showing Mainspring’s interface (assuming a trained model):

import torch
import msprime

# Simulate a test dataset
ts = msprime.sim_ancestry(20, sequence_length=1e5, recombination_rate=1e-8,
                           population_size=10_000, random_seed=42)
ts = msprime.sim_mutations(ts, rate=1.25e-8, random_seed=42)
D = torch.tensor(ts.genotype_matrix().T, dtype=torch.float32).unsqueeze(0)

# Run inference (single forward pass)
model = Mainspring.load_pretrained("mainspring_v1.pt")
model.eval()
with torch.no_grad():
    results = model(D, hard=True)

predicted_arg = results['topology']
ne_posterior = results['ne_posterior']   # samples from q(N_e(t))
node_times = results['times']           # gamma means

Let us begin with the design principles that make Mainspring more than a black box.