Timepiece I: PSMC
The Pairwise Sequentially Markovian Coalescent
The Mechanism at a Glance
PSMC is an algorithm that infers population size history \(N(t)\) from a single diploid genome. It looks at the pattern of heterozygous and homozygous sites along the genome and asks: what demographic history best explains this pattern?
The key insight: at each position along the genome, the two copies of the chromosome share a most recent common ancestor (MRCA). The time to this MRCA – the coalescence time – varies along the genome because of recombination. And the distribution of coalescence times is shaped by the population size history.
More heterozygous sites = the two copies diverged long ago = large \(N(t)\) in the past (large populations take longer to coalesce)
Fewer heterozygous sites = the two copies diverged recently = small \(N(t)\) (bottleneck forced rapid coalescence)
PSMC reads these patterns with a Hidden Markov Model.
If the coalescent is the escapement – the fundamental ticking mechanism – then PSMC is the simplest complete watch you can build around it: just two hands (two haplotypes), a single gear train (the HMM), and a dial that reads out population size through time.
Primary Reference
The four gears of PSMC:
The Continuous-Time Model (the escapement) – The transition density \(q(t|s)\) that describes how the coalescence time changes between adjacent positions, under variable population size \(N(t)\). This is where the coalescent theory from The Workbench comes alive.
Discretization (the gear train) – Converting the continuous coalescence time into discrete time intervals, with a transition matrix \(p_{kl}\) suitable for an HMM. Continuous math becomes finite computation.
The HMM and EM (the mainspring) – The complete Hidden Markov Model: hidden states are time intervals, observations are het/hom, and the Expectation-Maximization algorithm estimates \(N(t)\). This is the engine that iteratively refines our estimate.
Decoding the Clock (the case and dial) – Posterior decoding, scaling to real units, bootstrapping, and interpreting the population size history. The final step: reading the watch.
These gears mesh together into a complete inference machine:
Diploid consensus sequence (het/hom along genome)
|
v
+-----------------------+
| DISCRETIZE TIME |
| |
| Choose intervals |
| [t_k, t_{k+1}) |
| Initialize lambda_k |
+-----------------------+
|
v
+--------> BUILD HMM |
| States: time intervals |
| Emissions: P(het | t_k) |
| Transitions: p_{kl} |
| | |
| v |
| FORWARD-BACKWARD |
| | |
| v |
| MAXIMIZE Q (update |
| theta, rho, lambda_k) |
| | |
+-------- converged? NO |
| |
YES |
| |
v |
+-----------------------+
| DECODE & INTERPRET |
| |
| Posterior decoding |
| Scale to real units |
| Plot N(t) vs time |
+-----------------------+
|
v
Population size history
Why Just Two Sequences?
You might wonder: why would anyone analyze just two sequences when we have methods like SINGER that handle many? Three reasons:
Data efficiency: A single high-coverage diploid genome contains millions of informative sites. Two haplotypes recombine enough times along a full chromosome to trace population size changes over hundreds of thousands of years.
Computational simplicity: With only two sequences, the marginal tree at each position is trivial – just a single coalescence time. No tree topology to infer, no branching structure. Just a number: when did these two copies last share an ancestor? This makes PSMC the ideal first Timepiece – the simplest watch you can build that actually tells useful time.
Historical importance: PSMC was published in 2011, when whole-genome data from single individuals was becoming available. It showed that a single genome contains a remarkable amount of information about the species’ past.
The simplicity of PSMC makes it an ideal first Timepiece to understand, because the PSMC transition density reappears as a gear inside SINGER’s time sampling step (Timepiece VII). And the very next Timepiece – SMC++ (Timepiece II) – generalizes PSMC from two sequences to many. We start here and build the complete watch around it.
Prerequisites for this Timepiece
Before starting PSMC, you should have worked through:
Coalescent Theory – the exponential distribution of coalescence times and the concept of coalescent time units
Hidden Markov Models – the forward-backward algorithm and scaling
The SMC – the Markov approximation and the basic PSMC transition density for constant population size
If you’ve read those chapters, you have all the tools you need. If some concepts are rusty, we’ll point you back to the relevant sections as they come up.
Chapters
- Overview of PSMC
- The Continuous-Time PSMC Model
- Step 1: Coalescence Under Variable Population Size
- Step 2: The Transition Density Under Variable N(t)
- Step 3: Normalization – It Integrates to 1
- Step 4: The Stationary Distribution
- Step 5: Including No-Recombination (The Full Transition)
- Step 6: Approximating C_sigma
- Step 7: The Rate of Pairwise Difference
- Putting It All Together
- Exercises
- Solutions
- Discretizing Time
- The PSMC HMM and EM Algorithm
- Decoding the Clock
- Demo: Running PSMC on Simulated Data