The Coalescent Process

The escapement of the mechanism: how lineages find common ancestors.

The coalescent is the mathematical foundation of msprime. Before we can simulate anything, we need to understand the stochastic process that governs how lineages merge backwards in time. We build this from the very bottom: starting with two lineages, then \(n\), then adding recombination.

If you have worked through the Coalescent Theory chapter, much of the mathematics here will be familiar – but the emphasis is different. There, we developed the theory for its own sake. Here, we are building toward a simulation algorithm: every formula we derive will become a line of code in Hudson’s Algorithm, the main simulation loop – the ticking of the clock.

Think of this chapter as calibrating the escapement: we need to know exactly how fast the gears should turn before we can assemble the movement.

Step 1: Two Lineages, No Recombination

Start with the simplest possible case: two haploid genomes sampled from a population of constant size \(N\). Going one generation back, what is the probability that these two genomes share a parent?

In a haploid Wright-Fisher population of size \(N\), each individual in generation \(t\) chose its parent uniformly at random from the \(N\) individuals in generation \(t-1\). So the probability that our two samples chose the same parent is:

\[P(\text{same parent}) = \frac{1}{N}\]

and the probability they chose different parents is:

\[P(\text{different parents}) = 1 - \frac{1}{N}\]

If they didn’t coalesce in the first generation, the problem resets: we now have two lineages one generation further back, in the same population. The process is memoryless.

Probability Aside – The memoryless property

A random variable \(T\) is memoryless if \(P(T > s + t \mid T > s) = P(T > t)\) for all \(s, t \geq 0\). Knowing that no coalescence has happened so far gives you no information about how much longer you will wait. Among continuous distributions, only the exponential has this property. Among discrete distributions, only the geometric has it. This is why the coalescent waiting time (geometric in discrete time, exponential in continuous time) resets perfectly after each failed generation.

Waiting time distribution. Let \(T_2\) be the number of generations until the two lineages coalesce. The probability that they first coalesce in generation \(g\) is:

\[P(T_2 = g) = \left(1 - \frac{1}{N}\right)^{g-1} \cdot \frac{1}{N}\]

This is a geometric distribution with success probability \(1/N\).

The continuous-time approximation. For large \(N\), we can approximate this discrete geometric distribution with a continuous exponential. Measure time in units of \(N\) generations: let \(t = g / N\). Then:

\[P(T_2 > t) = \left(1 - \frac{1}{N}\right)^{Nt} \approx e^{-t}\]

as \(N \to \infty\). So in units of \(N\) generations, the coalescence time of two lineages is approximately \(\text{Exponential}(1)\).

Calculus Aside – The exponential limit

The key identity is \(\lim_{N \to \infty} (1 - 1/N)^N = e^{-1}\). More generally, \(\lim_{N \to \infty} (1 - c/N)^{Nt} = e^{-ct}\). This follows from taking the logarithm: \(Nt \cdot \ln(1 - c/N) \approx Nt \cdot (-c/N) = -ct\) as \(N \to \infty\), using the first-order Taylor expansion \(\ln(1 - x) \approx -x\) for small \(x\). For \(N = 10{,}000\), the error is about \(0.005\%\) – far smaller than any biological uncertainty.

import numpy as np

def simulate_coalescence_time_discrete(N, n_replicates=10000):
    """Simulate coalescence time for 2 lineages in discrete generations.

    Each generation, the two lineages independently choose a parent
    uniformly from N individuals. If they pick the same one, they coalesce.
    """
    times = np.zeros(n_replicates)
    for rep in range(n_replicates):
        g = 0
        while True:
            g += 1
            # With probability 1/N, the two lineages pick the same parent
            if np.random.random() < 1.0 / N:
                break
        times[rep] = g
    return times

def simulate_coalescence_time_continuous(n_replicates=10000):
    """Simulate coalescence time for 2 lineages (exponential).

    In the continuous-time limit, the waiting time is Exp(1)
    in coalescent units (units of N generations).
    """
    return np.random.exponential(1.0, size=n_replicates)

# Compare: discrete (in units of N generations) vs continuous
N = 10000
discrete_times = simulate_coalescence_time_discrete(N) / N  # rescale to coalescent units
continuous_times = simulate_coalescence_time_continuous()

print(f"Discrete:   mean = {discrete_times.mean():.4f}, "
      f"var = {discrete_times.var():.4f}")
print(f"Continuous: mean = {continuous_times.mean():.4f}, "
      f"var = {continuous_times.var():.4f}")
print(f"Theory:     mean = 1.0000, var = 1.0000")

With the two-lineage case in hand, we can now generalize to \(n\) lineages. The key insight is that each pair of lineages races independently to coalesce, and with \(k\) lineages there are \(\binom{k}{2}\) such pairs.

Step 2: Multiple Lineages, No Recombination

Now take \(k\) lineages in a population of size \(N\). Going one generation back, any pair could coalesce. How many pairs are there?

\[\binom{k}{2} = \frac{k(k-1)}{2}\]

Each pair independently has probability \(1/N\) of coalescing. For large \(N\), the probability that any pair coalesces is approximately:

\[P(\text{any coalescence}) \approx \frac{\binom{k}{2}}{N}\]

(We ignore the possibility of two simultaneous coalescences, which has probability \(O(1/N^2)\).)

Probability Aside – Why we can ignore simultaneous events

The probability that exactly one pair coalesces is \(\binom{k}{2} \cdot \frac{1}{N} \cdot (1 - \frac{1}{N})^{\binom{k}{2}-1} \approx \binom{k}{2}/N\). The probability that two or more pairs coalesce simultaneously is \(O(\binom{k}{2}^2 / N^2)\). For \(k = 100\) and \(N = 10{,}000\), the single-event probability is about \(0.5\), while the double-event probability is about \(0.0025\) – negligible. In the continuous-time limit (\(N \to \infty\)), simultaneous events have probability exactly zero.

The continuous-time limit. In units of \(N\) generations, the waiting time until the next coalescence among \(k\) lineages is:

\[T_k \sim \text{Exponential}\left(\binom{k}{2}\right)\]

with mean \(\frac{1}{\binom{k}{2}} = \frac{2}{k(k-1)}\).

The full coalescent process for \(n\) samples:

  1. Start with \(k = n\) lineages

  2. Wait \(T_k \sim \text{Exp}(\binom{k}{2})\) time units

  3. Choose a random pair to coalesce: \(k \to k - 1\)

  4. Repeat until \(k = 1\)

def simulate_coalescent(n, n_replicates=1):
    """Simulate the standard coalescent for n samples.

    Returns
    -------
    times : list of float
        Coalescence times (in coalescent units of N generations).
    pairs : list of (int, int)
        Which pair coalesced at each event.
    """
    all_results = []

    for _ in range(n_replicates):
        lineages = list(range(n))  # each lineage gets a unique ID
        times = []
        pairs = []
        t = 0.0

        while len(lineages) > 1:
            k = len(lineages)
            rate = k * (k - 1) / 2  # binom(k, 2) -- the coalescence rate
            # Wait exponential time with this rate
            t += np.random.exponential(1.0 / rate)
            # Choose random pair to coalesce
            i, j = sorted(np.random.choice(len(lineages), 2, replace=False))
            pairs.append((lineages[i], lineages[j]))
            times.append(t)
            # Merge: replace i with new node, remove j
            new_node = max(lineages) + 1
            lineages[i] = new_node
            lineages.pop(j)

        all_results.append((times, pairs))

    return all_results

# Simulate and show the coalescent for n=5
results = simulate_coalescent(5, n_replicates=1)
times, pairs = results[0]
print("Coalescent for n=5:")
for i, (t, (a, b)) in enumerate(zip(times, pairs)):
    print(f"  k={5-i}: t={t:.4f}, lineages {a} and {b} coalesce")

Expected time to MRCA

The total time to the Most Recent Common Ancestor (MRCA) is:

\[T_{\text{MRCA}} = \sum_{k=2}^{n} T_k\]

Since each \(T_k\) is independent with mean \(\frac{2}{k(k-1)}\):

\[E[T_{\text{MRCA}}] = \sum_{k=2}^{n} \frac{2}{k(k-1)} = 2\sum_{k=2}^{n} \left(\frac{1}{k-1} - \frac{1}{k}\right) = 2\left(1 - \frac{1}{n}\right)\]

This is remarkable. The expected MRCA time approaches 2 (in units of \(N\) generations) regardless of how large \(n\) is. Adding more samples barely changes the total tree height.

Calculus Aside – Telescoping sums

The partial fraction decomposition \(\frac{2}{k(k-1)} = 2\left(\frac{1}{k-1} - \frac{1}{k}\right)\) gives a telescoping sum: most terms cancel in pairs.

\[\sum_{k=2}^{n} 2\left(\frac{1}{k-1} - \frac{1}{k}\right) = 2\left[\left(\frac{1}{1} - \frac{1}{2}\right) + \left(\frac{1}{2} - \frac{1}{3}\right) + \cdots + \left(\frac{1}{n-1} - \frac{1}{n}\right)\right] = 2\left(1 - \frac{1}{n}\right)\]

Everything except the first and last terms cancels. This is a standard technique in combinatorics and analysis; watch for it whenever you see partial fractions of the form \(1/[k(k-1)]\).

Why? With many lineages, the coalescence rate \(\binom{k}{2} \sim k^2/2\) is so high that events happen almost instantly. Most of the time is spent waiting for the last few lineages to merge.

def expected_tmrca(n):
    """Expected MRCA time for n samples (in coalescent units)."""
    return 2 * (1 - 1.0 / n)

def expected_total_branch_length(n):
    """Expected total branch length of the coalescent tree.

    This equals 2 * H_{n-1}, where H_k is the k-th harmonic number.
    """
    return 2 * sum(1.0 / k for k in range(1, n))

for n in [2, 5, 10, 50, 100, 1000]:
    print(f"n={n:>5d}: E[T_MRCA] = {expected_tmrca(n):.4f}, "
          f"E[total length] = {expected_total_branch_length(n):.4f}")

We now know how to simulate when events happen (exponential waiting times) and what happens (a random pair coalesces). But in the real simulation, coalescence is not the only possible event. Recombination and migration also compete for the clock’s next tick. To handle this competition, we need the exponential race.

Step 3: The Exponential Race

The coalescent with recombination involves multiple types of events (coalescence, recombination, migration) happening at different rates. The simulation needs to determine which event happens next. This is where the exponential race comes in.

Fact. If \(X_1 \sim \text{Exp}(\lambda_1)\) and \(X_2 \sim \text{Exp}(\lambda_2)\) are independent, then:

  1. \(\min(X_1, X_2) \sim \text{Exp}(\lambda_1 + \lambda_2)\)

  2. \(P(X_1 < X_2) = \frac{\lambda_1}{\lambda_1 + \lambda_2}\)

Proof of (1). \(P(\min(X_1, X_2) > t) = P(X_1 > t) P(X_2 > t) = e^{-\lambda_1 t} e^{-\lambda_2 t} = e^{-(\lambda_1 + \lambda_2)t}\).

Proof of (2).

\[P(X_1 < X_2) = \int_0^\infty P(X_2 > t) \cdot f_{X_1}(t) \, dt = \int_0^\infty e^{-\lambda_2 t} \cdot \lambda_1 e^{-\lambda_1 t} \, dt = \lambda_1 \int_0^\infty e^{-(\lambda_1 + \lambda_2) t} \, dt = \frac{\lambda_1}{\lambda_1 + \lambda_2}\]

Calculus Aside – Evaluating the integral

The integral \(\int_0^\infty e^{-\alpha t}\,dt = 1/\alpha\) for \(\alpha > 0\) is one of the most important integrals in probability. It follows immediately from the antiderivative \(\int e^{-\alpha t}\,dt = -\frac{1}{\alpha}e^{-\alpha t} + C\) and the fact that \(e^{-\alpha t} \to 0\) as \(t \to \infty\). In the proof above, \(\alpha = \lambda_1 + \lambda_2\).

This extends to any number of competing exponentials: the minimum of \(\text{Exp}(\lambda_1), \ldots, \text{Exp}(\lambda_m)\) is \(\text{Exp}(\sum_i \lambda_i)\), and event \(i\) wins with probability \(\lambda_i / \sum_j \lambda_j\).

This is the simulation engine. At each step, msprime computes:

  • \(\lambda_{\text{coal}}\) = total coalescence rate

  • \(\lambda_{\text{recomb}}\) = total recombination rate

  • \(\lambda_{\text{mig}}\) = total migration rate

Then draws the minimum, advances time, and executes the winning event. This is the heartbeat of Hudson’s Algorithm – the main simulation loop, the ticking of the clock.

def exponential_race(*rates):
    """Simulate an exponential race.

    Each "competitor" proposes a random waiting time drawn from
    Exp(rate_i). The competitor with the shortest time wins.

    Parameters
    ----------
    rates : floats
        Rate of each competing process.

    Returns
    -------
    winner : int
        Index of the process that fired first.
    time : float
        Waiting time until the first event.
    """
    times = []
    for rate in rates:
        if rate > 0:
            # Draw a random waiting time: higher rate -> shorter wait on average
            times.append(np.random.exponential(1.0 / rate))
        else:
            times.append(np.inf)  # rate=0 means this event never fires

    winner = np.argmin(times)
    return winner, times[winner]

# Example: coalescence (rate=10) vs recombination (rate=5)
wins = np.zeros(2)
for _ in range(10000):
    w, t = exponential_race(10.0, 5.0)
    wins[w] += 1
print(f"Coalescence wins: {wins[0]/100:.1f}% "
      f"(expected: {10/15*100:.1f}%)")
print(f"Recombination wins: {wins[1]/100:.1f}% "
      f"(expected: {5/15*100:.1f}%)")

A practical subtlety

msprime does not actually compute the minimum of several exponential random variables. Instead, it draws each exponential independently and takes the minimum. This is mathematically equivalent but allows each rate to be computed separately (e.g., coalescence rates per population, migration rates per population pair), which is important for the implementation.

With the exponential race in our toolkit, we are ready to add the crucial complication that makes the coalescent interesting (and computationally challenging): recombination.

Step 4: Adding Recombination

Now the crucial complication. In the standard coalescent without recombination, each lineage is a single indivisible entity. With recombination, a lineage can split: part of the genome traces back through one parent, part through the other.

Going backwards in time, a recombination event on a lineage at position \(x\) produces:

  • A left lineage carrying ancestry for positions \([0, x)\)

  • A right lineage carrying ancestry for positions \([x, L)\)

This increases the number of lineages by one.

Rate of recombination. A lineage carrying a segment of length \(\ell\) has a recombination rate of \(\rho \cdot \ell / L\) per coalescent time unit, where \(\rho = 4N_e r L\) is the population-scaled recombination rate for the whole genome. Equivalently, the per-base-pair rate in coalescent units is \(\rho / L\).

The total recombination rate across all lineages is:

\[\lambda_{\text{recomb}} = \frac{\rho}{L} \sum_{i=1}^{k} \ell_i\]

where \(\ell_i\) is the total length of ancestry carried by lineage \(i\).

Why recombination rate depends on segment length

A recombination can only matter where the lineage carries ancestral material. If a lineage has already lost some genomic segments through prior coalescence events, recombination in those lost regions has no effect. Only the remaining “active” segments contribute to the rate.

Closing a confusion gap – Why does recombination increase the lineage count?

In forward time, recombination takes two parental chromosomes and shuffles them into one offspring chromosome. But in the coalescent, we trace ancestry backwards. A present-day chromosome that was formed by recombination has two parents – one for the left half and one for the right half. Tracing backwards, one lineage becomes two. This is the fundamental asymmetry of the backward view: coalescence reduces the lineage count (two children share one parent), while recombination increases it (one child has two parents for different genomic regions).

def coalescent_with_recombination_simple(n, L, rho, max_events=10000):
    """A simple (but slow) coalescent with recombination.

    Parameters
    ----------
    n : int
        Sample size.
    L : float
        Genome length.
    rho : float
        Population-scaled recombination rate (4*Ne*r*L).

    Returns
    -------
    events : list of (time, event_type, details)
    """
    # Each lineage is a list of (left, right) segments
    # Initially, every lineage carries the full genome [0, L).
    lineages = [[(0, L)] for _ in range(n)]
    events = []
    t = 0.0

    for _ in range(max_events):
        k = len(lineages)
        if k <= 1:
            break

        # Coalescence rate: any pair of k lineages can coalesce
        coal_rate = k * (k - 1) / 2

        # Recombination rate: proportional to total ancestry length
        total_length = sum(
            sum(r - l for l, r in segs) for segs in lineages
        )
        recomb_rate = rho * total_length / L

        # Exponential race between coalescence and recombination
        winner, dt = exponential_race(coal_rate, recomb_rate)
        t += dt

        if winner == 0:
            # Coalescence: pick two random lineages and merge
            i, j = sorted(np.random.choice(k, 2, replace=False))
            merged = merge_segments(lineages[i], lineages[j])
            lineages.pop(j)
            lineages[i] = merged
            events.append((t, 'coal', len(lineages)))

        else:
            # Recombination: pick a lineage weighted by length,
            # then pick a breakpoint
            lengths = [sum(r - l for l, r in segs) for segs in lineages]
            probs = np.array(lengths) / sum(lengths)
            idx = np.random.choice(k, p=probs)

            # Pick a random position within the segments
            bp = pick_random_breakpoint(lineages[idx], L)
            left_segs, right_segs = split_at_breakpoint(lineages[idx], bp)

            if left_segs and right_segs:
                lineages[idx] = left_segs
                lineages.append(right_segs)
                events.append((t, 'recomb', len(lineages)))

    return events

def merge_segments(segs_a, segs_b):
    """Merge two segment lists (simplified: just concatenate and sort)."""
    all_segs = sorted(segs_a + segs_b)
    # Merge overlapping intervals (coalescence reduces segment count)
    merged = [all_segs[0]]
    for l, r in all_segs[1:]:
        if l <= merged[-1][1]:
            merged[-1] = (merged[-1][0], max(merged[-1][1], r))
        else:
            merged.append((l, r))
    return merged

def pick_random_breakpoint(segs, L):
    """Pick a random breakpoint within the segment list."""
    total = sum(r - l for l, r in segs)
    target = np.random.uniform(0, total)
    cumulative = 0
    for l, r in segs:
        cumulative += r - l
        if cumulative >= target:
            bp = r - (cumulative - target)
            return bp
    return segs[-1][1]

def split_at_breakpoint(segs, bp):
    """Split segments at breakpoint bp into left and right."""
    left, right = [], []
    for l, r in segs:
        if r <= bp:
            left.append((l, r))       # entirely left of breakpoint
        elif l >= bp:
            right.append((l, r))      # entirely right of breakpoint
        else:
            left.append((l, bp))      # straddles: split into two pieces
            right.append((bp, r))
    return left, right

# Simulate
events = coalescent_with_recombination_simple(n=5, L=1000, rho=10.0)
print(f"Total events: {len(events)}")
for t, etype, k in events[:10]:
    print(f"  t={t:.4f}: {etype:>6s}, lineages={k}")

This simple implementation works, but it is slow: computing the total segment length requires iterating over all segments every step. In Segments & the Fenwick Tree, we will replace this with a Fenwick tree – a clever indexing mechanism for fast event scheduling – that maintains the total in \(O(\log n)\) time.

Step 5: The Coalescent with Recombination in Detail

Let’s be more precise about the rates. In a population of effective size \(N_e\), with \(k\) lineages, each carrying some genomic segments:

Coalescence rate:

In the standard (Hudson) model, any pair of \(k\) lineages can coalesce:

\[\lambda_{\text{coal}} = \frac{\binom{k}{2}}{N_e} = \frac{k(k-1)}{2N_e}\]

In coalescent time units (measured in \(N_e\) generations), the \(N_e\) cancels and the rate is simply \(\binom{k}{2}\).

SMC and SMC’ approximations

In the Sequentially Markov Coalescent (SMC), not all \(\binom{k}{2}\) pairs can coalesce – only those whose ancestral segments overlap genomically. This reduces the rate but also the state space, making inference algorithms (like PSMC and SINGER) tractable.

  • SMC (McVean & Cardin, 2005): After a recombination, the new lineage can only re-coalesce with lineages that share a marginal tree at the breakpoint.

  • SMC’ (Marjoram & Wall, 2006): Relaxes SMC slightly to allow re-coalescence with lineages that have contiguous ancestral segments.

  • SMC(k): A parameterized family that interpolates between SMC and the full coalescent. msprime implements this using “hulls” – bounding boxes of each lineage’s ancestry.

Recombination rate:

A lineage carrying total ancestral material of length \(\ell\) (summed over all its segments) experiences recombination at rate:

\[\lambda_{\text{recomb, lineage}} = \frac{\ell}{L} \cdot \frac{\rho}{2}\]

where \(\rho = 4N_e r L\) is the population-scaled total recombination rate. In practice, with a position-dependent recombination rate map \(r(x)\), the recombination “mass” of a segment \([a, b)\) is:

\[m(a, b) = \int_a^b r(x) \, dx\]

and the total recombination rate is proportional to the total mass across all segments.

Calculus Aside – From rate function to mass

The integral \(m(a,b) = \int_a^b r(x)\,dx\) is simply the area under the recombination rate curve between positions \(a\) and \(b\). When the rate is constant (\(r(x) = r\)), this reduces to \(m(a,b) = r \cdot (b - a)\). When the rate varies (e.g., recombination hotspots), the integral accounts for the fact that some base pairs contribute more to the recombination probability than others. In Segments & the Fenwick Tree, we will store these masses in a Fenwick tree so that the total can be maintained efficiently as segments are created and destroyed.

The breakpoint is chosen uniformly over the total recombination mass (not uniformly over genomic position). This naturally concentrates breakpoints in recombination hotspots.

Now that we have the recombination machinery, there is one more ingredient needed for realistic simulations: the effect of changing population size on coalescence waiting times.

Step 6: The Coalescent Waiting Time with Growth

When the population size changes through time, the coalescence rate changes too. A key case is exponential growth:

\[N(t) = N_0 \cdot e^{-\alpha t}\]

where \(\alpha > 0\) is the growth rate and \(t\) is measured backwards (so the population was smaller in the past for \(\alpha > 0\)).

The coalescence rate with \(k\) lineages at time \(t\) is:

\[\lambda(t) = \frac{\binom{k}{2}}{N(t)} = \frac{\binom{k}{2}}{N_0 e^{-\alpha t}}\]

The waiting time is no longer a simple exponential. We need the cumulative hazard: the integral of the rate from the current time \(t_0\) to some future time \(t_0 + w\):

\[\Lambda(w) = \int_{t_0}^{t_0 + w} \frac{\binom{k}{2}}{N_0 e^{-\alpha s}} \, ds\]

The survival function is \(P(W > w) = \exp(-\Lambda(w))\). To sample from this distribution, we use the inversion method.

Probability Aside – The inversion method

To sample from a distribution with survival function \(S(w) = e^{-\Lambda(w)}\), we draw \(U \sim \text{Uniform}(0,1)\) and solve \(S(W) = U\), i.e., \(\Lambda(W) = -\ln(U)\). Since \(-\ln(U) \sim \text{Exp}(1)\), this is equivalent to drawing \(E \sim \text{Exp}(1)\) and solving \(\Lambda(W) = E\). The advantage is that we can solve for \(W\) analytically when \(\Lambda\) has a closed-form inverse.

Deriving the waiting time formula. Let \(c = \binom{k}{2}\). We draw \(U \sim \text{Exp}(2c)\) (the waiting time under constant size \(N_0 = 1\)). Then we need to solve:

\[U = \int_{t_0}^{t_0 + W} \frac{c}{N(s)} \, ds\]

For constant size: \(W = N_0 \cdot U / c \cdot 2 = N_0 \cdot U\), which gives \(\text{Exp}(c / N_0)\).

For exponential growth, using \(N(s) = N_0 e^{-\alpha(s - t_0)}\):

\[U = \frac{c}{N_0} \int_0^W e^{\alpha s} \, ds = \frac{c}{N_0 \alpha}(e^{\alpha W} - 1)\]

Calculus Aside – Evaluating the growth integral

The integral \(\int_0^W e^{\alpha s}\,ds\) has antiderivative \(\frac{1}{\alpha}e^{\alpha s}\). Evaluating:

\[\int_0^W e^{\alpha s}\,ds = \frac{1}{\alpha}\left[e^{\alpha W} - e^{0}\right] = \frac{e^{\alpha W} - 1}{\alpha}\]

This integral appears frequently in demographic models. When \(\alpha \to 0\), l’Hopital’s rule gives \(\frac{e^{\alpha W} - 1}{\alpha} \to W\), recovering the constant-size case.

Solving for \(W\):

\[e^{\alpha W} = 1 + \frac{\alpha N_0 U}{c}\]

Taking the logarithm:

\[W = \frac{1}{\alpha} \ln\left(1 + \alpha N_0 U \cdot \frac{1}{c}\right)\]

This is valid only if \(1 + \alpha N_0 U / c > 0\). When \(\alpha < 0\) (population was larger in the past), the argument can become zero or negative, meaning the population was so large that coalescence never occurs within the growth epoch – the simulation must wait for a demographic event that changes the growth rate.

import math

def coalescent_waiting_time_constant(k, N):
    """Waiting time for k lineages, constant population size N."""
    rate = k * (k - 1) / 2  # binom(k,2)
    u = np.random.exponential(1.0 / (2 * rate))
    return N * u  # scale from coalescent units to generations

def coalescent_waiting_time_growth(k, N0, alpha, t0):
    """Waiting time for k lineages, exponential growth.

    Parameters
    ----------
    k : int
        Number of lineages.
    N0 : float
        Current population size.
    alpha : float
        Growth rate (positive = population was smaller in the past).
    t0 : float
        Current time.

    Returns
    -------
    w : float
        Waiting time (can be inf if coalescence doesn't occur).
    """
    rate = k * (k - 1) / 2  # binom(k,2)
    u = np.random.exponential(1.0 / (2 * rate))  # draw from Exp(2c)

    if alpha == 0:
        return N0 * u  # constant-size case

    dt = 0  # already at t0
    # Apply the inversion formula derived above
    z = 1 + alpha * N0 * math.exp(-alpha * dt) * u
    if z <= 0:
        return np.inf  # coalescence doesn't happen in this epoch

    return math.log(z) / alpha

# Compare waiting times: constant vs growing population
N0, alpha = 10000, 0.01
k = 10
n_reps = 10000

const_times = [coalescent_waiting_time_constant(k, N0) for _ in range(n_reps)]
growth_times = [coalescent_waiting_time_growth(k, N0, alpha, 0)
                for _ in range(n_reps)]

print(f"Constant N={N0}: mean waiting time = {np.mean(const_times):.1f} gen")
print(f"Growth alpha={alpha}: mean waiting time = {np.mean(growth_times):.1f} gen")
print(f"(Growth concentrates coalescences in the recent past)")

The formula in msprime’s code

In the reference implementation (algorithms.py), the waiting time with growth is computed in Population._get_common_ancestor_waiting_time:

u = random.expovariate(2 * np)
if self.growth_rate == 0:
    ret = self.start_size * u
else:
    z = 1 + self.growth_rate * self.start_size
        * math.exp(-self.growth_rate * dt) * u
    if z > 0:
        ret = math.log(z) / self.growth_rate

This matches our derivation exactly.

With coalescence, recombination, and variable population size all accounted for, there is one more biological mechanism to add: gene conversion.

Step 7: Gene Conversion

Gene conversion is a recombination-like event with a twist: instead of exchanging everything to one side of a breakpoint, it copies a short tract of DNA (typically a few hundred base pairs) from one homolog to the other.

Going backwards in time, a gene conversion event produces two lineages:

  • One carrying a short segment (the converted tract)

  • One carrying everything except that tract

Before gene conversion (going backwards):

Lineage:  ==========================================
                        | gene conversion at position x
                        | with tract length l

After:
Lineage 1: ============     ========================
                       |   |
Lineage 2:             =====
                       x   x+l

The rate of gene conversion initiation is proportional to the total genomic mass of the lineage, just like recombination. The tract length is drawn from a geometric distribution (discrete genome) or exponential distribution (continuous genome):

\[\ell \sim \text{Geometric}(1/\bar{\ell}) \quad \text{or} \quad \ell \sim \text{Exponential}(\bar{\ell})\]

where \(\bar{\ell}\) is the mean tract length (typically ~300-500 bp in humans).

Additionally, gene conversion can occur to the left of the first ancestral segment, producing a lineage that extends the ancestry leftward:

Before:        ===== ======= ====
               |
Gene conversion starting to the left:

After:
Lineage 1:     ===== ======= ====
Lineage 2: ====
           |        |
           (new)    (original start)

This “left extension” has a separate rate proportional to the number of ancestors times the mean gene conversion rate times the tract length.

def gene_conversion_event(segments, gc_position, tract_length, L):
    """Simulate a gene conversion event.

    Gene conversion removes a short tract from the main lineage and
    places it on a new lineage. This is like recombination, but
    instead of splitting left/right, it punches a "hole" in the middle.

    Parameters
    ----------
    segments : list of (left, right)
        Segments of the lineage.
    gc_position : float
        Starting position of the gene conversion.
    tract_length : float
        Length of the converted tract.

    Returns
    -------
    main_segs : list of (left, right)
        Remaining segments of the main lineage.
    tract_segs : list of (left, right)
        Segments of the gene conversion tract lineage.
    """
    gc_left = gc_position
    gc_right = min(gc_position + tract_length, L)

    main_segs = []
    tract_segs = []

    for l, r in segments:
        if r <= gc_left or l >= gc_right:
            # Entirely outside tract: stays with main
            main_segs.append((l, r))
        elif l >= gc_left and r <= gc_right:
            # Entirely inside tract: goes to tract lineage
            tract_segs.append((l, r))
        elif l < gc_left and r > gc_right:
            # Straddles both sides: split into three pieces
            main_segs.append((l, gc_left))
            tract_segs.append((gc_left, gc_right))
            main_segs.append((gc_right, r))
        elif l < gc_left:
            # Overlaps left boundary only
            main_segs.append((l, gc_left))
            tract_segs.append((gc_left, r))
        else:
            # Overlaps right boundary only
            tract_segs.append((l, gc_right))
            main_segs.append((gc_right, r))

    return main_segs, tract_segs

# Example
segs = [(100, 500), (700, 1000)]
main, tract = gene_conversion_event(segs, gc_position=400, tract_length=350, L=1000)
print(f"Original: {segs}")
print(f"Main:     {main}")
print(f"Tract:    {tract}")

With all event types defined, let us now summarize the complete rate table.

Step 8: Putting It Together – Event Rates Summary

At any point in the simulation, with \(k\) lineages in a population of size \(N(t)\), the competing event rates are:

Event

Rate

Effect

Coalescence

\(\binom{k}{2} / N(t)\)

Two lineages merge (\(k \to k-1\))

Recombination

\(\sum_i m_i^{\text{recomb}}\)

One lineage splits (\(k \to k+1\))

Gene conversion (within)

\(\sum_i m_i^{\text{gc}}\)

One lineage splits (\(k \to k+1\))

Gene conversion (left)

\(k \cdot \bar{g} \cdot \bar{\ell}\)

One lineage splits (\(k \to k+1\))

Migration

\(k_j \cdot M_{j \to j'}\)

One lineage moves between populations

where \(m_i^{\text{recomb}}\) is the recombination mass of lineage \(i\) and \(m_i^{\text{gc}}\) is the gene conversion mass.

The simulation draws independent exponential waiting times for each event type, takes the minimum, advances time, and executes the winning event. This is the exponential race in full generality, and it drives the main loop of Hudson’s Algorithm.

Termination condition

The simulation terminates when every genomic position has found its MRCA. This is tracked by an overlap counter \(S[x]\) that records how many lineages carry ancestral material at position \(x\). When \(S[x] = 1\) for all \(x\), we’re done. (In practice, msprime uses S[x] = 0 after a final decrement, stored in an AVL tree keyed by genomic position.) We will build this overlap counter in Segments & the Fenwick Tree.

We now have the complete mathematical specification of the coalescent with recombination. Every rate, every distribution, every event type is defined. But simulating this efficiently requires clever data structures – which brings us to the next chapter.

Exercises

Exercise 1: Verify the coalescent

Simulate 10,000 coalescent trees with \(n = 10\). Compute the mean and variance of \(T_{\text{MRCA}}\). Compare to the theoretical values: \(E[T_{\text{MRCA}}] = 2(1 - 1/n)\), \(\text{Var}(T_{\text{MRCA}}) = \sum_{k=2}^n 4/[k(k-1)]^2\).

Exercise 2: The exponential race

Implement the exponential race for 3 competing events with rates 1, 2, 3. Verify empirically that event \(i\) wins with probability \(\lambda_i / \sum \lambda_j\), and the minimum has rate \(\sum \lambda_j\).

Exercise 3: Coalescent with recombination

Use coalescent_with_recombination_simple to simulate 1000 genealogies with \(n = 5\), \(L = 10^4\), and varying \(\rho\). Plot the number of recombination events as a function of \(\rho\). Compare to the theoretical expectation: \(E[\text{recomb events}] \approx \rho \cdot \sum_{k=2}^n 1/(k-1)\).

Exercise 4: Growth vs. constant size

Simulate coalescent trees under (a) constant \(N = 10{,}000\) and (b) exponential growth with \(N_0 = 10{,}000\), \(\alpha = 0.01\). Plot the distributions of \(T_{\text{MRCA}}\). How does growth affect the shape of the genealogy?

Next: Segments & the Fenwick Tree – the linked-list track that follows each lineage’s ancestral material, and the Fenwick tree – a clever indexing mechanism for fast event scheduling.

Solutions

Solution 1: Verify the coalescent

We simulate 10,000 coalescent trees using simulate_coalescent and compare the empirical mean and variance of \(T_{\text{MRCA}}\) to their theoretical values.

The theoretical variance uses the fact that the \(T_k\) are independent, so:

\[\text{Var}(T_{\text{MRCA}}) = \sum_{k=2}^{n} \text{Var}(T_k) = \sum_{k=2}^{n} \frac{1}{\binom{k}{2}^2} = \sum_{k=2}^{n} \frac{4}{[k(k-1)]^2}\]

since \(T_k \sim \text{Exp}(\binom{k}{2})\) and the variance of an \(\text{Exp}(\lambda)\) random variable is \(1/\lambda^2\).

import numpy as np

n = 10
n_replicates = 10000

# Theoretical values
E_tmrca = 2 * (1 - 1.0 / n)
Var_tmrca = sum(4.0 / (k * (k - 1))**2 for k in range(2, n + 1))

# Simulate
tmrca_values = []
for _ in range(n_replicates):
    results = simulate_coalescent(n, n_replicates=1)
    times, pairs = results[0]
    tmrca_values.append(times[-1])  # last coalescence time is the MRCA

tmrca_values = np.array(tmrca_values)

print(f"E[T_MRCA]:   simulated = {tmrca_values.mean():.4f}, "
      f"theory = {E_tmrca:.4f}")
print(f"Var[T_MRCA]: simulated = {tmrca_values.var():.4f}, "
      f"theory = {Var_tmrca:.4f}")

Solution 2: The exponential race

We run the race 100,000 times with rates \(\lambda_1 = 1, \lambda_2 = 2, \lambda_3 = 3\). The total rate is \(\Lambda = 6\), so the minimum has mean \(1/6\) and event \(i\) wins with probability \(\lambda_i / \Lambda\).

import numpy as np

rates = [1.0, 2.0, 3.0]
total_rate = sum(rates)
n_trials = 100000

wins = np.zeros(3)
min_times = []

for _ in range(n_trials):
    winner, t = exponential_race(*rates)
    wins[winner] += 1
    min_times.append(t)

min_times = np.array(min_times)

print("Win probabilities:")
for i in range(3):
    print(f"  Event {i}: observed = {wins[i]/n_trials:.4f}, "
          f"expected = {rates[i]/total_rate:.4f}")

print(f"\nMinimum time distribution:")
print(f"  Mean: observed = {min_times.mean():.4f}, "
      f"expected = {1.0/total_rate:.4f}")
print(f"  Var:  observed = {min_times.var():.4f}, "
      f"expected = {1.0/total_rate**2:.4f}")

Solution 3: Coalescent with recombination

The expected number of recombination events in a coalescent tree is approximately \(\frac{\rho}{2} \cdot E[L_{\text{total}}]\) where \(E[L_{\text{total}}] = 2 \sum_{k=1}^{n-1} 1/k\). We simulate for several values of \(\rho\) and compare.

import numpy as np

n = 5
L = 10000
n_replicates = 1000
rho_values = [0.1, 1.0, 5.0, 10.0, 20.0, 50.0]

# Theoretical expected total branch length (coalescent units)
E_total_length = 2 * sum(1.0 / k for k in range(1, n))

print(f"E[total branch length] = {E_total_length:.4f}")
print(f"{'rho':>6s}  {'E[recomb] (sim)':>16s}  {'E[recomb] (theory)':>18s}")

for rho in rho_values:
    recomb_counts = []
    for _ in range(n_replicates):
        events = coalescent_with_recombination_simple(n=n, L=L, rho=rho)
        n_recomb = sum(1 for _, etype, _ in events if etype == 'recomb')
        recomb_counts.append(n_recomb)

    mean_recomb = np.mean(recomb_counts)
    # Theory: E[recomb] = (rho/2) * E[L_total]
    expected_recomb = (rho / 2) * E_total_length
    print(f"{rho:6.1f}  {mean_recomb:16.2f}  {expected_recomb:18.2f}")

Solution 4: Growth vs. constant size

Exponential growth (\(\alpha > 0\)) makes the population smaller in the past, which increases the coalescence rate at earlier times. This compresses the genealogy: \(T_{\text{MRCA}}\) is shorter and the tree is more star-shaped (most coalescences happen at roughly the same time in the recent past).

import numpy as np

N0 = 10000
alpha = 0.01
n = 10
n_reps = 10000

# Constant size: standard coalescent, then scale to generations
tmrca_const = []
for _ in range(n_reps):
    results = simulate_coalescent(n, n_replicates=1)
    times, _ = results[0]
    tmrca_const.append(times[-1] * N0)  # scale to generations

# Growth: draw waiting times with the growth formula
tmrca_growth = []
for _ in range(n_reps):
    t = 0.0
    k = n
    while k > 1:
        w = coalescent_waiting_time_growth(k, N0, alpha, t)
        if w == np.inf:
            break
        t += w
        k -= 1
    tmrca_growth.append(t)

tmrca_const = np.array(tmrca_const)
tmrca_growth = np.array(tmrca_growth)

print(f"Constant N={N0}:")
print(f"  Mean T_MRCA = {tmrca_const.mean():.0f} generations")
print(f"  Std  T_MRCA = {tmrca_const.std():.0f} generations")
print(f"\nGrowth alpha={alpha}:")
print(f"  Mean T_MRCA = {tmrca_growth.mean():.0f} generations")
print(f"  Std  T_MRCA = {tmrca_growth.std():.0f} generations")
print(f"\nGrowth reduces T_MRCA because the ancestral population was smaller, "
      f"forcing lineages to coalesce faster.")