The Structured Coalescent Under Selection
The escapement of the mechanism: how lineages coalesce and migrate between two backgrounds.
With the allele frequency trajectory \(x(t)\) in hand (from the previous chapter), we can now build the core of discoal’s algorithm: the structured coalescent that runs backward in time through the sweep phase.
Under neutrality, the coalescent is simple: \(k\) lineages coalesce pairwise at rate \(\binom{k}{2}\) in a population of size \(2N\) (in units of \(2N\) generations). But during a selective sweep, the population is split into two subpopulations that change in size over time. The coalescent becomes structured and inhomogeneous (time-varying).
This chapter derives the event rates from first principles, implements the structured coalescent in Python, and shows how the sweep distorts the genealogy compared to the neutral case.
Note
Prerequisites. This chapter builds directly on:
The Allele Frequency Trajectory – the trajectory \(x(t)\) is the input
The Coalescent Process – coalescence rates for \(k\) lineages
Familiarity with continuous-time Markov chains and exponential waiting times
Step 1: Partitioning the Population
During the sweep, every chromosome in the population carries either the beneficial allele or the wild-type allele at the selected site. This divides the population of \(2N\) chromosomes into two classes:
where \(x(t)\) is the frequency of the beneficial allele at time \(t\) (forward time).
Going backward in time through the sweep (which is how the coalescent operates), \(x\) decreases from 1.0 at the moment of fixation to \(1/(2N)\) at the origin of the mutation. So background \(B\) shrinks and background \(b\) grows as we look further into the past.
At the moment of fixation (\(x = 1\)), all chromosomes are in \(B\) – every sampled lineage starts on the beneficial background. As we trace backward, recombination can move lineages from \(B\) to \(b\) (and vice versa), and coalescence can reduce the number of lineages within each background.
Time (backward) x(t) B size b size
─────────────── ──── ────── ──────
Just fixed 1.00 2N 0
... 0.80 1.6N 0.4N
... 0.50 N N
... 0.20 0.4N 1.6N
... 0.01 0.02N 1.98N
Origin 1/2N 1 2N - 1
The B background shrinks to a single chromosome: the original mutant.
Coalescence within B accelerates dramatically as its size drops.
Step 2: Coalescence Rates Within Each Background
Within each background, coalescence happens just as in the standard coalescent, but with the effective population size equal to the background’s size. If there are \(n_B\) lineages in background \(B\) and \(n_b\) lineages in background \(b\), the pairwise coalescence rates are:
These are the standard coalescent rates \(\binom{k}{2}/(2N_{\text{eff}})\), with \(N_{\text{eff}}\) replaced by the background size.
The crucial consequence: as \(x(t) \to 0\) (going backward), coalescence in \(B\) becomes infinitely fast. This is the bottleneck. When the beneficial allele is at frequency 0.01, there are only \(0.01 \times 2N = 200\) beneficial chromosomes (for \(N = 10{,}000\)). Two lineages in this tiny pool coalesce 100 times faster than they would in the full population. This rapid coalescence is what destroys diversity near the selected site.
Closing a confusion gap – Coalescence between backgrounds?
Lineages in different backgrounds (\(B\) vs \(b\)) cannot coalesce directly. A \(B\) lineage and a \(b\) lineage have, by definition, different alleles at the selected site, so they cannot share a parent in the previous generation. Coalescence only occurs within a background.
However, a lineage can move between backgrounds via recombination. Once two lineages are in the same background, they can then coalesce.
def coalescence_rate(n_lineages, background_size):
"""Coalescence rate for n_lineages in a background of given size.
Parameters
----------
n_lineages : int
Number of lineages in this background.
background_size : float
Effective number of haploid chromosomes in this background.
Returns
-------
rate : float
Total pairwise coalescence rate (per generation).
"""
if n_lineages < 2 or background_size <= 0:
return 0.0
return n_lineages * (n_lineages - 1) / (2.0 * background_size)
# Example: 5 lineages in a B background of size 200 (x = 0.01, N = 10000)
rate_B = coalescence_rate(5, 200)
rate_neutral = coalescence_rate(5, 20000) # same lineages, full population
print(f"Coalescence rate in B (x=0.01): {rate_B:.4f} per generation")
print(f"Neutral coalescence rate: {rate_neutral:.6f} per generation")
print(f"Ratio (speed-up): {rate_B / rate_neutral:.0f}x")
# Output:
# Coalescence rate in B (x=0.01): 0.0500 per generation
# Neutral coalescence rate: 0.000500 per generation
# Ratio (speed-up): 100x
Step 3: Recombination as Migration
Recombination between the neutral locus and the selected site can transfer a lineage from one background to the other. Think of it this way: going backward in time, a lineage at the neutral locus is sitting on some chromosome. If a recombination event occurs between the neutral and selected sites, the neutral allele’s ancestry switches to a different chromosome – one that was randomly chosen from the population.
If the lineage was on a \(B\) chromosome and the recombinant chromosome is from background \(b\) (probability \(1 - x(t)\)), the lineage “migrates” from \(B\) to \(b\). Conversely, a \(b\) lineage can migrate to \(B\) with probability \(x(t)\).
The migration rates per lineage per generation are:
where \(r\) is the recombination rate between the neutral locus and the selected site. The total migration rates for all lineages are:
Why does recombination act as migration?
In the standard coalescent without selection, recombination splits a lineage into two pieces that independently trace their ancestry. But in the structured coalescent with selection, we are tracking which background each lineage is on. A recombination event does not split the lineage – it potentially moves it to the other background.
This is because we are simulating a single neutral locus linked to a selected site. When recombination occurs between them, the neutral locus’s ancestry continues on whatever chromosome the recombinant picked up – which could be from either background.
Compare this to msprime’s standard ARG, where recombination genuinely splits ancestry. In discoal’s structured coalescent, recombination at the selected site acts as a population transfer.
The key parameter that governs escape from the sweep is the ratio \(r/s\):
def migration_rates(n_B, n_b, r, x):
"""Compute migration rates between backgrounds due to recombination.
Parameters
----------
n_B : int
Number of lineages in the beneficial background.
n_b : int
Number of lineages in the wild-type background.
r : float
Recombination rate between neutral locus and selected site.
x : float
Current frequency of the beneficial allele.
Returns
-------
m_B_to_b : float
Total rate of migration from B to b (per generation).
m_b_to_B : float
Total rate of migration from b to B (per generation).
"""
m_B_to_b = n_B * r * (1.0 - x)
m_b_to_B = n_b * r * x
return m_B_to_b, m_b_to_B
Step 4: The Complete Event Rates
At any moment during the sweep, four types of events can occur:
Event |
Rate (per generation) |
Effect |
|---|---|---|
Coalescence in \(B\) |
\(\frac{n_B(n_B-1)}{2 \cdot 2Nx}\) |
Merge two \(B\) lineages |
Coalescence in \(b\) |
\(\frac{n_b(n_b-1)}{2 \cdot 2N(1-x)}\) |
Merge two \(b\) lineages |
Migration \(B \to b\) |
\(n_B \cdot r \cdot (1-x)\) |
Move one lineage from \(B\) to \(b\) |
Migration \(b \to B\) |
\(n_b \cdot r \cdot x\) |
Move one lineage from \(b\) to \(B\) |
The total event rate is the sum of all four:
Given that an event occurs, the probability of each type is proportional to its rate. The time to the next event is exponentially distributed with rate \(\Lambda\).
Probability Aside – Competing exponentials
This is the same “racing exponentials” framework used in Hudson’s algorithm (see Hudson’s Algorithm). When multiple types of events each occur at their own rate, the time to the first event is exponential with rate equal to the sum of all rates. Which event actually occurs is drawn categorically, with probability proportional to each rate.
Formally, if \(T_i \sim \text{Exp}(\lambda_i)\) independently, then \(\min(T_1, \ldots, T_k) \sim \text{Exp}(\lambda_1 + \cdots + \lambda_k)\) and \(P(\text{type } i \text{ occurs}) = \lambda_i / \sum_j \lambda_j\).
Step 5: The Inhomogeneous Process
There is one complication: the rates change over time because \(x(t)\) changes. This makes the process inhomogeneous: we cannot simply draw one exponential waiting time and fast-forward.
discoal handles this by discretizing the trajectory into small time steps (one generation each, or coarser intervals). Within each step, the rates are treated as constant (using the \(x\) value at that step). Events are tested within each step, and if the exponential waiting time exceeds the step length, we advance to the next step with the residual waiting time adjusted.
def structured_coalescent_sweep(trajectory, n_sample, r_site, N, rng=None):
"""Simulate the structured coalescent through a selective sweep.
Runs backward through the trajectory, simulating coalescence and
migration events. Returns the coalescence times.
Parameters
----------
trajectory : ndarray
Allele frequency x(t) at each generation, from origin to fixation.
We process this backward (from fixation to origin).
n_sample : int
Number of sampled lineages.
r_site : float
Recombination rate between the neutral locus and the selected site
(per generation).
N : int
Diploid effective population size.
rng : np.random.Generator, optional
Random number generator.
Returns
-------
coal_times : list of float
Times (in generations backward from fixation) of coalescence events.
n_remaining_B : int
Number of lineages remaining in B at the end of the sweep.
n_remaining_b : int
Number of lineages remaining in b at the end of the sweep.
"""
if rng is None:
rng = np.random.default_rng(42)
two_N = 2 * N
# All lineages start in B (the sweep just fixed)
n_B = n_sample
n_b = 0
coal_times = []
residual = rng.exponential(1.0) # time until next event (in rate-units)
# Process trajectory backward: from fixation (end) to origin (start)
T_sweep = len(trajectory)
for step in range(T_sweep - 1, -1, -1):
x = trajectory[step]
if n_B + n_b <= 1:
break # MRCA found
# Compute rates for this time step
bg_B = two_N * x
bg_b = two_N * (1.0 - x)
rate_coal_B = (n_B * (n_B - 1) / (2.0 * bg_B)) if (n_B >= 2 and bg_B > 0) else 0.0
rate_coal_b = (n_b * (n_b - 1) / (2.0 * bg_b)) if (n_b >= 2 and bg_b > 0) else 0.0
rate_mig_Bb = n_B * r_site * (1.0 - x) if n_B > 0 else 0.0
rate_mig_bB = n_b * r_site * x if n_b > 0 else 0.0
total_rate = rate_coal_B + rate_coal_b + rate_mig_Bb + rate_mig_bB
if total_rate == 0:
continue
# Can an event happen within this generation?
# The residual is in units of "rate * time". One generation = total_rate.
while residual < total_rate:
# An event happens within this step
backward_gen = T_sweep - step
coal_times_entry = None
# Determine which event
u = rng.random() * total_rate
if u < rate_coal_B:
# Coalescence in B
n_B -= 1
coal_times_entry = backward_gen
elif u < rate_coal_B + rate_coal_b:
# Coalescence in b
n_b -= 1
coal_times_entry = backward_gen
elif u < rate_coal_B + rate_coal_b + rate_mig_Bb:
# Migration B -> b
n_B -= 1
n_b += 1
else:
# Migration b -> B
n_b -= 1
n_B += 1
if coal_times_entry is not None:
coal_times.append(coal_times_entry)
if n_B + n_b <= 1:
break
# Draw next event residual
residual = rng.exponential(1.0)
# Subtract this step's contribution from the residual
residual -= total_rate
# At the origin of the sweep, all B lineages coalesce
# (they all trace to the single original mutant)
if n_B > 1:
for _ in range(n_B - 1):
coal_times.append(T_sweep)
n_B = 1
return coal_times, n_B, n_b
Step 6: The Bottleneck Effect
The most important consequence of the structured coalescent is the bottleneck in background \(B\). Let us compute how severe it is.
At the moment the beneficial allele was at frequency \(x\), the effective size of background \(B\) was \(2N \cdot x\). The allele spends most of its time near the boundary frequencies (where \(x\) is close to 0 or 1), because the logistic growth is slowest there.
The expected time that the allele spends between frequency \(x\) and \(x + dx\) (during the sweep) is approximately:
The “effective population size” experienced by \(B\) lineages during this interval is \(2Nx\). So the total coalescent opportunity (integrated inverse effective size) for the \(B\) background over the sweep is:
This integral diverges at \(x = 0\), reflecting the fact that when the allele is at very low frequency, the \(B\) background is tiny and coalescence is nearly instantaneous. In practice, the integral is bounded by the initial frequency \(1/(2N)\):
So the total coalescent opportunity is approximately:
For strong selection (\(2Ns \gg 1\)), this simplifies to \(\approx 1/s\). Compare this to the neutral coalescent, where two lineages coalesce in expected time \(2N\) generations (\(= 1\) coalescent unit). The ratio is:
For \(\alpha = 1000\), the bottleneck compresses the expected coalescence time by a factor of 1000. This is why strong sweeps obliterate diversity.
def demonstrate_bottleneck(N, alpha, n_sample=20, n_reps=500, seed=42):
"""Compare coalescence times under a sweep vs. neutrality.
Parameters
----------
N : int
Diploid population size.
alpha : float
Scaled selection coefficient 2Ns.
n_sample : int
Number of sampled lineages.
n_reps : int
Number of simulation replicates.
seed : int
Random seed.
"""
rng = np.random.default_rng(seed)
s = alpha / (2 * N)
# Simulate under sweep: coalescence times within the sweep
sweep_mean_times = []
for _ in range(n_reps):
traj = deterministic_trajectory(s, N)
# Use r_site = 0 (perfectly linked) to see pure bottleneck effect
coal_times, _, _ = structured_coalescent_sweep(
traj, n_sample, r_site=0.0, N=N, rng=rng
)
if coal_times:
sweep_mean_times.append(np.mean(coal_times))
# Neutral comparison: expected total coalescence time for k lineages
# is sum of 2N/choose(k,2) for k from n to 2
neutral_expected = sum(2*N / (k*(k-1)/2) for k in range(n_sample, 1, -1))
print(f"alpha = {alpha}, s = {s:.5f}, N = {N}")
print(f"Sweep: mean coalescence time = "
f"{np.mean(sweep_mean_times):.0f} generations")
print(f"Neutral: expected total tree height = "
f"{neutral_expected:.0f} generations")
print(f"Compression factor: "
f"{neutral_expected / np.mean(sweep_mean_times):.1f}x")
# demonstrate_bottleneck(10_000, 1000)
Step 7: The Escape Probability
Not all lineages are trapped in the bottleneck. If the recombination rate between the neutral locus and the selected site is high enough, a lineage can escape from background \(B\) to background \(b\) before the bottleneck crushes it.
The probability that a lineage escapes the sweep (recombines off \(B\) before being forced to coalesce) depends on the ratio \(r/s\). Intuitively:
During one generation of the sweep, a lineage recombines with probability \(\sim r\).
The sweep takes \(\sim 1/s\) coalescent time units.
So the expected number of recombination events per lineage during the sweep is \(\sim r/s\) (in appropriate units, this is \(\rho / \alpha\) per site).
The probability that a lineage does not escape (stays in \(B\) through the entire sweep) is approximately \(e^{-r/s}\), and the probability that it does escape is \(1 - e^{-r/s}\).
More precisely, for a hard sweep, the probability that a lineage at recombination distance \(r\) from the selected site is not affected by the sweep is:
This means:
At \(r = 0\) (the selected site itself): no escape. All lineages coalesce in the bottleneck.
At \(r \gg s/\ln(2N)\): nearly all lineages escape. The sweep has no effect.
At \(r \sim s/\ln(2N)\): the transition zone.
def escape_probability(r_site, s, N):
"""Approximate probability that a lineage escapes a hard sweep.
Parameters
----------
r_site : float
Recombination rate between neutral locus and selected site.
s : float
Selection coefficient.
N : int
Diploid population size.
Returns
-------
p_escape : float
Probability of escaping the sweep via recombination.
"""
return 1.0 - np.exp(-r_site / s * 2 * np.log(2 * N))
# How far does the sweep's effect reach?
N = 10_000
s = 0.01 # 2Ns = 200
r_values = np.logspace(-7, -2, 100)
p_esc = [escape_probability(r, s, N) for r in r_values]
print("Recombination distance at 50% escape probability:")
r_half = s / (2 * np.log(2 * N))
print(f" r_50 = s / (2*ln(2N)) = {r_half:.2e}")
print(f" Physical distance (at r=1e-8/bp): {r_half / 1e-8:.0f} bp")
Step 8: Putting It Together – The Full Sweep Simulation
Now we can assemble the complete structured coalescent for a single neutral locus linked to a selected site:
def simulate_sweep_genealogy(n_sample, N, s, r_site, tau_gen=0,
trajectory_mode='deterministic', rng=None):
"""Simulate a genealogy at a neutral locus linked to a selective sweep.
The full algorithm:
1. Generate the allele frequency trajectory.
2. Run neutral coalescent from present to tau (time since fixation).
3. Run structured coalescent through the sweep.
4. Run neutral coalescent for remaining lineages before the sweep.
Parameters
----------
n_sample : int
Number of sampled lineages.
N : int
Diploid effective population size.
s : float
Selection coefficient.
r_site : float
Recombination rate between neutral locus and selected site.
tau_gen : int
Generations since the sweep completed (0 = just fixed).
trajectory_mode : str
'deterministic' or 'stochastic'.
rng : np.random.Generator, optional
Random number generator.
Returns
-------
total_coal_times : list of float
All coalescence times (in generations backward from present).
"""
if rng is None:
rng = np.random.default_rng()
two_N = 2 * N
n = n_sample
total_coal_times = []
# Phase 1: Neutral coalescent from present to tau
t_current = 0
while n > 1 and t_current < tau_gen:
# Rate of coalescence: choose(n,2) / (2N)
rate = n * (n - 1) / (2.0 * two_N)
wait = rng.exponential(1.0 / rate)
if t_current + wait > tau_gen:
break # No coalescence before the sweep phase starts
t_current += wait
n -= 1
total_coal_times.append(t_current)
if n <= 1:
return total_coal_times
# Phase 2: Generate trajectory and run structured coalescent
if trajectory_mode == 'deterministic':
traj = deterministic_trajectory(s, N)
else:
traj = stochastic_trajectory(s, N, rng=rng)
coal_times_sweep, n_B, n_b = structured_coalescent_sweep(
traj, n, r_site, N, rng=rng
)
# Offset sweep coalescence times by tau
for ct in coal_times_sweep:
total_coal_times.append(tau_gen + ct)
# Phase 3: Neutral coalescent for remaining lineages
n_remaining = n_B + n_b
t_current = tau_gen + len(traj)
while n_remaining > 1:
rate = n_remaining * (n_remaining - 1) / (2.0 * two_N)
wait = rng.exponential(1.0 / rate)
t_current += wait
n_remaining -= 1
total_coal_times.append(t_current)
return total_coal_times
# Example: compare genealogies with and without a sweep
N = 10_000
s = 0.01
n = 10
rng = np.random.default_rng(42)
# Neutral genealogy (no sweep)
neutral_times = []
n_temp = n
t = 0
while n_temp > 1:
rate = n_temp * (n_temp - 1) / (2.0 * 2 * N)
t += rng.exponential(1.0 / rate)
n_temp -= 1
neutral_times.append(t)
# Sweep genealogy (r = 0, perfectly linked: maximum effect)
sweep_times = simulate_sweep_genealogy(n, N, s, r_site=0.0, rng=rng)
print(f"Neutral TMRCA: {max(neutral_times):,.0f} generations")
print(f"Sweep TMRCA: {max(sweep_times):,.0f} generations")
print(f"Neutral total: {sum(neutral_times):,.0f} gen")
print(f"Sweep total: {sum(sweep_times):,.0f} gen")
The structured coalescent is the mathematical engine that translates the allele frequency trajectory into a distorted genealogy. In the next chapter, we explore the different types of sweeps – hard, soft, and partial – and see how each produces distinctive genealogical signatures.