The Gamma Approximation

The escapement: the mathematical tick that makes the whole mechanism possible.

This chapter derives the two update rules at the heart of Gamma-SMC: the emission step (exact, via Poisson-gamma conjugacy) and the transition step (approximate, via gamma projection). Together, they allow the posterior TMRCA distribution to be tracked as a gamma distribution throughout the entire forward pass.

Biology Aside – What TMRCA tells us about population history

The time to most recent common ancestor (TMRCA) at a genomic position is the number of generations since the two haplotypes last shared a common ancestor at that site. It is a direct window into population size history: in a small population, lineages coalesce quickly, so TMRCAs are short; in a large population, coalescence is slow and TMRCAs are long. A population bottleneck produces a characteristic signature – a band of similar TMRCAs across many positions – because the small population forced most lineages to coalesce during that period. By estimating the TMRCA distribution at each position along the genome, Gamma-SMC effectively reads the history of population size changes encoded in the pattern of genetic variation between two chromosomes.

In a mechanical watch, the escapement is the component that converts the continuous energy of the mainspring into discrete, regular ticks. In Gamma-SMC, the gamma approximation plays an analogous role: it converts the continuous Bayesian updating process into a sequence of simple parameter updates – \((\alpha, \beta) \to (\alpha', \beta')\) – that can be evaluated in constant time.

The Emission Step: Exact Conjugacy

Suppose the forward density at position \(i\) is approximated as:

\[X_i \mid Y_{1:i-1} \sim \text{Gamma}(\alpha, \beta)\]

We observe \(Y_i\) at position \(i\). Using Bayes’ rule and the Markov property:

\[P(X_i = t \mid Y_{1:i}) &\propto P(Y_i \mid X_i = t) \cdot P(X_i = t \mid Y_{1:i-1})\]

The emission model is Poisson with rate \(\theta t\):

\[P(Y_i = y \mid X_i = t) = \frac{(\theta t)^y \cdot e^{-\theta t}}{y!}\]

Biology Aside – Why Poisson emissions?

At each genomic position, we observe either a heterozygous site (\(Y_i = 1\), meaning the two haplotypes differ) or a homozygous site (\(Y_i = 0\), meaning they are identical). Mutations accumulate on the two lineages at a constant rate \(\mu\) per base per generation, so the number of differences between the two haplotypes at a given position is approximately Poisson-distributed with a rate proportional to the TMRCA. Intuitively: the longer two lineages have been separated, the more mutations have accumulated between them, and the more likely they are to differ at any given site. This is the molecular clock principle applied at the single-nucleotide level.

The prior is \(\text{Gamma}(\alpha, \beta)\):

\[P(X_i = t \mid Y_{1:i-1}) = \frac{\beta^\alpha}{\Gamma(\alpha)} t^{\alpha - 1} e^{-\beta t}\]

Multiplying these together:

\[\begin{split}P(X_i = t \mid Y_{1:i}) &\propto (\theta t)^y \cdot e^{-\theta t} \cdot t^{\alpha - 1} \cdot e^{-\beta t} \\ &\propto t^{\alpha + y - 1} \cdot e^{-(\beta + \theta) t}\end{split}\]

This is the kernel of a \(\text{Gamma}(\alpha + y, \beta + \theta)\) distribution. Since the posterior must normalize to a proper density:

\[X_i \mid Y_{1:i} \sim \text{Gamma}(\alpha + Y_i, \; \beta + \theta)\]

This gives us the three cases:

Observation

Posterior

Interpretation

\(Y_i = 0\) (hom)

\(\text{Gamma}(\alpha, \beta + \theta)\)

Rate increases (TMRCA shifts shorter). Seeing no mutation is evidence of a more recent ancestor.

\(Y_i = 1\) (het)

\(\text{Gamma}(\alpha + 1, \beta + \theta)\)

Shape and rate both increase. Seeing a mutation is evidence of a more distant ancestor (shape goes up) while also incorporating the observation (rate goes up).

\(Y_i = -1\) (missing)

\(\text{Gamma}(\alpha, \beta)\)

No change. Missing data carries no information.

import numpy as np
from scipy.special import digamma, gammaln

def gamma_emission_update(alpha, beta, y, theta):
    """Apply the Poisson-gamma conjugate emission update.

    Parameters
    ----------
    alpha : float
        Current shape parameter.
    beta : float
        Current rate parameter.
    y : int
        Observation: 1 (het), 0 (hom), or -1 (missing).
    theta : float
        Scaled mutation rate.

    Returns
    -------
    alpha_new, beta_new : float
        Updated gamma parameters.
    """
    if y == -1:  # missing
        return alpha, beta
    return alpha + y, beta + theta

# Verify: starting from the prior Gamma(1, 1)
alpha, beta = 1.0, 1.0
theta = 0.001

# After a heterozygous observation
a_het, b_het = gamma_emission_update(alpha, beta, 1, theta)
print(f"After het: Gamma({a_het}, {b_het:.4f})")
print(f"  Mean TMRCA = {a_het/b_het:.4f} "
      f"(shifted up -- mutation is evidence of deeper coalescence)")

# After a homozygous observation
a_hom, b_hom = gamma_emission_update(alpha, beta, 0, theta)
print(f"After hom: Gamma({a_hom}, {b_hom:.4f})")
print(f"  Mean TMRCA = {a_hom/b_hom:.4f} "
      f"(shifted down -- no mutation favours recent coalescence)")

# After 100 hom and 1 het
a, b = 1.0, 1.0
for _ in range(100):
    a, b = gamma_emission_update(a, b, 0, theta)
a, b = gamma_emission_update(a, b, 1, theta)
print(f"\nAfter 100 hom + 1 het: Gamma({a:.1f}, {b:.4f}), "
      f"mean = {a/b:.4f}")

Why is this called “conjugacy”?

A prior distribution is conjugate to a likelihood if the posterior has the same functional form as the prior. Here, the gamma prior is conjugate to the Poisson likelihood: regardless of the observation, the posterior is always gamma. This means we never need to leave the gamma family during the emission step – the posterior is always characterized by just two numbers \((\alpha, \beta)\).

Conjugacy is one of the most powerful tools in Bayesian statistics. It turns Bayesian updating from an integration problem (which is generally intractable) into a parameter update (which is \(O(1)\)). The Poisson-gamma pair is one of the classical conjugate families, alongside normal-normal, beta-binomial, and Dirichlet-multinomial.

Intuition: what do \(\alpha\) and \(\beta\) track?

Think of \(\alpha\) as a “mutation count” and \(\beta\) as a “rate accumulator.” At the start (the exponential prior \(\text{Exp}(1) = \text{Gamma}(1, 1)\)), we have seen zero mutations and accumulated one unit of rate. Each homozygous site adds \(\theta\) to the rate (more evidence of short TMRCA) without adding to the count. Each heterozygous site adds \(\theta\) to the rate and 1 to the count (evidence of a mutation, hence longer TMRCA). The posterior mean \(\alpha / \beta\) is literally “mutation count divided by rate” – an estimate of the TMRCA.

The Transition Step: Gamma Projection

The emission step is exact. The transition step is where approximation enters.

After incorporating the observation at position \(i\), we have \(X_i \mid Y_{1:i} \sim \text{Gamma}(\alpha, \beta)\). To advance to position \(i + 1\), we must compute:

\[p_{\alpha,\beta}(t) := \int_0^\infty p(t \mid s) \cdot f_{\alpha,\beta}(s) \, ds\]

where \(p(t \mid s)\) is the SMC’ transition density (including the no-recombination case). This compound distribution \(p_{\alpha,\beta}(t)\) is the predictive distribution of the TMRCA at the next position, given our current gamma belief.

Biology Aside – What recombination does to TMRCA

Recombination breaks chromosomes and re-joins segments from different parental chromosomes. As a result, the genealogy of two haplotypes changes from position to position along the genome: at one site they may share an ancestor 10,000 generations ago, but after a recombination event a few bases away, they may share a different ancestor 50,000 generations ago. The transition step models this: it asks “given our current belief about the TMRCA, what is the distribution of the new TMRCA at the next position, accounting for the possibility of recombination?” The more recombination there is (higher \(\rho\)), the more the TMRCA can change between adjacent positions.

The problem: \(p_{\alpha,\beta}(t)\) is not exactly a gamma distribution. The integral mixes over the transition kernel, which introduces skewness and other deviations from the gamma family.

The solution: Empirically, \(p_{\alpha,\beta}(t)\) is very close to a gamma distribution (Schweiger & Durbin verify this with simulation studies; see Appendix A of the supplement). So we approximate it by projecting back onto the gamma family:

\[p_{\alpha,\beta}(t) \approx f_{\alpha',\beta'}(t) = \text{Gamma}(\alpha', \beta')\]

The question becomes: how do we find the best \((\alpha', \beta')\)?

The PDE Approach

For small recombination rate \(\rho\), the post-transition distribution \(p_{\alpha,\beta}\) is a small perturbation of the prior \(f_{\alpha,\beta}\). We can write:

\[\begin{split}\alpha' &= \alpha + u \cdot \rho \\ \beta' &= \beta + v \cdot \rho\end{split}\]

where \(u\) and \(v\) are the “flow” – the direction and magnitude of the gamma parameter change caused by one recombination step.

To find \(u\) and \(v\), we express the perturbation \((p_{\alpha,\beta} - f_{\alpha,\beta}) / \rho\) as a linear combination of the partial derivatives of \(f_{\alpha,\beta}\):

\[\frac{p_{\alpha,\beta}(x) - f_{\alpha,\beta}(x)}{\rho} \approx u \cdot \frac{\partial f_{\alpha,\beta}(x)}{\partial \alpha} + v \cdot \frac{\partial f_{\alpha,\beta}(x)}{\partial \beta}\]

The partial derivatives of the gamma PDF are derived by differentiating \(f_{\alpha,\beta}(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}\). It is easiest to work with the logarithm first:

\[\log f_{\alpha,\beta}(x) = \alpha \log\beta - \log\Gamma(\alpha) + (\alpha - 1)\log x - \beta x\]

Derivative with respect to \(\alpha\):

\[\frac{\partial}{\partial \alpha} \log f = \log\beta - \psi(\alpha) + \log x\]

where \(\psi(\alpha) = \frac{d}{d\alpha} \ln \Gamma(\alpha)\) is the digamma function – the logarithmic derivative of the gamma function. Since \(\frac{\partial f}{\partial \alpha} = f \cdot \frac{\partial \log f}{\partial \alpha}\):

\[\frac{\partial f_{\alpha,\beta}(x)}{\partial \alpha} = f_{\alpha,\beta}(x) \cdot \left(\log\beta - \psi(\alpha) + \log x\right)\]

Derivative with respect to \(\beta\):

\[\frac{\partial}{\partial \beta} \log f = \frac{\alpha}{\beta} - x\]

since \(\frac{\partial}{\partial \beta}(\alpha\log\beta) = \alpha/\beta\) and \(\frac{\partial}{\partial \beta}(-\beta x) = -x\). Therefore:

\[\frac{\partial f_{\alpha,\beta}(x)}{\partial \beta} = f_{\alpha,\beta}(x) \cdot \left(\frac{\alpha}{\beta} - x\right)\]
from scipy.special import digamma, gammaln
import numpy as np

def gamma_pdf_partials(x, alpha, beta):
    """Compute partial derivatives of the gamma PDF analytically.

    Returns df/dalpha and df/dbeta at each x.
    """
    log_f = (alpha * np.log(beta) - gammaln(alpha)
             + (alpha - 1) * np.log(x) - beta * x)
    f = np.exp(log_f)
    df_dalpha = f * (np.log(beta) - digamma(alpha) + np.log(x))
    df_dbeta = f * (alpha / beta - x)
    return df_dalpha, df_dbeta

# Verify against numerical finite differences
alpha, beta = 3.0, 2.0
x = np.linspace(0.01, 5.0, 100)
eps = 1e-7

df_da_analytic, df_db_analytic = gamma_pdf_partials(x, alpha, beta)

# Numerical d/dalpha
f_plus = np.exp((alpha + eps) * np.log(beta) - gammaln(alpha + eps)
                + (alpha + eps - 1) * np.log(x) - beta * x)
f_minus = np.exp((alpha - eps) * np.log(beta) - gammaln(alpha - eps)
                 + (alpha - eps - 1) * np.log(x) - beta * x)
df_da_numerical = (f_plus - f_minus) / (2 * eps)

print(f"Max error in df/dalpha: {np.max(np.abs(df_da_analytic - df_da_numerical)):.2e}")

We solve for \(u, v\) by least squares over a grid of 2,000 values of \(x\) covering the main support of \(f_{\alpha,\beta}\):

\[\arg\min_{u,v} \sum_{i=1}^{2000} \left( u \cdot \frac{\partial f_{\alpha,\beta}(x_i)}{\partial \alpha} + v \cdot \frac{\partial f_{\alpha,\beta}(x_i)}{\partial \beta} - \frac{p_{\alpha,\beta}(x_i) - f_{\alpha,\beta}(x_i)}{\rho} \right)^2\]

Why least squares and not moment matching?

Moment matching (equating the mean and variance of \(p_{\alpha,\beta}\) to those of \(\text{Gamma}(\alpha', \beta')\)) is another natural approach. The PDE/least-squares method is preferred because it minimizes the \(L^2\) distance between the true perturbation and its gamma approximation over the entire support of the distribution. This gives a better fit in the tails, which matters because the tails of the TMRCA distribution carry information about extreme coalescence times.

The Closed-Form Perturbation

The key quantity \((p_{\alpha,\beta}(t) - f_{\alpha,\beta}(t))/\rho\) can be evaluated analytically. Starting from the SMC’ transition density and integrating against the gamma prior, one obtains (see the full derivation in Appendix E of the supplement):

\[\begin{split}\frac{p_{\alpha,\beta}(t) - f_{\alpha,\beta}(t)}{\rho} \approx \; & e^{-t} \cdot \frac{(\beta t)^\alpha}{\Gamma(\alpha + 1)} \left[ M(\alpha, \alpha+1, (\beta - 1)t) - M(\alpha, \alpha+1, -(\beta+1)t) \right] \\ & + \frac{2t + e^{-2t} - 1}{2} \cdot f_{\alpha,\beta}(t) \\ & + (1 - e^{-2t}) \cdot \frac{\Gamma(\alpha, \beta t)}{\Gamma(\alpha)} \\ & - 2t \cdot f_{\alpha,\beta}(t)\end{split}\]

where:

  • \(M(a, b, z) = {}_1F_1(a, b, z)\) is Kummer’s confluent hypergeometric function, a special function that generalizes the exponential. It is evaluated using the Arb arbitrary-precision library (Johansson, 2017).

  • \(\Gamma(\alpha, x) = \int_x^\infty t^{\alpha-1} e^{-t} dt\) is the upper incomplete gamma function.

The four terms arise from integrating the SMC’ transition density against the gamma prior, splitting the integral into three regions (\(s < t\), \(s = t\), and \(s > t\)):

  1. The first term covers recombination at \(s < t\) (the detached lineage coalesces at a later time).

  2. The second term covers the self-coalescence (the lineage re-coalesces onto the same branch at \(t = s\)).

  3. The third term covers recombination at \(s > t\) (the detached lineage coalesces at an earlier time).

  4. The fourth term subtracts the no-recombination contribution (which is already counted in \(f_{\alpha,\beta}\)).

A critical property: parameter independence

The flow \((u, v)\) at each grid point \((\alpha, \beta)\) depends only on \(\alpha\) and \(\beta\) – not on \(\theta\), \(\rho\), or \(N_e\). This is because the perturbation is normalized by \(\rho\), and the remaining expression involves only the gamma distribution parameters and the SMC’ transition kernel (which, for constant population size, has no free parameters beyond the time units).

This parameter independence is what makes Gamma-SMC’s precomputation strategy possible: the flow field is computed once and reused for any dataset with any \(\theta\) and \(\rho\).

Log-Coordinates

Working directly with \((\alpha, \beta)\) is numerically inconvenient because these parameters span several orders of magnitude. Instead, Gamma-SMC uses log-coordinates based on the mean and coefficient of variation:

\[\begin{split}l_\mu &= \log_{10}\!\left(\frac{\alpha}{\beta}\right) \quad \text{(log-mean)} \\ l_C &= \log_{10}\!\left(\frac{1}{\sqrt{\alpha}}\right) \quad \text{(log-CV)}\end{split}\]

The flow field is evaluated over a grid of these coordinates:

  • 51 values of \(l_\mu\) equally spaced between \(-5\) and \(2\) (i.e., log-spaced means from \(10^{-5}\) to \(10^2\) coalescent time units)

  • 50 values of \(l_C\) equally spaced between \(-2\) and \(0\) (CVs from \(10^{-2}\) to \(1\))

Values with \(C_V > 1\) (\(\alpha < 1\)) are excluded because this would give an infinite gamma density at \(x = 0\), which is unphysical for a TMRCA.

Why log-coordinates?

Log-coordinates serve two purposes. First, they allow the grid to cover many orders of magnitude uniformly – a TMRCA could be \(10^{-3}\) coalescent units (very recent) or \(10^1\) (very ancient), and log-spacing gives equal resolution at all scales. Second, the Taylor expansion used for the PDE approach is performed in \((\log_{10} \alpha, \log_{10} \beta)\) coordinates, not in \((l_\mu, l_C)\) directly. The conversion uses the chain rule:

\[\begin{split}\Delta l_\mu &= \Delta \log_{10}(\alpha) - \Delta \log_{10}(\beta) \\ \Delta l_C &= -0.5 \cdot \Delta \log_{10}(\alpha)\end{split}\]

This avoids issues with the Taylor expansion that would arise from working in the \((l_\mu, l_C)\) coordinate system directly.

def to_log_coords(alpha, beta):
    """Convert (alpha, beta) to log-mean / log-CV coordinates.

    Parameters
    ----------
    alpha : float
        Shape parameter (must be > 0).
    beta : float
        Rate parameter (must be > 0).

    Returns
    -------
    l_mu : float
        log10(alpha/beta), the log-mean TMRCA.
    l_C : float
        log10(1/sqrt(alpha)), the log coefficient of variation.
    """
    l_mu = np.log10(alpha / beta)
    l_C = np.log10(1.0 / np.sqrt(alpha))
    return l_mu, l_C

def from_log_coords(l_mu, l_C):
    """Convert log-coordinates back to (alpha, beta).

    Parameters
    ----------
    l_mu : float
        Log-mean coordinate.
    l_C : float
        Log-CV coordinate.

    Returns
    -------
    alpha, beta : float
    """
    alpha = 10.0 ** (-2 * l_C)
    beta = alpha * 10.0 ** (-l_mu)
    return alpha, beta

# Verify round-trip conversion
for alpha, beta in [(1.0, 1.0), (5.0, 2.5), (100.0, 50.0)]:
    l_mu, l_C = to_log_coords(alpha, beta)
    a_back, b_back = from_log_coords(l_mu, l_C)
    print(f"Gamma({alpha}, {beta}) -> (l_mu={l_mu:.4f}, l_C={l_C:.4f}) "
          f"-> Gamma({a_back:.4f}, {b_back:.4f})")

# The prior Gamma(1, 1) maps to (0, 0) in log-coordinates
l_mu_prior, l_C_prior = to_log_coords(1.0, 1.0)
print(f"\nPrior Gamma(1,1): (l_mu, l_C) = ({l_mu_prior:.1f}, {l_C_prior:.1f})")

# The grid covers:
l_mu_grid = np.linspace(-5, 2, 51)
l_C_grid = np.linspace(-2, 0, 50)
print(f"Grid: {len(l_mu_grid)} x {len(l_C_grid)} = "
      f"{len(l_mu_grid) * len(l_C_grid)} points")

Summary

The gamma approximation rests on two pillars:

  1. Emission step (exact): \(\text{Gamma}(\alpha, \beta) + Y_i \to \text{Gamma}(\alpha + Y_i, \beta + \theta)\). This is Poisson-gamma conjugacy – a textbook result that requires no approximation.

  2. Transition step (approximate): \(\text{Gamma}(\alpha, \beta) \to \text{Gamma}(\alpha + u \cdot \rho, \beta + v \cdot \rho)\), where \((u, v)\) are determined by least-squares fitting to the true perturbation. This requires a one-time precomputation over a two-dimensional grid.

Together, these two rules mean that the entire forward pass of the HMM reduces to a sequence of arithmetic operations on two numbers \((\alpha, \beta)\). No matrix multiplications, no numerical integration during inference, no discretization of time. The cost of one step is \(O(1)\), making the full forward pass \(O(N)\) in the sequence length.

Next: The Flow Field – precomputing the transition machinery into a reusable vector field.