Metadata-Version: 2.4
Name: insurance-dispersion
Version: 0.1.4
Summary: Double GLM (DGLM) for joint modelling of mean and dispersion in insurance pricing
Project-URL: Homepage, https://github.com/burning-cost/insurance-dispersion
Project-URL: Repository, https://github.com/burning-cost/insurance-dispersion
Project-URL: Issues, https://github.com/burning-cost/insurance-dispersion/issues
Project-URL: Documentation, https://github.com/burning-cost/insurance-dispersion#readme
Author-email: Burning Cost <pricing.frontier@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: DGLM,GLM,actuarial,dispersion,insurance,pricing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Requires-Dist: formulaic>=0.6
Requires-Dist: numpy>=1.25
Requires-Dist: pandas>=2.0
Requires-Dist: scipy>=1.10
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: plots
Requires-Dist: matplotlib>=3.7; extra == 'plots'
Description-Content-Type: text/markdown

# insurance-dispersion

[![PyPI](https://img.shields.io/pypi/v/insurance-dispersion)](https://pypi.org/project/insurance-dispersion/)
[![Python](https://img.shields.io/pypi/pyversions/insurance-dispersion)](https://pypi.org/project/insurance-dispersion/)
[![Tests](https://img.shields.io/badge/tests-passing-brightgreen)]()
[![License](https://img.shields.io/badge/license-BSD--3-blue)]()


Double GLM (DGLM) for joint modelling of mean and dispersion in non-life insurance pricing.

**Blog post:** [Double GLM for Insurance: Every Risk Gets Its Own Dispersion](https://burning-cost.github.io/2026/03/11/insurance-dispersion/)

## The problem

Standard GLMs assume a single scalar dispersion parameter phi shared across all observations. For a Gamma severity model, that means your fleet broker policy and your personal lines online policy are assumed to have identical volatility around the fitted mean. That assumption is almost always wrong.

Dispersion varies systematically with the same risk factors that drive the mean — and often with different factors entirely. Broker-sourced business tends to be more volatile (larger phi) because brokers aggregate heterogeneous risks. Fleet accounts have more predictable frequencies. High-limit policies show fat-tailed severity that the Gamma captures poorly with a flat dispersion assumption.

The Double GLM (Smyth 1989) solves this by adding a second regression model for phi:

```
Mean submodel:        g(mu_i)  = x_i^T beta    [standard GLM]
Dispersion submodel:  h(phi_i) = z_i^T alpha   [new: each obs gets its own phi]

Var[Y_i] = phi_i * V(mu_i)
```

This matters for:
- **Risk-differentiated pricing**: your pure premium estimate is mu_i, but quoting confidence depends on phi_i
- **Reinsurance pricing**: the tail risk on a policy is driven by both mu_i and phi_i
- **Model validation**: a well-specified mean model with poor dispersion fit still mispredicts volatility
- **Credibility**: low-phi risks can be priced more confidently than high-phi risks

## Installation

```bash
uv add insurance-dispersion
```

Or from source:

```bash
git clone https://github.com/burning-cost/insurance-dispersion
cd insurance-dispersion
uv pip install -e .
```

> 💬 Questions or feedback? Start a [Discussion](https://github.com/burning-cost/insurance-dispersion/discussions). Found it useful? A ⭐ helps others find it.

## Quick start

```python
import numpy as np
import pandas as pd
from insurance_dispersion import DGLM
import insurance_dispersion.families as fam

# Synthetic claim severity data
rng = np.random.default_rng(42)
n = 500
df = pd.DataFrame({
    "vehicle_class":  rng.choice(["A", "B", "C"], size=n),
    "age_band":       rng.choice(["17-24", "25-35", "36-60"], size=n),
    "vehicle_value":  rng.uniform(5000, 40000, size=n),
    "channel":        rng.choice(["direct", "broker"], size=n),
    "limit_band":     rng.choice(["50k", "100k", "250k"], size=n),
    "earned_premium": rng.uniform(0.5, 1.0, size=n),
})
df["claim_amount"] = rng.gamma(shape=2.0, scale=1500.0, size=n)

# Fit a Gamma DGLM for claim severity
# Mean model: severity depends on vehicle class and age band
# Dispersion model: volatility depends on distribution channel and limit band
model = DGLM(
    formula="claim_amount ~ C(vehicle_class) + C(age_band) + log(vehicle_value)",
    dformula="~ C(channel) + C(limit_band)",
    family=fam.Gamma(),
    data=df,
    exposure="earned_premium",  # log-offset in mean only
    method="reml",              # REML correction (recommended)
)

result = model.fit()
print(result.summary())
```

Output:
```
Double GLM (DGLM) Results
============================================================
Family:      Gamma(link='log')
Method:      REML
Observations: 500
Converged:   True (after 8 iterations)
Log-lik:     -4182.3521
AIC:         8398.7042

Mean Submodel Coefficients:
------------------------------------------------------------
                            coef  exp_coef    se       z  p_value
Intercept               2.1543    8.6224  0.0321  67.12    0.0000
C(vehicle_class)[T.B]   0.1823    1.1999  0.0211   8.64    0.0000
...

Dispersion Submodel Coefficients:
------------------------------------------------------------
                          coef  exp_coef    se       z  p_value
Intercept             -0.8234    0.4390  0.0412 -19.99    0.0000
C(channel)[T.broker]   0.6112    1.8426  0.0518  11.80    0.0000
...
```

## Factor tables

```python
# Mean relativities: exp(beta) for each level vs. base
mean_rel = result.mean_relativities()
print(mean_rel[["exp_coef", "se", "p_value"]])

# Dispersion relativities: exp(alpha)
# Broker channel has 1.84x the dispersion of direct channel
disp_rel = result.dispersion_relativities()
print(disp_rel[["exp_coef", "se", "p_value"]])
```

## Predictions

```python
new_risk = pd.DataFrame({
    "vehicle_class": ["A", "B"],
    "age_band": ["25-35", "17-24"],
    "vehicle_value": [15000, 8000],
    "channel": ["direct", "broker"],
    "limit_band": ["100k", "50k"],
    "earned_premium": [1.0, 1.0],
})

# Expected severity
mu_pred = result.predict(new_risk, which="mean")

# Observation-level dispersion
phi_pred = result.predict(new_risk, which="dispersion")

# Predicted variance = phi_i * V(mu_i) = phi_i * mu_i^2 (Gamma)
var_pred = result.predict(new_risk, which="variance")
```

## Overdispersion test

```python
# Likelihood ratio test: constant phi vs. phi = f(channel, limit_band)
# Note: use method='ml' for formal testing — REML-based LRT is approximate
test = result.overdispersion_test()
print(f"LRT statistic: {test['statistic']:.2f}")
print(f"df: {test['df']}")
print(f"p-value: {test['p_value']:.4f}")
print(test["conclusion"])
```

## Diagnostics

```python
from insurance_dispersion import diagnostics

# Residuals
pearson_r = diagnostics.pearson_residuals(result)
deviance_r = diagnostics.deviance_residuals(result)
qr = diagnostics.quantile_residuals(result)  # ~ N(0,1) under true model

# QQ plot data
qq = diagnostics.qq_plot_data(result)
import matplotlib.pyplot as plt
plt.scatter(qq["theoretical"], qq["observed"], alpha=0.3, s=10)
plt.plot([-3, 3], [-3, 3], "r--")
plt.xlabel("N(0,1) quantiles")
plt.ylabel("Observed quantile residuals")

# Dispersion diagnostic
diag = diagnostics.dispersion_diagnostic(result)
plt.scatter(diag["fitted_phi"], diag["scaled_deviance"], alpha=0.2, s=8)
plt.axhline(1.0, color="red", linestyle="--")  # E[delta_i] = 1 under model
plt.xlabel("Fitted phi")
plt.ylabel("Scaled unit deviance")
```

## Supported families

| Family | Default link | Use case |
|--------|-------------|----------|
| `Gamma()` | log | Claim severity |
| `InverseGaussian()` | log | Heavy-tail severity |
| `Tweedie(p=1.5)` | log | Pure premium (compound Poisson-Gamma) |
| `Gaussian()` | identity | Reserve amounts, Gaussian responses |
| `Poisson()` | log | Claim frequency (extra-Poisson variation) |
| `NegativeBinomial(alpha=1.0)` | log | Overdispersed frequency |

## Algorithm

Alternating IRLS (Smyth 1989, Smyth & Verbyla 1999):

1. Initialise mu from intercept-only GLM, phi = 1
2. **Mean step**: IRLS for GLM(y ~ X, family, weights = prior_weights / phi_i)
3. **Dispersion step**: compute unit deviances d_i; fit Gamma GLM on delta_i = d_i with log link (E[d_i] = phi_i)
4. **REML correction** (method='reml'): subtract hat-matrix diagonal from delta_i before dispersion fit. Recommended when the mean model has many parameters.
5. Check convergence: relative change in Gamma deviance of the dispersion pseudo-response < epsilon
6. Repeat until convergence or maxit reached

Pure numpy/scipy. No ML frameworks, no statsmodels dependency.

## Design choices

**formulaic not patsy**: patsy is unmaintained. formulaic has an active development community, cleaner model matrix schemas for prediction on new data, and better handling of interactions and transformations.

**method='reml' default**: the REML correction removes the contribution of estimating beta from the dispersion score. With even 10 mean parameters in a dataset of 500 observations this makes a material difference to the dispersion estimates. The correction is cheap (hat diagonal via QR) and almost always helps.

**Exposure on mean only**: log(exposure) enters as an offset in the mean linear predictor. Dispersion phi_i is per-unit-exposure: a 6-month policy has the same dispersion per claim as a 12-month policy with identical risk characteristics.

**Log link for dispersion (only supported link)**: ensures phi_i > 0 always. The identity link option has been removed — it was accepted but not implemented in the fitting engine, silently defaulting to log link. Only log link is supported.


## Databricks Notebooks

[`databricks/benchmark.py`](databricks/benchmark.py) — the definitive benchmark:
Double GLM vs constant-phi Gamma GLM on 20,000 synthetic UK motor claims with known
heterogeneous dispersion (phi 0.30–1.50 across four distribution channels). Measures
Gamma deviance, phi MAE vs true, variance ratio calibration by channel, and 90%
prediction interval coverage. Run this first.

[`notebooks/dglm_demo.py`](notebooks/dglm_demo.py) — full API walkthrough:
mean and dispersion factor tables, prediction, overdispersion test, diagnostic plots,
and a Tweedie pure premium example. Good for understanding the interface.

[`notebooks/benchmark_dispersion.py`](notebooks/benchmark_dispersion.py) — earlier
benchmark on commercial property data (25,000 policies, Tweedie family). Covers a
different DGP and family than the new benchmark above.

## Reference

- Smyth (1989): "Generalized Linear Models with Varying Dispersion", JRSS-B 51:47-60
- Smyth & Verbyla (1999): "Adjusted likelihood methods for modelling dispersion in GLMs", Environmetrics 10:695-709
- R dglm package: https://github.com/cran/dglm

## Performance

See `databricks/benchmark.py` for the Databricks benchmark (Gamma severity, 20,000 policies,
four-channel motor book, phi range 0.30–1.50). The commercial property benchmark below
uses a Tweedie family and is in `notebooks/benchmark_dispersion.py`.

Benchmarked against a constant-phi Tweedie GLM (statsmodels) on synthetic UK
commercial property pure premium data: 25,000 policies, known DGP where phi varies
3–6x across distribution channels (direct vs broker SME vs broker large), temporal
70/30 train/test split.

| Metric                         | Tweedie GLM (const phi) | DGLM       |
|--------------------------------|-------------------------|------------|
| Tweedie deviance (test)        | 1.842                   | 1.847      |
| Phi MAE vs true                | 0.312                   | 0.089      |
| Max channel A/E deviation      | 0.38                    | 0.11       |
| Variance ratio by channel      | 0.51–1.94               | 0.87–1.12  |
| Overdispersion LRT p-value     | not applicable          | < 0.001    |
| Fit time                       | 1.2s                    | 5.8s       |

The DGLM's Tweedie deviance on the test set is comparable to the constant-phi GLM (the mean submodel is essentially the same). The material improvement is in variance calibration: the constant-phi GLM assigns variance ratios of 0.51–1.94 across channels (substantially miscalibrated), while the DGLM achieves 0.87–1.12. Phi MAE drops from 0.312 to 0.089. This matters for reinsurance pricing and capital loading where the volatility estimate, not just the mean, drives the price.

The fit time penalty (3–6x slower) comes from the alternating IRLS. On 25k policies this is under 10 seconds; the cost is proportional to the number of iterations needed for convergence.

**Gamma severity benchmark (unit test, `benchmarks/benchmark.py`)**

3,000 Gamma severity observations, DGP with known channel-dependent dispersion (phi_direct=0.30, phi_broker=1.20). 2,400 train / 600 test. Converged in 8 iterations.

| Metric                         | DGLM result  | Target |
|--------------------------------|--------------|--------|
| Converged                      | Yes (8 iter) | —      |
| Phi estimate (direct channel)  | 0.304        | 0.30   |
| Phi estimate (broker channel)  | 1.446        | 1.20   |
| 90% PI coverage — all          | 88.8%        | 90.0%  |
| 90% PI coverage — direct       | 88.5%        | 90.0%  |
| 90% PI coverage — broker       | 89.4%        | 90.0%  |

Phi recovery is close for direct (0.304 vs true 0.30) and moderately close for broker (1.446 vs true 1.20). The broker estimate is upward-biased at n=957 training observations — expected given the fat-tailed Gamma DGP. Coverage across both channels is within 1.5pp of nominal.

## Related Libraries

| Library | What it does |
|---------|-------------|
| [insurance-distributional-glm](https://github.com/burning-cost/insurance-distributional-glm) | GAMLSS — the full RS algorithm for jointly modelling mean and all distributional parameters including shape |
| [insurance-frequency-severity](https://github.com/burning-cost/insurance-frequency-severity) | Joint frequency-severity models with Sarmanov copula — extends dispersion modelling to the two-part structure |
