  Eval delta: FAIL (3 regressions detected)

    instruction_eval / completeness: baseline=4.0, current=3.33 (delta=-0.67,
threshold=-0.5)
    instruction_eval / control_balance: baseline=4.0, current=3.33 (delta=-0.67,
threshold=-0.5)
    instruction_eval / surface_coverage: baseline=4.0, current=3.33
(delta=-0.67, threshold=-0.5)
  ↳ Eval delta: skipped (no baselines) — 5 surfaces scored, overall 2.7/4.0
  ↳ Eval delta: skipped (no eval datasets)
debug test
Unexpected error: unexpected
bad input
policy violated
network down
----------------------------------------------------------------------
Ran 1616 tests in 14.524s

OK
