$ uv run gz test

[returncode] 0

[stdout]
Running tests...
  Eval delta: FAIL (3 regressions detected)

    instruction_eval / completeness: baseline=4.0, current=3.33 (delta=-0.67,
threshold=-0.5)
    instruction_eval / control_balance: baseline=4.0, current=3.33 (delta=-0.67,
threshold=-0.5)
    instruction_eval / surface_coverage: baseline=4.0, current=3.33
(delta=-0.67, threshold=-0.5)
  ↳ Eval delta: skipped (no baselines) — 5 surfaces scored, overall 2.7/4.0
  ↳ Eval delta: skipped (no eval datasets)
debug test
Unexpected error: unexpected
bad input
policy violated
network down

................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................
----------------------------------------------------------------------
Ran 1616 tests in 15.189s

OK

Tests passed.


[stderr]
