$ uv run gz test

[returncode] 0

[stdout]
Running tests...
  Eval delta: FAIL (3 regressions detected)

    instruction_eval / completeness: baseline=4.0, current=3.33 (delta=-0.67,
threshold=-0.5)
    instruction_eval / control_balance: baseline=4.0, current=3.33 (delta=-0.67,
threshold=-0.5)
    instruction_eval / surface_coverage: baseline=4.0, current=3.33
(delta=-0.67, threshold=-0.5)
  ↳ Eval delta: skipped (no baselines) — 5 surfaces scored, overall 2.7/4.0
  ↳ Eval delta: skipped (no eval datasets)
{
  "passed": true,
  "commands_discovered": 52,
  "commands_checked": 52,
  "commands_with_gaps": 0,
  "gaps": [],
  "undeclared_commands": [],
  "orphaned_docs": []
}
Documentation Coverage Gap Report
========================================

PASSED: 52 commands discovered, 52 checked, all required surfaces present.
debug test
Unexpected error: unexpected
bad input
policy violated
network down

................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
................................................................................
...............................
----------------------------------------------------------------------
Ran 1631 tests in 16.410s

OK

Tests passed.


[stderr]
