Zero-Shot vs Demo-Conditioned Evaluation

WAA Benchmark · GPT-5.1 · 2026-02-19 · 3 tasks × 2 conditions = 6 runs
Zero-Shot Success
0 / 3
Demo-Conditioned Success
0 / 3
Closest to Success
Notepad (demo): 1 click away
Key Finding
Demo dramatically improves action coherence
Settings
Archive
Notepad
Demo Prompts
Analysis
Step 0 / 19