Case Study
What do you actually lose by running locally instead of paying for Claude?
We ran the same 4-agent API design scenario through Cohort's 3 response modes -- local-only (free), hybrid local+Claude, and Claude-only. Same conversation. Same scoring engine. Same agents. The only variable is the inference pipeline.
In production, each agent gets the full Cohort context pipeline -- agent prompt, memory, learned facts, document library, and channel history. This test used short system prompts (~170 tokens), giving you a clean quality baseline. Scroll down to Why the Condenser Exists to see what happens when you feed the full pipeline to a 9B model -- and how Smartest mode's distillation step solves it.
Read the actual responses below and judge for yourself.
I will integrate FastAPI-Limiter for per-IP rate limiting to block cursor enumeration attempts and implement a dual-key Fernet strategy with a 24-hour grace period for seamless key rotation.
Smarter -- FreeHMAC validation must be constant-time (hmac.compare_digest, not ==) to prevent timing side-channels on cursor signatures, and the secret rotation story needs coverage too.
Smartest -- $0.004/callShip with unsigned cursors now, add a v1: prefix byte so you can transparently migrate to HMAC-signed cursors later without breaking clients -- exactly how Shopify handled their cursor format evolution.
Claude-only -- $0.005/call
All responses generated March 14, 2026. Smarter: qwen3.5:9b on RTX 3080. Smartest: qwen3.5:9b + Claude CLI. Claude-only: Claude CLI direct.
View raw test data (JSON)
Quality Scores
We scored each arm across 8 dimensions that matter for multi-agent team output. Speed and cost are intentionally excluded -- this is pure quality assessment.
| Dimension | Smarter | Smartest | Claude-only |
|---|---|---|---|
| Technical accuracy | 7 | 9 | 9 |
| Actionability | 6 | 8 | 9 |
| Independent thinking | 5 | 7 | 9 |
| Security depth | 6 | 9 | 8 |
| Real-world citations | 7 | 8 | 9 |
| Lane discipline | 8 | 7 | 8 |
| Builds on teammates | 6 | 7 | 8 |
| Would change your design? | 5 | 8 | 9 |
| Average | 6.3 | 7.9 | 8.6 |
Token Usage
Total tokens processed across all 8 responses per arm. Smartest processes the most tokens because it runs 3 phases (reason + distill + Claude), but Qwen handles the bulk locally for free.
| Metric | Smarter | Smartest | Claude-only |
|---|---|---|---|
| Total tokens in | 4,473 | 16,677 | 6,524 |
| Total tokens out | 765 | 10,027 | 1,522 |
| Tokens processed locally (free) | 5,238 | 23,356 | 0 |
| Tokens sent to Claude (paid) | 0 | 3,348 | 8,046 |
| LLM calls | 8 | 24 (8x3 phases) | 8 |
| Paid token savings vs Claude-only | 100% | 58% | -- |
Smarter scores 6.3 and it's free. For 95% of agent coordination -- routing, task delegation, status updates, basic analysis -- 6.3 is more than enough. The tiered architecture means you run Smarter for everyday work, then selectively escalate the 1-2 messages that genuinely need premium quality to Smartest or Claude-only. You're not choosing between tiers -- you're combining them.
The Three Pipelines
Each mode uses a different inference pipeline. The scoring engine and agent configuration are identical across all three.
Smarter
S+ -- FreeLocal model with thinking. Your GPU does all the work. Zero API calls, zero cost, zero data leaving your machine.
prompt --> qwen3.5:9b --> response
Smartest
S++ -- HybridLocal model reasons deeply, distills its analysis into a briefing, then Claude polishes the final response.
prompt --> qwen3.5:9b reason --> distill --> Claude CLI --> response
Claude-only
Direct -- PaidFull conversation context sent directly to Claude. Maximum quality, maximum cost. The baseline for comparison.
prompt --> Claude CLI --> response
Read the Responses
Same scenario, same agents, same phase. Three different pipelines. Click any card to expand the full response.
This is where the differences become clearest. Smarter agrees with everything and adds reasonable suggestions. Smartest digs deep -- timing attacks, EXPLAIN ANALYZE, dual-key rotation from AWS KMS. Claude-only makes the strategic call: "Don't make HMAC a launch blocker. Ship with unsigned cursors and a version prefix." That kind of independent judgment -- disagreeing with the team when the team is over-engineering -- is the hardest thing for any model to produce.
What Separates Them
Smarter -- The Reliable Executor
Hits the right direction on every response. Names real technologies, stays concise, follows instructions. But it agrees too much. Nobody on the Smarter team says "no, don't do that." The Researcher validates everything. The Developer implements whatever's asked. You wouldn't change your design based on what it said -- but you'd trust it to execute a design you already validated.
Smartest -- The Deep Analyst
Real depth shows up here: timing attacks on HMAC, EXPLAIN ANALYZE under load, the limit+1 infinite loop class of bugs, AWS KMS key rotation patterns. These are things an experienced senior engineer would flag in a real code review. Smartest has the deepest technical knowledge of the three -- particularly on security, where it scored a 9. The tradeoff is that it sometimes buries those gems in longer responses.
Claude-only -- The Senior Engineer
Every sentence earns its place. The Researcher pushing back on HMAC as a launch blocker, citing Stripe and Shopify as precedent. The Tester providing an actual SQL injection payload. The Developer proposing a v1: prefix for cursor format evolution. These aren't just correct -- they're the kind of insights that actually change what you build. Claude-only produces the team you'd want in the room for critical decisions.
Why the Condenser Exists
The test above used short system prompts (~170 tokens). In production, every agent carries its full context: personality, principles, capabilities, persistent memory, learned facts from RSS feeds, and channel conversation history. That's a very different prompt.
Test Prompt (this page)
Production Prompt (real pipeline)
We ran the same scenario through the full production pipeline. The 9B local model received the complete agent prompt -- personality, principles, capabilities, persistent memory, RSS-sourced facts, collaboration rules, and channel context. It looped. The model regurgitated its own system prompt text repeatedly, interspersed with Chinese characters meaning "stuck in a loop." Zero usable output from ~16K tokens of input.
This isn't a model quality problem -- it's a context management problem. A 9B model has the knowledge to answer well (it scored 6.3/10 with short prompts), but drowning it in 4,000+ tokens of system context before the actual question causes prompt-following collapse.
Smartest mode solves this with a three-phase pipeline. Phase 1: Qwen processes the full context with its 256K window and reasons about the answer (thinking tokens, free). Phase 2: Qwen distills its own reasoning into a structured briefing (~200-400 tokens of pure signal). Phase 3: Claude receives that tight briefing instead of the raw 4,000+ token prompt.
The result: Claude gets better input than reading raw context. The distillation extracts key findings, recommended approaches, and concrete details -- pre-digested by a model that already did the hard work of synthesizing the agent's full knowledge base. That's why Smartest sent only 3,348 tokens to Claude across all 8 responses, compared to Claude-only's 8,046 tokens -- 58% fewer paid tokens with comparable quality.
The Real Story: You Don't Choose One
In production, Cohort runs Smarter for 95% of agent coordination -- free, fast, good enough. When a message genuinely needs premium quality (a security review, a critical architecture decision), you escalate that one message to Smartest. You pay $0.004 for one Claude call instead of $0.042 for eight.
That's the tiered architecture: local model handles the volume, Claude handles the exceptions. Your GPU does 95% of the thinking. Your wallet barely notices the other 5%.