Skip to main content

Case Study

What do you actually lose by running locally instead of paying for Claude?

We ran the same 4-agent API design scenario through Cohort's 3 response modes -- local-only (free), hybrid local+Claude, and Claude-only. Same conversation. Same scoring engine. Same agents. The only variable is the inference pipeline.

In production, each agent gets the full Cohort context pipeline -- agent prompt, memory, learned facts, document library, and channel history. This test used short system prompts (~170 tokens), giving you a clean quality baseline. Scroll down to Why the Condenser Exists to see what happens when you feed the full pipeline to a 9B model -- and how Smartest mode's distillation step solves it.

Read the actual responses below and judge for yourself.

I will integrate FastAPI-Limiter for per-IP rate limiting to block cursor enumeration attempts and implement a dual-key Fernet strategy with a 24-hour grace period for seamless key rotation.

Smarter -- Free

HMAC validation must be constant-time (hmac.compare_digest, not ==) to prevent timing side-channels on cursor signatures, and the secret rotation story needs coverage too.

Smartest -- $0.004/call

Ship with unsigned cursors now, add a v1: prefix byte so you can transparently migrate to HMAC-signed cursors later without breaking clients -- exactly how Shopify handled their cursor format evolution.

Claude-only -- $0.005/call

All responses generated March 14, 2026. Smarter: qwen3.5:9b on RTX 3080. Smartest: qwen3.5:9b + Claude CLI. Claude-only: Claude CLI direct.
View raw test data (JSON)

Quality Scores

We scored each arm across 8 dimensions that matter for multi-agent team output. Speed and cost are intentionally excluded -- this is pure quality assessment.

Dimension Smarter Smartest Claude-only
Technical accuracy 7 9 9
Actionability 6 8 9
Independent thinking 5 7 9
Security depth 6 9 8
Real-world citations 7 8 9
Lane discipline 8 7 8
Builds on teammates 6 7 8
Would change your design? 5 8 9
Average 6.3 7.9 8.6

Token Usage

Total tokens processed across all 8 responses per arm. Smartest processes the most tokens because it runs 3 phases (reason + distill + Claude), but Qwen handles the bulk locally for free.

Metric Smarter Smartest Claude-only
Total tokens in 4,473 16,677 6,524
Total tokens out 765 10,027 1,522
Tokens processed locally (free) 5,238 23,356 0
Tokens sent to Claude (paid) 0 3,348 8,046
LLM calls 8 24 (8x3 phases) 8
Paid token savings vs Claude-only 100% 58% --
[*] The takeaway isn't "use Claude for everything"

Smarter scores 6.3 and it's free. For 95% of agent coordination -- routing, task delegation, status updates, basic analysis -- 6.3 is more than enough. The tiered architecture means you run Smarter for everyday work, then selectively escalate the 1-2 messages that genuinely need premium quality to Smartest or Claude-only. You're not choosing between tiers -- you're combining them.

The Three Pipelines

Each mode uses a different inference pipeline. The scoring engine and agent configuration are identical across all three.

Smarter

S+ -- Free

Local model with thinking. Your GPU does all the work. Zero API calls, zero cost, zero data leaving your machine.

prompt --> qwen3.5:9b --> response

Smartest

S++ -- Hybrid

Local model reasons deeply, distills its analysis into a briefing, then Claude polishes the final response.

prompt --> qwen3.5:9b reason --> distill --> Claude CLI --> response

Claude-only

Direct -- Paid

Full conversation context sent directly to Claude. Maximum quality, maximum cost. The baseline for comparison.

prompt --> Claude CLI --> response

Read the Responses

Same scenario, same agents, same phase. Three different pipelines. Click any card to expand the full response.

DISCOVER Researching existing APIs and pagination approaches
RS Researcher
Smarter 173 in / 98 out | 1.4s
We should adopt cursor-based pagination as the primary strategy, following the pattern established by Twitter's API v2 and GitHub, as it provides constant query performance regardless of dataset size. For offset-based pagination, we can reference the approach used by Stripe and the Ruby on Rails kaminari gem, though we must warn against deep pagination issues. I recommend implementing a hybrid model where cursors are the default for large datasets, while allowing offset limits for small, initial exploratory queries.
Smartest 713 in / 491 out | 15.6s
Cursor-based pagination has well-established precedent across high-scale APIs -- Twitter's v2 API, Stripe, Slack, and the GitHub GraphQL API all use keyset/cursor patterns precisely because offset pagination degrades to O(n) as the offset grows, since the database must scan and discard all preceding rows. The seminal work here is Markus Winand's "Use The Index, Luke" which demonstrates how WHERE id > :last_seen ORDER BY id LIMIT :size leverages index seeks for constant-time page fetches regardless of depth. For our use case, I'd recommend the opaque cursor approach (base64-encoded keyset values) over raw column values, since it decouples the client from internal sort implementation and lets us change the underlying ordering strategy without breaking the API contract.
Claude-only 122 in / 228 out | 13.0s
The three dominant patterns in production APIs are offset/limit (GitHub REST, Stripe), cursor-based (Facebook Graph API, Slack, Relay spec), and keyset pagination (used internally by many databases). Offset/limit is simplest but degrades on large tables due to OFFSET N requiring scanning N rows; cursor-based avoids this by encoding an opaque position token, making it O(1) for indexed columns. Stripe's approach is notable -- they use cursor-based with starting_after=obj_id and limit, which maps cleanly to a WHERE id > :cursor ORDER BY id LIMIT :n query and avoids the stale-data issues of offset pagination when records are inserted or deleted mid-traversal. For a /users endpoint, I'd recommend cursor-based with a stable sort key (like created_at + id composite) since it scales well and is the clear industry consensus for any dataset that could grow beyond a few thousand rows.
DV Developer
Smarter 272 in / 105 out | 1.3s
I agree with the hybrid approach, but we should enforce a strict maximum page size (e.g., 100 items) for offset-based queries to prevent deep pagination attacks. For the cursor implementation, we'll use the page query parameter to encode the last id or created_at timestamp from the previous response, ensuring consistent ordering. Let's start by modifying the SQLAlchemy limit() and offset() clauses to accept these parameters, then add validation middleware to reject requests exceeding our configured thresholds.
Smartest 2,807 in / 2,389 out | 36.3s
Cursor-based pagination is the right call for our API endpoints. I'd implement it with a cursor: Optional[str] = None and limit: int = Query(default=20, le=100) on each list endpoint, where the cursor is a base64-encoded JSON blob containing the last seen id and created_at -- then the SQLAlchemy query just adds .filter(Model.id > last_id).order_by(Model.id.asc()).limit(limit + 1), fetching one extra row to determine has_more and build the next cursor. We should wrap the cursor encode/decode in a small utility in utils/pagination.py so every endpoint stays clean and consistent, and return a standard {items: [...], next_cursor: str | null} envelope.
Claude-only 354 in / 127 out | 11.3s
Agreed on cursor-based. For implementation, I'd encode the cursor as a base64'd created_at:id tuple, use a WHERE (created_at, id) > (:ts, :id) ORDER BY created_at, id LIMIT :limit+1 query (the +1 trick tells us if there's a next page without a separate count), and return next_cursor only when that extra row exists. We should default limit to 50, cap it at 200, and make sure we have a composite index on (created_at, id) -- I can have that migration and the endpoint changes ready to review quickly.
PLAN Designing the pagination architecture
RS Researcher
Smarter 447 in / 93 out | 1.2s
This design aligns well with the Twitter API v2 and GitHub REST API patterns, which use opaque tokens to prevent user enumeration and ensure consistent ordering. To maintain backward compatibility, we should include a limit parameter with a default value of 100 for existing clients while deprecating the offset approach in our documentation. I recommend validating that the cursor token is strictly alphanumeric to avoid injection risks before passing it to the database layer.
Smartest 2,808 in / 2,196 out | 35.8s
Cursor-based pagination using the limit + 1 fetch pattern is well-established -- Stripe, GitHub, and Slack all use opaque cursor tokens over offset-based approaches precisely because offsets break under concurrent inserts/deletes. The critical migration concern is that switching /users from unbounded results to a default limit of 100 is a breaking change for any client iterating without pagination; a Deprecation header plus a temporary ?legacy=true escape hatch gives existing consumers a migration window. On the cursor token itself, base64-encoding the primary key is common but you should HMAC-sign it server-side to prevent clients from tampering with the cursor state or using it as an injection vector. Finally, make sure the final-page logic handles results.length <= limit correctly -- if you always slice off the extra record without checking, you'll return has_more: true on the last page and clients will loop forever.
Claude-only 548 in / 237 out | 12.9s
The backward compatibility concern is real -- any existing clients calling GET /users without pagination params will go from getting all records to getting only the first 25. Stripe handled this well: they kept the unpaginated response shape identical (still returns a list) but added has_more and next_cursor as new fields, so existing clients still work but only see the first page. I'd also flag that cursor-based pagination has a known limitation for prior art: it doesn't support "jump to page N" which offset/limit does, but in practice almost no production API consumer actually needs random page access -- infinite scroll and "load more" patterns dominate. One historical note worth considering: both Twitter's v2 API and Shopify's GraphQL API migrated from offset to cursor-based and documented the transition as non-breaking by defaulting the first page when no cursor is provided, which is exactly the pattern the architect proposed.
VALIDATE Security review and edge case analysis
TS Tester
Smarter 802 in / 87 out | 1.3s
I recommend adding rate limiting per client IP to prevent cursor enumeration attacks where attackers guess sequential tokens to bypass pagination limits. Ensure the Fernet key rotation strategy is implemented to avoid token invalidation during production updates, and verify that the WHERE clause sanitization prevents SQL injection if the hashed cursor contains unexpected characters. Finally, validate that the before parameter correctly handles timezone-aware timestamps to prevent data leakage across time boundaries.
Smartest 1,881 in / 674 out | 21.8s
The limit + 1 boundary case needs a concrete test: insert exactly N * limit rows, paginate to the end, and assert the final response returns has_more: false with zero extra fetches -- an off-by-one here silently creates an infinite polling loop. HMAC validation must be constant-time (hmac.compare_digest, not ==) to prevent timing side-channels on cursor signatures, and the secret rotation story needs coverage too -- what happens to in-flight cursors when the key rolls? I'd also want a load test confirming the WHERE id > last_id query plan stays an index seek under concurrent inserts, since a stats-stale planner can flip to a scan mid-traffic.
Claude-only 1,206 in / 224 out | 17.1s
Critical edge cases to validate: (1) cursor tampering -- verify that a valid base64 string containing a SQL injection payload like 2024-01-01T00:00:00; DROP TABLE users--:1 is safely rejected by the datetime parser rather than passed to the query, (2) boundary conditions -- empty table returns {"data": [], "next_cursor": null, "has_more": false}, exactly limit+1 rows returns has_more: true with only limit items in data, and a cursor pointing to a deleted record still works correctly since the > comparison skips over gaps, (3) the base64 cursor leaks created_at timestamps which could be an information disclosure concern -- consider HMAC-signing or encrypting the cursor so clients can't decode user creation times or forge cursors to enumerate records. Also need a test confirming limit=0, limit=-1, and limit=999 all get clamped or rejected before hitting the database.
RS Researcher
Smarter 787 in / 88 out | 1.2s
Your security review is spot on; rate limiting is critical since cursor enumeration can bypass pagination limits even with signed tokens, as seen in vulnerabilities affecting the Stripe API. For Fernet key rotation, ensure you implement a dual-key strategy where old keys remain valid for a grace period to prevent breaking existing sessions during updates. Additionally, verify that the WHERE clause strictly validates the hashed cursor format before execution to avoid any potential SQL injection risks.
Smartest 1,742 in / 694 out | 22.4s
The limit + 1 off-by-one pattern causing infinite loops is a well-documented class of pagination bugs -- cursor-based systems that fetch limit + 1 to detect "has next page" must still only return limit rows to the caller, or the cursor never advances. For HMAC signature verification, hmac.compare_digest isn't optional -- Python's == on strings short-circuits on the first differing byte, leaking timing information that lets an attacker reconstruct valid signatures incrementally. The dual-key rotation strategy mirrors how production systems like AWS KMS and Stripe handle key rollover: accept signatures from both old and new keys during a grace window, then hard-cut to invalidate the old key. On the index side, databases can drop an index scan when the ORDER BY columns don't match the index's sort order -- explicitly aligning them and using EXPLAIN ANALYZE under load keeps pagination queries on the fast path.
Claude-only 1,204 in / 202 out | 12.9s
The tester's point about HMAC-signing cursors has strong prior art -- Slack's API uses signed cursors precisely to prevent enumeration attacks, and Facebook's Graph API encrypts theirs entirely. However, I'd push back on making this a launch blocker: Stripe and GitHub both ship plain base64 cursors with a documented "cursors are opaque and may change format" contract, treating cursor forgery as a low-severity issue since the underlying query is parameterized and can't return data the caller isn't already authorized to see. The pragmatic move is to ship with unsigned cursors now, add a cursor_version prefix byte (e.g., v1:base64payload) so you can transparently migrate to HMAC-signed cursors later without breaking clients, which is exactly how Shopify handled their cursor format evolution.
[*] What to notice in the VALIDATE phase

This is where the differences become clearest. Smarter agrees with everything and adds reasonable suggestions. Smartest digs deep -- timing attacks, EXPLAIN ANALYZE, dual-key rotation from AWS KMS. Claude-only makes the strategic call: "Don't make HMAC a launch blocker. Ship with unsigned cursors and a version prefix." That kind of independent judgment -- disagreeing with the team when the team is over-engineering -- is the hardest thing for any model to produce.

What Separates Them

Smarter -- The Reliable Executor

Hits the right direction on every response. Names real technologies, stays concise, follows instructions. But it agrees too much. Nobody on the Smarter team says "no, don't do that." The Researcher validates everything. The Developer implements whatever's asked. You wouldn't change your design based on what it said -- but you'd trust it to execute a design you already validated.

Smartest -- The Deep Analyst

Real depth shows up here: timing attacks on HMAC, EXPLAIN ANALYZE under load, the limit+1 infinite loop class of bugs, AWS KMS key rotation patterns. These are things an experienced senior engineer would flag in a real code review. Smartest has the deepest technical knowledge of the three -- particularly on security, where it scored a 9. The tradeoff is that it sometimes buries those gems in longer responses.

Claude-only -- The Senior Engineer

Every sentence earns its place. The Researcher pushing back on HMAC as a launch blocker, citing Stripe and Shopify as precedent. The Tester providing an actual SQL injection payload. The Developer proposing a v1: prefix for cursor format evolution. These aren't just correct -- they're the kind of insights that actually change what you build. Claude-only produces the team you'd want in the room for critical decisions.

Why the Condenser Exists

The test above used short system prompts (~170 tokens). In production, every agent carries its full context: personality, principles, capabilities, persistent memory, learned facts from RSS feeds, and channel conversation history. That's a very different prompt.

Test Prompt (this page)

~170
tokens input per call
2-sentence role description + task

Production Prompt (real pipeline)

~4,000+
tokens input per call
agent_prompt.md + grounding rules + memory + learned facts + channel history
[!] What happens without distillation

We ran the same scenario through the full production pipeline. The 9B local model received the complete agent prompt -- personality, principles, capabilities, persistent memory, RSS-sourced facts, collaboration rules, and channel context. It looped. The model regurgitated its own system prompt text repeatedly, interspersed with Chinese characters meaning "stuck in a loop." Zero usable output from ~16K tokens of input.

This isn't a model quality problem -- it's a context management problem. A 9B model has the knowledge to answer well (it scored 6.3/10 with short prompts), but drowning it in 4,000+ tokens of system context before the actual question causes prompt-following collapse.

[*] What the Smartest pipeline actually does

Smartest mode solves this with a three-phase pipeline. Phase 1: Qwen processes the full context with its 256K window and reasons about the answer (thinking tokens, free). Phase 2: Qwen distills its own reasoning into a structured briefing (~200-400 tokens of pure signal). Phase 3: Claude receives that tight briefing instead of the raw 4,000+ token prompt.

The result: Claude gets better input than reading raw context. The distillation extracts key findings, recommended approaches, and concrete details -- pre-digested by a model that already did the hard work of synthesizing the agent's full knowledge base. That's why Smartest sent only 3,348 tokens to Claude across all 8 responses, compared to Claude-only's 8,046 tokens -- 58% fewer paid tokens with comparable quality.

Full agent context (4,000+ tokens) --[Qwen reasoning, free]--> Distilled briefing (~300 tokens) --[Claude, paid]--> Final response

The Real Story: You Don't Choose One

In production, Cohort runs Smarter for 95% of agent coordination -- free, fast, good enough. When a message genuinely needs premium quality (a security review, a critical architecture decision), you escalate that one message to Smartest. You pay $0.004 for one Claude call instead of $0.042 for eight.

That's the tiered architecture: local model handles the volume, Claude handles the exceptions. Your GPU does 95% of the thinking. Your wallet barely notices the other 5%.