An open, reproducible, end-to-end benchmark of tools that claim to cut the token cost of coding agents. We measure the only thing that matters: did the task succeed, and what did the whole session actually cost — with and without each optimizer. Every run uses the Claude Sonnet model, 10 runs per task per tool on a single pinned Claude Code version.
What this table shows. How much each tool changes the cost of a coding session versus plain Claude Code, on long sessions only — tasks where vanilla Claude Code burns more than 200,000 tokens. A positive number means cheaper; negative means more expensive.
Why long sessions: most real work with a coding agent is long and multi-step; short throwaway tasks aren't representative — every tool lands on the same near-zero cost there, so only fixed overhead shows.
All runs use the Claude Sonnet model. With more budget we'd like to push this to multi-million-token sessions; for now this is the regime we can afford to measure, and we add new tools in batches as budget allows.
results.json for your own analysis.Used as documented, most of these tools don't beat plain Claude Code — several make it more expensive. The dominant reason is adoption: the agent simply doesn't call the optimizer's tools often enough for them to pay back their own overhead. A tool that adds a CLI, an MCP server or a prompt the agent ignores is pure cost.
Where any tool does help, it's on long, expensive sessions (see the session-size split below) — never on the short ones. New tools are added over time as token budget allows; this benchmark is expensive to run, so the board grows in batches rather than all at once.
Almost every tool here advertises an impressive reduction (−58%, −90%, even −99%). Those numbers are usually real but narrow: they measure one of the tool's own functions in isolation — its compressor run on a fixed blob, or its search run on a fixed query — under tightly controlled conditions. That is not how an agent behaves on a real task. In real conditions, three things an isolated compression benchmark never sees usually erase the saving or reverse it:
Together, this is why a tool can headline "−90% tokens" and still make a real session more expensive end to end — which is exactly what THOL measures.
Each run is one fully isolated headless Claude Code session (throwaway HOME, throwaway
/tmp workspace, --strict-mcp-config), single pinned model and Claude Code version,
deterministic fixtures, programmatic verifiers scoring against ground truth never present in the workspace.
Every step, manifest and verifier is in the repository.
→ Full method & exact reproduction steps in the README.
Task selection. We dropped tasks where vanilla Claude Code finishes in under ~5 turns on
average: they're too trivial to separate the tools (every tool lands on the same near-zero cost — only fixed
overhead shows) and they don't resemble real agent work. The board scores only the substantive tasks that
remain.
Aggregation. Costs are aggregated as a geometric mean of per-task ratios — the standard way to average
ratios (it avoids the upward bias of an arithmetic mean), and a cost-weighted total of all sessions agrees with
it to within a point. Every successful run is used as-is and all raw per-run data is published for re-analysis;
bootstrap 95% confidence intervals for the headline figures are in results.json.
This benchmark is maintained by the author of one of the measured tools (tokenade). The conflict is handled by construction: the harness cannot tell competitors apart, install steps follow each tool's own documentation, verifiers are frozen before any run and check task outcomes (not tool behaviour), the control is a first-class entry whose variance bounds every claim, and all raw data — including runs unfavourable to any tool, tokenade included — is published unedited.
Every optimizer measured, linked to its source. tokenade is the tool maintained by this
benchmark's author; it is held to the same blind pipeline as every other entry.