THOL — Token-Harness Optimizer Leaderboard

An open, reproducible, end-to-end benchmark of tools that claim to cut the token cost of coding agents. We measure the only thing that matters: did the task succeed, and what did the whole session actually cost — with and without each optimizer. Every run uses the Claude Sonnet model, 10 runs per task per tool on a single pinned Claude Code version.

Loading results…

What this table shows. How much each tool changes the cost of a coding session versus plain Claude Code, on long sessions only — tasks where vanilla Claude Code burns more than 200,000 tokens. A positive number means cheaper; negative means more expensive.

Why long sessions: most real work with a coding agent is long and multi-step; short throwaway tasks aren't representative — every tool lands on the same near-zero cost there, so only fixed overhead shows.

All runs use the Claude Sonnet model. With more budget we'd like to push this to multi-million-token sessions; for now this is the regime we can afford to measure, and we add new tools in batches as budget allows.

How to read this

Cost reduction = how much cheaper an optimizer makes a session versus vanilla Claude Code, in end-to-end USD, geometric-mean-aggregated across tasks (so no single task dominates). Positive = cheaper, negative = more expensive; 0% = no difference, with control (vanilla Claude Code) sat at the 0% line.
Adoption = share of runs in which the agent actually invoked the optimizer's tools — a tool the agent never calls cannot save anything. It's shown as N/A for tools the agent doesn't explicitly call: rtk acts via an automatic hook, and prompt/context ones (lean-ctx, claude-token-efficient) are just text. (tokenade also has a hook, but it additionally exposes CLI functions — map, skeleton, query, exec… — the agent calls by hand, so its adoption counts those.)
Every raw per-run measurement is in results.json for your own analysis.

What we found so far

Used as documented, most of these tools don't beat plain Claude Code — several make it more expensive. The dominant reason is adoption: the agent simply doesn't call the optimizer's tools often enough for them to pay back their own overhead. A tool that adds a CLI, an MCP server or a prompt the agent ignores is pure cost.

Where any tool does help, it's on long, expensive sessions (see the session-size split below) — never on the short ones. New tools are added over time as token budget allows; this benchmark is expensive to run, so the board grows in batches rather than all at once.

Why most tools do poorly — despite great compression numbers

Almost every tool here advertises an impressive reduction (−58%, −90%, even −99%). Those numbers are usually real but narrow: they measure one of the tool's own functions in isolation — its compressor run on a fixed blob, or its search run on a fixed query — under tightly controlled conditions. That is not how an agent behaves on a real task. In real conditions, three things an isolated compression benchmark never sees usually erase the saving or reverse it:

Adoption is hard, and mis-adoption backfires. Getting a model to actually call an MCP/CLI tool is difficult; getting a net win is harder still, because the tool has to be used in the right context. When it isn't, the agent gets a poor result, falls back to its normal way of working (re-reading, re-searching) — and learns the tool is unhelpful, so it stops calling it. You pay the tokens for the failed call and lose future adoption. (We even tried forcing adoption with a verbose system prompt — it still cost more; see the GSP experiment below.)
Lossy output compression makes the agent re-fetch. For tools that compress command/CLI output, if the compaction drops the bytes the model actually needed, the model simply re-runs the command — often bypassing the tool — to recover them. Net result: more turns, more tokens, the opposite of the advertised saving.
Overhead is paid on every turn. Adding an MCP server or a system prompt injects tokens at the start of every conversation, and they are re-counted on every agent turn. Most of these are cache reads, billed at a reduced rate — but still billed. Over a long session this standing cost quietly accumulates, and for a tool the agent rarely uses it is pure loss.
Some tools break the context cache. A proxy like Headroom rewrites the growing conversation history on every turn, so the cached prefix no longer matches byte-for-byte. That forces the model to re-read the whole context as fresh input — billed at the full rate instead of the cached rate, which is ~10× cheaper. The few tokens its compression saves are dwarfed by the cache it destroys.

Together, this is why a tool can headline "−90% tokens" and still make a real session more expensive end to end — which is exactly what THOL measures.

Method & reproduction

Each run is one fully isolated headless Claude Code session (throwaway HOME, throwaway /tmp workspace, --strict-mcp-config), single pinned model and Claude Code version, deterministic fixtures, programmatic verifiers scoring against ground truth never present in the workspace. Every step, manifest and verifier is in the repository. → Full method & exact reproduction steps in the README.

Task selection. We dropped tasks where vanilla Claude Code finishes in under ~5 turns on average: they're too trivial to separate the tools (every tool lands on the same near-zero cost — only fixed overhead shows) and they don't resemble real agent work. The board scores only the substantive tasks that remain. Aggregation. Costs are aggregated as a geometric mean of per-task ratios — the standard way to average ratios (it avoids the upward bias of an arithmetic mean), and a cost-weighted total of all sessions agrees with it to within a point. Every successful run is used as-is and all raw per-run data is published for re-analysis; bootstrap 95% confidence intervals for the headline figures are in results.json.

Impartiality

This benchmark is maintained by the author of one of the measured tools (tokenade). The conflict is handled by construction: the harness cannot tell competitors apart, install steps follow each tool's own documentation, verifiers are frozen before any run and check task outcomes (not tool behaviour), the control is a first-class entry whose variance bounds every claim, and all raw data — including runs unfavourable to any tool, tokenade included — is published unedited.

Tools benchmarked

Every optimizer measured, linked to its source. tokenade is the tool maintained by this benchmark's author; it is held to the same blind pipeline as every other entry.