THOL — Token-Harness Optimizer Leaderboard

An open, reproducible, end-to-end benchmark of tools that claim to cut the token cost of coding agents. We measure the only thing that matters: did the task succeed, and what did the whole session actually cost — with and without each optimizer. Every run uses the Claude Sonnet model, 10 runs per task per tool on a single pinned Claude Code version.

Loading results…

What this table shows. How much each tool changes the cost of a coding session versus plain Claude Code, on long sessions only — tasks where vanilla Claude Code burns more than 200,000 tokens. A positive number means cheaper; negative means more expensive.

Why long sessions: most real work with a coding agent is long and multi-step; short throwaway tasks aren't representative — every tool lands on the same near-zero cost there, so only fixed overhead shows.

All runs use the Claude Sonnet model. With more budget we'd like to push this to multi-million-token sessions; for now this is the regime we can afford to measure, and we add new tools in batches as budget allows.

How to read this

What we found so far

Used as documented, most of these tools don't beat plain Claude Code — several make it more expensive. The dominant reason is adoption: the agent simply doesn't call the optimizer's tools often enough for them to pay back their own overhead. A tool that adds a CLI, an MCP server or a prompt the agent ignores is pure cost.

Where any tool does help, it's on long, expensive sessions (see the session-size split below) — never on the short ones. New tools are added over time as token budget allows; this benchmark is expensive to run, so the board grows in batches rather than all at once.

Why most tools do poorly — despite great compression numbers

Almost every tool here advertises an impressive reduction (−58%, −90%, even −99%). Those numbers are usually real but narrow: they measure one of the tool's own functions in isolation — its compressor run on a fixed blob, or its search run on a fixed query — under tightly controlled conditions. That is not how an agent behaves on a real task. In real conditions, three things an isolated compression benchmark never sees usually erase the saving or reverse it:

  1. Adoption is hard, and mis-adoption backfires. Getting a model to actually call an MCP/CLI tool is difficult; getting a net win is harder still, because the tool has to be used in the right context. When it isn't, the agent gets a poor result, falls back to its normal way of working (re-reading, re-searching) — and learns the tool is unhelpful, so it stops calling it. You pay the tokens for the failed call and lose future adoption. (We even tried forcing adoption with a verbose system prompt — it still cost more; see the GSP experiment below.)
  2. Lossy output compression makes the agent re-fetch. For tools that compress command/CLI output, if the compaction drops the bytes the model actually needed, the model simply re-runs the command — often bypassing the tool — to recover them. Net result: more turns, more tokens, the opposite of the advertised saving.
  3. Overhead is paid on every turn. Adding an MCP server or a system prompt injects tokens at the start of every conversation, and they are re-counted on every agent turn. Most of these are cache reads, billed at a reduced rate — but still billed. Over a long session this standing cost quietly accumulates, and for a tool the agent rarely uses it is pure loss.
  4. Some tools break the context cache. A proxy like Headroom rewrites the growing conversation history on every turn, so the cached prefix no longer matches byte-for-byte. That forces the model to re-read the whole context as fresh input — billed at the full rate instead of the cached rate, which is ~10× cheaper. The few tokens its compression saves are dwarfed by the cache it destroys.

Together, this is why a tool can headline "−90% tokens" and still make a real session more expensive end to end — which is exactly what THOL measures.

Method & reproduction

Each run is one fully isolated headless Claude Code session (throwaway HOME, throwaway /tmp workspace, --strict-mcp-config), single pinned model and Claude Code version, deterministic fixtures, programmatic verifiers scoring against ground truth never present in the workspace. Every step, manifest and verifier is in the repository. → Full method & exact reproduction steps in the README.

Task selection. We dropped tasks where vanilla Claude Code finishes in under ~5 turns on average: they're too trivial to separate the tools (every tool lands on the same near-zero cost — only fixed overhead shows) and they don't resemble real agent work. The board scores only the substantive tasks that remain. Aggregation. Costs are aggregated as a geometric mean of per-task ratios — the standard way to average ratios (it avoids the upward bias of an arithmetic mean), and a cost-weighted total of all sessions agrees with it to within a point. Every successful run is used as-is and all raw per-run data is published for re-analysis; bootstrap 95% confidence intervals for the headline figures are in results.json.

Impartiality

This benchmark is maintained by the author of one of the measured tools (tokenade). The conflict is handled by construction: the harness cannot tell competitors apart, install steps follow each tool's own documentation, verifiers are frozen before any run and check task outcomes (not tool behaviour), the control is a first-class entry whose variance bounds every claim, and all raw data — including runs unfavourable to any tool, tokenade included — is published unedited.

Tools benchmarked

Every optimizer measured, linked to its source. tokenade is the tool maintained by this benchmark's author; it is held to the same blind pipeline as every other entry.