The Token Waste Problem: 80% of AI Coding Tokens Are Irrelevant

The Token Waste Problem: 80% of AI Coding Tokens Are Irrelevant
There’s a number that should change how you think about AI coding costs: roughly 80% of the tokens your agent processes in a typical task are irrelevant to that task.
Not slightly off-topic. Not marginally useful. Irrelevant — code the model reads, processes, and generates around, but that has no structural connection to what you actually asked it to do.
This is the token waste problem. And it’s costing teams real money.
Where the 80% Comes From
Consider a typical Claude Code task on a medium-sized production codebase.
You ask:
“Add a rate limiter to the /payments/charge endpoint.”The actually relevant code might be:
- The
/payments/chargeroute handler: ~200 tokens - The existing rate limiting decorator (if any): ~150 tokens
- The middleware configuration: ~100 tokens
- The relevant test file: ~200 tokens
Total relevant context: ~650 tokens.
What Claude Code might actually load:
- The entire
payments/directory: 15,000 tokens - Shared utilities: 8,000 tokens
- Auth helpers it touched nearby: 4,000 tokens
- Various config files: 3,000 tokens
- Conversation history: 5,000 tokens
Total loaded: ~35,000 tokens.
Relevant fraction:
- 650 / 35,000 = 1.9% relevant
The “80% waste” figure is actually conservative. On codebases over 100,000 lines, the relevant fraction is often under 5%.
Why Agents Over-Load Context
This isn’t a bug or a misconfiguration. It’s a rational strategy in the absence of better information.
AI coding agents face a fundamental uncertainty problem: they don’t know what’s relevant until they’ve read it. The cost of missing something critical (a wrong answer, a broken change) is high. The cost of over-including seems low.
So the default strategy becomes: when in doubt, include more.
This produces what’s variously called:
- Context bloat
- Token waste
- Inefficient context loading
The agent is being cautious, but caution is expensive.
Under the hood, most agents rely on two imprecise methods for context selection:
1. Keyword / Semantic Search
They search for files whose content is semantically similar to the task description.
- Strength: finds files that talk about related topics
- Weakness: doesn’t guarantee those files are structurally connected to the code you’re modifying
2. Directory / Heuristic Loading
They load files that are “nearby” in the filesystem.
- Strength: simple and often “good enough” for small projects
- Weakness: assumes code is organized by feature; in reality, utilities and shared logic often live elsewhere (
utils/,lib/,shared/)
Neither approach understands the actual dependency structure of the code.
The Structural Solution: Dependency Graphs
The fix is to load context based on the actual dependency graph of your codebase.
A dependency graph built from static analysis — imports, function calls, class inheritance, type references — knows exactly what each piece of code depends on.
For the rate limiter task, a graph-based engine would:
- Start at the
/payments/chargehandler - Traverse imports and references to find:
- The rate limit decorator
- The middleware configuration
- Any shared helpers directly used by that handler
- Stop at the boundary instead of pulling in:
- Unrelated payment models
- Billing services
- Adjacent routes that aren’t touched
- Return only the traversed subgraph as context
This isn’t semantic similarity. It’s structural necessity.
The files that get included are the ones that are actually connected to the code being modified.
Benchmark Results (Real FastAPI Codebase, 21 Runs per Condition)
Using dependency-graph-based context selection:
- Input token reduction: 65–70%
- Output token reduction: 63%
- Cost reduction: 58%
- Speed improvement: 22%
The output token reduction is particularly telling: when the model receives focused context, it produces focused output. Less noise in, less noise out.
The Compounding Cost of Token Waste
Token waste compounds quickly at team scale.
Solo Developer Example
- 5 sessions/day at $0.50/session → $2.50/day → $50/month
- With 58% reduction: $1.05/day → $21/month
- Savings: $29/month
10-Developer Team Example
Each developer runs 8 sessions/day:
- Without optimization:
- 10 devs × 8 sessions × $0.50 = $40/day
- ≈ $800/month
- With 58% reduction:
- Effective cost ≈ $16.80/day
- ≈ $336/month
- Monthly savings: ~$464
- Annual savings: ~$5,500
Against a vexp Pro subscription at $190/month ($2,280/year), that’s roughly 2.4x ROI on direct API cost alone.
This ignores:
- Time savings from 22% faster task completion
- Quality improvements from less noisy context (fewer re-runs, fewer corrections)
What 80% Irrelevant Tokens Do to Output Quality
Token waste isn’t just a cost problem. It’s a quality problem.
Language models exhibit attention dilution: when the context contains a lot of irrelevant content, the model’s attention spreads more broadly, and relevant signals get relatively less weight.
In practice, this shows up as:
1. More Hallucination
The model fills gaps with plausible-sounding but incorrect information, partly because the correct information is buried in irrelevant context.
2. Less Precise Code
You get outputs that are technically correct but don’t match the existing patterns and conventions of the codebase — because those patterns were diluted by noise.
3. Longer, Vager Explanations
The model hedges more when it’s uncertain. Irrelevant context increases uncertainty, so explanations get longer and less decisive.
4. More Re-Reads
The model sometimes re-reads files it already processed, burning extra output tokens, because the relevant signal wasn’t prominent enough the first time.
The 63% reduction in output tokens with focused context reflects all of this: shorter outputs because they’re better targeted, not because they’re less accurate.
How to Measure Your Own Token Waste
You can estimate your current token waste ratio with a simple experiment:
- Pick a simple, well-defined task in your codebase
- e.g. “Fix this specific bug” or “Add this field to this endpoint”
- Manually identify the genuinely necessary files
- The target file plus its direct dependencies (imports, helpers, config, tests)
Frequently Asked Questions
What percentage of AI coding tokens are actually wasted?
What are the main types of token waste in AI coding?
How does loading too many files cause token waste?
Can token waste lead to worse AI suggestions?
What is the most effective way to eliminate token waste?
Nicola
Developer and creator of vexp — a context engine for AI coding agents. I build tools that make AI coding assistants faster, cheaper, and actually useful on real codebases.
Related Articles

Claude Code Pro vs Max vs API: Which Plan Actually Saves Money
Data-driven breakdown of Claude Code pricing: Pro $20, Max $100-200, and API pay-per-token. Which plan costs less depends on your usage and token efficiency.

'Claude Code Spending Too Much' — Fixing the #1 Developer Complaint
Why Claude Code feels expensive, what actually drives token usage, and concrete steps (with numbers) to cut your monthly bill by 30–60%.

How to Reduce Claude Code API Costs for Your Engineering Team
Team-scale Claude Code costs multiply individual inefficiencies 8-15x. Here's the playbook: shared context engine, standardized CLAUDE.md, per-developer keys, and the actual ROI math.