The Token Waste Problem: 80% of AI Coding Tokens Are Irrelevant

Nicola·March 12, 2026

The Token Waste Problem: 80% of AI Coding Tokens Are Irrelevant

There’s a number that should change how you think about AI coding costs: roughly 80% of the tokens your agent processes in a typical task are irrelevant to that task.

Not slightly off-topic. Not marginally useful. Irrelevant — code the model reads, processes, and generates around, but that has no structural connection to what you actually asked it to do.

This is the token waste problem. And it’s costing teams real money.

Where the 80% Comes From

Consider a typical Claude Code task on a medium-sized production codebase.

You ask:

“Add a rate limiter to the /payments/charge endpoint.”

The actually relevant code might be:

The /payments/charge route handler: ~200 tokens
The existing rate limiting decorator (if any): ~150 tokens
The middleware configuration: ~100 tokens
The relevant test file: ~200 tokens

Total relevant context: ~650 tokens.

What Claude Code might actually load:

The entire payments/ directory: 15,000 tokens
Shared utilities: 8,000 tokens
Auth helpers it touched nearby: 4,000 tokens
Various config files: 3,000 tokens
Conversation history: 5,000 tokens

Total loaded: ~35,000 tokens.

Relevant fraction:

650 / 35,000 = 1.9% relevant

The “80% waste” figure is actually conservative. On codebases over 100,000 lines, the relevant fraction is often under 5%.

Why Agents Over-Load Context

This isn’t a bug or a misconfiguration. It’s a rational strategy in the absence of better information.

AI coding agents face a fundamental uncertainty problem: they don’t know what’s relevant until they’ve read it. The cost of missing something critical (a wrong answer, a broken change) is high. The cost of over-including seems low.

So the default strategy becomes: when in doubt, include more.

This produces what’s variously called:

Context bloat
Token waste
Inefficient context loading

The agent is being cautious, but caution is expensive.

Under the hood, most agents rely on two imprecise methods for context selection:

1. Keyword / Semantic Search

They search for files whose content is semantically similar to the task description.

Strength: finds files that talk about related topics
Weakness: doesn’t guarantee those files are structurally connected to the code you’re modifying

2. Directory / Heuristic Loading

They load files that are “nearby” in the filesystem.

Strength: simple and often “good enough” for small projects
Weakness: assumes code is organized by feature; in reality, utilities and shared logic often live elsewhere (utils/, lib/, shared/)

Neither approach understands the actual dependency structure of the code.

The Structural Solution: Dependency Graphs

The fix is to load context based on the actual dependency graph of your codebase.

A dependency graph built from static analysis — imports, function calls, class inheritance, type references — knows exactly what each piece of code depends on.

For the rate limiter task, a graph-based engine would:

Start at the /payments/charge handler
Traverse imports and references to find:

The rate limit decorator
The middleware configuration
Any shared helpers directly used by that handler

Stop at the boundary instead of pulling in:

Unrelated payment models
Billing services
Adjacent routes that aren’t touched

Return only the traversed subgraph as context

This isn’t semantic similarity. It’s structural necessity.

The files that get included are the ones that are actually connected to the code being modified.

Benchmark Results (Real FastAPI Codebase, 21 Runs per Condition)

Using dependency-graph-based context selection:

Input token reduction: 65–70%
Output token reduction: 63%
Cost reduction: 58%
Speed improvement: 22%

The output token reduction is particularly telling: when the model receives focused context, it produces focused output. Less noise in, less noise out.

The Compounding Cost of Token Waste

Token waste compounds quickly at team scale.

Solo Developer Example

5 sessions/day at $0.50/session → $2.50/day → $50/month
With 58% reduction: $1.05/day → $21/month
Savings: $29/month

10-Developer Team Example

Each developer runs 8 sessions/day:

Without optimization:
10 devs × 8 sessions × $0.50 = $40/day
≈ $800/month
With 58% reduction:
Effective cost ≈ $16.80/day
≈ $336/month
Monthly savings: ~$464
Annual savings: ~$5,500

Against a vexp Pro subscription at $190/month ($2,280/year), that’s roughly 2.4x ROI on direct API cost alone.

This ignores:

Time savings from 22% faster task completion
Quality improvements from less noisy context (fewer re-runs, fewer corrections)

What 80% Irrelevant Tokens Do to Output Quality

Token waste isn’t just a cost problem. It’s a quality problem.

Language models exhibit attention dilution: when the context contains a lot of irrelevant content, the model’s attention spreads more broadly, and relevant signals get relatively less weight.

In practice, this shows up as:

1. More Hallucination

The model fills gaps with plausible-sounding but incorrect information, partly because the correct information is buried in irrelevant context.

2. Less Precise Code

You get outputs that are technically correct but don’t match the existing patterns and conventions of the codebase — because those patterns were diluted by noise.

3. Longer, Vager Explanations

The model hedges more when it’s uncertain. Irrelevant context increases uncertainty, so explanations get longer and less decisive.

4. More Re-Reads

The model sometimes re-reads files it already processed, burning extra output tokens, because the relevant signal wasn’t prominent enough the first time.

The 63% reduction in output tokens with focused context reflects all of this: shorter outputs because they’re better targeted, not because they’re less accurate.

How to Measure Your Own Token Waste

You can estimate your current token waste ratio with a simple experiment:

Pick a simple, well-defined task in your codebase

e.g. “Fix this specific bug” or “Add this field to this endpoint”

Manually identify the genuinely necessary files

The target file plus its direct dependencies (imports, helpers, config, tests)

Frequently Asked Questions

What percentage of AI coding tokens are actually wasted?

Benchmarks across production codebases show that 70-90% of input tokens in a typical unoptimized session are irrelevant to the task at hand. This includes files loaded by keyword proximity, exploration overhead while the agent maps the codebase, and accumulated stale context from earlier in the session. Only 10-30% of the input tokens actually contribute to the final output.

What are the main types of token waste in AI coding?

There are four main categories: (1) proximity waste — loading files near the task but not relevant to it; (2) keyword waste — loading files that mention relevant terms but aren't part of the call chain; (3) exploration waste — the agent reading files to understand the codebase before starting the actual task; (4) accumulation waste — old tool outputs and previous reads that stay in context even when no longer needed.

How does loading too many files cause token waste?

Each file read consumes a fixed number of tokens regardless of how useful the file turns out to be. When an agent loads a 500-line file to find that only 20 lines were relevant, 480 lines of input tokens were wasted. Multiply this across 30+ files in a complex session and you have thousands of wasted tokens that directly translate to API costs and slower responses.

Can token waste lead to worse AI suggestions?

Yes, directly. An LLM's attention is finite within its context window. When the context contains 90% irrelevant code, the model distributes attention across it all, diluting focus on the actually relevant sections. This leads to suggestions that mix concepts from unrelated parts of the codebase, miss important constraints from buried relevant code, or produce generic responses that ignore project-specific patterns.

What is the most effective way to eliminate token waste?

The highest-impact change is replacing keyword-based file selection with dependency-graph traversal. This eliminates proximity and keyword waste in one step. For exploration waste, session memory means the agent doesn't need to re-explore the codebase on subsequent sessions. For accumulation waste, using the run_pipeline single-call pattern prevents tool output accumulation by returning pre-compressed results.

Nicola

Developer and creator of vexp — a context engine for AI coding agents. I build tools that make AI coding assistants faster, cheaper, and actually useful on real codebases.