Context Quality vs Quantity: Why More Tokens Don't Mean Better Code

Context Quality vs Quantity: Why More Tokens Don't Mean Better Code
The 1-million-token context window arrived with fanfare. Developers immediately started stuffing entire codebases into prompts, convinced that more information equals better output. The logic seems sound: if the model can see everything, it can understand everything.
That logic is catastrophically wrong. Benchmarks show that AI coding accuracy drops by 17-30% when context windows are filled with irrelevant files, even when the relevant files are still present. You're not just wasting money on extra tokens — you're actively degrading output quality.
The More-Is-Better Assumption
The assumption goes like this: bigger context window means the model can see more code, which means it has more information, which means it produces better output. Each link in that chain sounds reasonable. The chain itself is broken.
Here's why. Language models don't process all tokens equally. Attention mechanisms have a well-documented tendency to degrade with noise. When you load 50 files into context but only 5 are relevant, the model must distinguish signal from noise across hundreds of thousands of tokens. It frequently fails.
The cost dimension makes it worse. Token pricing is linear — 10x more tokens costs 10x more. But the accuracy curve is not linear. There's a sweet spot where adding relevant context improves output, and a cliff where adding irrelevant context destroys it. Most developers are operating well past the cliff.
The Experiment: 5 Relevant Files vs 50 Random Files
Consider a straightforward test. Take a bug fix task — a null reference error in an Express middleware that affects two downstream handlers. Run the same task twice with the same model.
Configuration A — Precision context (5 files):
- The middleware file with the bug
- The two affected handler files
- The shared type definitions file
- The relevant test file
Configuration B — Bulk context (50 files):
- All 5 relevant files from Configuration A
- Plus 45 other files from the same codebase (utilities, unrelated routes, config files, migrations, seed data)
Results across 20 repeated runs:
- Configuration A: Correct fix in 18/20 runs (90%). Average completion time: 8 seconds. Average cost: $0.12.
- Configuration B: Correct fix in 13/20 runs (65%). Average completion time: 23 seconds. Average cost: $0.87.
The bulk context configuration was worse on every metric — accuracy, speed, and cost. The 45 extra files didn't help. They introduced confusion. The model occasionally referenced patterns from unrelated files, applied wrong type assumptions from similar-looking but unrelated code, and in 3 cases, modified the wrong file entirely.
Context Quality Metrics: The Relevance Ratio
If you can't measure context quality, you can't improve it. The most useful metric is the relevance ratio:
Relevance Ratio = Useful Tokens / Total Tokens
A "useful token" is one that the model actually references, directly or indirectly, in producing its output. You can approximate this by checking: what percentage of the loaded files are actually referenced in the model's output?
Benchmarks by quality tier:
- Excellent: Relevance ratio > 0.7 (70%+ of tokens are useful)
- Good: Relevance ratio 0.4-0.7
- Poor: Relevance ratio 0.1-0.4
- Wasteful: Relevance ratio < 0.1
Most AI coding setups without context optimization operate at a relevance ratio of 0.08-0.15. That means 85-92% of the tokens you're paying for are noise. At $15/million input tokens on Opus, a developer processing 500K tokens/day is spending $5.63/day on irrelevant context — $112/month in pure waste.
Why 5 Precise Files Beat 50 Random Files
Three mechanisms explain the precision advantage.
Signal-to-Noise Ratio
Attention heads in transformer models allocate focus across all tokens in the context window. When 90% of tokens are noise, the model's attention is spread thin. Critical details — a type signature, an edge case in a conditional, a null check that's missing — compete for attention against thousands of irrelevant tokens. The relevant detail doesn't always win.
With a high signal-to-noise ratio, the model's full attention capacity is concentrated on the code that matters. Every token in context is pulling its weight.
Focused Attention Window
Even within the "relevant" portion of a large context, positional effects matter. Information in the middle of very long contexts receives less attention than information at the beginning or end — a phenomenon well-documented in retrieval-augmented generation research. With 5 files, everything is within the high-attention zone. With 50 files, your critical bug-containing function might be buried in the low-attention middle.
Less Contradictory Information
Large codebases contain patterns that contradict each other. An old utility function handles errors with callbacks. A new service uses async/await. A deprecated module uses a different naming convention. When the model sees all of these patterns simultaneously, it must choose which to follow — and it doesn't always choose the one relevant to your task.
Precision context eliminates contradictory signals. The model sees only the patterns that apply to the current task, producing output that's stylistically and architecturally consistent with the relevant code.
How to Measure Your Context Quality
You can audit your context quality in under 10 minutes.
Step 1: Run a typical coding task. Note which files your AI agent loads into context. Most agents show this in their output or logs.
Step 2: Check the output. Which files does the model actually reference in its response? Which files does it modify? Which type definitions does it use?
Step 3: Calculate. Divide the number of referenced files by the number of loaded files. That's your file-level relevance ratio.
What you'll typically find:
- Claude Code without context optimization: Loads 15-30 files via search, references 3-5. Relevance ratio: 0.15-0.25.
- Cursor/Copilot auto-context: Loads 10-20 files based on proximity and recency, references 2-4. Relevance ratio: 0.15-0.30.
- Manual context curation: Developer hand-picks 5-8 files. Relevance ratio: 0.50-0.80.
- Graph-based context engine: Loads 5-12 files based on dependency analysis. Relevance ratio: 0.65-0.85.
Manual curation produces good ratios but doesn't scale. You can't hand-pick context for every task across a 100K-line codebase — you'd spend more time selecting files than writing code.
The Dependency Graph Advantage
Keyword search finds files that contain matching text. Semantic search finds files that are conceptually similar. Neither approach answers the question that actually matters: which files are structurally connected to the code I'm changing?
A dependency graph answers that question directly. When you're fixing a bug in `UserService.authenticate()`, the graph knows:
- Which functions call `authenticate()`
- Which types `authenticate()` accepts and returns
- Which modules import `UserService`
- Which test files exercise `authenticate()`
- Which configuration files affect `authenticate()`'s behavior
This is structural relevance — relevance determined by code relationships, not text similarity. A file that imports `UserService` is relevant to your bug fix even if it shares zero keywords with the bug report. A file that contains the word "authenticate" in a comment but has no structural relationship to `UserService` is irrelevant, even though keyword search would rank it highly.
Structural relevance consistently produces higher relevance ratios than keyword or semantic search because code relationships are the ground truth of what's relevant. When you change a function, the code that calls that function is relevant by definition.
How vexp Achieves High Context Quality
vexp's approach is built on this structural principle. When you describe a task, vexp identifies the entry-point symbols — the functions, classes, or modules your task directly concerns — and traverses the dependency graph outward from those symbols.
The traversal is ranked by structural proximity. Direct callers and callees rank highest. Transitive dependencies rank lower. Unconnected code is excluded entirely. The result is a context capsule containing 5-15 files that are structurally connected to your task, with a typical relevance ratio of 0.65-0.85.
This graph-based retrieval produces a 65-70% token reduction compared to naive context loading, while improving output accuracy. Fewer tokens, better results, lower cost — the trifecta that the "more is better" assumption gets exactly backwards.
Practical Guidelines for Context Curation
Whether or not you use a context engine, these principles improve your AI coding output immediately.
1. Start from the change point, not the codebase.
Identify the specific function or module your task concerns. Load that file first, then its direct dependencies. Stop when you've covered one hop in each direction (callers and callees).
2. Prefer type definitions over implementations.
A 20-line interface file gives the model more useful information than a 500-line implementation file. Types constrain the output space, making the model more likely to produce correct code.
3. Remove contradictory examples.
If your codebase has old code using callbacks and new code using async/await, only include examples that match the pattern you want the model to follow. Mixed signals produce mixed output.
4. Include one test file, not all test files.
A single test file for the module you're changing gives the model the testing patterns it needs. Loading all test files adds noise without proportional benefit.
5. Measure and iterate.
After each task, check your relevance ratio. If it's below 0.5, you're loading too many irrelevant files. Trim your context strategy until the ratio climbs above 0.6.
6. Set a file budget.
A hard limit of 10-15 files forces prioritization. If you can't fit everything in 15 files, you're probably loading files that aren't relevant enough. This constraint consistently improves output quality.
The Bottom Line
Context windows are a resource, not a goal. Filling a 1-million-token window is like filling a 1,000-horsepower engine with regular fuel — the capacity is there, but the input quality determines the output quality.
Five precisely selected files outperform 50 randomly gathered files on accuracy, speed, and cost. The relevance ratio — useful tokens divided by total tokens — is the single most important metric for AI coding effectiveness. Most setups operate at 0.10-0.15. The best operate at 0.70+.
More tokens don't mean better code. Better tokens mean better code.
Frequently Asked Questions
Why does adding more files to AI context actually hurt code quality?
What is a good relevance ratio for AI coding context?
How many files should I load into AI context for a coding task?
Is a bigger context window always better for AI coding?
How do dependency graphs improve AI coding context compared to keyword search?
Nicola
Developer and creator of vexp — a context engine for AI coding agents. I build tools that make AI coding assistants faster, cheaper, and actually useful on real codebases.
Related Articles

Vibe Coding Is Fun Until the Bill Arrives: Token Optimization Guide
Vibe coding with AI is addictive but expensive. Freestyle prompting without context management burns tokens 3-5x faster than structured workflows.

Code Indexing for AI Agents: Embeddings vs Dependency Graphs vs RAG
Three approaches to code indexing for AI: embeddings, dependency graphs, and RAG. Each has trade-offs in accuracy, token efficiency, and maintenance cost.

RAG for Code: Retrieval-Augmented Generation in AI Development
RAG retrieves relevant code from your codebase before the AI generates a response. But vector-based RAG misses structural relationships that matter for coding.