Context Quality vs Quantity: Why More Tokens Don't Mean Better Code

Q: Why does adding more files to AI context actually hurt code quality?

Adding irrelevant files dilutes the model's attention across noise tokens, introduces contradictory coding patterns, and buries critical details in the low-attention middle zone of long contexts. Benchmarks show accuracy drops of 17-30% when context is filled with irrelevant files, even when the relevant files are still present. The model wastes processing capacity distinguishing signal from noise instead of focusing on the actual task.

Q: What is a good relevance ratio for AI coding context?

A relevance ratio above 0.7 (70% of loaded tokens are useful) is excellent. Most AI coding setups without context optimization operate at 0.08-0.15, meaning 85-92% of tokens are wasted. Graph-based context engines like vexp typically achieve 0.65-0.85 by selecting files based on structural code relationships rather than keyword matching.

Q: How many files should I load into AI context for a coding task?

For most tasks, 5-15 files is the optimal range. Start from the specific function or module your task concerns, include its direct callers and callees, add relevant type definitions, and include one test file. Setting a hard limit of 10-15 files forces prioritization and consistently produces better output than loading 30-50 files without filtering.

Q: Is a bigger context window always better for AI coding?

No. A bigger context window is a capacity increase, not a quality increase. It allows you to fit more tokens, but if those tokens are irrelevant, the extra capacity actively degrades output. The optimal strategy is to use the smallest context that includes all structurally relevant code — typically 5-15 files rather than the maximum the window can hold.

Q: How do dependency graphs improve AI coding context compared to keyword search?

Dependency graphs select files based on structural code relationships — which functions call each other, which types are shared, which modules import which. This produces structurally relevant context that directly connects to the task at hand. Keyword search finds files containing matching text, which often includes irrelevant files that happen to share terminology. Graph-based retrieval consistently produces relevance ratios of 0.65-0.85, compared to 0.15-0.25 for keyword-based approaches.

Nicola·May 20, 2026

Context Quality vs Quantity: Why More Tokens Don't Mean Better Code

The 1-million-token context window arrived with fanfare. Developers immediately started stuffing entire codebases into prompts, convinced that more information equals better output. The logic seems sound: if the model can see everything, it can understand everything.

That logic is catastrophically wrong. Benchmarks show that AI coding accuracy drops by 17-30% when context windows are filled with irrelevant files, even when the relevant files are still present. You're not just wasting money on extra tokens — you're actively degrading output quality.

The More-Is-Better Assumption

The assumption goes like this: bigger context window means the model can see more code, which means it has more information, which means it produces better output. Each link in that chain sounds reasonable. The chain itself is broken.

Here's why. Language models don't process all tokens equally. Attention mechanisms have a well-documented tendency to degrade with noise. When you load 50 files into context but only 5 are relevant, the model must distinguish signal from noise across hundreds of thousands of tokens. It frequently fails.

The cost dimension makes it worse. Token pricing is linear — 10x more tokens costs 10x more. But the accuracy curve is not linear. There's a sweet spot where adding relevant context improves output, and a cliff where adding irrelevant context destroys it. Most developers are operating well past the cliff.

The Experiment: 5 Relevant Files vs 50 Random Files

Consider a straightforward test. Take a bug fix task — a null reference error in an Express middleware that affects two downstream handlers. Run the same task twice with the same model.

Configuration A — Precision context (5 files):

The middleware file with the bug
The two affected handler files
The shared type definitions file
The relevant test file

Configuration B — Bulk context (50 files):

All 5 relevant files from Configuration A
Plus 45 other files from the same codebase (utilities, unrelated routes, config files, migrations, seed data)

Results across 20 repeated runs:

Configuration A: Correct fix in 18/20 runs (90%). Average completion time: 8 seconds. Average cost: $0.12.
Configuration B: Correct fix in 13/20 runs (65%). Average completion time: 23 seconds. Average cost: $0.87.

The bulk context configuration was worse on every metric — accuracy, speed, and cost. The 45 extra files didn't help. They introduced confusion. The model occasionally referenced patterns from unrelated files, applied wrong type assumptions from similar-looking but unrelated code, and in 3 cases, modified the wrong file entirely.

Context Quality Metrics: The Relevance Ratio

If you can't measure context quality, you can't improve it. The most useful metric is the relevance ratio:

Relevance Ratio = Useful Tokens / Total Tokens

A "useful token" is one that the model actually references, directly or indirectly, in producing its output. You can approximate this by checking: what percentage of the loaded files are actually referenced in the model's output?

Benchmarks by quality tier:

Excellent: Relevance ratio > 0.7 (70%+ of tokens are useful)
Good: Relevance ratio 0.4-0.7
Poor: Relevance ratio 0.1-0.4
Wasteful: Relevance ratio < 0.1

Most AI coding setups without context optimization operate at a relevance ratio of 0.08-0.15. That means 85-92% of the tokens you're paying for are noise. At $15/million input tokens on Opus, a developer processing 500K tokens/day is spending $5.63/day on irrelevant context — $112/month in pure waste.

Why 5 Precise Files Beat 50 Random Files

Three mechanisms explain the precision advantage.

Signal-to-Noise Ratio

Attention heads in transformer models allocate focus across all tokens in the context window. When 90% of tokens are noise, the model's attention is spread thin. Critical details — a type signature, an edge case in a conditional, a null check that's missing — compete for attention against thousands of irrelevant tokens. The relevant detail doesn't always win.

With a high signal-to-noise ratio, the model's full attention capacity is concentrated on the code that matters. Every token in context is pulling its weight.

Focused Attention Window

Even within the "relevant" portion of a large context, positional effects matter. Information in the middle of very long contexts receives less attention than information at the beginning or end — a phenomenon well-documented in retrieval-augmented generation research. With 5 files, everything is within the high-attention zone. With 50 files, your critical bug-containing function might be buried in the low-attention middle.

Less Contradictory Information

Large codebases contain patterns that contradict each other. An old utility function handles errors with callbacks. A new service uses async/await. A deprecated module uses a different naming convention. When the model sees all of these patterns simultaneously, it must choose which to follow — and it doesn't always choose the one relevant to your task.

Precision context eliminates contradictory signals. The model sees only the patterns that apply to the current task, producing output that's stylistically and architecturally consistent with the relevant code.

How to Measure Your Context Quality

You can audit your context quality in under 10 minutes.

Step 1: Run a typical coding task. Note which files your AI agent loads into context. Most agents show this in their output or logs.

Step 2: Check the output. Which files does the model actually reference in its response? Which files does it modify? Which type definitions does it use?

Step 3: Calculate. Divide the number of referenced files by the number of loaded files. That's your file-level relevance ratio.

What you'll typically find:

Claude Code without context optimization: Loads 15-30 files via search, references 3-5. Relevance ratio: 0.15-0.25.
Cursor/Copilot auto-context: Loads 10-20 files based on proximity and recency, references 2-4. Relevance ratio: 0.15-0.30.
Manual context curation: Developer hand-picks 5-8 files. Relevance ratio: 0.50-0.80.
Graph-based context engine: Loads 5-12 files based on dependency analysis. Relevance ratio: 0.65-0.85.

Manual curation produces good ratios but doesn't scale. You can't hand-pick context for every task across a 100K-line codebase — you'd spend more time selecting files than writing code.

The Dependency Graph Advantage

Keyword search finds files that contain matching text. Semantic search finds files that are conceptually similar. Neither approach answers the question that actually matters: which files are structurally connected to the code I'm changing?

A dependency graph answers that question directly. When you're fixing a bug in `UserService.authenticate()`, the graph knows:

Which functions call `authenticate()`
Which types `authenticate()` accepts and returns
Which modules import `UserService`
Which test files exercise `authenticate()`
Which configuration files affect `authenticate()`'s behavior

This is structural relevance — relevance determined by code relationships, not text similarity. A file that imports `UserService` is relevant to your bug fix even if it shares zero keywords with the bug report. A file that contains the word "authenticate" in a comment but has no structural relationship to `UserService` is irrelevant, even though keyword search would rank it highly.

Structural relevance consistently produces higher relevance ratios than keyword or semantic search because code relationships are the ground truth of what's relevant. When you change a function, the code that calls that function is relevant by definition.

How vexp Achieves High Context Quality

vexp's approach is built on this structural principle. When you describe a task, vexp identifies the entry-point symbols — the functions, classes, or modules your task directly concerns — and traverses the dependency graph outward from those symbols.

The traversal is ranked by structural proximity. Direct callers and callees rank highest. Transitive dependencies rank lower. Unconnected code is excluded entirely. The result is a context capsule containing 5-15 files that are structurally connected to your task, with a typical relevance ratio of 0.65-0.85.

This graph-based retrieval produces a 65-70% token reduction compared to naive context loading, while improving output accuracy. Fewer tokens, better results, lower cost — the trifecta that the "more is better" assumption gets exactly backwards.

Practical Guidelines for Context Curation

Whether or not you use a context engine, these principles improve your AI coding output immediately.

1. Start from the change point, not the codebase.

Identify the specific function or module your task concerns. Load that file first, then its direct dependencies. Stop when you've covered one hop in each direction (callers and callees).

2. Prefer type definitions over implementations.

A 20-line interface file gives the model more useful information than a 500-line implementation file. Types constrain the output space, making the model more likely to produce correct code.

3. Remove contradictory examples.

If your codebase has old code using callbacks and new code using async/await, only include examples that match the pattern you want the model to follow. Mixed signals produce mixed output.

4. Include one test file, not all test files.

A single test file for the module you're changing gives the model the testing patterns it needs. Loading all test files adds noise without proportional benefit.

5. Measure and iterate.

After each task, check your relevance ratio. If it's below 0.5, you're loading too many irrelevant files. Trim your context strategy until the ratio climbs above 0.6.

6. Set a file budget.

A hard limit of 10-15 files forces prioritization. If you can't fit everything in 15 files, you're probably loading files that aren't relevant enough. This constraint consistently improves output quality.

The Bottom Line

Context windows are a resource, not a goal. Filling a 1-million-token window is like filling a 1,000-horsepower engine with regular fuel — the capacity is there, but the input quality determines the output quality.

Five precisely selected files outperform 50 randomly gathered files on accuracy, speed, and cost. The relevance ratio — useful tokens divided by total tokens — is the single most important metric for AI coding effectiveness. Most setups operate at 0.10-0.15. The best operate at 0.70+.

More tokens don't mean better code. Better tokens mean better code.

Frequently Asked Questions

Why does adding more files to AI context actually hurt code quality?

Adding irrelevant files dilutes the model's attention across noise tokens, introduces contradictory coding patterns, and buries critical details in the low-attention middle zone of long contexts. Benchmarks show accuracy drops of 17-30% when context is filled with irrelevant files, even when the relevant files are still present. The model wastes processing capacity distinguishing signal from noise instead of focusing on the actual task.

What is a good relevance ratio for AI coding context?

A relevance ratio above 0.7 (70% of loaded tokens are useful) is excellent. Most AI coding setups without context optimization operate at 0.08-0.15, meaning 85-92% of tokens are wasted. Graph-based context engines like vexp typically achieve 0.65-0.85 by selecting files based on structural code relationships rather than keyword matching.

How many files should I load into AI context for a coding task?

For most tasks, 5-15 files is the optimal range. Start from the specific function or module your task concerns, include its direct callers and callees, add relevant type definitions, and include one test file. Setting a hard limit of 10-15 files forces prioritization and consistently produces better output than loading 30-50 files without filtering.

Is a bigger context window always better for AI coding?

No. A bigger context window is a capacity increase, not a quality increase. It allows you to fit more tokens, but if those tokens are irrelevant, the extra capacity actively degrades output. The optimal strategy is to use the smallest context that includes all structurally relevant code — typically 5-15 files rather than the maximum the window can hold.

How do dependency graphs improve AI coding context compared to keyword search?

Dependency graphs select files based on structural code relationships — which functions call each other, which types are shared, which modules import which. This produces structurally relevant context that directly connects to the task at hand. Keyword search finds files containing matching text, which often includes irrelevant files that happen to share terminology. Graph-based retrieval consistently produces relevance ratios of 0.65-0.85, compared to 0.15-0.25 for keyword-based approaches.

Nicola

Developer and creator of vexp — a context engine for AI coding agents. I build tools that make AI coding assistants faster, cheaper, and actually useful on real codebases.

Cost & Optimization

Vibe Coding Is Fun Until the Bill Arrives: Token Optimization Guide

Vibe coding with AI is addictive but expensive. Freestyle prompting without context management burns tokens 3-5x faster than structured workflows.

Nicola·May 25, 2026

Context Engineering

Code Indexing for AI Agents: Embeddings vs Dependency Graphs vs RAG

Three approaches to code indexing for AI: embeddings, dependency graphs, and RAG. Each has trade-offs in accuracy, token efficiency, and maintenance cost.

Nicola·May 22, 2026

Context Engineering

RAG for Code: Retrieval-Augmented Generation in AI Development

RAG retrieves relevant code from your codebase before the AI generates a response. But vector-based RAG misses structural relationships that matter for coding.

Nicola·May 21, 2026

Context Quality vs Quantity: Why More Tokens Don't Mean Better Code

The More-Is-Better Assumption

The Experiment: 5 Relevant Files vs 50 Random Files

Context Quality Metrics: The Relevance Ratio

Why 5 Precise Files Beat 50 Random Files

Signal-to-Noise Ratio

Focused Attention Window

Less Contradictory Information

How to Measure Your Context Quality

The Dependency Graph Advantage

How vexp Achieves High Context Quality

Practical Guidelines for Context Curation

The Bottom Line

Frequently Asked Questions

Related Articles

Vibe Coding Is Fun Until the Bill Arrives: Token Optimization Guide

Code Indexing for AI Agents: Embeddings vs Dependency Graphs vs RAG

RAG for Code: Retrieval-Augmented Generation in AI Development