RAG for Code: Retrieval-Augmented Generation in AI Development

RAG for Code: Retrieval-Augmented Generation in AI Development
Standard RAG was built for documents. You embed paragraphs, query by semantic similarity, retrieve the top-k chunks, and feed them to the model. It works beautifully for customer support bots answering questions about PDF manuals. It works poorly for AI agents navigating a 200K-line TypeScript monorepo.
The gap between document RAG and code RAG is structural. Documents are linear — paragraphs relate to adjacent paragraphs. Code is a graph — a function relates to its callers, its callees, its type definitions, and its test cases, none of which are textually adjacent. Applying document retrieval patterns to code produces retrieval that looks right but misses what matters.
What RAG Is and How It Works
Retrieval-Augmented Generation is a two-phase approach: retrieve relevant context from a knowledge base, then generate a response conditioned on that context. Instead of relying on the model's training data alone, RAG gives the model fresh, specific information at inference time.
The standard pipeline has four stages:
- Indexing: Split source material into chunks, generate vector embeddings for each chunk, store embeddings in a vector database.
- Query: Convert the user's question into an embedding using the same model.
- Retrieval: Find the top-k chunks whose embeddings are most similar to the query embedding (cosine similarity or dot product).
- Generation: Feed the retrieved chunks as context to the LLM along with the original question.
For documents, this works well. A question about "return policy for electronics" retrieves paragraphs about return policies and electronics, which is exactly what the model needs.
How RAG Works for Code
Adapting RAG to code follows the same pipeline with code-specific adjustments.
Indexing: Parse the codebase into chunks. Common strategies include chunking by function, by class, by file, or by fixed-size blocks with overlap. Generate embeddings for each chunk using a code-aware embedding model (OpenAI's `text-embedding-3-large`, Voyage Code, or Cohere's embed models).
Query: The developer's task description — "fix the authentication middleware timeout" — gets embedded and compared against the code chunk embeddings.
Retrieval: The top-k most semantically similar chunks are returned. If the query mentions "authentication," chunks containing authentication-related code rank highest.
Generation: The retrieved chunks are inserted into the prompt context, and the LLM generates the fix based on that context.
This approach has genuine advantages over pure keyword search. Semantic similarity captures conceptual relationships — a query about "user validation" can retrieve code that uses the term "credential verification" even though the keywords don't overlap. It handles typos, synonyms, and conceptual queries better than grep.
RAG vs Traditional Search for Code
Keyword search (grep, ripgrep, IDE search) matches exact text patterns. It's fast, deterministic, and precise when you know exactly what you're looking for. It fails when you don't know the naming convention, when the concept spans multiple terms, or when you need to find semantically related code.
Vector RAG matches semantic meaning. It finds code that's conceptually related to your query, even with different terminology. It handles natural-language queries ("how does the app handle expired sessions?") far better than keyword search.
Comparison on real tasks:
| Task | Keyword Search | Vector RAG |
|------|---------------|------------|
| Find function `validateToken` | Instant, exact match | Works but slower |
| Find "all auth-related code" | Miss code using different terms | Better recall |
| Find callers of `validateToken` | Requires regex, misses dynamic calls | Misses — no structural awareness |
| Find code affected by changing `User` type | Manual, error-prone | Misses — embeddings don't encode type dependencies |
The last two rows reveal the fundamental limitation. Both approaches fail at structural queries because neither understands code as a graph of relationships.
Limitations of Vector-Based Code RAG
Vector RAG for code has four structural limitations that no amount of embedding model improvement can fully resolve.
Embeddings Miss Structural Relationships
An embedding captures what code looks like, not what it does in the system. Two functions with similar variable names and control flow patterns produce similar embeddings, even if they operate in completely different domains and have zero structural relationship.
Conversely, a type definition file and its consuming function may produce dissimilar embeddings despite being tightly coupled. The type file is declarations; the function is logic. They look nothing alike, but changing one requires changing the other.
Similar-Looking Code Is Not Relevant Code
A codebase with 20 API route handlers produces 20 chunks with similar embeddings — they all follow the same pattern (parse request, validate, call service, return response). When you query for "fix the payments endpoint," vector RAG retrieves several route handlers ranked by similarity to the word "payments." It might return the payments handler, but it also returns other handlers that are structurally irrelevant.
Meanwhile, the `PaymentService` class, the `Stripe` integration module, and the `Transaction` type definition — all critically relevant — rank lower because they look different from a route handler, even though they're structurally essential to fixing the payments endpoint.
Chunk Boundaries Break Context
Code doesn't split cleanly into independent chunks. A function that calls another function, uses a type from a third file, and implements an interface from a fourth file is a node in a web of relationships. Chunking it removes those relationships.
A class split across a 200-line file might get chunked into 3-4 pieces. The constructor is in chunk 1, the method with the bug is in chunk 3, and the type it returns is in a different file entirely. RAG retrieves chunk 3 because it matches the query, but without chunks 1 and the type definition, the model lacks the context needed for a correct fix.
Retrieval Accuracy Degrades with Codebase Size
In a 10K-line codebase, top-10 retrieval might capture most relevant code. In a 500K-line codebase, the same top-10 retrieval captures a much smaller fraction. The embedding space gets crowded — more chunks means more near-neighbors competing for the top-k slots.
Developers report that vector RAG accuracy drops noticeably as codebases grow past 50K lines. The retrieval remains semantically sensible (the returned chunks are "about" the right topic) but structurally incomplete (critical dependencies are missing).
The Structural Alternative: Graph-Based Retrieval
A dependency graph indexes code by relationships, not by text content. The graph nodes are code symbols (functions, classes, types, modules). The edges are structural relationships (calls, imports, implements, extends, returns).
Retrieval on a graph works differently from retrieval on vectors:
- Identify entry points: Parse the task to find relevant symbols (the function to fix, the module to refactor).
- Traverse outward: Walk the graph from entry points along structural edges — callers, callees, type dependencies, imports.
- Rank by proximity: Symbols closer to the entry points (fewer hops) rank higher than distant symbols.
- Return subgraph: The relevant slice of the codebase, determined by structural connectivity.
This retrieval method answers structural queries that vector RAG cannot. "What code is affected if I change `UserService.authenticate()`?" is a graph traversal — follow all incoming edges (callers), outgoing edges (callees), and type edges (shared interfaces). The answer is exact and complete.
Comparing RAG Approaches for Code
Vector RAG:
- Strengths: Handles natural-language queries, finds semantically related code, requires no structural parsing
- Weaknesses: Misses structural dependencies, chunk boundary problems, accuracy degrades at scale
- Best for: Exploratory queries ("how does auth work here?"), finding example code, documentation search
Graph RAG:
- Strengths: Exact structural relevance, scales with codebase size, fast incremental updates, high precision
- Weaknesses: Requires structural parsing, doesn't handle natural-language fuzzy queries, limited to parsed languages
- Best for: Bug fixes, refactors, impact analysis, any task where the change point is known
Hybrid RAG:
- Strengths: Combines semantic and structural retrieval, covers both exploratory and targeted tasks
- Weaknesses: More complex pipeline, potential for conflicting signals between retrieval methods
- Best for: General-purpose AI coding where task types vary
The hybrid approach sounds ideal in theory, but implementation matters more than architecture. A well-implemented graph RAG outperforms a poorly-implemented hybrid on structural tasks every time.
How vexp Implements Graph-Based Retrieval
vexp takes the graph RAG approach, using tree-sitter parsing to build dependency graphs across 30 programming languages. The retrieval pipeline works as follows:
Indexing: Parse every file in the codebase with tree-sitter, extract symbols (functions, classes, types, variables), resolve imports and references to build a directed dependency graph. Compute graph centrality metrics (PageRank, betweenness) for ranking. Store in a local index that updates incrementally — only re-parse changed files.
Retrieval: When a task arrives ("fix the JWT validation bug"), identify entry-point symbols through keyword + semantic + graph centrality hybrid search. Traverse outward from those entry points along dependency edges, collecting the structural neighborhood. Rank results by structural proximity (fewer hops = higher rank) and graph centrality (high-centrality nodes are architectural pivots).
Context assembly: Package the ranked symbols with their source code into a compressed context capsule. Typical output: 5-15 files at a relevance ratio of 0.65-0.85, compared to 0.10-0.25 for vector RAG or keyword search.
The result is a 65-70% token reduction versus naive context loading, with higher accuracy because every file in context is structurally connected to the task.
When Vector RAG Wins vs When Graph Retrieval Wins
Neither approach is universally superior. The right choice depends on the task type.
Vector RAG is better when:
- You're exploring an unfamiliar codebase and don't know what to look for
- The query is conceptual ("how does caching work in this project?")
- You need to find code examples or patterns across the codebase
- The codebase uses languages without strong structural parsing support
Graph retrieval is better when:
- You know the change point (a specific function, module, or file)
- The task is a bug fix, refactor, or feature addition to existing code
- You need to understand blast radius (what code is affected by a change)
- Accuracy matters more than exploration breadth
- The codebase is large (50K+ lines) where vector retrieval accuracy degrades
For most professional development work — fixing bugs, building features, refactoring existing code — the change point is known. You know which function is broken, which module needs the feature, which pattern needs refactoring. Graph retrieval dominates these tasks because it starts from known entry points and expands structurally.
Vector RAG excels at the discovery phase — onboarding to a new codebase, finding examples of a pattern, understanding high-level architecture. Once discovery transitions to modification, graph retrieval takes over.
Practical Implications for Developers
If you're evaluating code RAG solutions:
- Test on structural tasks, not just search. Any RAG can find a function by name. Test whether it finds the function's callers, its type dependencies, and the test file that exercises it.
- Measure the relevance ratio. Divide useful files by total files retrieved. If the ratio is below 0.5, the retrieval is producing more noise than signal.
- Test at your codebase scale. A retrieval method that works on a 5K-line demo project may fail on your 200K-line production codebase. Vector RAG is particularly susceptible to this scale degradation.
- Check incremental update speed. If re-indexing after a code change takes minutes, the index is stale during active development — exactly when you need it most. Graph-based indexing with incremental updates (re-parse only changed files) stays current in seconds.
- Consider the integration cost. The best retrieval system is the one your team actually uses. If it requires a separate vector database, embedding API costs, and custom chunking logic, adoption will be low. Native MCP integration with zero infrastructure is a significant practical advantage.
Code RAG is not a solved problem. But the direction is clear — structural understanding of code relationships produces fundamentally better retrieval than text similarity alone. The tools that treat code as a graph rather than a bag of text chunks will deliver the context quality that AI coding agents actually need.
Frequently Asked Questions
What is RAG for code and how does it differ from regular RAG?
Why do vector embeddings work poorly for code retrieval?
What is graph-based code retrieval and how does it work?
When should I use vector RAG vs graph retrieval for code?
How much does code RAG improve AI coding accuracy and cost?
Nicola
Developer and creator of vexp — a context engine for AI coding agents. I build tools that make AI coding assistants faster, cheaper, and actually useful on real codebases.
Related Articles

Vibe Coding Is Fun Until the Bill Arrives: Token Optimization Guide
Vibe coding with AI is addictive but expensive. Freestyle prompting without context management burns tokens 3-5x faster than structured workflows.

Code Indexing for AI Agents: Embeddings vs Dependency Graphs vs RAG
Three approaches to code indexing for AI: embeddings, dependency graphs, and RAG. Each has trade-offs in accuracy, token efficiency, and maintenance cost.

Context Quality vs Quantity: Why More Tokens Don't Mean Better Code
Loading more files into the context window doesn't improve AI output — it degrades it. Quality context with 5 relevant files beats 50 random ones every time.