Same model. Better context. More bugs fixed.
We benchmarked 4 coding agents on SWE-bench Verified — all running Claude Opus 4.5. The only variable was the context layer. vexp won on cost. And resolved issues no other agent could.
The results.

Evaluated on a 100-task stratified subset of SWE-bench Verified. All agents use Claude Opus 4.5 for a fair, apples-to-apples comparison. External resolution data from swe-bench/experiments. Cost data from each agent's published benchmarks.
| Agent | Pass@1 | $/task | Unique Wins |
|---|---|---|---|
| vexp + Claude Code | 73.0% | $0.67 | 7–10 |
| Live-SWE-Agent | 72.0% | $0.86 | — |
| OpenHands | 70.0% | $1.77 | — |
| Sonar Foundation | 70.0% | $1.98 | — |
All four agents run the same model. The difference is context.
Every agent in this benchmark uses Claude Opus 4.5 with the same cost limit ($3/task) and the same turn budget (250 turns). The scaffolding varies — but the model is identical.
Yet vexp + Claude Code resolves 73% of issues at $0.67 per task. The closest competitor costs 28% more. The most expensive costs 3× more.
The difference isn't the model. It's what the model sees before it writes code.
vexp pre-indexes your codebase into a dependency graph, then delivers a ranked context capsule — full source for pivot files, skeletonized signatures for everything else — bounded to your token budget. The agent starts every task already knowing what matters.
This benchmark is evidence for a simple thesis: in agentic coding, context engineering is the highest-leverage intervention available. Better context → fewer wasted turns → lower cost → higher resolution.
“On 7–10 tasks, vexp was the only agent to produce a passing patch. These aren't marginal improvements — they're bugs that the model simply cannot fix without the right context.”
Resolution vs cost. The efficiency map.

Top-left is the best zone: high resolution, low cost.
vexp is the only agent in that quadrant — resolving the most issues while spending the least per task. The scatter also shows the “unique wins” count: issues that only vexp resolved.
The cost advantage compounds at scale. On a 500-task run, the difference between $0.67 and $1.98 per task is $655 — enough to pay for vexp Pro for almost 3 years.
Where vexp leads — and where it doesn't.

astropy — 80% vs 40%
vexp doubles the resolution rate of the nearest competitor on astropy's complex astronomical computation issues.
xarray, requests — 75–83%
On data-structure-heavy repos, vexp's dependency graph surfaces the exact chain of imports and type relationships.
matplotlib — 43% vs 86%
Sonar Foundation leads on matplotlib. vexp's graph-based context is less effective on rendering-heavy, visual output code. We're investigating why — and it's a good reason to run this benchmark yourself.
Methodology — how we ran this benchmark.
Transparent, reproducible, open-source. No cherry-picking.
Don't trust us. Run it yourself.
Clone. Setup. Run. Under 10 minutes to first result.
$ git clone https://github.com/Vexp-ai/vexp-swe-bench.git
$ cd vexp-swe-bench && ./setup.sh
$ source .venv/bin/activate
$ node dist/cli.js run
Requires Node ≥ 18, Python ≥ 3.10, Docker. Default agent: Claude Code with vexp. Use --no-vexp to run without vexp and compare yourself.
Use code BENCHMARK at vexp.dev/#pricing for 14 days of vexp Pro — free.
No credit card required.
Context is the highest-leverage variable
in AI coding.
73% pass rate. $0.67 per task. 7–10 issues no one else solved.
All on your machine. No cloud. No account.
or npm install -g vexp-cli
One-time email with install links. No spam.
Free tier · No account · No credit card · Zero network calls