Same model. Better context. More bugs fixed.
We benchmarked 4 coding agents on SWE-bench Verified — all running Claude Opus 4.5. The only variable was the context layer. vexp won on cost. And resolved issues no other agent could.
Highest resolution. Lowest cost.
One leaderboard, one model, four agents — measured on the same 100 tasks.

Evaluated on a 100-task stratified subset of SWE-bench Verified. All agents use Claude Opus 4.5 for a fair, apples-to-apples comparison. External resolution data from swe-bench/experiments. Cost data from each agent's published benchmarks.
| Agent | Pass@1 | $/task | Unique Wins |
|---|---|---|---|
| vexp + Claude Code | 73.0% | $0.67 | 7–10 |
| Live-SWE-Agent | 72.0% | $0.86 | — |
| OpenHands | 70.0% | $1.77 | — |
| Sonar Foundation | 70.0% | $1.98 | — |
All four agents run the same model. The difference is context.
Every agent in this benchmark uses Claude Opus 4.5 with the same cost limit ($3/task) and the same turn budget (250 turns). The scaffolding varies — but the model is identical.
Yet vexp + Claude Code resolves 73% of issues at $0.67 per task. The closest competitor costs 28% more. The most expensive costs 3× more.
The difference isn't the model. It's what the model sees before it writes code.
vexp pre-indexes your codebase into a dependency graph, then delivers a ranked context capsule — full source for pivot files, skeletonized signatures for everything else — bounded to your token budget. The agent starts every task already knowing what matters.
This benchmark is evidence for a simple thesis: in agentic coding, context engineering is the highest-leverage intervention available. Better context → fewer wasted turns → lower cost → higher resolution.
“On 7–10 tasks, vexp was the only agent to produce a passing patch. These aren't marginal improvements — they're bugs that the model simply cannot fix without the right context.”
Resolution vs. cost, mapped.
High resolution at low cost — the quadrant every agent is trying to reach.

Top-left is the best zone: high resolution, low cost.
vexp is the only agent in that quadrant — resolving the most issues while spending the least per task. The scatter also shows the “unique wins” count: issues that only vexp resolved.
The cost advantage compounds at scale. On a 500-task run, the difference between $0.67 and $1.98 per task is $655 — enough to pay for vexp Pro for almost 3 years.
Where vexp leads — and where it doesn't.
Resolution rates across all 12 repositories, reported in full.

astropy — 80% vs 40%
vexp doubles the resolution rate of the nearest competitor on astropy's complex astronomical computation issues.
xarray, requests — 75–83%
On data-structure-heavy repos, vexp's dependency graph surfaces the exact chain of imports and type relationships.
matplotlib — 43% vs 86%
Sonar Foundation leads on matplotlib. vexp's graph-based context is less effective on rendering-heavy, visual output code. We're investigating why — and it's a good reason to run this benchmark yourself.
How we ran this benchmark.
Transparent, reproducible, open-source. No cherry-picking.
Don't trust us. Run it yourself.
Clone. Setup. Run. Under 10 minutes to first result.
$ git clone https://github.com/Vexp-ai/vexp-swe-bench.git
$ cd vexp-swe-bench && ./setup.sh
$ source .venv/bin/activate
$ node dist/cli.js run
Requires Node ≥ 18, Python ≥ 3.10, Docker. Default agent: Claude Code with vexp. Use --no-vexp to run without vexp and compare yourself.
Use code BENCHMARK at vexp.dev/#pricing for 14 days of vexp Pro — free.
No credit card required.
Context is the highest-leverage variable
in AI coding.
73% pass rate. $0.67 per task. 7–10 issues no one else solved.
All on your machine. No cloud. No account.
or npm install -g vexp-cli
One-time email with install links. No spam.
Free tier · No account · No credit card · Zero network calls