SWE-bench Verified · 100 real GitHub issues · Open-source benchmark

Same model. Better context. More bugs fixed.

We benchmarked 4 coding agents on SWE-bench Verified — all running Claude Opus 4.5. The only variable was the context layer. vexp won on cost. And resolved issues no other agent could.

73%

Pass@1 resolution rate

Highest among all tested agents on 100 tasks

$0.67

Cost per task

22% cheaper than the next best agent

7–10

Unique wins

Issues resolved only by vexp — no other agent could

Results

The results.

SWE-bench Verified leaderboard — vexp + Claude Code achieves 73% pass@1 at $0.67/task

Evaluated on a 100-task stratified subset of SWE-bench Verified. All agents use Claude Opus 4.5 for a fair, apples-to-apples comparison. External resolution data from swe-bench/experiments. Cost data from each agent's published benchmarks.

Agent	Pass@1	$/task	Unique Wins
vexp + Claude Code	73.0%	$0.67	7–10
Live-SWE-Agent	72.0%	$0.86	—
OpenHands	70.0%	$1.77	—
Sonar Foundation	70.0%	$1.98	—

Thesis

All four agents run the same model. The difference is context.

Every agent in this benchmark uses Claude Opus 4.5 with the same cost limit ($3/task) and the same turn budget (250 turns). The scaffolding varies — but the model is identical.

Yet vexp + Claude Code resolves 73% of issues at $0.67 per task. The closest competitor costs 28% more. The most expensive costs 3× more.

The difference isn't the model. It's what the model sees before it writes code.

vexp pre-indexes your codebase into a dependency graph, then delivers a ranked context capsule — full source for pivot files, skeletonized signatures for everything else — bounded to your token budget. The agent starts every task already knowing what matters.

This benchmark is evidence for a simple thesis: in agentic coding, context engineering is the highest-leverage intervention available. Better context → fewer wasted turns → lower cost → higher resolution.

“On 7–10 tasks, vexp was the only agent to produce a passing patch. These aren't marginal improvements — they're bugs that the model simply cannot fix without the right context.”

Efficiency

Resolution vs cost. The efficiency map.

Top-left is the best zone: high resolution, low cost.

vexp is the only agent in that quadrant — resolving the most issues while spending the least per task. The scatter also shows the “unique wins” count: issues that only vexp resolved.

The cost advantage compounds at scale. On a 500-task run, the difference between $0.67 and $1.98 per task is $655 — enough to pay for vexp Pro for almost 3 years.

Per-repository

Where vexp leads — and where it doesn't.

astropy — 80% vs 40%

vexp doubles the resolution rate of the nearest competitor on astropy's complex astronomical computation issues.

xarray, requests — 75–83%

On data-structure-heavy repos, vexp's dependency graph surfaces the exact chain of imports and type relationships.

matplotlib — 43% vs 86%

Sonar Foundation leads on matplotlib. vexp's graph-based context is less effective on rendering-heavy, visual output code. We're investigating why — and it's a good reason to run this benchmark yourself.

Methodology

Methodology — how we ran this benchmark.

Transparent, reproducible, open-source. No cherry-picking.

Don't trust us. Run it yourself.

Clone. Setup. Run. Under 10 minutes to first result.

bash

$ git clone https://github.com/Vexp-ai/vexp-swe-bench.git

$ cd vexp-swe-bench && ./setup.sh

$ source .venv/bin/activate

$ node dist/cli.js run

Requires Node ≥ 18, Python ≥ 3.10, Docker. Default agent: Claude Code with vexp. Use --no-vexp to run without vexp and compare yourself.

Use code BENCHMARK at vexp.dev/#pricing for 14 days of vexp Pro — free.

No credit card required.

View full README on GitHub →

Context is the highest-leverage variable
in AI coding.

73% pass rate. $0.67 per task. 7–10 issues no one else solved.
All on your machine. No cloud. No account.

Install vexp for VS Code — Free Read the Docs

or npm install -g vexp-cli

On mobile? Get the install links by email.

One-time email with install links. No spam.

Free tier · No account · No credit card · Zero network calls