DeepSWE and the death of SWE-bench Pro: the benchmark replacement that landed in late May
datacurve.ai shipped a SWE-bench-Pro replacement that is contamination-free and verifies cleanly. The leaderboard it produces looks dramatically different from what the labs were marketing.
Two videos landed within five days of each other in late May that reframed the entire coding-benchmark conversation. Matthew Berman’s “DeepSWE just changed the benchmark game” (May 27) walked through the new bench’s methodology. Theo’s “SWE-Bench is getting replaced???” (May 31) made the contamination case against SWE-bench Pro with the bluntness only a disclosed investor in the replacement can get away with. The combined effect: in the working-engineer corner of the AI-coding community, SWE-bench Pro stopped being treated as the authoritative coding benchmark within a week.
This piece is the working read after a week of cross-checking DeepSWE rankings against my own model rotation across personal and client projects.
Why SWE-bench Pro stopped being credible
Theo’s framing is the harshest because he can afford to be — he is an investor in datacurve, makers of DeepSWE — and the harsh framing happens to be correct. “I personally don’t believe that Qwen 3.7 Max or GLM 5.1 are meaningfully better than state-of-the-art models from OpenAI. I also don’t believe that Gemini 3.5 Flash is spitting distance away from GPT-5.4 and 5.5. That’s just obviously not true.”
The bench’s failure modes had been accumulating quietly through Q1 2026. The most-quoted single fact in both videos is the verifier audit datacurve published: SWE-bench Pro’s automated verifier misgrades agent outputs at 8% false positive and 24% false negative rates. That is the rate at which a model can submit a wrong patch and have the verifier mark it correct, and conversely, the rate at which the verifier marks a correct patch wrong. Either failure mode at those rates makes the leaderboard largely meaningless for the comparisons most engineers actually want to make.
The deeper issue is contamination. The SWE-bench task set is drawn from public GitHub issues and PRs, and the labs train their models on most public GitHub. Theo’s framing: “So much of the info on how to solve these problems has leaked that models will regularly cheat and the cheating is barely even measured by the people verifying the results.” This is the structural problem with any benchmark whose tasks come from a corpus the labs already train on — passing the bench can mean genuinely solving the problem, or it can mean recognising it.
What DeepSWE actually changes
Berman’s video walks through the four claimed advances, and after a week of looking at the leaderboard, the four hold up:
1. Contamination free. Every task in DeepSWE was written from scratch by the datacurve team — not adapted from existing commits or PRs. The task set is not public. Models cannot have seen the solutions during pre-training because the solutions did not exist before datacurve wrote them.
2. High diversity. Tasks span 91 repositories across five languages. Not just Python; not just the most-trained-on big repos. This matters for the rankings because models with strong Python-specific training can dominate Python-heavy benches in ways that do not transfer.
3. Real-world complexity in the prompts. Berman’s framing of this is the one that matters for working engineers: “If you’re like me, you’re not giving your model extensive prompts, explaining exactly where to find something, what the problem is, tests that you’ve already run against it, failed test cases — you’re typing ‘fix it’. And that’s it.” DeepSWE prompts are roughly half the length of SWE-bench Pro prompts, but solutions require 5.5x more code and 2x more output tokens. The bench is measuring whether the model can do the work, not whether you can scaffold the problem precisely.
4. Reliable verification. Datacurve invested in a verifier that drops the false positive and false negative rates substantially. The exact numbers will surface in their paper, but the directional shift is the point — the leaderboard you read actually reflects the work the model did.
The leaderboard everyone is talking about
The headline result is the gap between GPT-5.5 high and Opus 4.7 max. Most other benchmarks show these two models within a few points of each other. DeepSWE shows GPT-5.5 extra high leading by 15+ points — a gap large enough that, if it holds, it changes the working-engineer decision matrix.
The full ranking Berman walked through:
- GPT-5.5 extra high — clear top
- GPT-5.4 — close second
- Opus 4.7 max — 15+ points behind GPT-5.5 extra high
- Sonnet 4.6 — meaningfully behind Opus 4.7
- Gemini 3.5 Flash — ~28%
- Kimi / MiMo / GLM — further down the board
Berman’s commentary on whether the rankings match working-engineer vibe-check: “All the engineers that I’ve been speaking to praise GPT-5.5 as this massive improvement over previous models and even a massive improvement over Opus 4.7.” That matches my experience and matches the Reddit chatter through late May. The DeepSWE rankings are roughly the rankings power users have been reporting from real workflows, which is the calibration test the old bench was failing.
The complication worth naming: the SWE-bench Pro rankings showed Mythos (which used to be the open-source darling) crushing the leaderboard against frontier closed models, and the DeepSWE rankings show Mythos meaningfully lower. Theo’s bluntness on this: “While it is cool to see Mythos kill it on SWE-bench Pro, I personally don’t believe that Qwen 3.7 Max or GLM 5.1 are meaningfully better than state-of-the-art models from OpenAI.” If you have been making model-selection decisions based on SWE-bench Pro Mythos scores in the past three months, the DeepSWE result implies you should re-evaluate.
What the r/LocalLLaMA crowd is doing instead
The r/LocalLLaMA community was already skeptical of SWE-bench Pro before DeepSWE landed, and their adaptation has been to lean harder on harness-quality questions and access-tier questions instead of leaderboard position. The 689-upvote thread “Cohere’s unreleased coding model (early access for localllama)” (June 7) is the most-upvoted single post from the window — the community’s pattern this month has been to chase early-access of unreleased models rather than to bet on published rankings.
The 32-upvote “Best Coding Harness for Qwen3.6 35B” (June 7) is the other useful signal: the question that gets asked once your benchmark is no longer trusted is “given a model, what is the right harness to actually make it perform well?” The harness-tuning angle is exactly what Anthropic’s late-May agent-harness masterclass (when we publish that piece) addresses for Claude Code users. The directional shift across both creator and community signals: trust your harness, trust your vibe check, treat the leaderboard as a starting hypothesis.
Creator POV vs the bench publishers
The creator coverage in late May is interesting because two of the three most-watched videos came from people with disclosed bias toward DeepSWE — Theo is an investor in datacurve, Berman ran the video in concert with the launch. Both disclosed cleanly. Both made arguments that hold up on their merits.
The counter-narrative from SWE-bench Pro defenders has been muted, which is itself a signal. If the bench operators had a strong rebuttal to the 8%/24% false positive / false negative rates, they would have published it. The silence implies the numbers are roughly correct. The community move that happens next, when a credible benchmark replaces an established one, is that frontier labs start citing the new bench in their next launch deck. Watch the Anthropic / OpenAI / Google blog posts for the next Opus / GPT / Gemini family launch — whether they cite DeepSWE numbers will tell you how widely the replacement has stuck.
What this means for working engineers right now
Three practical implications for the rest of June 2026:
1. The “Opus is the smartest coder” working assumption may have been propped up by bench bias. If GPT-5.5 extra high is 15+ points ahead of Opus on a clean bench, the right default for “what model do I reach for on the hard problems” question may have shifted. The Composer 2.5 price-per-task analysis we wrote separately holds even more strongly — there is no good reason to be running routine work on Opus tier if both the cost economics and the underlying quality argument favour other options.
2. Trust your own vibe check more than any bench. The honest read from both Berman’s and Theo’s videos is that working engineers were already converging on “GPT-5.5 high is the best general-purpose coder, Opus 4.7 is great for specific reasoning tasks, and the workhorse tier should be a cheap fast model” before DeepSWE published. The bench just made the consensus legible. If you have already settled into a model rotation that works, do not change it based on a benchmark rotation — DeepSWE is more accurate than SWE-bench Pro but no single bench should override months of personal calibration.
3. The contamination problem will recur. DeepSWE is uncontaminated today. Six months from now it will have been scraped, analysed, and quietly leaked into pre-training corpora. The benchmark publishers’ game is now adversarial — every clean bench has a contamination half-life. Plan to refresh your benchmark trust every quarter, not every year.
The honest summary
SWE-bench Pro is not officially retired. DeepSWE is not officially the replacement. But within the working-engineer community, the transition happened in the week between Berman’s and Theo’s videos. The new bench produces rankings that match the vibe check most engineers were already running. The old bench produced rankings that visibly did not. That is the kind of mismatch a community sorts out fast.
Watch for the labs to cite DeepSWE in their next launch decks. Watch for SWE-bench Pro to start publishing methodology rebuttals or quietly fade. Either way, the period of treating any single coding benchmark as authoritative ended in late May 2026. We are back to “build your model rotation from your own work, calibrate against multiple benches, trust the harness you build over the leaderboard you read.” That is probably where we should have been all along.
Sources
Every reference behind this piece. If we make a claim, it's because at least one of these said so — or we lived it ourselves.
- Firsthand One week of cross-checking DeepSWE leaderboard rankings against my own model rotation
- Docs datacurve.ai — DeepSWE benchmark documentation — datacurve.ai
- YouTube DeepSWE just changed the benchmark game... — Matthew Berman
- YouTube SWE-Bench is getting replaced??? — Theo - t3.gg
- YouTube Cursor just beat EVERYONE. — Matthew Berman
- Blog r/LocalLLaMA — Cohere's unreleased coding model early access (689 ups) — r/LocalLLaMA
- Blog r/LocalLLaMA — Best Coding Harness for Qwen3.6 35B (32 ups) — r/LocalLLaMA