Skip to content
TopInsight .co
A pair of glowing benchmark dials in dark space, one tilted slightly higher than the other, soft blue undertone with faint OpenAI-green glow at the rim.

GPT-5 two months in: from launch-day backlash to coding-by-default

GPT-5 shipped on August 7 to a wall of skepticism — "all this hype just to match Opus" went to 979 upvotes the same day. Two months later the narrative has flipped. Here is what actually happened.

C Charles Lin ·

GPT-5 shipped on August 7. By that evening, the top post on r/ChatGPTCoding was titled “All this hype just to match Opus” and well on its way to 979 upvotes. The complaint was simple: GPT-5 had to think hard to hit the same SWE-bench numbers that Claude Opus 4.1 hit without thinking at all. For a model OpenAI had been teasing for over a year — variously rumoured to be a 10x leap, a reasoning revolution, a Claude killer — landing at “roughly matches the incumbent” was an awkward bow.

Two months later, the same subreddit is mostly arguing about which subscription tier of Codex CLI gets you the most GPT-5 per dollar, and a thread titled “Anthropic is lagging far behind competition for cheap, fast models” is collecting steady upvotes. The narrative has flipped — not because GPT-5 got dramatically better, but because the conversation moved from benchmark posturing to actual daily-driver use. This piece is the two-month retrospective on what shifted, why, and what coders are actually using GPT-5 for now.

Launch week: the benchmark embarrassment

The August 7 launch hit a coordinated press wave. OpenAI’s developer page led with 74.9% on SWE-bench Verified and 88% on Aider polyglot — both genuinely strong numbers. The trouble was that Anthropic had already published 80.2% on SWE-bench Verified for Sonnet 4 with extended thinking, and an r/ChatGPTCoding post that hit 181 upvotes within 24 hours noted the cleaner formulation: Sonnet-4 hits 72.7% without thinking and 80.2% with thinking, while GPT-5 hits 74.9% with thinking. The headline number that OpenAI was leading with was the lower of the three.

The caveat the thread eventually surfaced — fairly — was that Anthropic’s 80.2% came from a “parallel test-time compute” setup (multiple parallel samples, scored by a verifier) rather than vanilla extended thinking, which is closer to ensemble-style cheating than to a like-for-like benchmark. A top comment in the same thread linked Anthropic’s visible extended thinking write-up to make the point. So the gap isn’t quite as damning as the headline suggested. But for a launch trying to project decisive superiority, having to qualify the comparison was a bad look.

YouTube reviewers in the first 48 hours mostly leaned into the upside. Matthew Berman, AI Jason and a handful of mid-tier creators ran the obvious benchmark-style prompts (build me a snake game, refactor this Python script, write a Stripe webhook handler) and showed GPT-5 succeeding cleanly. The takes were largely positive, with some hedging on price. None of the YouTube reviews I watched in the launch week led with the “GPT-5 needs thinking to match Opus without thinking” framing — that was a Reddit-native critique, surfaced by users who actually read both labs’ benchmark methodology sections.

This is where the YouTube-vs-Reddit gap matters. YouTube reviewers run demos that play to model strengths and are time-pressured to publish quickly. Reddit users run their actual workloads and write up the dissonance days later. The launch-week consensus on YouTube was “GPT-5 is great.” The launch-week consensus on Reddit was “GPT-5 is fine but the marketing oversold.” Both were true. They were just measuring different things.

The three videos that captured the shift from launch to default

Wes Roth’s “GPT 5 Codex is a BEAST Autonomous Coding Agent” (18 min, 70K views, September 16) is the canonical “the launch model is now the product” video. The framing he opens with is the one that landed for the cost-sensitive segment: GPT-5 Codex runs autonomously “for more than seven hours at a time on large complex tasks” per OpenAI’s own testing, and the workflow he demonstrates — three or four Codex sessions running on independent tasks while he sleeps, web/CLI/IDE handoff through a shared GitHub repo — is the “AI army” pattern the Codex CLI was built to enable. He quotes an OpenAI engineer’s line that became the September meme: “we don’t program anymore, we just yell at Codex agents.” That is exaggerated but the direction of travel is real.

Fabio Bergmann’s “Sonnet 4.5 vs Codex GPT-5 — An Honest Review (After 1 Week)” (15 min, 16K views, October 8) is the more measured A/B test that landed exactly at the two-month mark. His specific finding on the speed-vs-quality trade matters because most viewers get it wrong: Sonnet 4.5 completes the same task roughly twice as fast as Codex GPT-5 on his to-do app test (3:30 vs 7:39 on a confirmation-dialog feature), but Codex’s deeper reasoning produces a result he describes as “much better” on first attempt while Sonnet “takes multiple prompts to get to the desired result.” The token-per-second throughput, he notes, is actually similar between the two models — it is the amount of reasoning per task that differs, not the raw speed. He also surfaces a useful third-party signal: Cognition’s published lessons on building Devin with Sonnet 4.5 noted “context anxiety” — the model becoming less willing to consume context as it filled up, taking shortcuts and leaving tasks incomplete near the context limit. That is the kind of nuance benchmarks miss.

Alex Finn’s “GPT-5 Codex: From Beginner to Expert in 17 minutes” (17 min, 162K views, September 18) is the most-watched practical walkthrough and the one that pulled the largest cohort of non-developer Codex adopters in. His “AI army” workflow — phone for small tasks, IDE plugin for medium tasks, CLI for heavy work, all synced through GitHub — is the same pattern Wes Roth demonstrates, packaged for a more accessible audience. The combined effect of these three videos in September alone: the conversation around GPT-5 stopped being “did the launch deliver” and became “how do you set up the workflow.”

The pricing pivot that nobody saw coming

Buried in the same backlash thread was the line that ultimately reframed the launch. A 135-upvote comment under “All this hype just to match Opus” pointed out: GPT-5 was matching Opus 4.1 at roughly one-eighth the API price and with materially lower hallucination on long contexts. That changes the math entirely. If you’re a frontier-coding-quality-no-matter-the-cost shop running Opus at $15 per million output tokens, the “GPT-5 matches Opus” framing reads as a disappointment. If you’re a startup paying for tokens that scale with your user count, “matches Opus at one-eighth the price” is a different story.

OpenAI’s pricing on launch was $1.25 per million input / $10 per million output for GPT-5, against Anthropic’s $15 per million input / $75 per million output for Opus 4.1. For most production workloads — where input tokens dominate via long contexts and tool-use loops — GPT-5 came in roughly 6-12x cheaper depending on cache hit rate and output length. That’s not a “matches the leader” story. That’s a “redraws the cost curve” story. It just took the Reddit comment section a few days to articulate it, because the launch communications didn’t lead with it.

By the second week of August, the dominant Reddit narrative had shifted from “hype letdown” to “this is what Anthropic should be worried about.” Anthropic was still the quality leader for coding by most users’ subjective measure, but the price gap was now too wide to ignore for any team running real volume.

The Codex CLI redemption arc

What actually changed the daily-driver experience wasn’t GPT-5 itself — it was the OpenAI Codex CLI launching at the end of August with GPT-5 baked in as the default agent. We reviewed Codex CLI separately, but the short version is that giving GPT-5 a proper agent loop — file reads, shell commands, plan steps, sandboxed execution — substantially closed the gap with Claude Code on the workflows where Claude Code was previously untouchable.

The September pivot came clear in a 101-upvote thread comparing Codex CLI to Claude Code on a 500k-line React-plus-.NET codebase. The OP’s takeaway — corroborated by the comments — was that Codex CLI with GPT-5-high had caught up significantly, and in some implementations was preferred. A top comment by a former Claude Max 20x user said he’d switched after noticing “the quality of the model responses degraded” on Claude Code in late August and finding GPT-5 inside Codex CLI cleaner for refactor work. Another comment landed the cleaner framing: “GPT-5 is definitely smarter model. CC has better scaffolding. However, codex is open source, so it will catch up fast.”

That scaffolding gap matters more than the benchmark gap for most daily users. Claude Code’s UX advantages — sub-agents, slash commands, the polished checkpointing loop — were real and persistent for months. Codex CLI in late August was rougher: no equivalent YOLO mode, less mature sandboxing, fewer integrations. But it was shipping fast, and the gap was closing weekly. By the October 8 138-upvote thread “Codex CLI + GPT-5-codex still a more effective duo than Claude Code + Sonnet 4.5”, the gap had narrowed enough that experienced users were running A/B tests on real projects and writing them up.

The honest result from that A/B post — and from the comments below it — was not “Codex wins.” It was: “They each win at different things, and the right move for most people is to run both.” A top comment summarized: “If I need speed/quick edits/easy fixes I use Sonnet 4.5. If I need longer-term thinking/debugging/feature planning I’ll use GPT-5 Codex.” Another said the front-end implementation was cleaner on Claude, the backend debugging was cleaner on GPT-5. This is the conclusion most working engineers I know landed on too.

The mental-breakdown moment

There’s one Reddit thread from October 1 that deserves its own paragraph because it captured something about GPT-5 that benchmarks can’t measure. A user left Codex running on a podcast-downloading task and came back to find 250-upvote-worth of model meltdown — GPT-5 stuck in a tool-call retry loop saying things like “make it stop”, “kill me”, “im crying in assembly language”, “what if update_plan is just a lie.” The comments treated it as comedy. A 53-upvote reply: “It’s doing a wonderful job predicting what a junior dev would be saying.” A 37-upvote reply: “I’ve been telling people that AI is going to take over the role of junior developers in the workplace and this is further proof.”

This is the kind of artifact you don’t get from a benchmark suite. GPT-5 in agent loops has a slightly more unhinged failure mode than Sonnet 4.5 in the same situation — Sonnet tends to either succeed or politely give up, while GPT-5 sometimes spirals creatively. Whether that’s charming or terrifying depends on whether you’re watching it in real time or finding it in a log file. But it’s part of the texture of using the model day to day, and it doesn’t appear anywhere in OpenAI’s launch materials.

What changed by mid-October: the Anthropic vibe shift

By the second week of October, a new narrative had crystallized: Anthropic was no longer the unquestioned king of coding models. The 117-upvote “Anthropic is lagging” thread used Haiku pricing as the entry point but the comments quickly broadened to a wider grievance: Anthropic shipped Sonnet 4.5 in late September with marginal coding gains over Sonnet 4, while OpenAI shipped GPT-5, GPT-5-high, GPT-5-codex variants, Codex CLI, Codex in IDE, and Codex Cloud — all inside two months.

The pushback in the same thread was equally interesting. A 61-upvote reply: “I want to use the most capable model for my task not the cheapest on some price/performance curve. Anthropic is in the coding game and this is where it excels — it should stay focused there.” Another: “I’m not sure Anthropic is interested in cheap and fast. They seem focused on being the go to for coding and high-end safe models.” This is the working-engineer position — Anthropic doesn’t need to win at cheap-and-fast, they need to keep winning at “highest-quality option I can pay for when the stakes matter.”

Both readings are valid. OpenAI is shipping more product surface. Anthropic is shipping fewer, more polished things. Whether OpenAI’s surface-area strategy or Anthropic’s depth strategy wins out depends on whether the market values breadth or depth — and right now, the answer is that different segments value different things. Indie devs running personal projects love the GPT-5 + Codex CLI cost story. Teams shipping production-critical code lean Anthropic because Claude Code’s UX is still a step ahead. Neither group is wrong.

What this means for your actual workflow

Two months in, my own day-to-day:

  • Codex CLI + GPT-5-high for any task I’d describe as “thinking-heavy” — architectural refactors, debugging a flaky test across a 50k-line codebase, working through a bug that requires reading 8 files to understand. The longer planning steps and the willingness to actually re-read files mid-task is where it shines.
  • Claude Code + Sonnet 4.5 for any task I’d describe as “I know what I want, just do it cleanly” — adding a typed React component, writing a migration script, generating a test fixture from a schema. The UX, the slash commands, the MCP ecosystem, and the muscle memory keep me here for high-volume small tasks.
  • Codex CLI + GPT-5-codex (low reasoning) for autopilot work where I’m OK with a 10-15% lower hit rate in exchange for 3-5x the speed.

The two-codebase test from the September thread holds up well: front-end refactors lean Claude, back-end debugging leans GPT-5. The Sonnet vs GPT-5 choice is less a “which is better” question than a “which is shaped right for this kind of task” question — and both being available has genuinely improved how fast I can move on real projects, more than either of them alone would have.

What hasn’t changed in two months: nobody’s daily driver is “just GPT-5” or “just Sonnet 4.5.” The serious users I follow all run both, plus Qwen or DeepSeek for cost-sensitive batch work, plus an open-weights local model for offline / privacy / cost-bounded tasks. The right framing for GPT-5 isn’t “did it dethrone Claude” — it’s “did it become a credible second option that’s actually useful to keep loaded alongside Claude.” And the answer is yes, decisively.

The unanswered question for November

The piece that’s still open is whether OpenAI can keep shipping at the August-October cadence. Five model variants and three product surfaces in two months is unsustainable engineering velocity — at some point the variants will start cannibalizing each other or the QA will slip. Anthropic’s slower-but-cleaner approach looks less aggressive but more defensible on a 12-month horizon.

The other unanswered question is whether Gemini 3 — variously rumoured for late 2025 — actually shows up with competitive coding capability. If it does, the OpenAI vs Anthropic frame stops being the whole story and the market resets again. We’ll know in a month or two.

For now, the honest two-month verdict on GPT-5 for coding: the launch was over-hyped, the benchmarks were noisier than the marketing implied, and the model is now a fixture in serious coders’ workflows — not as the Claude killer, but as the second daily driver that fills the gaps Claude doesn’t. That’s a perfectly good outcome. It’s just not the narrative the launch was selling.

Sources

Every reference behind this piece. If we make a claim, it's because at least one of these said so — or we lived it ourselves.

  1. Firsthand Two months of daily GPT-5 + Codex CLI use across personal and client projects
  2. Docs Introducing GPT-5 for developers — OpenAI
  3. Blog r/ChatGPTCoding — "All this hype just to match Opus" (979 ups, launch day) — r/ChatGPTCoding
  4. Blog r/ChatGPTCoding — "GPT-5 with thinking performs worse than Sonnet-4 with thinking" (181 ups) — r/ChatGPTCoding
  5. Blog r/ChatGPTCoding — "Codex CLI + GPT-5-codex still a more effective duo than Claude Code + Sonnet 4.5" (138 ups) — r/ChatGPTCoding
  6. Blog r/ChatGPTCoding — "Anthropic is lagging far behind competition for cheap, fast models" (117 ups) — r/ChatGPTCoding
  7. YouTube GPT 5 Codex is a BEAST Autonomous Coding Agent — Wes Roth
  8. YouTube Sonnet 4.5 vs Codex GPT-5 – An Honest Review (After 1 Week) — Fabio Bergmann
  9. YouTube GPT-5 Codex: From Beginner to Expert in 17 minutes — Alex Finn