Codex CLI + GPT-5 vs Claude Code + Sonnet 4.5: the agent-loop showdown

Both shipped major upgrades within four weeks of each other. After running them daily for a month, here is the honest head-to-head — where each wins, where each fails.

C Charles Lin · October 20, 2025

In the four weeks between Codex CLI’s full GPT-5 release (late August) and Claude Code’s Sonnet 4.5 update (late September), two products that started in very different places ended up close enough to A/B test on the same task. By mid-October I had run both as my daily driver for over a week each, on a mix of a 200k-line client codebase, a personal Astro site rewrite, and a backend refactor at work. The takeaways below are the consolidated read after that month, cross-checked against the 102-upvote Reddit comparison on a 500k-line codebase and the 138-upvote “more effective duo” follow-up that landed in early October.

The short version: they are different tools that both happen to live in a terminal, and the right move for most engineers is to run both. That said, the specific failure modes and strength areas matter more than the overall verdict — so here is the comparison broken down by the dimensions that show up in daily work.

Three videos that shaped how people picked sides this fall

The single most useful piece of independent coverage on this exact comparison is Steve from Builder.io’s “Codex vs Claude Code: which AI coding agent is better?” (8 min, 94K views, September 28). Steve is a credible source on this question specifically — he was “a cursor agent power user for months” who wrote a widely-referenced Cursor tips post, then “Claude Code came out and that became my go-to,” then within weeks switched again: “I now use Codex as my daily driver.” The reasoning he gives in the video is granular and worth quoting because it is the cleanest articulation of the pro-Codex case from someone who genuinely tried both.

His take on the agents: “Codex is so similar to Claude Code that I really wonder if they trained off of Claude Code’s outputs as well. I’ve noticed some small things like the Codex agent likes to reason for longer but seems to have a faster token per second output. Claude Code likes to spend less time reasoning but has a slower tokens per second output.” On model options: “I like these model options [low/medium/high/minimal reasoning] better than the just two model options in Claude Code.” On pricing: “GPT-5 is actually a significantly more efficient model under the hood than Claude Sonnet… Codex can offer you more usage at a lower price.” On limits: “In my experience, a lot more people will be fine on the $20 a month Codex plan than they would on the $17 a month Claude plan where people seem to hit the limits pretty quickly. Even on the $100 and $200 plans for Claude, some heavy users still hit limits. Whereas the Codex, I’ve almost never heard of someone hitting limits on the pro plans.” On the one thing Claude Code still wins: “Claude does have better MCP integration with lots of connectors you can just click to install.”

OpenAI’s own “Using OpenAI Codex CLI with GPT-5-Codex” (6 min, 204K views, October 14) is the vendor walkthrough and worth watching because it lays out the specific Codex CLI ergonomics that Claude Code does not have: a /model switcher that lets you change reasoning level mid-session, three approval modes (read-only / auto / full access) that persist per-project, and codex resume for picking up any previous session. The auto mode “stays in the boundaries of your project. It’s not going to affect anything else on your laptop” — which is the answer to Steve’s complaint about Claude Code’s permission system. OpenAI also leans into the multiplayer-game-built-in-one-prompt demo, which is showy but does correctly illustrate the “long-running agentic” framing the company is positioning Codex around.

Alex Finn’s “GPT-5 Codex: From Beginner to Expert in 17 minutes” (17 min, 162K views, September 18) is the more practical workflow video and the one that captures why heavy users switched. His framing is the “AI army” pattern: ChatGPT app on the phone for small tasks dispatched while you are away from the computer, Codex extension in Cursor for medium tasks, Codex CLI in the terminal for the heavy work — all synchronised through a GitHub repo. “Claude Code is an amazing AI coding tool, but you have to handhold it step by step. The difference with Codex… is I’m going to show you how to create your own AI army that’s going to be doing the coding for you.” Whether his specific workflow is for you is a separate question — but the parallel-task-dispatch pattern he is articulating is real, and it is genuinely something Claude Code cannot replicate today.

The counterweight is Cole Medin’s “Claude Sonnet 4.5 — The New Coding King? (Sonnet 4.5 vs. GPT-5 Codex)” (11 min, 38K views, September 30). He runs the same PRP — a Stripe integration into an existing agentic chat application — through both tools in parallel, live, no dry run. His honest framing acknowledges the migration trend Steve articulates: “a lot of people have been switching over to Codex from Claude Code. And so I’m really curious if Sonnet 4.5 is enough to bring everyone back.” He does not declare a winner. Neither does the Reddit data that followed.

The setup

Two reasonably-matched workflows after I had been on each for a week:

Codex CLI + GPT-5-high

$200/month ChatGPT Pro subscription (rate-limit-bound rather than token-bound)
codex invoked from terminal, AGENTS.md at repo root, sandbox mode workspace-write
GPT-5-high as the agent model; GPT-5-codex available for low-reasoning autopilot tasks
Optional MCP servers connected for filesystem, GitHub, browser

Claude Code + Sonnet 4.5

Claude Max 20x subscription ($200/month, rate-limit-bound)
claude from terminal, CLAUDE.md at repo root, slash commands and sub-agents configured
Sonnet 4.5 as default model; Opus 4.1 available for harder tasks
MCP ecosystem maturity gives it more pre-built tool options

I tried to be deliberate about the same prompt to both on enough tasks that the comparison was not anecdotal. The findings below are weighted toward patterns that repeated across multiple tasks, not one-off impressions.

Where Codex CLI + GPT-5-high wins

1. Reading large codebases to find the right thing

This is the dimension where the 500k-line Reddit thread and my own testing agreed most clearly. GPT-5-high is patient about reading files. Asked to fix a bug whose root cause spans 6-8 files, it will systematically read each one — sometimes re-reading mid-task when it realizes it missed context. Sonnet 4.5 in Claude Code is faster to read, often correct, but more willing to make confident edits before fully understanding the surrounding code. On the bigger codebase, this difference compounded: Codex CLI found 2 of 3 cross-file bug roots that Claude Code’s first pass missed, before I had even given it a second chance.

A top comment under the 500k-codebase thread put it cleanly: “GPT-5 is definitely smarter model. CC has better scaffolding.” The smartness gap is real but narrow. The scaffolding gap is wider but moving fast.

2. Multi-step debugging with stateful tool calls

For tasks that span more than 15-20 agent turns — debugging a flaky integration test, working through a tricky migration, hunting an N+1 query across an ORM-heavy backend — GPT-5-high sustains the planning state better. Sonnet 4.5 starts drifting around turn 15-20 in my logs, especially if the task involves bouncing between filesystem reads and shell command output. Codex CLI’s update_plan tool seems to help here: GPT-5 actually uses it to write down what it learned mid-task, and the plan persists across turns.

This is also the same trait that produced the 250-upvote “Codex had a mental breakdown” thread — when the update_plan tool gets stuck, GPT-5 spirals creatively rather than giving up cleanly. That is the trade-off: more stamina on healthy paths, more dramatic failure modes when something breaks.

3. Cost-per-task on heavy work

On the ChatGPT Pro subscription, the rate limits are weekly rather than per-message, and GPT-5-codex (the lower-reasoning variant) is much faster than GPT-5-high for routine work. Steve’s video lands the pricing point with specifics: GPT-5 costs “anywhere from 2/3 to a half of what Sonnet costs and closer to a tenth of what Opus costs.” His Builder.io data showed “GPT-5 and GPT-5 codex costs a third of what Claude Sonnet does.” A Reddit comment under the “more effective duo” thread captured the lived economics: “Codex is insanely good. On the Plus subscription, ran out of weekly limit in 2 days. But it was 2 days of heavy usage. Something that would have taken me at least 2 months if done manually.”

That said — if you are paying per-API-call, the picture changes. Claude Sonnet 4.5 via API is cheaper per token than GPT-5 via API (Sonnet 4.5 is $3/$15 vs GPT-5 at $1.25/$10; the catch is Sonnet 4.5 typically generates more tokens per task because its prose is more verbose). Most subscription users I know prefer Codex; most API-billed users I know prefer Claude.

4. The parallel-task-dispatch pattern

Alex Finn’s “AI army” framing maps onto a real product capability that Claude Code does not yet match. Codex’s web/mobile/CLI/IDE integration through a shared GitHub repo lets you actually fire off three small tasks from your phone in the morning, return to your laptop at lunch with three PRs queued for review, and use the CLI for the harder afternoon work. Claude Code’s mobile story is a much thinner equivalent. For solo developers or founders working in bursts across devices, this matters.

Where Claude Code + Sonnet 4.5 wins

1. UX, slash commands, and the workflow loop

This is where Claude Code’s nine-month head start shows. The slash commands (/init, /review, /security-review, custom user-defined commands), the sub-agent system, the granular permission UI, the way CLAUDE.md files compose hierarchically across nested directories — none of this exists in Codex CLI yet at the same polish level. Codex CLI’s AGENTS.md is close to CLAUDE.md but more limited, and there is no equivalent sub-agent system. For workflows where you have already built up a personal automation layer on top of Claude Code, the migration cost to Codex CLI is significant.

A comment under the same “more effective duo” thread got it right: “If it’s about the model itself stripped from any DX add-ons, I’d say Claude is on par with Codex high. Adding all the add-ons and the DX that Claude Code has, Codex doesn’t stand a chance.” That is overstated — Codex CLI is good enough on raw capability to overcome a DX gap — but the directional point is correct. Claude Code is a more polished tool today.

2. Front-end and React work specifically

Two A/B tests I ran on real React migrations had Claude Code producing cleaner output in less iteration. The 500k-codebase Reddit thread’s exact framing: “Claude outdid GPT-5 in frontend implement and GPT-5 outshone Claude in debugging and implementing backend.” This matched my experience. Sonnet 4.5 seems to have absorbed more idiomatic React 19 patterns and is more conservative about adding incidental complexity to component files. GPT-5 will sometimes refactor a component “while it’s there” in a way that is technically an improvement but expands the diff beyond what I asked for.

3. Quick-edit tasks where you know exactly what you want

For tasks where the spec is precise — “add a new field to this Zod schema, plumb it through to the API route handler, write a test” — Sonnet 4.5 + Claude Code is noticeably faster. GPT-5-high will do the same task correctly but with more deliberation: it will re-read the schema file, check imports, consider edge cases. That is valuable when the task is ambiguous; it is overhead when the task is well-specified. Sonnet 4.5 is better-tuned for the “I know what I want, just do it cleanly” case.

4. Sub-agents and parallel task dispatch within a session

Claude Code’s sub-agent system lets you dispatch a child agent for a research task while the parent agent continues with the implementation. For complex tasks where you need to do independent research in parallel — e.g. surveying three different libraries for an API client choice, then making the implementation decision — this is currently a clean win for Claude Code. Codex CLI does not have an equivalent in-session abstraction yet. (Alex Finn’s “AI army” is the cross-device version of this; Claude Code’s sub-agents are the in-session version. They are different.)

5. MCP ecosystem maturity

This is the one feature Steve from Builder.io explicitly conceded in his pro-Codex video: “Claude does have better MCP integration with lots of connectors you can just click to install.” Codex CLI does support MCP servers, but the directory of pre-built ones for Claude Code is wider and the install ergonomics are better. If your stack relies on Figma, Playwright, Linear, Context7, or any of the dozen popular community MCP servers, Claude Code is the lower-friction path.

Where they are roughly tied

Pure code quality on standard tasks — write a SQL query, generate a TypeScript interface from a JSON sample, write a unit test from a spec. Both models do this well, and the result is usually indistinguishable. The 0.5% SWE-bench gap is invisible in real use.
Test generation — both produce reasonable tests. Sonnet 4.5 leans toward more verbose tests with edge cases; GPT-5 leans toward fewer, more focused tests. Pick whichever style you prefer.
Refactoring small modules — anything under 500 lines of changes, both will do well.
Following AGENTS.md / CLAUDE.md instructions — both honor the project conventions file reasonably well, with occasional drift that is roughly equivalent across the two.

Where both still struggle

It is worth being honest that neither tool has solved several persistent issues:

Long-running agent loops without supervision. Both will occasionally enter a confidence-incorrect loop where they make the same wrong edit repeatedly. Codex CLI’s failure mode is more theatrical (see the mental-breakdown thread); Claude Code’s is quieter but still happens.
Test selection. Both will sometimes generate the wrong tests for the change they made — tests that do not cover the actual edited behavior, or tests that test the mock rather than the implementation.
Knowing when to ask for clarification. Both are biased toward executing rather than asking. This is genuinely a product design choice (asking too much would be annoying) but it does mean they will occasionally guess wrong on ambiguous prompts.
Managing context drift on very large diffs. If you are applying a 2000-line refactor across 30 files in a single agent loop, both will lose track of what they did three steps ago.

Creator POV vs Reddit dissent

Most YouTube head-to-head reviews of these two products in the first month leaned toward declaring a winner. The format pushes for one. Steve from Builder.io declared Codex the winner from a working-engineer perspective with specific reasoning; Cole Medin ran a live A/B test and declined to declare a winner; Alex Finn evangelised Codex specifically for the multi-device workflow without claiming it was strictly better at code generation.

The honest picture from Reddit and from my own daily work is that there is not a single winner — the right answer depends on what your specific task looks like and what your existing workflow already invested in. The 196-upvote “I can’t stop vibe coding with Codex CLI” thread and the Claude-Code-defending comments under the “Anthropic is lagging” post are both honest signals from real users — they are just describing different use cases.

The reconciliation between YouTube and Reddit takes here: YouTube reviews captured the launch-week excitement on either side. Reddit captured the second-month settling. The settling pattern is “use both, switch by task type” rather than “pick a winner.”

The honest one-month verdict

If you can only run one, the choice depends on what kind of engineer you are:

Founders / solo builders / vibe coders: Codex CLI + ChatGPT Pro. The rate limits favor heavy bursts, GPT-5-high handles the planning load when the spec is fuzzy, and the cost story is better than Claude Max 20x at heavy usage. Alex Finn’s multi-device pattern is real value.
Senior engineers in mature codebases: Claude Code + Max 20x. The UX, the sub-agents, the MCP ecosystem, the slash commands, and the more conservative editing behavior all matter more when you are working in code that lots of other people also touch.
Specifically large-codebase debugging or backend-heavy work: lean Codex CLI. The patience advantage is real and shows up.
Specifically React / front-end-heavy work or quick well-specified edits: lean Claude Code. The output is cleaner and the tooling is faster.

The more useful framing — borrowed from a top comment in the 500k-codebase thread — is to run both and treat them as complementary daily drivers. The combined cost is $400/month for both top subscriptions; the productivity delta over running just one is, in my honest measurement, a solid 30-40% on tricky tasks. For an engineer whose hourly rate is anywhere above $100, the math is trivial.

The deeper question — which we will revisit in late November when GPT-5.x updates and presumably Claude 4.x will both be live — is whether the gap closes further or one pulls ahead. My read at one month: they are going to stay close for the foreseeable future, and the right discipline is to use the one that is shaped right for the task in front of you, not the one you have brand loyalty to.

Sources

Every reference behind this piece. If we make a claim, it's because at least one of these said so — or we lived it ourselves.

Firsthand One month of running both daily across personal projects and a 200k-line client codebase
Docs Codex CLI release notes — OpenAI
Docs Claude Code release notes — Anthropic
YouTube Codex vs Claude Code: which AI coding agent is better? — Steve (Builder.io)
YouTube Using OpenAI Codex CLI with GPT-5-Codex — OpenAI
YouTube GPT-5 Codex: From Beginner to Expert in 17 minutes — Alex Finn
YouTube Claude Sonnet 4.5 - The New Coding King? (Sonnet 4.5 vs. GPT 5 Codex) — Cole Medin
Blog r/ChatGPTCoding — Codex CLI vs Claude Code (adding features to a 500k codebase) — r/ChatGPTCoding
Blog r/ChatGPTCoding — Codex CLI + GPT-5-codex still a more effective duo than Claude Code + Sonnet 4.5 — r/ChatGPTCoding
Blog r/ChatGPTCoding — I can't stop vibe coding with Codex CLI — r/ChatGPTCoding