Grok 4 for coding: separating the claims from the reality

Elon Musk claimed Grok 4 beats Cursor. Theo, Fireship and Matthew Berman piled in within 24 hours; r/singularity called it disappointing within four days. Working read after testing.

C Charles Lin · July 18, 2025

Theo’s July 10, 2025 video — “Grok 4 just dropped, it’s the best model right now (yes really)” — set the tone for the YouTube reaction wave: a creator who openly distrusts Elon Musk’s marketing, watching the livestream, conceding that the benchmark numbers are real and the model is competitive at the frontier. AI Explained landed “Grok 4 - 10 New Things to Know” the same day with a more measured technical breakdown. Fireship’s “Grok 4 pushes humanity closer to AGI… but there’s a problem” followed on July 11. Matthew Berman ran “Grok 4 Fully Tested (INSANE)” the same day. Four creators, four flavours, one shared headline: xAI shipped a real frontier model, and the cope from a year ago — “Grok is unserious” — no longer holds.

By July 13, the Reddit response had inverted. r/singularity’s “Grok 4 disappointment is evidence that benchmarks are meaningless” (882 upvotes) crystallised what working users were already reporting: the benchmark dominance does not translate to the things developers actually do. The opening line of the thread is the punchline of this entire launch week — “I’ve heard nothing but massive praise and hype for grok 4, people calling it the smartest AI in the world, but then why does it seem that it still does a subpar job for me for many things, especially coding?”

The gap between those two framings is the whole story.

What Grok 4 actually is

xAI’s mid-2025 frontier model, shipped July 9. Two tiers — Grok 4 and Grok 4 Heavy (which spawns multiple agent instances in parallel for the hardest problems). 256K context window. Always-on reasoning trace. Available via the xAI API and the Grok consumer product. Pricing is competitive with Claude 4 Opus on the base tier; the Heavy tier and the per-token rates above 128K context get expensive fast.

The headline benchmark numbers are what drove the YouTube wave: top scores on Humanity’s Last Exam, ARC-AGI, and a handful of competition-math benchmarks. The “Grok-4 benchmarks” thread on r/singularity (753 upvotes) circulated the launch slides within hours. The numbers are real. What they predict about day-to-day coding is the contested question.

The Musk “better than Cursor” claim

The framing that lit Reddit up was Musk’s “Grok 4 works better than Cursor” line during the launch livestream, which r/ChatGPTCoding’s top Grok 4 thread (1096 upvotes) immediately dismantled. The category error is obvious — Cursor is an IDE that wraps any model; Grok 4 is a model. The charitable reading is that Grok 4 inside a Cursor-like agent loop produces better outputs than Cursor’s default routing. The actual reading is that the claim isn’t testable as stated, and the community knows it.

Top comments on that thread are uniformly skeptical, several from users who’d tested Grok 4 against Claude inside Cursor and reported the opposite. The pattern is consistent across the launch-week threads: founder hype claims no longer move the dev subreddits the way they did in 2023. The community wants benchmarks they can reproduce, with workloads they recognise.

Where Grok 4 actually performs

From my own testing across a handful of real coding tasks, and triangulated against the launch-week Reddit experiments:

Reasoning-heavy debugging — strong. Comparable to OpenAI o3 in my tests on hard logic puzzles and race-condition hunts. The r/ClaudeAI “Tested Claude 4 Opus vs Grok 4 on 15 Rust coding tasks” thread (415 upvotes) reports the same thing harder: Grok 4 caught every race condition and deadlock in a 30k-line Rust codebase, including a tokio::RwLock deadlock that Opus missed. That’s a real, reproducible win.

Math-flavoured code — notably good. Numerical algorithms, statistical computing, graph algorithms. xAI clearly trained heavily on competition-math-adjacent code.

Multi-file edits and frontend work — measurably worse than Claude 4 Sonnet. r/ChatGPTCoding’s “Grok 4 still doesn’t come close to Claude 4 on frontend dev” thread (152 upvotes) pointed at the Design Arena leaderboard where Grok 4 sat at 10th — behind Grok 3 at 6th and both Claude variants at the top. My own multi-file refactor tests matched: Grok drops files from context, especially on larger sessions.

Rate limits and tooling friction — significant. The Rust-test thread flagged “brutal” rate limits and a 2x price step above 128K tokens. In a Cursor-style agent loop that fires dozens of requests per task, this is the difference between “model I default to” and “model I reach for on specific problems.”

Rule following — middling. The Rust tester reported Grok ignored custom coding rules on 2 of 15 tasks; Opus followed them perfectly. Small sample, but the pattern matches what other launch-week reports describe.

Creator POV vs Reddit dissent

Theo’s framing is the most honest of the YouTube wave: he calls the model genuinely impressive on benchmarks, names the rate-limit and pricing problems as real, and concludes that “best model right now” is true on the leaderboard and complicated in practice. AI Explained’s “10 things” video sticks tight to the technical claims and avoids the hype tone entirely. Fireship gets the entertainment value out of the launch but flags the gap between marketing and Musk’s reliability record. Matthew Berman runs the model through his standard prompt set and finds it strong on reasoning, average on practical builds.

What none of the YouTube creators have time to do — and what the subreddits did within 72 hours — is stress-test the model on the messy, multi-file, agentic workflows that working engineers actually run. That’s where the picture inverts.

The Reddit dissent splits cleanly:

“Benchmarks don’t predict my workflow” — the r/singularity disappointment thread’s core argument. Even the top reply concedes the model is impressive at reasoning, then notes Claude is still better at coding. Multiple commenters separately mention they hit the same wall: great math, weaker code.
“xAI is releasing a specialized coding model later” — recurring across threads. The implicit acknowledgement from xAI’s own roadmap: Grok 4 base is not the coding model. The dedicated coding variant is the one to wait for.
“Where are Grok 3’s weights?” r/LocalLLaMA’s “Friendly reminder that Grok 3 should be now open-sourced” (1446 upvotes) — landed July 11, the day after launch. xAI promised the prior-generation weights would open when the new model shipped. They didn’t. The trust deficit is now baked into how the open-frontier community reads xAI’s announcements.

The mature read across both camps: Grok 4 is a legitimate frontier model that wins specific niches and loses the day-to-day coding default. The hype framing oversells what it demonstrates; the dismissive framing undersells what it actually does well.

How Grok 4 fits in a real coding stack

For a working engineer in mid-July 2025:

1. Default coding model — Claude 4 Sonnet (Opus for harder work). Nothing in the launch-week evidence dislodges that. The Rust-test thread, the frontend-dev thread, and my own multi-file tests all point the same way.

2. Hard reasoning and bug hunts — Grok 4 earns a slot here. The race-condition finding in the Rust test is the kind of thing that justifies routing specific problems to it, even if it’s not your daily driver.

3. Math-heavy or numerical code — Grok 4 is now a credible primary choice. xAI’s training mix shows in this category.

4. Long-context analysis — Gemini 2.5 Pro still wins on price-per-token at long context. Grok 4’s pricing doubles past 128K, which kills the use case.

5. Cost-conscious bulk — DeepSeek V3 remains the floor. Grok 4 isn’t in this conversation.

The honest take: Grok 4 has a real niche where it competes hard. As a default coding model, it doesn’t displace Claude. As a “specialist you route to for the right problem” it’s worth adding to your stack.

The honest critique

What this story doesn’t mean:

Grok 4 isn’t a Claude replacement for coding. Multiple independent tests this week — across Rust, frontend, multi-file refactors — show Claude 4 Sonnet/Opus still ahead on the workloads most engineers run daily. xAI knows this; the dedicated coding model they’ve signalled is coming exists because Grok 4 base is not it.
The rate limits are real. Several launch-week reports describe hitting walls fast. Until xAI lifts them, Grok 4 inside an agentic loop is uncomfortable as a primary model.
The open-weights credibility gap matters. The Grok 3 weights were promised on Grok 4 launch and didn’t appear. For engineers evaluating xAI as a long-term dependency, that’s a real signal about how the company treats its own commitments.
Benchmark dominance isn’t workflow dominance. This isn’t unique to Grok 4 — every frontier launch in 2025 has had some version of this gap — but the magnitude is unusually large here. Top of Humanity’s Last Exam, mid-pack on Design Arena’s frontend voting. Both numbers are real. Only one of them predicts how your IDE session goes.

The underlying arc, though, is durable. xAI is now a real frontier player, not a meme. A year ago that wasn’t obvious. The fact that Theo — who has been openly hostile to Musk on Twitter — opened his launch video with “yes really” tells you the calibration has shifted. The remaining question for working engineers is not whether to take xAI seriously, but where in the routing stack Grok 4 belongs. For most of us in mid-2025, the answer is “specialist slot, not default” — and that’s still a meaningful upgrade from where Grok sat last year.

For the broader model comparison, see our Claude vs GPT vs Gemini coding piece. For the cost-tier alternative, the DeepSeek V3 review.

Sources

Every reference behind this piece. If we make a claim, it's because at least one of these said so — or we lived it ourselves.

YouTube Theo - t3.gg — "Grok 4 just dropped, it's the best model right now (yes really)" — Theo - t3.gg
YouTube AI Explained — "Grok 4 - 10 New Things to Know" — AI Explained
YouTube Fireship — "Grok 4 pushes humanity closer to AGI… but there's a problem" — Fireship
YouTube Matthew Berman — "Grok 4 Fully Tested (INSANE)" — Matthew Berman
Docs xAI — Grok product page and API documentation — xAI
Blog r/ChatGPTCoding — "Elon Musk: [Grok 4] Works better than Cursor." (1096 upvotes) — r/ChatGPTCoding
Blog r/singularity — "Grok 4 disappointment is evidence that benchmarks are meaningless" (882 upvotes) — r/singularity
Blog r/singularity — "Grok-4 benchmarks" (753 upvotes) — r/singularity
Blog r/ClaudeAI — "Tested Claude 4 Opus vs Grok 4 on 15 Rust coding tasks" (415 upvotes) — r/ClaudeAI
Blog r/ChatGPTCoding — "Grok 4 still doesn't come close to Claude 4 on frontend dev" (152 upvotes) — r/ChatGPTCoding
Blog r/LocalLLaMA — "Friendly reminder that Grok 3 should be now open-sourced" (1446 upvotes) — r/LocalLLaMA
Firsthand Tested Grok 4 via the xAI API on a few real coding tasks