Skip to content
TopInsight .co
An abstract sphere in dark space with electric-blue lightning-like traces across its surface — suggestion of Grok / xAI branding.

Grok 4 for coding: separating the claims from the reality

Elon Musk claimed Grok 4 outperforms Cursor and the rest. The Reddit reaction was unsurprisingly skeptical. After testing, here is the working read.

C Charles Lin ·

Elon Musk claimed in mid-2025 that Grok 4 works better than Cursor. The Reddit reaction was predictably skeptical — the r/ChatGPTCoding thread on this claim hit 1100 upvotes, mostly with people saying “show benchmarks, then we’ll talk.”

After testing Grok 4 across a handful of real coding tasks, here is the working read.

What Grok 4 actually is

xAI’s latest frontier model. Strong on math/reasoning benchmarks, large context window (256K+), pitched as a Claude / GPT-4 competitor. Available via the xAI API and the Grok consumer product.

The marketing positions it as the best at a few things. The reality is more nuanced.

Where Grok 4 actually performs

In limited testing:

Reasoning-heavy debugging: comparable to OpenAI o1-mini in my tests. Not better than Claude 3.7 Sonnet for typical coding, but on hard logic puzzles it’s competitive.

Math-flavoured code: notably good at numerical algorithms, statistical computing, graph algorithms. xAI clearly trained heavily on this.

Multi-file edits: worse than Claude 3.7 Sonnet. The agent loop is less reliable; Grok occasionally drops files from context, especially on larger refactors.

Code review: medium-quality. Catches obvious issues. Misses subtleties that Claude 3.7 catches consistently.

The Musk claim, evaluated

The “better than Cursor” framing is a category error. Cursor is an IDE that wraps any model; Grok 4 is a model. Comparing them is comparing a kitchen to a chef.

What Musk plausibly meant: Grok 4 used as the model behind a Cursor-like tool produces better outputs than Cursor’s default model setup. This is hard to verify directly because:

  • Cursor’s “default model” has shifted over time
  • The specific tasks tested aren’t disclosed
  • Reproducibility evidence is thin

The Reddit response — “marketing, show me the benchmark numbers” — is reasonable. xAI has not published SWE-bench Verified scores comparable to Anthropic / OpenAI / Google. Until they do, the claim is more PR than evidence.

How Grok 4 fits in a real coding stack

For a working engineer in mid-2025:

  • Default coding model: Claude 3.7 Sonnet
  • Hard reasoning / debugging: OpenAI o3-mini
  • Long-context analysis: Gemini 2.5 Pro
  • Math / numerical code: Grok 4 is a credible add to your routing setup
  • Cost-conscious bulk: DeepSeek V3

The honest take: Grok 4 has a real niche (math-heavy code) where it competes with the alternatives. As a default coding model, it doesn’t displace Claude — and the marketing framing oversells what the model can demonstrate.

What Reddit is actually saying

The 1100-upvote Reddit thread on the “better than Cursor” claim:

  • Top comments uniformly skeptical
  • Several “I tested it, here are my findings” replies with measured criticism
  • The community treats xAI’s marketing as separate from the actual model quality

The pattern: r/ChatGPTCoding has matured past being moved by founder hype claims. The community wants verifiable benchmarks, not statements.

The bigger lesson

LLM marketing in 2025 has converged on “we’re better than X.” Most of these claims are partial, context-dependent, or framed in ways that don’t reproduce. The right response is the Reddit response: ask for SWE-bench Verified, HumanEval, and at minimum a published methodology. Until then, treat claims as PR.

Grok 4 is a credible model. It’s not the best at coding generally. It has specific niches where it competes. For our routing setup, it’s a “maybe add for math-heavy tasks” — not “replace Claude.”

For the broader model comparison, see our Claude vs GPT vs Gemini piece. For the cost-tier alternative, DeepSeek V3 review.

Sources

Every reference behind this piece. If we make a claim, it's because at least one of these said so — or we lived it ourselves.

  1. Firsthand Tested Grok 4 via the xAI API on a few coding tasks
  2. Docs xAI Grok documentation — xAI
  3. Blog r/ChatGPTCoding — Grok 4 vs Cursor reaction thread (1100 ups) — r/ChatGPTCoding
  4. YouTube Independent Grok 4 coding evaluations — Various