Gemini 3 Pro launch: dominates benchmarks, but the model is not the moat anymore

Google shipped Gemini 3 Pro in November with benchmark numbers that should have been a knockout. The shipped reality: the model wins, but the agentic stack still belongs to Anthropic.

C Charles Lin · November 22, 2025

Google launched Gemini 3 Pro on November 18 with benchmark numbers that, on paper, should have been a knockout. According to early independent analysis — including Matthew Berman’s late-November coverage — Gemini 3 Pro beat GPT-5.1 on most major benchmarks and pulled clear of Claude Sonnet 4.5 on reasoning-heavy tasks, hitting 45.8% on Humanity’s Last Exam with code execution and search. By the benchmark scoreboard, Google won the November model race decisively.

The interesting thing is how little that benchmark win has changed daily workflows two weeks later. The shipped reality — captured cleanly in IndyDevDan’s November 24 video “I gave Gemini 3 Pro its own computer” — is that the model is real, the benchmark dominance is real, and the agentic coding stack still belongs to Anthropic. Dan’s exact framing after running 15 agent sandboxes split between Gemini 3 Pro, Claude Sonnet 4.5, and Codex 5.1 Max: “Gemini 3 Pro dominates every benchmark, but Claude Code still delivers the most reliable results. Why? Because it’s not just about the model anymore. It’s about the complete agentic experience — the agent, the tooling, the workflows, and how well they work together.”

That’s the November 2025 lesson in one sentence. This piece works through what Gemini 3 Pro is actually good for, where Google’s Antigravity IDE landed in its first week, and why the engineers I follow are running Gemini 3 Pro as a complement to Claude Code rather than a replacement.

What Gemini 3 Pro actually shipped

The headline numbers from Google’s developer announcement:

State-of-the-art on reasoning-heavy benchmarks — top of Humanity’s Last Exam, GPQA Diamond, MMLU, AIME, and aggregate intelligence indexes
Strong on coding benchmarks — competitive with GPT-5.1 on SWE-bench Verified, ahead on certain web-development specific evaluations
Native computer-use capability — designed for agentic browser and OS workflows, paired with the new Antigravity IDE that ships as the reference experience
1M context window — same as the previous generation, but with reportedly better long-context retention
Price below GPT-5.1 — competitive token pricing for the capability tier

The launch shipped alongside Google Antigravity, an “Agent-First IDE” that’s the first real product-grade IDE Google has shipped specifically for AI-assisted development. This isn’t Project IDX. This is a Cursor / Windsurf / Claude Code competitor that ships with Gemini 3 Pro as the default model and an Agent Manager that orchestrates multiple parallel agents in-IDE.

The model is genuinely impressive. The benchmark numbers aren’t padded. Engineers who tried it through November consistently describe the raw output quality as at least competitive with Sonnet 4.5 and sometimes ahead. The question — and what this piece argues — is whether raw model quality is what determines daily-driver choice in November 2025. It isn’t, and that’s the more important story.

What IndyDevDan and AI Jason found in the field

The two strongest in-the-wild Gemini 3 Pro reviews from creators I trust both shipped within 48 hours of launch:

IndyDevDan, November 24 ran a head-to-head test giving Gemini 3 Pro, Claude Sonnet 4.5, and GPT Codex 5.1 Max each their own E2B agent sandbox and asked them to build full-stack applications in parallel — SQLite CRUD interfaces, note-taking apps with persistence, image generation tools. The results, in Dan’s framing: “Gemini 3 Pro dominates every benchmark, but Claude Code still delivers the most reliable results.” The model is smart enough; the tooling around it (E2B integration, Agent Skills, sandbox orchestration) is where Claude Code still pulls ahead. Dan’s broader thesis is the more important point: “Welcome to 2025 where model intelligence isn’t the limitation anymore — YOU are. Every new release unlocks incredible capabilities, but most engineers aren’t using enough compute.”

AI Jason, November 23 took a different angle — assuming Gemini 3 is the chosen model, how do you prompt-engineer it to perform 10x on specific use cases? Jason’s framework is essentially a 3-step prompt guide focused on what Anthropic did to make Sonnet 4.5 frontend-design-strong and how to apply equivalent prompt patterns to Gemini 3. The takeaway: Gemini 3 Pro responds well to very specific design references and worked examples — more sensitive to prompt scaffolding than Sonnet, but capable of dramatic uplift when the scaffolding is right.

Together these two videos capture a useful pattern: Gemini 3 Pro wins when you push it hard, with the right scaffolding, on tasks where raw reasoning matters. It loses when the workflow needs polished agentic tooling, because the surrounding ecosystem isn’t there yet.

The Antigravity IDE first-week reality

Antigravity is the new product from Google that’s supposed to close the agentic-tooling gap. The Reddit reaction to it in the first week was mixed — captured cleanly in the 443-upvote r/ChatGPTCoding review “I tried Google’s new Antigravity IDE so you don’t have to” from November 21.

The OP’s verdict had real positives: “The ‘Agent Manager’ is the real deal. Unlike the linear chat in VS Code/Cursor, here you can spawn multiple agent threads.” The agent orchestration UI is the most ambitious anyone has shipped, including Anthropic’s parallel sub-agent system.

But the comments did what Reddit does — surface the friction the launch demo glossed over:

“The multi-agent orchestration is a nice UI, but like you say, it just makes it easier to do lazy vibe-coding. I also find that Gemini 3.0 is way more aggressive at doing stuff you probably don’t want.” — 41 upvotes

“Sonnet 4.5 I tried it too, but it often breaks my files. It’ll need some time to mature.” — 35 upvotes (re: Gemini 3 in Antigravity, the commenter misnamed the model)

“Creating tests for a file at the same time as it’s being refactored, by two separate agents who are unaware of each other, seems… wrong.” — 22 upvotes

“I left cursor a long time ago for CLI tools, mainly codex and claude code. I’m curious how it compares to those, tbh I’m not particularly impressed with gemini for coding, seems over-hyped but I didn’t spend too much time with it.” — 15 upvotes

The pattern across these comments is consistent: Antigravity’s agent orchestration UI is genuinely innovative. Gemini 3’s coding behavior in agent mode is too aggressive and edits files in ways users didn’t ask for. That’s not a model-quality issue per se — it’s an agentic-guardrails issue, where Antigravity’s defaults push the model to do more than the user wanted. Six months from now this is probably fixed. Today, it’s the lived experience.

The reconciliation: model quality vs agentic experience

YouTube creators in launch-week mode emphasized the benchmark dominance. Matthew Berman’s late-November analysis carried the “Why Gemini 3 Shifted the AI Race” framing — that the benchmark gap had real strategic implications for the OpenAI-vs-Anthropic frame. He’s not wrong on the benchmarks.

Reddit users in production-use mode emphasized the agentic friction. The Antigravity thread, the comments about Gemini 3 being “way more aggressive” in agent mode, and the pre-Gemini-3 234-upvote “Left gemini for 30 minutes and came back to this 🤦” thread from October 30 all carry the same theme: Gemini’s autonomous behavior is harder to trust than Claude’s, regardless of which model is “smarter” on a benchmark.

Both are true. They’re measuring different things. Benchmarks measure raw reasoning capacity on contained tasks. Daily-driver experience measures whether the agent does the right thing when you leave it alone for 15 minutes. Gemini 3 Pro is currently winning at the former and losing at the latter, and the latter is what determines workflow choice for most working engineers.

This is exactly IndyDevDan’s framing: “it’s not just about the model anymore. It’s about the complete agentic experience — the agent, the tooling, the workflows, and how well they work together.” The model is part of the product. It’s no longer the whole product. Whichever lab figures out how to ship the model and the polished agentic surface together wins; whichever lab ships the model in isolation and trusts the ecosystem to catch up loses ground despite winning the benchmark race.

What this means for your daily workflow

Two weeks in, my own pattern with Gemini 3 Pro:

Use it as a research / reasoning model. Asking it hard questions, exploring architectural trade-offs, debugging through complex error messages — its raw reasoning is genuinely strong. Better than Sonnet 4.5 for the “explain this and consider edge cases” prompt category. About equal to GPT-5.1 for the same.
Use it in Antigravity for genuinely parallel work. When you have 5 independent UI variations to explore in parallel, Antigravity’s Agent Manager is the cleanest UI for that workflow today. Use with caution and review every diff.
Don’t use it as a Claude Code replacement. The agent loop in Claude Code is still meaningfully more reliable for daily multi-file work. The MCP ecosystem, the slash commands, the sub-agent system, and the muscle memory of months of use compound in ways Antigravity can’t replicate in one launch.
Don’t use it on autonomous long-running tasks until the agentic guardrails mature. The Reddit complaints about Gemini 3 being “aggressive” in agent mode are legitimate.

What the launch actually shifted

Google now has a viable model in the frontier-coding conversation for the first time in 2025. That’s a real change from the August-October “Anthropic vs OpenAI” duopoly. The competitive dynamic is now three-way at the model layer, and Google’s bringing real money and a polished IDE (Antigravity) to the fight rather than just an API.

But the agentic-tooling moat is real. Claude Code’s six-month head start in workflow polish, MCP, sub-agents, and the surrounding ecosystem keeps Anthropic competitive even though Gemini 3 Pro has the benchmark lead. The 2025 lesson — repeated three times now with GPT-5’s launch, Sonnet 4.5, and now Gemini 3 — is that shipping a state-of-the-art model isn’t enough anymore. You ship the model with the agent, or your benchmark win doesn’t translate to user share.

The next test for Google is whether Antigravity matures fast enough to challenge Claude Code’s tooling lead. The Reddit feedback in week one was mixed-positive — real innovation in agent orchestration UI, real friction in agent behavior defaults. Six months gives them time to fix the friction. Whether they ship at the iteration speed needed to close the gap is the open question.

For now: Gemini 3 Pro is real, the benchmark dominance is real, and the model is not the moat anymore. Run it as a complementary tool in your stack. Don’t expect it to replace Claude Code yet. The agents who use it best in November 2025 are the ones who use it alongside their existing setup, not instead of it.

Sources

Every reference behind this piece. If we make a claim, it's because at least one of these said so — or we lived it ourselves.

YouTube IndyDevDan — "I gave Gemini 3 Pro its own computer... it's official, Claude Code has COMPETITION" — IndyDevDan
YouTube AI Jason — "okay, but I want Gemini 3 to perform 10x for my specific use case" — AI Jason (Jason Zhou)
YouTube The Industry Reacts to Gemini 3... — Matthew Berman
YouTube Gemini 3 Pro is the best model ever made — Theo - t3.gg
YouTube Gemini 3 Pro (Fully Tested): This MODEL Broke MY BENCHMARKS! — AICodeKing
Docs Google — Introducing Gemini 3 for developers — Google
Blog r/ChatGPTCoding — "I tried Google's new Antigravity IDE so you don't have to (vs Cursor/Windsurf)" (443 ups) — r/ChatGPTCoding
Blog r/ChatGPTCoding — "Left gemini for 30 minutes and came back to this 🤦" (234 ups, pre-Gemini-3 frustration baseline) — r/ChatGPTCoding
Firsthand Two weeks of running Gemini 3 Pro alongside Claude Code and Codex CLI on real projects