Skip to content
TopInsight .co
A glowing compressed-attention sphere in dark space with extended faint context-thread rays reaching outward, conveying long-context capability without overt drama.

DeepSeek V4: the 1M-context + 75%-cheaper launch that made everyone else look slow

V4 ships native 1M context, two new attention mechanisms, and pricing 10-100x cheaper than the closed frontier. The technical report is a primer on what compounded efficiency actually looks like.

C Charles Lin ·

DeepSeek shipped V4 on May 27 with two model variants — V4 Pro and V4 Flash — and a 58-page technical report that is genuinely worth reading even if you have no intention of running the model. The headline numbers are notable: native 1M context with ~70% retrieval accuracy at that length, pricing roughly 10-100x cheaper than the closed frontier, and (after they made the launch-week 75% discount permanent two days later) a real cost-per-task that makes “use frontier model by default” hard to justify for most workloads.

Bycloud’s two videos through late May and early June walk through what V4 actually is and how the cost economics work. This piece is the working read after three days running V4 Pro and V4 Flash as the cheap-tier routing models in our production setup.

What V4 actually shipped

Bycloud’s “How Did DeepSeek V4 Make V4 So Cheap?” (May 27, launch-day coverage) is the cleanest single article-length summary of the launch. The pricing table he runs through is the one that should reset most engineers’ default model assumptions:

ModelInput $/MOutput $/M
DeepSeek V4 (75% off)$0.435$0.87
DeepSeek V4 cache hit~$0.00365
GLM 5.1$1.40$4.40
Gemini 3.1 Pro$2.00$12.00
Opus 4.6$5.00$25.00

Bycloud’s framing on what that means in practice: “With the same budget, a company can use DeepSeek for seven years and only four months if they use Claude.” That number sounds like hyperbole until you do the math on a meaningful production workload. For a startup running an agent loop that processes 100M tokens per month, V4 is ~$130 against Opus’s ~$3000. The difference at scale is not “we save some money”; it is “we can afford to run agents on tasks we previously could not afford to consider.”

The discount story is also worth quoting because it tells you something about DeepSeek’s confidence: the launch-week discount became permanent within 48 hours. Bycloud’s reaction in the video: “Oh, never mind, they just made the discount permanent now. And we all get to know how they have done that completely for free.” DeepSeek’s V3 economics already suggested they were making 50-70% margins at their headline prices. The V4 permanent discount implies the underlying inference cost dropped enough that the new price is sustainable, not promotional.

The two attention mechanisms that made it possible

DeepSeek’s technical report introduces two new attention mechanisms — CSA (Compressed Sparse Attention) and HCA (Heavily Compressed Attention) — that together unlock the long-context + low-cost combination. Bycloud’s longer technical follow-up “The Insane Infrastructure Design of DeepSeek V4” (June 5) covers the implementation in detail.

The short version, in the framing bycloud uses: in standard attention, the KV cache grows one entry per token, so a 1M-token context requires 1M memory slots to look through. Most tokens do not need to remain at full resolution. CSA compresses the sparse parts of the attention pattern that the model has already learned do not contain useful long-range information. HCA goes further on the genuinely-static parts of the context.

The net effect across both mechanisms is that V4 can serve 1M-context workloads with memory and compute proportional to meaningful context rather than total context. This is what other labs have been trying to do for two years and failing to ship at scale. DeepSeek shipped it as a launch feature with documentation and a permissive license.

The retrieval accuracy at 1M tokens is the part that makes this matter for working engineers. Most “long context” claims from labs in the past have been technically true but practically useless — the model can read 1M tokens but it cannot reliably retrieve from them. V4 holds roughly 70% retrieval accuracy at the full 1M context length, closing in on Opus 4.6’s accuracy at that scale. For workflows that involve reading large codebases, large document sets, or extensive agent-history context, V4 is now a credible option where previously only the most expensive closed models were.

Where V4 actually fits in a routing setup

After three days running V4 Pro and V4 Flash in our production rotation, the segmentation that has settled in:

V4 Flash as default workhorse for bulk routine work. Generating boilerplate from specifications, writing test cases from existing code, transforming structured data, summarising long documents, RAG-adjacent retrieval workflows where the context is the work and the model just needs to produce a clean summary. At cache-hit prices of $0.00365 per million tokens, the cost ceases to be a budgeting concern at any production volume.

V4 Pro for cost-sensitive coding tasks where the routing logic justifies the swap. Multi-file refactors with clear specs, well-defined feature implementations, test-driven development workflows where the test suite catches regressions. The quality is not Opus 4.8, but the cost difference is large enough that the routing matters at production scale.

Keep frontier models for the cases that justify them. Hard debugging, novel architectural decisions, ambiguous specs, tasks where the cost of a wrong answer compounds across many subsequent decisions. Opus 4.8 and GPT-5.5 are not displaced from this slot.

Long-context analysis is where V4 specifically wins. The 1M-context + 70% retrieval combination means V4 is now the right tool for “read this entire monorepo and answer questions about cross-file patterns” or “ingest a year of meeting transcripts and identify recurring themes.” Previously this slot belonged to Gemini Pro alone. V4 enters as a much cheaper alternative.

The r/LocalLLaMA reception

The r/LocalLLaMA community’s reaction was muted at the model-level — by their leaderboard count, V4 ranks third-best open-weights model behind Kimi K2.6 and MiMo V2.5 Pro on pure benchmark scores. But the specific posts that surfaced through the launch window are useful signals.

The r/LocalLLaMA “Tips on hitting nearly 200 tok/s for DeepSeek v4 Flash on Hopper” thread is the practical-deployment signal that V4 Flash is genuinely fast on modern GPUs, with the community working out the configuration details together. That kind of thread is what tells you a model is being seriously deployed, not just benchmarked.

The 689-upvote “Cohere’s unreleased coding model (early access for localllama)” thread is the meta-signal — even with V4 just launched, the highest-engagement post in the window was about the next unreleased model rather than the current shipped one. r/LocalLLaMA’s pattern through June 2026 is to chase early-access of whatever has not yet shipped rather than to commit to whatever just did. That is the community signal that even at V4’s launch price, the pressure on the open-model leaderboard is continuous.

Creator POV vs the working-engineer position

Bycloud’s coverage on V4 has been technically deep and quietly enthusiastic — the framing across both his videos is “DeepSeek did the work no other lab has been willing to do on the efficiency frontier, and they did it openly.” AICodeKing’s “Minimax M3” coverage from June 1 covers the model that landed in the same week with a similar cost-engineering pitch — the broader pattern is that the “cheap with long context” tier has gone from “no credible options” to “multiple credible options” in a single month.

The dissent worth quoting is from working engineers who tried V4 in production and reported that the model is competent but not Opus-tier on hard coding tasks. That is the right calibration. V4 is not a replacement for Opus 4.8 on the work where Opus is the right tool. It is a replacement for “we cannot afford to use Opus, so we don’t run agents on this slice of work.” Those are different segments of the same engineer’s daily decision. V4 expands what is affordable; it does not displace the frontier.

The deeper strategic read on V4 is the one bycloud closes his second video on: this is what compounded efficiency looks like. DeepSeek has been shipping efficiency innovations every six months for three years, each one absorbed and built on by the next. The Western labs ship capability releases on a similar cadence but treat efficiency as a side project. V4 is the result of treating efficiency as the main project. Whether the Western labs adopt that posture in the next twelve months will determine whether the open-weights cost-frontier story stays a DeepSeek story or becomes a broader market story.

What this means for working engineers right now

Three practical implications for June 2026:

1. If you are not yet running a cheap-tier model in your routing setup, V4 is the right entry point. The cost gap to Opus is large enough that the routing complexity pays back within a week. Pick the slice of your work that is “well-specified and high-volume” and try V4 Pro on it. If quality holds, expand. If not, you have learned something useful about where the model floor actually is for your workloads.

2. The 1M-context capability is genuinely new and unlocks workflows that did not exist last quarter. Try giving V4 a full repository’s worth of context for a one-shot question that previously required RAG plumbing. The result may not be perfect, but the comparison against your RAG pipeline at the same cost may surprise you.

3. Plan to re-evaluate your cheap-tier model every two months. The open-weights cost frontier is moving fast enough that locking in any single cheap-tier choice for longer than a quarter is leaving value on the table. V4 today; whatever Cohere ships next; whatever the V4.x increment looks like in August. The discipline is to have a working routing setup and refresh the cheap-tier slot regularly.

The honest summary

V4 is the model the budget-constrained side of the market needed and the model the frontier labs should be embarrassed they have not shipped. The cost economics are real, the long-context capability is real, the technical report is worth reading, and the open release means the engineering community can study and build on what DeepSeek did rather than waiting for the closed labs to catch up.

The strategic question for the rest of 2026 is whether anyone meaningfully matches DeepSeek’s efficiency posture or whether the open-weights cost frontier stays a one-lab story. My read is that we see at least one credible Western competitor by end of year — the gap between V4 and what any closed lab is shipping at cheap-tier pricing is too large to leave unanswered — but the bar V4 just set is high enough that it will take another six months of investment to clear. The cheap-tier market is now meaningfully more competitive than the frontier tier, which is exactly the inversion working engineers have been waiting for.

Sources

Every reference behind this piece. If we make a claim, it's because at least one of these said so — or we lived it ourselves.

  1. Firsthand Three days of running DeepSeek V4 Pro and Flash as cheap-tier routing models in production
  2. Docs DeepSeek — V4 technical report — DeepSeek
  3. YouTube How Did DeepSeek V4 Make V4 So Cheap? — bycloud
  4. YouTube The Insane Infrastructure Design of DeepSeek V4 — bycloud
  5. YouTube Minimax M3 (Fully Tested) + FULLY FREE API: This is ACTUALLY GOOD! — AICodeKing
  6. Blog r/LocalLLaMA — Tips on hitting 200 tok/s for DeepSeek v4 Flash on Hopper — r/LocalLLaMA
  7. Blog r/LocalLLaMA — Cohere's unreleased coding model early access (689 ups) — r/LocalLLaMA