DeepSeek V4 pricing turns 1M-token context into an operator choice

DeepSeek’s V4 preview pairs 1M-token context with aggressive pricing. What Pro vs Flash costs, why cache hits matter, and what to verify on non-Nvidia infra.

AgenticWire Desk·3 hours ago·8 min read

X LinkedIn

Interlocking modular blocks on dark navy, a long cyan ribbon suggesting 1M-token context.

DeepSeek released previews of DeepSeek-V4-Pro and DeepSeek-V4-Flash, models with a 1M-token context window. DeepSeek V4 pricing: DeepSeek’s pricing page lists V4 Pro at $1.74 per 1M input tokens (cache miss) and $3.48 per 1M output tokens, with a V4 Flash tier and cache-hit input rates. (Sources: DeepSeek V4 release note, DeepSeek Models & Pricing)

The key benefit is not “one million tokens” as a party trick. It is that long-context and agent-style workloads now have a credible price floor in open weights, and that shifts routing decisions for teams that have been treating frontier APIs as the default. (Source: DeepSeek V4 tech report)

Primary sources: DeepSeek’s V4 release note, DeepSeek’s pricing page, and the DeepSeek V4 technical report, with independent benchmark context from Artificial Analysis. (Sources: DeepSeek V4 release note, DeepSeek Models & Pricing, DeepSeek V4 tech report, Artificial Analysis V4)

What shipped

From DeepSeek’s release note and pricing card, the V4 surface looks like this. (Sources: DeepSeek V4 release note, DeepSeek Models & Pricing)

**DeepSeek-V4-Pro** is positioned as the flagship: a MoE model described as 1.6T total parameters with 49B activated parameters, supporting a 1M-token context window. (Sources: DeepSeek V4 release note, DeepSeek V4 tech report)
**DeepSeek-V4-Flash** is the smaller tier: 284B total parameters with 13B activated parameters, also supporting 1M-token context. (Sources: DeepSeek V4 release note, DeepSeek V4 tech report)
The API is intended to be drop-in for common client stacks: DeepSeek says it supports OpenAI ChatCompletions and Anthropic-format APIs, and instructs users to keep `base_url` and swap the model name to `deepseek-v4-pro` or `deepseek-v4-flash`. (Source: DeepSeek V4 release note)
V4 supports both thinking and non-thinking modes, and DeepSeek’s pricing page lists a maximum output of 384K tokens for the V4 models. (Source: DeepSeek Models & Pricing)
DeepSeek’s pricing page lists V4 Flash at $0.14 per 1M input tokens (cache miss) and $0.28 per 1M output tokens, while V4 Pro is $1.74 per 1M input tokens (cache miss) and $3.48 per 1M output tokens, with separate lower prices for cache hits and an announced temporary discount on V4 Pro. (Source: DeepSeek Models & Pricing)
DeepSeek says `deepseek-chat` and `deepseek-reasoner` are compatibility names that will be retired and inaccessible after 2026-07-24 15:59 UTC. (Source: DeepSeek V4 release note)

DeepSeek V4 pricing is the headline, but “cache hit” is the lever

DeepSeek’s pricing page explicitly prices input tokens differently depending on whether they hit the prompt cache, and that makes “reused context” an engineering lever instead of a rounding error. (Source: DeepSeek Models & Pricing)

Practitioner payoff: DeepSeek V4 Pro costs $1.74 per 1M input tokens (cache miss) and $3.48 per 1M output tokens, while DeepSeek V4 Flash costs $0.14 per 1M input tokens and $0.28 per 1M output tokens. Cache-hit input is priced far lower than cache-miss input, which makes “reuse the same system and policy context across many requests” a first-class cost lever. (Source: DeepSeek Models & Pricing)

Decision rule for teams: If you have a stable system prompt and stable tool definitions, structure your requests so the cache hit rate is high, and spend your budget on output tokens where the actual work happens. If you constantly mutate your system prompt, you will pay the cache miss rate and erase a lot of the advantage. (Inference: based on cache-hit vs cache-miss pricing mechanics on DeepSeek’s pricing page.)

A 1M-token context window only matters if KV cache stops exploding

“1M context” is not new as marketing, but it is still rare as a routine operational setting because attention cost and KV cache size are what break first. DeepSeek’s technical report frames V4 as an efficiency play: hybrid attention designs are there to make long context computationally tractable, not just possible in a demo. (Source: DeepSeek V4 tech report)

In the abstract, DeepSeek says V4 uses a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), plus other architectural and optimization changes. It also claims that in a one-million-token context setting, V4-Pro requires 27% of the single-token inference FLOPs and 10% of the KV cache compared with DeepSeek-V3.2. (Source: DeepSeek V4 tech report)

Why this matters: DeepSeek V4’s context window is listed as 1M tokens, and its pricing page lists a maximum output of 384K tokens. In the tech report, DeepSeek argues the engineering win is efficiency: reducing FLOPs and KV cache at 1M context so long-horizon tasks can run without collapsing throughput or memory. (Sources: DeepSeek Models & Pricing, DeepSeek V4 tech report)

Operator note (first-hand): We pulled the raw V4 technical report PDF and verified the abstract and the FLOPs and KV-cache reduction claims directly in the document, then cross-checked pricing against the live DeepSeek “Models & Pricing” page on 2026-04-28. (Sources: DeepSeek V4 tech report, DeepSeek Models & Pricing)

Practitioner payoff: If your product’s “context” is really a pile of artifacts, policies, and logs, then lowering KV cache growth is what makes “stuff everything in one prompt” stop being a denial-of-service against your own GPU memory. (Inference: derived from the report’s KV-cache efficiency framing.)

Pro vs Flash is a product decision, not a benchmark flex

DeepSeek’s own positioning is straightforward: Pro is the maximum-capability tier and Flash is the faster, cheaper workhorse. For most teams, the practical approach is routing: default to Flash for volume, then escalate only when failure modes justify it. (Sources: DeepSeek V4 release note, DeepSeek Models & Pricing)

Artificial Analysis provides a useful external check on that product split. In its writeup, AA reports DeepSeek V4 Pro (Max) at 52 on its Intelligence Index and V4 Flash (Max) at 47, with V4 Pro leading open weights on its GDPval-AA agentic work-tasks benchmark. AA also flags a high hallucination rate behavior in which the models “nearly always respond anyway” when they do not know an answer. (Source: Artificial Analysis V4)

Decision rule for teams:

Route routine tool calls, extraction, and high-throughput chat to `deepseek-v4-flash` when the cost ceiling matters more than best-effort reasoning. (Source: DeepSeek Models & Pricing)
Route long-context synthesis, code-agent work, and “you only get one chance” reasoning to `deepseek-v4-pro`, then measure whether the higher output-token cost is offset by fewer retries and fewer tool failures. (Sources: DeepSeek Models & Pricing, Artificial Analysis V4)
Add a “refuse to answer” or “ask a clarifying question” policy layer in your product, because a high hallucination rate is an operational property, not just a benchmark footnote. (Source: Artificial Analysis V4)

The Huawei Ascend angle: what’s proven today vs what you still need to test

**Huawei Ascend support** in this context means “Ascend is being marketed as an inference stack that can run DeepSeek-family MoE models.” Huawei’s Ascend inference page says its large-model inference solution is “deeply adapted” to DeepSeek and mainstream MoE large models, which supports a baseline compatibility claim. It does not, by itself, certify V4-specific performance or SLA behavior. (Source: Huawei Ascend inference)

The interesting infrastructure story is not whether V4 can be downloaded. It is whether “open weights + long context” can run well on non-Nvidia stacks in a way that operators can trust. In practice, that comes down to kernel maturity, expert parallelism support, and predictable tool calling under load. (Inference: based on common deployment constraints for MoE models and long-context attention.)

Defensive focus: if you are evaluating Ascend or any non-Nvidia stack for V4-class models, treat it like a production readiness checklist:

Verify long-context throughput at your real token mix, not a short prompt demo, because 1M context changes KV-cache behavior and memory pressure. (Source: DeepSeek V4 tech report)
Validate expert-parallel performance and routing stability, because MoE serving is sensitive to interconnect and scheduling details. (Source: DeepSeek V4 tech report)
Test tool calling and JSON mode under concurrency, because “works once” is not the same as “works as a service.” (Sources: DeepSeek V4 release note, DeepSeek Models & Pricing)
Measure “don’t know” behavior and refusal policy outcomes, because high hallucination rates can turn into silent correctness bugs. (Source: Artificial Analysis V4)

The story is not “1M context exists.” It is that long-context is now cheap enough,

and portable enough, that routing and infrastructure become first-order decisions.

(Sources: DeepSeek Models & Pricing, DeepSeek V4 tech report)

Context: DeepSeek’s return makes the race about price and deployment options

Over the last year, “frontier” has increasingly meant “closed API, rising price, and a moving target.” DeepSeek is taking the opposite posture: make the weights downloadable under a permissive license, then set API pricing low enough that “try open weights first” becomes plausible. (Sources: DeepSeek V4 release note, DeepSeek Models & Pricing)

The interesting part is not open weights as ideology. It is that operators get to choose where the model runs: hosted API for speed, or self-hosting for control and compliance, or a hybrid where you keep sensitive workflows inside your boundary and route generic work out. That decision is only plausible if the cost curve and long-context efficiency are believable. (Source: DeepSeek V4 tech report)

If you want a comparison point for “how vendors are changing the meter,” our recent [Opus 4.7 GA coverage](https://www.agenticwire.news/article/opus-4-7-ga) is a useful parallel: effort modes and token accounting are now part of product design, not just billing. (Inference: contextual internal link for operator framing.)

Adoption notes

Decision rules for teams:

Start with `deepseek-v4-flash` for prototypes and high-throughput features, then promote only the “hard prompts” to V4 Pro based on measurable failure modes like retries, tool errors, and user-visible corrections. (Sources: DeepSeek Models & Pricing, Artificial Analysis V4)
Treat 1M context as a capability you earn with engineering, not a number you paste into marketing: constrain what you feed the model, cache what is stable, and monitor memory and latency at the 95th percentile. (Source: DeepSeek V4 tech report)
If your workload depends on the model abstaining when uncertain, build that behavior at the application layer and evaluate “don’t know” behavior explicitly, because AA reports high hallucination rates for V4 Pro and V4 Flash. (Source: Artificial Analysis V4)
For non-Nvidia infra, do not treat “supports DeepSeek” as “supports V4 at your SLA”: run a pilot that includes long-context prompts, tool calling, and real concurrency, then decide. (Sources: Huawei Ascend inference, DeepSeek V4 tech report)

[Opus 4.7 GA](https://www.agenticwire.news/article/opus-4-7-ga) - what changed in Anthropic’s flagship tier and why “effort modes” matter to operators.
[OpenAI 122B funding](https://www.agenticwire.news/article/openai-122b-funding) - the macro pressure that turns model capability into pricing power.
[Microsoft Agent Framework 1.0 ships graph workflows and MCP](https://www.agenticwire.news/article/microsoft-agent-framework-1-0-workflows-mcp) - how agent runtimes are becoming explicit graphs and tool contracts.

References

Artificial Analysis V4 - https://artificialanalysis.ai/articles/deepseek-is-back-among-the-leading-open-weights-models-with-v4-pro-and-v4-flash
DeepSeek Models & Pricing - https://api-docs.deepseek.com/quick_start/pricing/
DeepSeek V4 release note - https://api-docs.deepseek.com/news/news260424
DeepSeek V4 tech report - https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

AgenticWire Desk

Editorial

Related Coverage

Google invests in Anthropic: a $40B deal that reads like a compute contract

DeepSeek V4 pricing turns 1M-token context into an operator choice

What shipped

DeepSeek V4 pricing is the headline, but “cache hit” is the lever

A 1M-token context window only matters if KV cache stops exploding

Pro vs Flash is a product decision, not a benchmark flex

The Huawei Ascend angle: what’s proven today vs what you still need to test

Context: DeepSeek’s return makes the race about price and deployment options

Adoption notes

References

AgenticWire Desk

Related Coverage

Google invests in Anthropic: a $40B deal that reads like a compute contract

OpenAI’s $122B funding round at $852B: what it signals

Claude Design turns prompts into prototypes and slide decks

What shipped

DeepSeek V4 pricing is the headline, but “cache hit” is the lever

A 1M-token context window only matters if KV cache stops exploding

Pro vs Flash is a product decision, not a benchmark flex

The Huawei Ascend angle: what’s proven today vs what you still need to test

Context: DeepSeek’s return makes the race about price and deployment options

Adoption notes

Related coverage

References

AgenticWire Desk

Related Coverage

Google invests in Anthropic: a $40B deal that reads like a compute contract

OpenAI’s $122B funding round at $852B: what it signals

Claude Design turns prompts into prototypes and slide decks