Grok 4.3: Higher AA Intelligence Score, Lower Benchmark Cost

xAI shipped Grok 4.3 on April 30, 2026. Independent analysts at Artificial Analysis report an Intelligence Index score of 53 for Grok 4.3, improved agentic benchmark results versus Grok 4.20 0309 v2, and a lower dollar cost to run the full Artificial Analysis Intelligence Index suite at about $395 compared with the prior Grok generation under the same methodology. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

The headline is not “a new version number.” It is a shift in cost-per-intelligence on Artificial Analysis’ composite leaderboard: higher measured capability alongside cheaper full-suite evaluation costs driven by lower per-token pricing, even when the model consumes more output tokens than Grok 4.20 0309 v2 on the same benchmark battery. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

Primary sources: Artificial Analysis’ unrolled thread summary and the Grok 4.3 model card on Artificial Analysis. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

What shipped

Artificial Analysis positions Grok 4.3 as a proprietary reasoning model that moves xAI up the Intelligence Index while improving benchmark economics. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

Intelligence Index: Grok 4.3 scores 53 on the Artificial Analysis Intelligence Index, placing it just above Muse Spark and Claude Sonnet 4.6 and about four points ahead of the latest Grok 4.20 in Artificial Analysis’ ranking narrative. (Source: Artificial Analysis thread)
API sticker prices: Artificial Analysis lists Grok 4.3 at $1.25 per million input tokens and $2.50 per million output tokens, with a cache hit input rate of $0.20 on the model page snapshot used for this article. (Source: Artificial Analysis Grok 4.3)
Full-suite evaluation cost: Artificial Analysis reports about $395 to run the Intelligence Index for Grok 4.3 in the thread narrative, and $395.17 on the model page, framed as roughly 20% lower than Grok 4.20 0309 v2 for the same suite. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)
Price cuts versus Grok 4.20: The thread cites 37.5% lower input token prices and 58.3% lower output token prices as drivers, alongside an older headline-style claim of about 40% lower input and 60% lower output versus Grok 4.20. (Source: Artificial Analysis thread)
Throughput: The Grok 4.3 model page lists 189.9 output tokens per second on xAI’s API in the captured snapshot, which Artificial Analysis ranks highly on speed among evaluated models. (Source: Artificial Analysis Grok 4.3)
Verbosity: Grok 4.3 uses about 44% more output tokens than Grok 4.20 0309 v2 to complete the Intelligence Index, while the model page records 88M output tokens for the Intelligence Index run and still frames Grok 4.3 as comparatively less verbose than some other leading models in Artificial Analysis’ narrative. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)
Modalities and context: The model page lists text and image input, text output, a 1M token context window, and classifies Grok 4.3 as a reasoning model. (Source: Artificial Analysis Grok 4.3)

Practitioner payoff: score up, benchmark bill down

Teams that route traffic by “leaderboard tier” and API price should treat Artificial Analysis’ Intelligence Index as one composite signal, not a replacement for task-specific evals. Still, the Grok 4.3 story is unusually concrete on economics: Artificial Analysis explicitly ties the Intelligence Index run cost to combined token usage and per-token pricing, and reports a lower suite cost for Grok 4.3 than Grok 4.20 0309 v2 despite higher output-token usage on that suite. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

Practitioner payoff: If your internal workloads resemble long outputs and agent-style turns, output-token price and verbosity dominate bills more than input-token price. Grok 4.3’s published output price is $2.50 per million output tokens on Artificial Analysis’ card, and the thread emphasizes that output-token volume rose versus Grok 4.20 0309 v2 even as total suite cost fell. That pattern rewards teams that measure dollars per successful task, not dollars per million tokens in isolation. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

Why this matters: A lower Intelligence Index run cost is not the same as “cheap in production,” but it is a meaningful signal that xAI is pushing Grok 4.3 toward a more favorable spot on Artificial Analysis’ intelligence-versus-cost charts, which many buyers use as a first-pass filter before deeper evaluations. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

Operator note (first-hand): On 2026-05-02 we fetched https://artificialanalysis.ai/models/grok-4-3 over HTTPS and captured the published Intelligence Index score (53), pricing lines ($1.25 input, $2.50 output), output speed (189.9 tokens per second), Intelligence Index output-token total (88M), and suite cost total ($395.17) directly from the live page content returned to the client. (Source: Artificial Analysis Grok 4.3)

Agentic lifts: GDPval-AA, instruction following, and support simulations

Artificial Analysis highlights Grok 4.3’s largest single benchmark jump on GDPval-AA, its agentic evaluation focused on real-world tasks. Grok 4.3 posts an Elo of 1500, up 321 points from 1179 for Grok 4.20 0309 v2. Artificial Analysis says Grok 4.3 surpasses Gemini 3.1 Pro Preview, Muse Spark, GPT-5.4 mini (xhigh), and Kimi K2.5 on that benchmark snapshot, while still trailing GPT-5.5 (xhigh) by 276 Elo points with an expected win rate of about 17% head-to-head under a standard Elo framing. (Source: Artificial Analysis thread)

Practitioner payoff: If your product roadmap looks like “agents that complete multi-step workflows in messy domains,” GDPval-AA is closer to that risk surface than a pure coding leaderboard. The magnitude of the jump matters: 321 Elo points is a headline-grade move, even if GPT-5.5 (xhigh) remains the leader on Artificial Analysis’ snapshot. (Source: Artificial Analysis thread)

On instruction following and customer-support style simulations, Artificial Analysis reports that Grok 4.3 gains five points on 𝜏²-Bench Telecom to 98%, described as in line with GLM-5.1, and maintains an 81% IFBench score carried forward from Grok 4.20 0309 v2. Those lines support the thread’s theme that Grok 4.3 is intentionally competitive on agentic customer-support scenarios, not only on abstract reasoning scores. (Source: Artificial Analysis thread)

Decision rule for teams: Treat telecom-style tool simulations as a directional signal for regulated or procedure-heavy support flows, then validate on your own transcripts, ticket taxonomy, and tool contracts. Benchmark leaders can still fail where your tools differ or where compliance constraints narrow allowable actions. (Inference: common deployment practice when benchmarks approximate but do not equal production.)

Knowledge stack tradeoff: AA-Omniscience accuracy versus non-hallucination rate

Artificial Analysis also reports a mixed picture on AA-Omniscience, its knowledge-and-hallucination framing. Grok 4.3 gains eight points on AA-Omniscience Accuracy, but loses eight points on AA-Omniscience Non-Hallucination Rate versus Grok 4.20 0309 v2. On Non-Hallucination Rate, Grok 4.20 0309 v2 still leads in Artificial Analysis’ snapshot, followed by MiMo-V2.5-Pro, with Grok 4.3 positioned on that leaderboard rather than at the top of that specific column. (Source: Artificial Analysis thread)

Why this matters: Teams evaluating “accuracy-first” assistants versus “refuse-when-unsure” assistants should separate those objectives. A gain on accuracy without a gain on non-hallucination suggests different failure modes: more correct answers when the model commits, but not necessarily safer abstention behavior. (Inference: interpretation of reported metric directions from Artificial Analysis.)

Defensive focus: If you ship customer-facing answers grounded in policy documents, run red-team prompts that reward abstention, measure hallucination-style failures on your own corpus, and do not assume leaderboard non-hallucination rankings transfer across domains. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

The practical read is not “Grok 4.3 wins every column.” It is that xAI shipped a model that improves Artificial Analysis’ headline intelligence score and several agentic tracks while cutting the analyst suite’s run cost, with explicit tradeoffs visible on omniscience-style metrics. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

Context: benchmark suites increasingly double as pricing reviewers

Artificial Analysis has spent years turning “model comparisons” into repeatable methodology around blended price ratios, cache-aware pricing, token-use measurements, and composite indices. Grok 4.3 is another data point in that meta-story: vendors compete not only on capability slides but on how expensive their models are to run through the same public evaluation harness. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)

If you want adjacent framing on how frontier releases interact with operator economics, our GPT-5.5 agentic shift coverage walks through how OpenAI positioned GPT-5.5 across coding and knowledge work, which pairs well with reading GDPval-AA movements as part of a broader agentic trendline. (Inference: editorial pointer; no benchmark equivalence implied.)

Adoption notes

Decision rules for teams:

Pin your evaluation harness before you pin the model. If Grok 4.3’s strengths on Artificial Analysis are agentic tracks and instruction-following simulations, mirror those tasks with your own tools and data before switching production routes. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)
Recompute total cost with your token mix. Artificial Analysis’ suite cost combines usage and published pricing; your application may emit shorter or longer outputs than the Intelligence Index harness, which changes where input versus output pricing bites hardest. (Sources: Artificial Analysis thread, Artificial Analysis Grok 4.3)
Treat omniscience metrics as a policy question. If non-hallucination rate moved in the wrong direction for your risk appetite, add retrieval grounding, citation requirements, and escalation paths independent of the headline Intelligence Index score. (Source: Artificial Analysis thread)
Compare against Sonnet-class alternatives on real workflows. Artificial Analysis places Grok 4.3 near Claude Sonnet 4.6 on the Intelligence Index narrative in the thread; the right choice still depends on latency, compliance posture, and toolchain fit. (Source: Artificial Analysis thread)

GPT-5.5 Arrives: The Agentic Shift in Coding, Research, and Knowledge Work - how OpenAI framed GPT-5.5 across agentic workloads that overlap GDPval-style narratives.
Claude Opus 4.7 is GA: the migration checklist for agentic coding - migration discipline when upgrading flagship tiers used behind coding agents.
DeepSeek V4 pricing turns 1M-token context into an operator choice - a contrasting open-weights story where API sticker price and cache mechanics dominate routing decisions.

References

Artificial Analysis Grok 4.3 - https://artificialanalysis.ai/models/grok-4-3
Artificial Analysis thread - https://threadreaderapp.com/thread/2049987001655714250.html

Grok 4.3 tops Grok 4.20 on Intelligence Index for less benchmark spend

What shipped

Practitioner payoff: score up, benchmark bill down

Agentic lifts: GDPval-AA, instruction following, and support simulations

Knowledge stack tradeoff: AA-Omniscience accuracy versus non-hallucination rate

Context: benchmark suites increasingly double as pricing reviewers

Adoption notes

References

AgenticWire Desk

Related Coverage

X rebuilds its ads stack: phased AI rollout targets retrieval and ranking

Workspace Intelligence in Google Workspace: What Actually Shipped

Google Confirms Gemini-Powered Siri and Apple Intelligence for 2026 Release

What shipped

Practitioner payoff: score up, benchmark bill down

Agentic lifts: GDPval-AA, instruction following, and support simulations

Knowledge stack tradeoff: AA-Omniscience accuracy versus non-hallucination rate

Context: benchmark suites increasingly double as pricing reviewers

Adoption notes

Related coverage

References

AgenticWire Desk

Related Coverage

X rebuilds its ads stack: phased AI rollout targets retrieval and ranking

Workspace Intelligence in Google Workspace: What Actually Shipped

Google Confirms Gemini-Powered Siri and Apple Intelligence for 2026 Release