On June 4, 2026, NVIDIA released Nemotron 3 Ultra, an open 550-billion-parameter mixture-of-experts (MoE) model with 55 billion active parameters built for agent planning in long-running workflows. The same day, AibleClaw added Ultra as a governed planning backend customers can run on a private server or via an NVIDIA Cloud Partner endpoint. For teams shipping autonomous agents locally, the durable answer is simple: put a frontier planning model at the orchestration layer, run it through NVIDIA NIM or AibleClaw-managed infrastructure, and pair it with NemoClaw plus OpenShell when you need a reference secure runtime.

Nemotron 3 Ultra targets the hard turns: multi-step plans, tool routing, sub-agent delegation, and recovery after failed actions. NVIDIA reports 5x higher throughput than comparable open models and up to 30% lower cost to complete agentic benchmarks because the model uses fewer total tokens per task. (Source: NVIDIA Nemotron 3 Ultra blog)

Key takeaways

  • Nemotron 3 Ultra is a 550B/55B MoE open model post-trained for agent harnesses, not single-turn chat.
  • June 4, 2026 general availability spans Hugging Face, ModelScope, OpenRouter, and build.nvidia.com as an NVIDIA NIM microservice.
  • AibleClaw can auto-install Ultra on a private server or point to an existing cloud endpoint for governed long-running agents.
  • NemoClaw (blueprint) and OpenShell (secure runtime) form NVIDIA's reference stack for running harnesses like OpenClaw and Hermes Agent safely.
  • As models absorb orchestration, planning model choice matters more than heavyweight framework features. (Source: The New Stack container wars)

What shipped on June 4, 2026

Nemotron 3 Ultra is NVIDIA's open frontier model for long-running agent orchestration. It combines hybrid Mamba-Transformer layers for long context, NVFP4 quantization so one checkpoint runs on Hopper, Blackwell, and Ampere GPUs, and Multi-Teacher On-Policy Distillation (MOPD) from more than ten domain-specific teacher models. (Source: NVIDIA Nemotron 3 Ultra blog)

On agent benchmarks NVIDIA published, Ultra hits 91% on PinchBench agent productivity, 40% on EnterpriseOps-Gym long-horizon planning (ahead of GLM 5.1 at 33% and Kimi K2.6 at 29%), and 95% on Ruler @1M long-context recall. Throughput tests on the Artificial Analysis Intelligence Index show 5x faster inference using Blackbox endpoints. Cost experiments on SWE-bench and Terminal-Bench 2.0 report up to 30% lower spend because the model completes tasks with fewer tokens per turn. (Source: NVIDIA Nemotron 3 Ultra blog)

NVIDIA also shipped the surrounding Agent Toolkit: NemoClaw blueprints (available now), OpenShell secure runtime (early preview), and CUDA-X libraries exposed as agent skills. Nemotron 3 Ultra is post-trained for leading harnesses including Hermes Agent, LangChain Deep Agents, OpenClaw, OpenHands, and OpenCode. Weights, data recipes, and OpenMDW-1.1 licensing ship open for fine-tuning. (Sources: NVIDIA enterprise agents newsroom, NVIDIA Nemotron 3 Ultra blog)

AibleClaw, Aible's enterprise layer for governed "claw" agents, announced same-day support. Customers can route planning calls to an NVIDIA partner endpoint or have Aible automatically install and configure Ultra on a private server inside the customer's cloud boundary. (Source: AibleClaw Nemotron announcement)

Why agent planning is the new bottleneck

Single-turn chat is a poor fit for production agents. Each planning cycle, tool call, sub-agent handoff, and validation step appends tokens to the context window. Costs compound, and goal drift rises when the orchestration model cannot hold a multi-hour plan steady.

NVIDIA frames the fix as a system of models: a frontier planning model for hard orchestration decisions and smaller models for high-volume execution and validation. Nemotron 3 Ultra sits in the planning slot. (Source: NVIDIA Nemotron 3 Ultra blog)

The framework market is moving the same direction from the opposite side. The New Stack's June 2026 "container wars" analysis argues hyperscalers now give away thin harnesses (model, tools, prompt) as on-ramps to paid inference runtimes, while newer models handle native tool use and self-correction that once required graph orchestrators. AWS's Strands team put it plainly: "We realized that we no longer needed such complex orchestration to build agents, because models now have native tool-use and reasoning capabilities." (Source: The New Stack container wars)

Inference: When the planning model improves, the expensive layer shifts up-stack. Teams debating LangGraph versus CrewAI should first ask whether their planning model can complete multi-step workflows without constant rerouting.

"NVIDIA NemoClaw provides enterprise software developers with the open building blocks to create more secure, long-running AI coworkers that amplify human expertise as they reshape how work gets done." — Jensen Huang, founder and CEO, NVIDIA (Source: NVIDIA enterprise agents newsroom)

Stack comparison: planning model, harness, runtime, enterprise layer

Nemotron 3 Ultra is only one layer. Local agent planning fails when teams treat the model weights as the whole stack. The table below maps four common assembly patterns.

LayerNemoClaw + OpenShell (reference)OpenClaw + NIM (DIY open)AibleClaw + Ultra (enterprise)Cloud NIM endpoint only
Planning modelNemotron 3 Ultra via build.nvidia.com or NIMSame; self-managed NIM on private GPUUltra on private server or partner endpointPartner-hosted Ultra; no local GPU
HarnessHermes, OpenClaw, OpenHands via NemoClaw blueprintPick harness; wire JSON config to NIMOpenClaw inside OpenShell; Aible governanceAPI-only; harness runs elsewhere
Runtime / policyOpenShell early preview; policy + privacy controlsYou own sandboxing (see AgenticWire sandbox guide)Deterministic execution, pre-approved tools, audit trailsInference only; runtime is your problem
Best forDevelopers reproducing NVIDIA's secure reference stackGPU-rich teams comfortable owning securityRegulated enterprises needing governed long-running clawsFast benchmark pilots without hardware

(Sources: NVIDIA Nemotron 3 Ultra blog, NVIDIA enterprise agents newsroom, AibleClaw Nemotron announcement)

Key benefit: AibleClaw adds the enterprise control plane Aible calls deterministic execution: successful Ultra runs can serialize into NVIDIA AI-Q plans for scheduled reuse, and outputs under the open Nemotron license can post-train Nemotron 3 Super or Nano models via Aible's intern-feedback pipeline. (Source: AibleClaw Nemotron announcement)

How to adopt local agent planning with Nemotron 3 Ultra

Step 1 - Choose your deployment plane. Options include a self-hosted NVIDIA NIM microservice on private Hopper/Blackwell/Ampere hardware, an NVIDIA Cloud Partner endpoint, or AibleClaw-managed private server install. NVIDIA lists Hugging Face, ModelScope, OpenRouter, and build.nvidia.com as Day-one channels. (Source: NVIDIA enterprise agents newsroom)

Step 2 - Pick a harness matched to your workflow. Ultra is post-trained across OpenClaw, Hermes Agent, OpenHands, and OpenCode. NVIDIA's GitHub getting-started repo documents config for each. For a minimal loop, OpenClaw plus NIM is enough to test planning quality. (Source: NVIDIA Nemotron 3 Ultra blog)

Step 3 - Add a secure runtime before production. NemoClaw installs OpenShell, where agent-generated code and tool calls execute under policy. NVIDIA positions OpenShell for on-premises, hybrid cloud, DGX Spark, and Windows paths with Microsoft. Skip this only for isolated benchmarks. (Source: NVIDIA enterprise agents newsroom)

Step 4 - Validate on a multi-step workflow, not a single prompt. Aible's joint hackathon with the NemoClaw team required OpenClaw to locate the correct agent, pick a dataset, run analysis, post to Slack, and save the plan for reuse. Nemotron 3 Ultra completed the chain with fewer backtracks, passed Aible's hallucination check on quantitative claims, and saved a deterministic AI-Q plan on the first try. (Source: AibleClaw Nemotron announcement)

Step 5 - Distill if latency or cost still bite. Use Ultra as a teacher under Nemotron's permissive license to bootstrap smaller Super or Nano models with user feedback on reasoning steps via NeMo Customizer. That closes the cold-start gap closed models block with restrictive output licenses. (Source: AibleClaw Nemotron announcement)

Operator note (first-hand): On June 12, 2026, this pipeline run fetched the NVIDIA developer blog, NVIDIA newsroom release, and Aible Newswire post over HTTPS without error. The aiagentstore.ai weekly digest (June 6-12) did not list Aible or Nemotron 3 Ultra, confirming the June 4 launch sits outside that aggregator's current window even though primary sources are live.

For a consumer-GPU baseline on local open models, see AgenticWire's local LLM agentic coding on RTX 3090. Ultra targets datacenter-class MoE inference, not laptop-class weights.

When Ultra beats investing in a heavier framework

MindStudio's 2026 agentic model roundup notes that 95% single-step tool accuracy fails over 20-step workflows because errors compound. That math favors a stronger planning model before another orchestration abstraction. Closed frontier models still lead many SaaS agent products, but Ultra's open weights plus NIM packaging give teams an on-prem planning tier without API egress. (Source: MindStudio agentic models 2026)

NVIDIA's MOPD training, built on NeMo RL, explicitly optimizes harness-native trajectories: plan, call tools, read observations, delegate, validate, recover. SWEBench Verified scores between 65% and 70.4% held consistent across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent, signaling harness portability matters less when the model is post-trained across them. (Source: NVIDIA Nemotron 3 Ultra blog)

Decision rule: If your agents already run inside a secure runtime and your bottleneck is bad plans on turn three, upgrade the planning model first. If the bottleneck is ungoverned tool access, fix runtime and identity first; compare patterns in AgenticWire's OpenAI Agents SDK sandbox versus harness safety piece.

Long-running agents also need memory architecture separate from planning. AgenticWire's Mem0 vs Zep vs Letta comparison covers vector, graph, and OS-tier memory when Ultra's context window is not enough for months-long claws.

For framework selection when you still need explicit graphs, see the AutoGen AG2 agent frameworks guide.

FAQ

What is NVIDIA Nemotron 3 Ultra used for in agent workflows?

Nemotron 3 Ultra is an open 550B-parameter MoE model with 55B active parameters post-trained for long-running agent orchestration: planning, tool routing, sub-agent delegation, validation, and error recovery across many turns. NVIDIA positions it for the hard planning calls while smaller models handle high-volume execution. (Source: NVIDIA Nemotron 3 Ultra blog)

Can you run Nemotron 3 Ultra on a private server?

Yes. NVIDIA ships Ultra as open weights and an NVIDIA NIM microservice you can deploy on private Hopper, Blackwell, or Ampere GPUs with NVFP4 quantization. AibleClaw also offers automatic install and configuration on a customer-controlled private server, or you can point to an existing NVIDIA Cloud Partner endpoint. (Sources: NVIDIA enterprise agents newsroom, AibleClaw Nemotron announcement)

What is the difference between NemoClaw and AibleClaw?

NemoClaw is NVIDIA's open-source blueprint that installs OpenShell and wires popular harnesses (Hermes, OpenClaw) to Nemotron models. AibleClaw is Aible's enterprise product for governed long-running agents with deterministic execution, pre-approved tools, audit trails, and optional private Ultra deployment. AibleClaw can run inside the same OpenShell runtime NemoClaw sets up. (Sources: NVIDIA Nemotron 3 Ultra blog, AibleClaw Nemotron announcement)

How does Nemotron 3 Ultra compare on agent planning benchmarks?

NVIDIA reports 40% on EnterpriseOps-Gym long-horizon planning (vs 33% for GLM 5.1 and 29% for Kimi K2.6), 91% on PinchBench, and 5x higher throughput than open peers in its class, with up to 30% lower cost on SWE-bench and Terminal-Bench 2.0 token usage. Treat vendor benchmarks as directional; reproduce on your harness before production cutover. (Source: NVIDIA Nemotron 3 Ultra blog)

Do you still need a heavy agent framework with Nemotron 3 Ultra?

Not always. Models with strong native tool use shrink the orchestration layer frameworks once provided. You still need a harness (OpenClaw, Hermes, OpenHands), a secure runtime (OpenShell), and often an enterprise governance layer (AibleClaw) for regulated workloads. Ultra reduces how much custom graph logic you must maintain inside the harness. (Sources: The New Stack container wars, NVIDIA Nemotron 3 Ultra blog)

References