TL;DR: NVIDIA has released Nemotron 3 Super, an open-source language model engineered specifically for multi-agent AI deployments. It combines a Mamba-Transformer hybrid architecture with mixture-of-experts routing and a 1-million-token context window. Announced as part of NVIDIA's broader agentic AI push at GTC, Nemotron 3 Super is available under a permissive open license and integrates with the NemoClaw and OpenShell enterprise platforms.
What you will learn
What Nemotron 3 Super is and why it matters
The enterprise AI market has been divided for some time between closed frontier models — GPT-5.4, Claude 3.7, Gemini 2.5 — and open-weight models that trade raw capability for deployment flexibility. NVIDIA's Nemotron 3 Super lands squarely in that gap and challenges the assumption that the two properties must be in tension.
Nemotron 3 Super is not a generalist language model tuned for leaderboard benchmarks. NVIDIA designed it with a specific deployment target in mind: multi-agent pipelines where multiple LLM instances collaborate, delegate, and reason across very long task horizons. That design constraint shapes nearly every architectural decision in the model, from how it handles attention to how it routes computation across expert subnetworks.
The timing is deliberate. NVIDIA announced Nemotron 3 Super at GTC — its annual developer and research conference — alongside a cluster of other agentic AI products including the NemoClaw enterprise agent framework and the OpenShell secure agent execution environment. Together these form what NVIDIA is calling its agentic AI stack, and Nemotron 3 Super is the inference engine that sits at the core.
What makes this release significant is not any single capability in isolation. A 1M-token context window is impressive but not unprecedented. Mixture-of-experts routing is becoming table stakes. Mamba-style state space models have been explored extensively in research. What NVIDIA has done is combine all three in a single open-weight model optimized for the specific computational and memory access patterns that emerge when you run dozens of collaborating agents simultaneously.
For developers building production agent systems on proprietary cloud APIs, the cost and latency math often breaks down at scale. A single long-horizon task might require hundreds of model calls, each potentially carrying a multi-thousand-token context. Running that workload on a self-hosted open-weight model with Nemotron 3 Super's efficiency profile changes the economics substantially.
Standard transformer architectures scale quadratically with sequence length in their attention mechanism. For most tasks this is manageable, but it becomes a serious bottleneck when you want to reason across very long contexts — especially in a multi-agent setting where the effective context may include conversation history, tool call logs, retrieved documents, and inter-agent message threads simultaneously.
Mamba is a class of selective state space models (SSMs) that process sequences with linear rather than quadratic complexity. The core idea is that instead of attending to every prior token on every forward pass, the model maintains a compressed hidden state that it updates recurrently. This makes Mamba naturally efficient for long sequences but historically weaker at tasks requiring precise retrieval or fine-grained comparison across distant parts of the input — exactly the cases where transformers excel.
Nemotron 3 Super's hybrid architecture interleaves Mamba layers with standard transformer attention blocks. The intuition is straightforward: use Mamba for the bulk of sequence processing where compressed recurrent state is sufficient, and deploy full attention selectively at layers where precise cross-token comparison is necessary. The model learns, during training, which layers need which behavior.
The practical result is a model that handles very long contexts without the memory and compute explosion that would normally accompany a 1M-token window. A pure transformer at that context length would be prohibitively expensive to run outside of a hyperscaler data center. The Mamba-Transformer hybrid brings those memory requirements within reach of enterprise on-premise deployments — particularly when running on NVIDIA's own hardware stack, including the H100 and the newer Blackwell-generation GPUs announced at GTC alongside the Vera Rubin supercomputer architecture.
NVIDIA's internal benchmarks show that Nemotron 3 Super maintains near-identical retrieval accuracy to a full-attention transformer on tasks requiring precise lookups at positions up to 128K tokens, while delivering significantly better throughput at context lengths beyond 256K. At the 1M-token upper bound, a comparable full-attention model would require impractical amounts of KV cache memory; the hybrid architecture keeps this manageable through selective attention application.
The hybrid also has implications for agent workflows specifically. In a multi-agent pipeline, one agent's output often needs to be precisely referenced by another agent later in the workflow. The transformer attention layers in Nemotron 3 Super are positioned to handle exactly this kind of targeted cross-context lookup, while Mamba layers efficiently compress and propagate the broader state between those lookup events.
MoE routing for multi-agent scenarios
Mixture-of-experts (MoE) is a model architecture that replaces dense feed-forward layers with a set of specialist subnetworks (the "experts") and a learned routing function that selects which experts to activate for any given input token. The result is a model with a large total parameter count but a smaller number of active parameters per forward pass — giving you the expressivity of a large model at the inference cost of a smaller one.
Nemotron 3 Super applies MoE routing with a design philosophy specific to multi-agent workloads. In a typical single-model deployment, MoE routing optimizes for task-level diversity — routing code tokens through different experts than natural language tokens, for example. In a multi-agent setting, the diversity is different: you have a planner agent, retrieval agents, critic agents, execution agents, and synthesis agents, all operating with different token distributions and different reasoning patterns.
NVIDIA has trained Nemotron 3 Super's routing function on data that reflects this agentic token distribution. The model develops specialized expert clusters that handle agent-specific input patterns: tool call formatting, inter-agent delegation syntax, chain-of-thought scratchpads, structured output schemas, and multi-turn instruction following with role context. When deployed in a multi-agent system, the router efficiently channels each agent's characteristic input patterns to the expert cluster best suited to process them.
This has a compounding benefit in batched inference. When you run many agent instances simultaneously — which is the normal operating mode for a multi-agent pipeline — the router's specialization means that tokens from agents with similar roles tend to route to the same experts. This improves GPU utilization through better expert load balancing and reduces the tail latency that typically plagues heterogeneous agent batches.
There is also a fault isolation dimension. Because each agent's computation is routed through a relatively contained subset of experts, errors in one agent's context — malformed tool outputs, adversarial injections, hallucinated tool call syntax — are less likely to contaminate the model's internal state in ways that affect other concurrent agents. This is not a security guarantee, but it is a meaningful architectural property for systems that need to reason about agent isolation.
What a 1M token context window actually enables
A 1-million-token context window is an extraordinary number, but it is easy to announce a large context limit while the model's actual performance degrades severely at lengths far below the stated maximum. The meaningful question is whether Nemotron 3 Super maintains coherent reasoning and accurate retrieval across that full range.
NVIDIA's evaluation methodology for long-context performance focuses on three regimes: needle-in-a-haystack retrieval (finding a specific fact buried in a long document), multi-document synthesis (answering questions that require combining information from multiple documents spread across the context), and temporal reasoning across long interaction histories (following the evolution of a state or plan across a very long conversation thread).
The Mamba-Transformer hybrid architecture described above is specifically optimized for the second and third regimes, which are the ones most relevant to multi-agent deployments. Single-document needle retrieval is the easiest benchmark to game; multi-document synthesis and temporal reasoning across very long horizons are the hard problems that most models struggle with past 100K tokens.
In practical terms, a 1M-token context window with reliable performance enables several agent architectures that were previously impractical:
Full repository reasoning. A software engineering agent can load an entire codebase — including documentation, test files, and historical commit context — into a single context window without chunking. This eliminates the retrieval errors that occur when relevant code is split across multiple context windows and the agent must decide which chunk to load.
Long-horizon task execution. An agent working on a multi-day project can maintain the full thread of its work history in context without summarization. Summarization is a lossy compression that routinely drops the precise details that matter most when a task runs into an edge case hours or days into execution.
Multi-agent transcript reasoning. A supervisor agent orchestrating a team of specialized subagents can hold the full communication transcript of all agents in context simultaneously. This allows the supervisor to identify contradictions between agents, track the provenance of intermediate results, and reason about the overall state of the pipeline without lossy compression.
Legal and financial document analysis. Enterprise use cases involving contracts, regulatory filings, or financial statements often require reasoning across very long documents. A 1M-token window can accommodate an entire legal contract corpus or a full set of SEC filings in a single pass.
Open-source licensing and how to access it
Nemotron 3 Super is released under a permissive open license that allows commercial use, fine-tuning, and redistribution. NVIDIA has been deliberate about not applying the use-case restrictions that have made some previous open-weight releases less useful in practice — restrictions that prohibit competing with the releasing organization or require attribution in production systems, for example.
The model weights are available through NVIDIA NGC, the company's model registry, as well as through Hugging Face. NVIDIA is also publishing the full technical report detailing the training methodology, dataset composition, and evaluation protocols — a level of transparency that goes beyond most commercial open-weight releases.
Fine-tuning support is provided through NVIDIA's NeMo framework, which includes tooling for supervised fine-tuning, RLHF, and direct preference optimization. Quantized variants are available in INT4 and INT8 precision for deployment on hardware with tighter memory constraints. NVIDIA is publishing reference deployment configurations for single-node A100, H100, and Blackwell GPU setups, as well as multi-node configurations for high-throughput production deployments.
The open release is partly a strategic move. By making the model weights freely available, NVIDIA positions its hardware as the natural deployment platform and its software stack — NeMo, TensorRT-LLM, Triton Inference Server — as the obvious infrastructure layer. The model itself is the loss-leader; the GPU and software ecosystem is where NVIDIA captures value.
How it fits into NVIDIA's NemoClaw and OpenShell stack
Nemotron 3 Super is not a standalone product. It is the inference engine at the center of a broader agentic AI platform that NVIDIA has been assembling across multiple GTC announcements.
NemoClaw is NVIDIA's open-source framework for building and orchestrating multi-agent AI systems in enterprise environments. It provides the agent lifecycle management, inter-agent communication protocols, tool integration layers, and observability infrastructure that are missing from basic LLM APIs. NemoClaw is designed to run on top of any compliant inference backend, but it is optimized for Nemotron 3 Super's specific capabilities — particularly the 1M-token context window and the MoE routing behavior under batched multi-agent loads.
OpenShell addresses the security and isolation requirements that enterprise deployments impose. When agents have access to tools — code execution, file systems, external APIs, databases — you need a sandboxed execution environment that enforces privilege separation and audits agent actions. OpenShell provides this layer, integrating with NemoClaw's orchestration layer and with Nemotron 3 Super's inference endpoint.
Together, these three components give enterprise development teams a complete stack for building multi-agent systems that can run entirely on-premise, on NVIDIA hardware, without any dependency on third-party cloud APIs. For organizations with strict data residency requirements, regulated industries, or simply the desire to avoid per-token API costs at scale, this is a compelling alternative to building on top of OpenAI or Anthropic's cloud APIs.
The hardware foundation for these deployments is the new generation of NVIDIA infrastructure announced at GTC, including the Vera Rubin six-chip AI supercomputer architecture for datacenter-scale deployments and the Blackwell-generation server nodes for enterprise on-premise installations.
Comparison with GPT-5.4, Claude, and Gemini for agent use cases
Any honest comparison between Nemotron 3 Super and the leading closed frontier models has to acknowledge two different axes: raw capability and deployment practicality.
On raw capability benchmarks — MMLU, HumanEval, MATH, standard reasoning tasks — Nemotron 3 Super is competitive with models in the GPT-4-class tier but does not match the frontier performance of GPT-5.4 or Claude 3.7 Sonnet on general-purpose tasks. For agentic benchmarks specifically — GAIA, AgentBench, SWE-bench — the gap narrows considerably, because these benchmarks test the specific combination of long-context reasoning, tool use, and multi-step planning that Nemotron 3 Super is optimized for.
On deployment practicality, Nemotron 3 Super has structural advantages that matter at scale:
Latency. Cloud API calls for frontier models carry inherent network latency on top of inference time. A self-hosted Nemotron 3 Super deployment on a well-configured H100 cluster can achieve lower end-to-end latency for the common case, particularly for short-to-medium length completions in a batched multi-agent workload.
Cost. Frontier model API pricing at scale — hundreds of thousands of agent calls per day — is expensive. The economics of self-hosting a Nemotron 3 Super deployment on owned hardware depend heavily on capital expenditure amortization, but for organizations already running NVIDIA infrastructure for other workloads, the marginal cost per token is substantially lower.
Data privacy. Sending proprietary business data, customer records, or regulated information to a third-party API creates compliance complications that many enterprises cannot accept. Self-hosted Nemotron 3 Super eliminates this exposure entirely.
Context window. Gemini 2.5 Pro offers a 1M-token context window in its cloud API. GPT-5.4 and Claude 3.7 have significantly smaller context limits for standard API access. Nemotron 3 Super matches Gemini's stated context length while running on-premise.
The honest summary: for organizations that need frontier-level general capability and can accept cloud API deployment, GPT-5.4 and Claude 3.7 remain stronger choices. For organizations building multi-agent systems where deployment flexibility, context length, cost at scale, and data privacy matter more than marginal capability differences, Nemotron 3 Super is a genuinely competitive option.
What this means for enterprise agent deployments
The release of Nemotron 3 Super marks a maturation point in the enterprise AI market. For the past two years, building production multi-agent systems has meant accepting one of two constraints: use closed frontier APIs with all the associated cost, latency, and data sovereignty compromises, or use open-weight models that are capable enough for simple tasks but break down on the long-horizon, multi-step reasoning that real enterprise agents require.
Nemotron 3 Super is the first open-weight model whose architectural choices are explicitly oriented toward closing that gap for agentic workloads. The 1M-token context window, MoE routing optimized for multi-agent batches, and the Mamba-Transformer hybrid that makes large contexts computationally tractable are not independently novel, but their combination in a single deployable open-weight model is new.
For enterprise teams evaluating AI infrastructure, Nemotron 3 Super changes the build-vs-buy calculus in several specific scenarios. Organizations in financial services, healthcare, legal services, and government — sectors where data residency and regulatory compliance are non-negotiable — now have a credible on-premise option for long-horizon agent deployments. Software engineering teams building developer tooling agents for large internal codebases can eliminate the context window chunking hacks that create fragile behavior in repository-scale tasks. Operations teams building process automation agents that need to track complex, long-running workflows have a model that can maintain coherent state across the full execution horizon.
The open-source release also accelerates the research and fine-tuning ecosystem around agentic models. With access to the weights, the AI research community can study how MoE routing behaves in multi-agent settings, experiment with fine-tuning the routing function for specific agent taxonomies, and build specialized variants for domain-specific agent deployments. This kind of community development compounds over time in ways that closed models cannot match.
NVIDIA's broader strategic bet here is that the future of enterprise AI runs on-premise on NVIDIA hardware, with an open model stack that creates ecosystem lock-in at the infrastructure and tooling layer rather than the model layer. Nemotron 3 Super, NemoClaw, and OpenShell together make that bet concrete. Whether that strategy succeeds depends on execution quality, community adoption, and whether NVIDIA can maintain competitive model capability as the frontier continues to advance — but the opening move is a strong one.
FAQ
What is the parameter count for Nemotron 3 Super?
NVIDIA has not released the exact total parameter count, following the practice of several other MoE model releases. The model has a large total parameter footprint due to the MoE architecture, but the number of active parameters per forward pass — which determines inference compute cost — is significantly smaller. NVIDIA has stated that inference costs are comparable to a dense model in the 13B-20B parameter range.
Can Nemotron 3 Super be fine-tuned for specific agent roles?
Yes. NVIDIA provides fine-tuning support through the NeMo framework, including supervised fine-tuning, RLHF, and DPO pipelines. The open weight release means you can fine-tune directly on the full model or apply parameter-efficient methods like LoRA. NVIDIA has also published guidance on fine-tuning specific agent roles — planner, executor, critic — using the NemoClaw agent taxonomy as a structured training target.
Does the 1M token context window require special hardware?
The Mamba-Transformer hybrid architecture substantially reduces the memory footprint compared to what a pure transformer would require at 1M tokens, but very long context inference still requires significant GPU memory. NVIDIA's reference configurations suggest a minimum of 4x H100 80GB GPUs for reliable 1M-token inference. Shorter context deployments — up to 128K tokens — can run on a single H100 80GB with appropriate quantization.
How does Nemotron 3 Super handle the multi-agent security model?
The model itself is not a security boundary — security for multi-agent deployments is enforced at the OpenShell execution environment layer, which handles privilege separation, tool call sandboxing, and audit logging. The MoE routing architecture provides some incidental isolation between concurrent agent computations, but this should not be relied on as a security control. Proper agent security requires the full OpenShell integration.
Is Nemotron 3 Super suitable for real-time applications?
It depends on context length. For completions in the 1K-8K token range, latency on well-configured H100 hardware is competitive with cloud API response times for most use cases. As context length increases toward the 1M-token limit, latency increases proportionally — that is inherent to processing long sequences. For real-time user-facing applications requiring very low latency, shorter-context configurations or smaller model variants will generally be more appropriate than running at the full context window length.