NVIDIA Nemotron 3: The Open Models Rewriting the Rules for Agentic AI

Photo by Pixabay on Pexels Source

NVIDIA Nemotron 3 is the first family of open-weight models designed from the ground up for multi-agent AI workloads. The Nano model activates just 3.2 billion of its 30 billion parameters per forward pass and still outperforms GPT-OSS-20B on most standard benchmarks. Super, the mid-tier option released in March 2026, uses a novel LatentMoE architecture to deliver 2.2x the throughput of GPT-OSS-120B while matching it on accuracy. Ultra, the 500B-parameter flagship, is expected in mid-2026. All three models share the same hybrid Mamba-Transformer mixture-of-experts backbone, a 1M-token context window, and open weights released under the NVIDIA Open Model License.

That combination of efficiency, long context, and openness is what makes Nemotron 3 different from yet another model drop. It is an architecture thesis: that agentic workloads need a fundamentally different model design than chat.

Why Agentic AI Needs a Different Architecture

Standard Transformer models process every token with the same compute budget. That works fine for chat, where context windows stay under 8K tokens and latency per response is the primary concern. Agent workloads look completely different.

A typical multi-agent pipeline might involve a planning agent that reads 50,000 tokens of context, calls three tools, passes results to a coding agent, which generates 4,000 tokens of output, passes that to a review agent, and loops twice. The total token count for a single task easily hits 200K-500K. At Transformer-only attention costs, which scale quadratically with sequence length, running that pipeline at scale gets expensive fast.

Mamba layers solve the long-context problem. Mamba-2, the selective state-space model architecture, processes sequences in linear time relative to length. It maintains a compressed state that grows with the information content of the sequence, not its raw length. For an agent reading through 100K tokens of code to find a bug, Mamba layers handle the bulk of that context at a fraction of the cost of full attention.

But Mamba alone is not enough. State-space models can struggle with tasks that require precise, long-range token-to-token comparisons, exactly the kind of reasoning agents need when they match a function signature on line 40 to a call on line 8,000. Transformer attention layers excel at this. The hybrid approach gives you both: Mamba layers handle the bulk of context processing cheaply, while strategically placed attention layers handle the comparisons that need global precision.

Inside the Nemotron 3 Architecture

The three Nemotron 3 models share architectural principles but differ in scale and sophistication.

Nano: 30B Total, 3.2B Active

Nemotron 3 Nano is the production workhorse. Its 52 layers break down into 23 Mamba-2 layers, 23 MoE feed-forward layers, and 6 grouped-query attention (GQA) layers. Each MoE layer contains 128 experts plus 1 shared expert, with 6 experts activated per token. The result: only 3.2B parameters fire per forward pass out of 30B total.

On NVIDIA’s own benchmarks, Nano delivers 3.3x higher throughput than Qwen3-30B-A3B on a single H200 GPU in an 8K-input / 16K-output setting. That throughput advantage compounds in multi-agent pipelines where you are running dozens of parallel agent calls.

The 1M-token context window is native, not bolted on through RoPE scaling. On the RULER benchmark for long-context evaluation, Nano maintains its accuracy advantage over GPT-OSS-20B and Qwen3-30B across context lengths from 4K to 128K tokens.

Super: 120B Total, 12B Active

Nemotron 3 Super, released in March 2026, introduces two architectural innovations that push the efficiency frontier further.

LatentMoE replaces the standard MoE routing mechanism. In a regular MoE, a router network sends each token to a subset of experts based on the token’s embedding. LatentMoE adds a latent representation step: tokens are first projected into a lower-dimensional space before routing, which allows the model to learn more nuanced expert specialization. NVIDIA reports this achieves better accuracy per parameter and per FLOP than standard MoE.

Multi-Token Prediction (MTP) allows the model to predict multiple future tokens in a single forward pass. On SPEED-Bench, Super achieves an average acceptance length of 3.45 tokens per verification step, compared to 2.70 for DeepSeek-R1. That translates to up to 3x wall-clock speedups through speculative decoding without requiring a separate draft model.

Combined, these innovations let Super deliver up to 7.5x higher throughput than Qwen3.5-122B while matching its benchmark accuracy, a gap that matters when you are paying for GPU-hours in production.

Ultra: 500B Total, 50B Active

Nemotron 3 Ultra is the unreleased flagship, expected in H1 2026. With 500B total parameters and approximately 50B active per token, it targets deep research, strategic planning, and large-scale multi-agent coordination. Ultra was trained using NVFP4 precision on Blackwell GPUs, which signals that NVIDIA is designing models specifically for their latest hardware stack. The open release will include weights, training recipes, and most of the training data.

Who Is Building on Nemotron 3

The early adopter list reads like a who’s who of enterprise software. NVIDIA’s announcement names Accenture, Cadence, CrowdStrike, Cursor, Deloitte, EY, Oracle Cloud Infrastructure, Palantir, Perplexity, ServiceNow, Siemens, Synopsys, and Zoom as companies integrating Nemotron 3 into production workflows.

CrowdStrike: Cybersecurity Triage

CrowdStrike’s use case highlights why Nemotron 3’s architecture matters for agents. Their security operations center processes millions of alerts daily. Each alert needs context: what happened before, what the system topology looks like, what similar alerts have been seen. That is a long-context, high-throughput workload where Nano’s 3.2B active parameters and 1M context window hit the sweet spot. Running a frontier model at that scale would be economically impractical.

Cursor: Code Agent Acceleration

Cursor, the AI code editor, is integrating Nemotron 3 for code understanding and generation tasks. Code agents are the canonical multi-step agentic workload: read file, understand dependencies, plan changes, generate code, review output, iterate. Super’s Multi-Token Prediction is especially valuable here, as code has high token-to-token predictability, making speculative decoding particularly effective.

Greptile: Real-World Code Review Benchmarks

Greptile’s hands-on evaluation of Nemotron 3 Super for code review found the model returned useful reviews in 12.5 seconds with just 2 tool calls. Their assessment: Super “punches above its weight class,” performing comparably to much larger models on structured code understanding tasks while maintaining lower latency.

Nemotron 3 vs. the Competition

The models Nemotron 3 competes against are not the frontier giants (GPT-5.3, Claude Opus 4.6) but the open-weight efficiency tier: Meta’s GPT-OSS family, Qwen3/3.5, and DeepSeek.

Model	Total Params	Active Params	Context	Throughput (8K in/16K out)
Nemotron 3 Nano	30B	3.2B	1M	3.3x vs Qwen3-30B-A3B
Nemotron 3 Super	120B	12B	1M	2.2x vs GPT-OSS-120B
Qwen3-30B-A3B	30B	3B	128K	1x (baseline)
GPT-OSS-120B	120B	~12B	128K	1x (baseline)

The throughput advantage comes from two sources: Mamba layers reduce the per-token compute for long contexts, and MoE routing keeps the active parameter count low. The 1M context window is a qualitative difference, not just a quantitative one. Agents that can process entire codebases, full security logs, or complete document repositories in a single context window behave fundamentally differently than agents limited to 128K.

What This Means for the Open Model Landscape

Nemotron 3 represents a strategic shift in how NVIDIA approaches the model layer. Rather than competing directly with OpenAI and Anthropic on general-purpose chat, NVIDIA is carving out a niche: the best open models for agentic workloads, optimized for NVIDIA hardware.

Three implications stand out:

The Mamba-Transformer hybrid is becoming the default for efficiency-first models. AI21’s Jamba pioneered the pattern in 2024. Nemotron 3 validates it at scale with enterprise adoption. Expect Qwen and Meta to ship their own hybrid variants within 12 months.

Open-weight models with 1M context are now real. A year ago, 1M context was a proprietary feature from Google and Anthropic. Nemotron 3 delivers it in an open model you can self-host, fine-tune, and deploy without API rate limits.

NVIDIA is building a vertical stack. Nemotron 3 models run on NVIDIA hardware, serve through NVIDIA NIM, integrate with the NVIDIA Agent Toolkit, and get profiled by AgentIQ. That is not just an open model release. It is an ecosystem play.

For teams building production agent systems, Nemotron 3 Nano is worth evaluating today. It is available on Hugging Face, through NVIDIA NIM, and on DeepInfra. Super landed in March 2026 and is already available free on OpenRouter. Ultra will complete the family when it ships later this year.

Frequently Asked Questions

What is NVIDIA Nemotron 3?

Nemotron 3 is NVIDIA’s family of open-weight AI models (Nano, Super, Ultra) built specifically for agentic AI workloads. They use a hybrid Mamba-Transformer mixture-of-experts architecture with a 1M-token context window, delivering high throughput at low active parameter counts.

How does the Mamba-Transformer hybrid architecture work in Nemotron 3?

Nemotron 3 interleaves Mamba-2 state-space layers (which process long sequences in linear time) with Transformer attention layers (which handle precise long-range comparisons) and MoE feed-forward layers. Nano uses 23 Mamba-2 layers, 23 MoE layers, and 6 attention layers across its 52-layer stack.

What is the difference between Nemotron 3 Nano, Super, and Ultra?

Nano has 30B total / 3.2B active parameters for efficient edge and single-GPU deployment. Super has 120B total / 12B active parameters with LatentMoE and multi-token prediction for collaborative agent workloads. Ultra has 500B total / 50B active parameters for deep research and large-scale multi-agent coordination.

Is Nemotron 3 open source?

Nemotron 3 is open-weight under the NVIDIA Open Model License. NVIDIA releases model weights, training recipes, pre- and post-training software, and most training data. Nano is available on Hugging Face and NVIDIA NIM. Super launched in March 2026. Ultra is expected in H1 2026.

How does Nemotron 3 compare to GPT-OSS and Qwen models?

Nemotron 3 Nano delivers 3.3x higher throughput than Qwen3-30B-A3B and 2.2x higher than GPT-OSS-20B on a single H200 GPU. Super achieves up to 7.5x higher throughput than Qwen3.5-122B while matching benchmark accuracy. The key advantages are the 1M context window and Mamba-based efficiency.

Why Agentic AI Needs a Different Architecture#

Inside the Nemotron 3 Architecture#

Nano: 30B Total, 3.2B Active#

Super: 120B Total, 12B Active#

Ultra: 500B Total, 50B Active#

Who Is Building on Nemotron 3#

CrowdStrike: Cybersecurity Triage#

Cursor: Code Agent Acceleration#

Greptile: Real-World Code Review Benchmarks#

Nemotron 3 vs. the Competition#

What This Means for the Open Model Landscape#

Frequently Asked Questions#

What is NVIDIA Nemotron 3?#

How does the Mamba-Transformer hybrid architecture work in Nemotron 3?#

What is the difference between Nemotron 3 Nano, Super, and Ultra?#

Is Nemotron 3 open source?#

How does Nemotron 3 compare to GPT-OSS and Qwen models?#