Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Nemotron 3 Super is a 120 billion parameter open-source large language model developed by NVIDIA, released in 2026. The model combines hybrid Mamba-Attention architecture with mixture-of-experts (MoE) design, featuring 12 billion actively used parameters during inference. It represents a significant advancement in efficient language model design, offering substantial performance improvements over comparable open-source baselines while maintaining computational efficiency through its sparse activation mechanism.
Nemotron 3 Super employs a hybrid Mamba-Attention architecture, integrating selective state space models (Mamba) with traditional transformer attention mechanisms. This architectural choice enables the model to process sequential information efficiently while maintaining the expressive power of attention-based mechanisms for complex reasoning tasks. The model's mixture-of-experts design activates only 12 billion parameters during inference despite its 120 billion parameter total capacity, reducing computational requirements and memory footprint compared to dense models of equivalent scale 1).
The sparse activation pattern of MoE systems allows Nemotron 3 Super to achieve higher throughput and lower latency than fully dense competitors. This efficiency gain becomes increasingly valuable in production environments where serving costs and response time are critical operational considerations.
The model was trained on approximately 25 trillion tokens, providing comprehensive exposure to diverse textual data across multiple domains and languages. Its extended context window of 1 million tokens enables processing of substantially longer documents compared to models with standard 4K-128K context windows. This extended context capacity allows the model to maintain coherence across entire books, codebases, or comprehensive documentation sets without requiring intermediate summarization or chunking strategies 2).
The large training token budget supports the model's ability to capture nuanced linguistic patterns, domain-specific terminology, and complex reasoning capabilities required for sophisticated language understanding and generation tasks.
Nemotron 3 Super demonstrates up to 2.2x throughput improvement compared to GPT-OSS-120B, NVIDIA's previous 120-billion parameter open-source offering. This significant efficiency gain derives from the combination of selective attention mechanisms, mixture-of-experts sparsity, and optimized kernel implementations. The throughput advantage translates directly to reduced inference latency and increased request processing capacity on fixed hardware infrastructure 3).
The performance improvement enables practical deployment scenarios previously constrained by computational requirements, including real-time inference, high-concurrency serving, and edge deployment applications where resource efficiency directly impacts operational viability.
The model's extended context window and efficient architecture make it suitable for applications requiring analysis of long-form content. Research applications benefit from the model's ability to process complete academic papers or research corpora. Software development use cases leverage the extended context for comprehensive codebase understanding and generation. Content analysis and summarization tasks benefit from maintaining coherence across documents exceeding standard context lengths.
The open-source nature of Nemotron 3 Super enables fine-tuning for domain-specific applications, custom instruction sets, and specialized tasks within organizations' infrastructure and compliance frameworks.
Compared to closed-source models of similar scale, Nemotron 3 Super offers operational advantages through open-source availability, enabling organizations to maintain complete control over model deployment, data privacy, and operational transparency. The efficiency improvements over previous open-source baselines reduce the computational barriers to adoption, democratizing access to capable language models across organizations with varying resource constraints 4).
The hybrid architecture represents an evolution beyond purely dense transformer designs, incorporating insights from selective state space model research while maintaining the interpretability and stability properties of traditional attention mechanisms.