TL;DR: Jensen Huang took the stage at SAP Center in San Jose on March 16, 2026 for his GTC keynote in front of 39,000 attendees from 190 countries, and unveiled Vera Rubin — NVIDIA's first extreme co-designed, six-chip AI platform and direct successor to Blackwell. The Rubin GPU delivers 5x faster inference and 3.5x faster training than Blackwell, with a 10x reduction in inference token cost. The platform ships in the second half of 2026.
What you will learn
- What Jensen Huang announced at GTC 2026
- The six-chip Vera Rubin architecture explained
- Performance benchmarks vs Blackwell
- Inference cost economics: the 10x reduction
- NVLink 6 and co-packaged optics
- The Vera CPU: NVIDIA's proprietary processor
- Agentic AI and physical AI focus
- Shipping timeline and availability
- What this means for hyperscalers
- Competitive landscape: AMD, Google, and custom silicon
- What this means for developers and AI teams
- Frequently asked questions
What Jensen announced
GTC 2026 was billed as an "agentic AI inflection point," and Jensen Huang delivered a two-hour keynote that lived up to the framing. The event filled SAP Center in San Jose to capacity — 39,000 attendees on-site with hundreds of thousands watching the livestream — representing the largest gathering in GTC history.
The centerpiece announcement was Vera Rubin, NVIDIA's next-generation AI platform. Named after the pioneering astrophysicist who confirmed the existence of dark matter, Vera Rubin follows the company's now-established tradition of naming compute platforms after groundbreaking scientists. The announcement confirmed what samples shipped to customers in late February had already hinted: NVIDIA is not slowing its one-year cadence on new architectures.
Huang opened by framing the competitive and economic context. AI model performance continues to compound at roughly 3x improvement per year, he said, and the wave of agentic AI — autonomous systems that take continuous action rather than responding to individual queries — is driving a step change in compute demand. The implication is that Blackwell, which only reached volume production in late 2025, will already need a successor within 12 months.
Vera Rubin is that successor. It is not an incremental upgrade. It is a ground-up redesign of the full stack: GPU, CPU, interconnect, optics, and memory, all co-designed to function as a single coherent AI supercomputer at rack scale.
The additional thread running through the keynote was physical AI — robotics, autonomous vehicles, and real-world simulation. NVIDIA announced new milestones across its DRIVE, Omniverse, and Isaac platforms. But the compute foundation underpinning all of those is Vera Rubin.
More details are available at the official GTC 2026 news hub and the NVIDIA developer deep-dive on the Rubin platform.
The six-chip architecture explained
The defining characteristic of Vera Rubin is its scope. NVIDIA describes it as the first "extreme co-designed" AI platform — meaning every chip in the system was designed together, from the ground up, specifically to work as a unified whole. The platform comprises six distinct chips:
1. Rubin GPU. The headline chip. Built on a new process node and featuring a next-generation Transformer Engine. It delivers the core inference and training compute improvements.
2. NVIDIA Vera CPU. NVIDIA's proprietary Arm-based processor, purpose-built to feed the GPU efficiently at scale. This is not a third-party CPU bolted onto the system — it is co-designed with the GPU to minimize bottlenecks in data movement.
3. NVLink Switch 6. The chip that enables the ultra-high-bandwidth interconnect binding multiple GPUs into a coherent fabric. NVLink 6 runs at 3.6 TB/s bisection bandwidth — the measure of total data flow across the network midpoint.
4. Co-packaged optical (CPO) switch. NVIDIA's second-generation co-packaged optic switch, integrated with TSMC packaging technology, enables the high-speed optical interconnects between nodes that allow the system to scale beyond a single rack.
5 & 6. Two additional chips handling memory, I/O, and reliability functions. Full specifications on these components are expected at Hot Chips later this year.
The key architectural insight is that previous NVIDIA platforms were essentially GPU-centric, with other components sourced externally and integrated at the system level. Vera Rubin breaks that pattern. By owning the full silicon stack, NVIDIA can optimize data pathways across every boundary — GPU-to-CPU, GPU-to-GPU, rack-to-rack — in ways that are simply not possible when procuring components from independent vendors.
This is a strategic shift as much as a technical one. It mirrors what hyperscalers like Google (TPU), Amazon (Trainium/Inferentia), and Microsoft (Maia) have been doing with their custom silicon programs. NVIDIA is reclaiming vertical integration of the AI compute stack.
NVIDIA provided specific benchmark comparisons between Vera Rubin and the current Blackwell generation at GTC. The headline numbers:
The 5x inference improvement is particularly significant given where AI spending is heading. The dominant workload for deployed AI systems is not training — it is inference. Every API call to an LLM, every agentic task execution, every real-time recommendation is an inference operation. As AI agents multiply and run continuously rather than responding to discrete user inputs, inference volumes grow exponentially.
The 3.5x training improvement matters for labs and enterprises developing or fine-tuning their own models. A 3.5x speedup on training runs of the scale hyperscalers operate at translates to meaningful reductions in time-to-deployment for new model versions.
These numbers should be treated as peak or representative figures rather than universal guarantees — actual performance depends heavily on model architecture, batch size, and system configuration. But NVIDIA's benchmark methodology has historically been conservative enough that the directional claims hold up under third-party scrutiny.
Inference cost economics: the 10x reduction
The most commercially significant number in the Vera Rubin announcement is not raw throughput — it is the 10x reduction in inference token cost.
To understand why that matters, consider the unit economics of running a large language model at production scale. For a mid-sized enterprise deploying an AI assistant across 10,000 employees, inference cost is the primary variable in whether the product is economically viable. At Blackwell-era pricing, many use cases are marginal. At one-tenth of that cost, they become straightforward.
The cost reduction also changes the competitive calculus for AI startups building on NVIDIA infrastructure via cloud providers. AWS, Azure, and GCP all pass GPU compute costs through to customers in some form. A 10x reduction in cost of goods sold for inference either expands margins for cloud providers, reduces prices for end users, or some combination of both. Historical patterns suggest competition between cloud providers will push most of the benefit to customers relatively quickly.
There is also a second-order effect on model size and quality. When inference is expensive, developers are incentivized to use the smallest model that meets their accuracy threshold. When inference is cheap, the constraint relaxes. Teams will run larger, more capable models across more of their workload. That shifts average model quality upward across the ecosystem — which in turn drives more adoption, more inference volume, and more demand for GPU capacity.
NVIDIA is deliberately engineering this flywheel. Cheaper inference drives more deployment, which drives more revenue for NVIDIA's customers, which drives more investment in AI infrastructure, which drives more demand for the next chip generation.
NVLink 6 and co-packaged optics
The networking layer of Vera Rubin deserves focused attention because interconnect has become the binding constraint in large-scale AI training and inference.
NVLink 6 delivers 3.6 TB/s bisection bandwidth across the GPU fabric. Bisection bandwidth measures the total data throughput across the midpoint of the network — effectively, how much data can flow between any two halves of a cluster simultaneously. At 3.6 TB/s, NVLink 6 is roughly double the bandwidth of the NVLink generation that shipped with Blackwell.
Why does this matter? Large language models, particularly those running across multiple GPUs or nodes, spend significant time moving activations, gradients, and key-value cache data between processing units. If the interconnect cannot keep up with the compute, GPUs sit idle waiting for data. NVLink 6's bandwidth increase means the ratio of useful compute to idle wait time improves materially.
The co-packaged optic (CPO) switch is the complementary innovation at the rack-to-rack and cluster level. Traditional electrical interconnects degrade over distance and generate heat. Optical interconnects do neither — photons travel at the speed of light and generate negligible thermal overhead. NVIDIA's second-generation CPO switch, co-developed with TSMC, integrates photonic components directly into the chip package rather than as a separate external module. This reduces latency, improves signal integrity, and lowers the power overhead of inter-rack communication.
The combination of NVLink 6 for intra-node connectivity and CPO for inter-node connectivity addresses the two primary networking bottlenecks at different scales of deployment.
The Vera CPU
One of the less-discussed but strategically important elements of Vera Rubin is the NVIDIA Vera CPU.
NVIDIA has used third-party CPUs — primarily Arm-based Grace processors — as host processors in prior platforms. Vera Rubin introduces NVIDIA's own proprietary CPU, purpose-built to feed data to the Rubin GPU at the rate the GPU can consume it.
The shift matters for two reasons. First, performance: when CPU and GPU are co-designed by the same team with a shared understanding of memory access patterns, pipeline depths, and cache hierarchies, the interface between them can be optimized in ways that are not possible with off-the-shelf processors. The result is lower latency in data delivery to the GPU and higher sustained throughput under realistic workloads.
Second, supply chain control: by owning the CPU design, NVIDIA reduces its dependence on third-party suppliers for a critical component. This is the same logic that drove Apple to design its own A-series and M-series chips — control over the critical path improves both performance and predictability.
Full microarchitectural details on the Vera CPU were not disclosed at GTC. NVIDIA has confirmed it is an Arm-based design, consistent with the Grace architecture lineage, but with substantial proprietary extensions. Additional technical disclosures are expected at Hot Chips 2026.
Agentic AI and physical AI focus
Jensen Huang spent considerable time at the GTC keynote on the thematic framing of where AI is going. The phrase "agentic AI inflection point" appeared repeatedly, and it was not incidental to the product roadmap.
The argument NVIDIA is making is structural: the shift from AI models that respond to queries to AI agents that take continuous, multi-step actions represents a qualitative change in compute demand. A language model that answers one question consumes compute once. An AI agent managing a software deployment pipeline, monitoring a manufacturing line, or coordinating a supply chain consumes compute continuously — dozens or hundreds of inference calls per task, running around the clock.
NVIDIA's "Build-a-claw" demonstration illustrated proactive AI assistants that anticipate user needs rather than waiting to be prompted. The demo was consumer-facing but the underlying message was enterprise-focused: as AI shifts from reactive to proactive, the number of inference calls per user per hour increases by an order of magnitude.
Physical AI — robotics and autonomous systems — received substantial attention at GTC. NVIDIA announced updates to its Isaac robotics platform, its DRIVE autonomous vehicle stack, and its Omniverse simulation environment. All three rely on continuous inference: robots and autonomous vehicles do not pause to process sensor data, they run inference in real time at high frequency.
Vera Rubin's architecture — with its emphasis on inference throughput, low token cost, and continuous workloads — is explicitly designed for this shift. The 5x inference improvement is most valuable not for batch processing jobs but for latency-sensitive, continuously running agentic systems.
Shipping timeline and availability
NVIDIA confirmed at GTC that Vera Rubin will ship in the second half of 2026. Early samples were already in customer hands as of late February, when NVIDIA's CFO Colette Kress confirmed on the Q4 FY26 earnings call that "first Vera Rubin samples shipped to customers earlier this week."
The H2 2026 target is consistent with NVIDIA's one-year architecture cadence, which the company formalized with its Blackwell and Hopper generations. Blackwell reached volume production in late 2025; Vera Rubin shipping in H2 2026 maintains that tempo.
Availability will follow NVIDIA's standard tiered rollout. Major cloud providers — AWS, Azure, GCP, Oracle Cloud — typically receive priority access and early inventory to build out their AI accelerator offerings. Enterprise customers with direct NVIDIA relationships come next. Broader availability through cloud providers' standard instance catalogs typically follows six to twelve months after the initial hyperscaler deployments.
Pricing has not been disclosed. Blackwell GPU pricing in cloud configurations ran approximately $2-3 per GPU-hour depending on configuration and cloud provider. Vera Rubin's cost per token reduction suggests either lower headline pricing, higher throughput at similar pricing, or both.
What this means for hyperscalers
AWS, Azure, GCP, and Oracle Cloud will be the first buyers of Vera Rubin in volume. The implications for each differ somewhat based on their AI strategies.
Microsoft Azure has the most direct stake given its deep OpenAI partnership. OpenAI's inference workloads — which underpin both the ChatGPT consumer product and the API business — are among the largest AI inference deployments in the world. A 5x inference throughput improvement and 10x cost reduction directly improves the unit economics of that business.
Google Cloud is in the unusual position of being both a major NVIDIA customer and a developer of its own competing TPU hardware. Google will adopt Vera Rubin for workloads where NVIDIA's software ecosystem (CUDA, cuDNN, TensorRT) is the path of least resistance, while continuing to push its own TPU v6 for internal training workloads. The two approaches coexist in practice.
AWS offers both NVIDIA GPUs (via P and G instance families) and its own Trainium and Inferentia chips. Amazon is likely to adopt Vera Rubin for GPU-native workloads while using Trainium 2 for its proprietary model training. The 10x cost reduction, if it flows through to AWS instance pricing, would significantly improve the economics of GPU-accelerated inference on the platform.
Oracle Cloud Infrastructure has positioned itself aggressively as a GPU-first cloud, winning several high-profile AI workloads. OCI is likely to be a fast mover on Vera Rubin inventory given that GPU performance is central to its competitive positioning.
Competitive landscape
Vera Rubin does not ship into a vacuum. The competitive picture for AI accelerators has changed materially since Blackwell launched.
AMD MI450 is expected to reach market in late 2026, targeting similar workloads. AMD's ROCm software stack has improved, but the CUDA ecosystem moat remains significant. Most AI frameworks, libraries, and toolchains are optimized first (and sometimes exclusively) for CUDA. Vera Rubin's 10x cost advantage, if it holds at launch, would require AMD to deliver exceptional performance to offset it.
Google TPU v6 (Trillium) is in production and powers Google's internal training and inference at scale. The TPU architecture is purpose-built for transformer workloads and highly competitive on a performance-per-watt basis. But TPUs are available only on Google Cloud — they are not a general-purpose option for teams building on other clouds or on-premises.
Custom silicon from Amazon (Trainium 2), Microsoft (Maia 200), and Meta represents the hyperscaler bet on reducing NVIDIA dependence. These chips are optimized for specific internal workloads and are not generally available to external customers. They serve as negotiating leverage and cost-reduction tools rather than market alternatives.
Cerebras and Groq target specific high-throughput inference use cases with unconventional architectures. Both companies have compelling performance stories for certain workloads but lack NVIDIA's ecosystem, scale, and software support.
The realistic assessment: Vera Rubin ships into a market where NVIDIA holds approximately 85-90% share of AI accelerator revenue. The six-chip co-design strategy and 10x cost reduction are designed to make that position more defensible, not less. The risk for competitors is not just catching up to Blackwell — it is catching up to a target that will have moved again by the time their responses are in production.
What this means for developers and AI teams
For teams building AI products today, Vera Rubin's immediate relevance is limited — the chip ships in H2 2026 and broad cloud availability will follow in 2027. But the announcements change the planning calculus in a few specific ways.
Model size decisions. If inference token cost drops 10x within 18 months, committing today to architectures optimized for small, cheap models may be unnecessary. Teams working on use cases where model quality is marginal on current hardware should consider whether a brief wait changes the economics significantly.
On-premises vs cloud. The Vera Rubin announcement reinforces that the performance-per-dollar of cloud GPU instances will continue to improve. Teams evaluating multi-million dollar on-premises GPU investments should factor in the depreciation risk of hardware that will be outpaced by a 10x cost improvement within 12-18 months.
Agentic architectures. The clear signal from NVIDIA's product roadmap is that continuous, agentic workloads are the target use case. Teams building reactive, query-response AI products should consider whether the underlying architecture would support agentic patterns — because the economics will increasingly favor systems that run continuously rather than episodically.
CUDA investment. The Vera Rubin platform is fully backward compatible with CUDA code written for Blackwell and prior generations. Investment in CUDA optimization and NVIDIA-specific tooling continues to pay dividends across generations.
For a closer look at the technical specifications, the NVIDIA developer blog on the Rubin platform provides the most detailed public breakdown available, and Computer Weekly's coverage offers useful analysis of the enterprise implications. The full keynote is available at nvidia.com/gtc/keynote.
Frequently asked questions
What is NVIDIA Vera Rubin?
Vera Rubin is NVIDIA's next-generation AI computing platform, announced by Jensen Huang at GTC 2026 on March 16. It is described as the first "extreme co-designed" six-chip AI platform and is the direct successor to the Blackwell architecture. The platform includes a new Rubin GPU, the NVIDIA Vera CPU, NVLink 6, a second-generation co-packaged optic switch, and two additional supporting chips for memory and I/O.
Who is Vera Rubin named after?
Vera Rubin (1928–2016) was an American astrophysicist best known for providing some of the strongest observational evidence for the existence of dark matter. NVIDIA has a tradition of naming its compute architectures after pioneering scientists, including Hopper (Grace Hopper), Blackwell (David Blackwell), and Ampere (André-Marie Ampère).
How much faster is Vera Rubin than Blackwell?
NVIDIA's benchmarks show Vera Rubin delivering 5x faster inference throughput and 3.5x faster training throughput than Blackwell. These are representative figures under optimized conditions; actual performance gains depend on workload type, model architecture, and system configuration.
What is the 10x inference cost reduction?
NVIDIA claims Vera Rubin reduces the cost of generating each inference token by 10x compared to Blackwell. This means that deploying an AI model on Vera Rubin infrastructure should cost approximately one-tenth as much per query as the same deployment on Blackwell hardware. The reduction comes from higher throughput per chip combined with better memory and interconnect efficiency.
What are the six chips in Vera Rubin?
The six chips are: (1) the Rubin GPU, (2) the NVIDIA Vera CPU, (3) the NVLink 6 switch chip, (4) the second-generation co-packaged optic switch, and two additional chips handling memory and I/O functions. Full specifications on the memory and I/O chips are expected at Hot Chips 2026.
What is NVLink 6?
NVLink 6 is NVIDIA's next-generation GPU interconnect technology included in Vera Rubin. It delivers 3.6 TB/s bisection bandwidth — the total data throughput available across the network midpoint — which is roughly double the bandwidth of the NVLink generation in Blackwell. Higher bisection bandwidth reduces GPU idle time spent waiting for data, improving sustained throughput for large distributed workloads.
What is co-packaged optics (CPO)?
Co-packaged optics integrates photonic (optical) components directly into the chip package rather than using separate external optical modules. This reduces signal latency, improves integrity over longer distances, and lowers power overhead compared to electrical interconnects. NVIDIA's second-generation CPO switch in Vera Rubin is co-developed with TSMC and handles inter-node communication within and between racks.
What is the NVIDIA Vera CPU?
The Vera CPU is NVIDIA's proprietary Arm-based processor, purpose-built to serve as the host CPU for the Vera Rubin platform. Unlike prior NVIDIA platforms that used third-party processors, the Vera CPU was co-designed alongside the Rubin GPU to optimize data delivery and minimize bottlenecks at the CPU-GPU interface. Full microarchitectural details are expected later in 2026.
When does Vera Rubin ship?
NVIDIA confirmed at GTC 2026 that Vera Rubin will ship in the second half of 2026. First samples were already in customer hands as of late February 2026. Broad availability through major cloud providers is expected to follow six to twelve months after initial deployment.
Which cloud providers will offer Vera Rubin?
NVIDIA has not announced a specific cloud availability list, but AWS, Microsoft Azure, Google Cloud, and Oracle Cloud Infrastructure are the most likely launch partners based on historical patterns with prior GPU generations. All four are major NVIDIA customers and have announced plans for continued AI infrastructure investment through 2026 and beyond.
How does Vera Rubin compare to AMD's competing GPU?
AMD's MI450, expected in late 2026, targets similar AI training and inference workloads. NVIDIA's claimed 5x inference improvement and 10x cost reduction over its own current generation sets a high bar for AMD to match. The more significant challenge for AMD is the CUDA software ecosystem — most AI frameworks and optimized libraries are built first for NVIDIA hardware, creating switching costs independent of raw performance.
Does Vera Rubin support existing CUDA code?
Yes. NVIDIA maintains backward compatibility across GPU generations. Code written and optimized for Blackwell and prior CUDA architectures will run on Vera Rubin without modification. This is a deliberate strategy to preserve developer investment in the NVIDIA ecosystem.
What workloads is Vera Rubin designed for?
Vera Rubin is explicitly optimized for agentic AI inference — continuous, multi-step AI workloads that run autonomously rather than responding to individual queries. Jensen Huang framed the platform as purpose-built for the "agentic AI inflection point," where AI agents run continuously, generating far more inference operations per hour than traditional conversational AI systems.
What is physical AI, and how does Vera Rubin relate?
Physical AI refers to AI systems that operate in the physical world — robotics, autonomous vehicles, drones, and industrial automation. These systems require continuous high-frequency inference (a robot must process sensor data in real time, not in batches). Vera Rubin's inference performance improvements make it particularly well-suited for physical AI deployments. NVIDIA also announced updates to its Isaac (robotics), DRIVE (autonomous vehicles), and Omniverse (simulation) platforms at GTC 2026, all of which run on NVIDIA hardware.
What was the "Build-a-claw" event at GTC?
"Build-a-claw" was a demonstration at GTC 2026 showcasing proactive AI assistants — systems that anticipate user needs and take action without being explicitly prompted. It was designed to illustrate the shift from reactive AI (waiting for a query) to proactive AI (continuously monitoring and acting). The demo highlighted the agentic AI use case that Vera Rubin is designed to serve.
What does Vera Rubin mean for AI token pricing?
If Vera Rubin's 10x cost reduction flows through to cloud providers' pricing, it would significantly lower the cost of AI API calls for developers and enterprises. Historical patterns suggest competition between AWS, Azure, and GCP will pass a substantial portion of infrastructure cost savings to customers, though timelines vary. The economics of deploying larger, more capable models becomes substantially more favorable at one-tenth the per-token cost.
Is Vera Rubin available for on-premises deployment?
NVIDIA has not specified on-premises availability details at this stage. Prior NVIDIA platforms have been available both through cloud providers and via enterprise hardware partners (Dell, HP, Lenovo). Vera Rubin is expected to follow a similar pattern, but the hardware configuration and pricing for on-premises deployments have not been disclosed.
The Transformer Engine in Vera Rubin is an upgraded version of the technology introduced in Hopper and refined in Blackwell. It provides hardware-level acceleration for the attention mechanisms and matrix multiplications that dominate transformer model computation. Combined with support for FP8 and lower-precision numerics, it allows the GPU to perform more useful work per clock cycle for AI workloads than general-purpose floating-point hardware.
What security and reliability features does Vera Rubin include?
NVIDIA disclosed that Vera Rubin includes a new Reliability, Availability, and Serviceability (RAS) engine and expanded confidential computing support. The RAS engine improves error detection and hardware fault tolerance for enterprise deployments, while confidential computing enables encrypted computation for sensitive workloads. Full details are expected at subsequent technical conferences.
What is the significance of the GTC 2026 attendance numbers?
GTC 2026 drew over 39,000 in-person attendees from 190 countries, making it one of the largest AI and computing conferences ever held. The attendance figures reflect the degree to which NVIDIA has positioned GTC as the industry's central event — comparable to what Apple's WWDC is for consumer software development. The scale signals the breadth of companies and governments now treating AI infrastructure as a strategic priority.