TL;DR: Jensen Huang steps onto the GTC 2026 stage on March 16 to formally launch the Vera Rubin GPU and Vera CPU — the platform that succeeds Grace Blackwell. Early disclosures point to a 3.3x to 5x inference performance jump over Blackwell, NVLink 6 delivering 3.6 TB/s bisection bandwidth across a six-chip supercomputer design, and a shipping timeline targeting the second half of 2026. Hyperscaler pre-orders are expected to be announced at the keynote. This is the most consequential GPU launch since Blackwell itself.
Table of contents
- What we know about Vera Rubin GPU and CPU
- Performance expectations: 3-5x over Blackwell
- NVLink 6 and the interconnect breakthrough
- Why agentic AI inference needs new hardware
- The Rubin platform: 6-chip supercomputer architecture
- Shipping timeline and hyperscaler pre-orders
- Competition: AMD MI400, Intel Falcon Shores, custom chips
- What this means for AI infrastructure planning in 2026-2027
- Frequently asked questions
What we know about Vera Rubin GPU and CPU
Nvidia's GPU Technology Conference returns to San Jose on March 17–21, 2026, with Jensen Huang's keynote kicking off the event on March 16. GTC has become the annual milestone where Nvidia resets the AI hardware roadmap, and this year's event is particularly consequential: it marks the formal debut of the Vera Rubin platform.
The Vera Rubin architecture is named after the pioneering astronomer who discovered dark matter — consistent with Nvidia's tradition of naming GPU architectures after influential scientists (Hopper, Blackwell, Ampere, Volta). The platform was first outlined publicly at GTC 2025, where Nvidia introduced its next-generation roadmap. Vera Rubin is the successor to the Grace Blackwell NVL72 rack, which itself only began wide deployment in late 2025.
What makes the Vera Rubin announcement structurally different from prior launches is the dual-chip architecture at its core. The platform pairs a new Vera GPU with a Vera CPU — both designed in tandem to maximize memory bandwidth and data movement efficiency. The Vera CPU is Nvidia's in-house ARM-based processor, succeeding the Grace CPU that shipped with the Grace Hopper and Grace Blackwell superchips. Integrating the CPU and GPU design at the silicon level, rather than treating the CPU as an afterthought, is central to how Nvidia is attacking the memory wall that increasingly limits AI inference performance.
This is not a minor generational bump. It is an architectural rethinking of how compute, memory, and interconnect work together under one Nvidia-designed roof.
The headline metric being discussed ahead of the keynote is a 3.3x to 5x inference performance improvement over Blackwell. The spread in that range is meaningful: the 3.3x figure likely reflects compute-bound workloads at high batch sizes, while the 5x number reflects memory-bandwidth-sensitive inference tasks like long-context reasoning — exactly the kind of workload that dominates in agentic AI deployments.
Nvidia's previous generation-over-generation claims have held up in practice. Hopper to Blackwell delivered roughly 2.5–4x on inference per dollar depending on the task. If Vera Rubin hits even the lower end of its claimed range, the economics of large-scale AI inference shift materially again.
The architectural improvements driving these gains include:
Higher-bandwidth HBM memory. Vera Rubin is expected to ship with HBM4 memory, a significant step up from the HBM3e used in current Blackwell H200 configurations. HBM4 targets roughly 1.5–2x the bandwidth of HBM3e per stack, and with multiple stacks per chip, the per-GPU memory bandwidth ceiling rises substantially.
Larger memory capacity per chip. Agentic AI workloads increasingly require holding large model weights, extended context windows, and retrieved document sets in GPU memory simultaneously. More on-chip memory directly translates to fewer I/O bottlenecks and lower latency per inference call.
Improved FP4 and INT4 inference. Blackwell introduced FP4 (4-bit floating point) precision support, which roughly quadrupled raw throughput for inference versus FP8 on prior generations. Vera Rubin is expected to extend that capability with better hardware support for mixed-precision inference and improved sparsity acceleration — techniques that allow modern AI models to skip redundant computation during inference.
Redesigned SM architecture. The streaming multiprocessors that execute tensor operations are expected to be redesigned for better utilization on the irregular, dynamic workloads characteristic of agentic AI. Traditional training workloads are dense and predictable; inference for multi-step AI agents is neither.
Taken together, these improvements explain how Nvidia is targeting a 3–5x range rather than a single number. The actual gain a customer sees depends heavily on what they are running. But across the majority of relevant inference workloads in 2026, the direction is clearly upward and meaningfully so.
NVLink 6 and the interconnect breakthrough
If the GPU die is the engine, NVLink is the highway system that determines how far and how fast data can travel between engines. And on the Vera Rubin platform, that highway is being fundamentally widened.
NVLink 6 is expected to deliver 3.6 TB/s of bisection bandwidth — the maximum throughput available when splitting the system in half and measuring how much data can cross that boundary simultaneously. For comparison, NVLink 4 (Hopper) delivered around 0.9 TB/s bisection bandwidth on NVL8 configurations, and NVLink 5 in Blackwell NVL72 configurations delivered approximately 1.8 TB/s at scale.
That 3.6 TB/s number is not just an engineering milestone. It changes what is computationally feasible:
Multi-model concurrency. Many real-world agentic AI deployments run multiple specialized models simultaneously — a routing model, a generation model, a tool-use model, a retrieval model. High-bandwidth interconnect allows these models to share activations and context states in near real time, enabling tighter coordination than what is possible when models communicate only over PCIe or slower inter-GPU fabrics.
Longer effective context. One of the hard limits on transformer models today is that attention computation scales quadratically with context length. Distributing that computation across many GPUs helps, but only if the inter-GPU communication overhead is low relative to the compute. NVLink 6's bandwidth increase pushes that balance in favor of longer context.
KV cache sharing. In high-throughput inference deployments, caching the key-value attention states (the KV cache) for repeated or related prompts dramatically reduces redundant computation. Higher bandwidth makes it practical to share KV cache across a larger pool of GPUs, improving cache hit rates without introducing latency penalties.
NVLink 6 also matters for multi-node scaling. The transition from NVLink to NVLink-C2C (chip-to-chip) in multi-rack configurations is where many hyperscaler deployments hit bottlenecks today. Nvidia is expected to announce NVLink Switch System 6 alongside the GPU launch, extending the 3.6 TB/s fabric across full-rack and multi-rack configurations.
Why agentic AI inference needs new hardware
The shift from generative AI (a user asks, a model answers) to agentic AI (a model autonomously plans, decides, acts, and loops back) is the central architectural change in AI software since GPT-3. And it is driving a hardware capability gap that Vera Rubin is specifically designed to close.
Here is the core problem: traditional inference benchmarks measure throughput in tokens per second at a fixed batch size. A user types a prompt, the GPU generates tokens, done. That model made sense for chatbots.
Agentic AI looks nothing like that.
A single agentic task might involve:
- An orchestrator model generating a multi-step plan (long-form generation)
- Tool calls dispatched to specialized sub-models (multiple parallel small generations)
- Retrieval-augmented generation fetching documents from a vector store (memory-bandwidth-intensive)
- Code execution and output interpretation (variable-length, unpredictable sequences)
- Self-critique and revision loops (iterative re-inference on growing context)
Each of these steps has different compute profiles. The workload is irregular, bursty, and memory-intensive in ways that existing GPU architectures — optimized primarily for batch training and single-shot inference — handle inefficiently.
Vera Rubin's design choices address this directly:
Persistent memory for KV cache. Rather than reconstructing the KV cache from scratch for each step in an agentic loop, the platform supports much larger on-chip and near-chip memory tiers, allowing the cache to persist across loop iterations. This alone can cut per-step latency by 40–60% for complex agents with long conversation histories.
Hardware task scheduling. The redesigned Vera CPU handles agent orchestration workloads natively, offloading the control-plane overhead that currently runs on separate host CPUs. This reduces the round-trip latency between agent decision steps.
Fine-grained power scaling. Agentic workloads have high variance in compute demand within a single task. A generation step might peg the GPU; a tool-call dispatch step might use 5% of available FLOPs. Vera Rubin's power management is expected to support finer-grained frequency and voltage scaling to maintain efficiency across that range, rather than burning full power during idle-heavy phases.
Jensen Huang has been explicit about this framing. On the Q4 FY26 earnings call, he described Vera Rubin as specifically targeted at the "agentic AI inflection point" — his assertion that AI has crossed from useful to indispensable, with agents running continuously rather than episodically.
The numbers support him. A single AI agent running multi-step inference tasks continuously generates far more GPU-hours of demand per deployment than an equivalent chatbot. As enterprises move from chatbots to agents, the demand per deployment multiplies. Vera Rubin is sized for that new baseline.
Vera Rubin is not just a single chip. It is the foundation chip of the Rubin platform — a full-stack rack-scale computing architecture that Nvidia previewed at GTC 2025.
The Rubin platform at full scale is a six-chip supercomputer — meaning a single node or rack unit integrates six Vera Rubin GPU chips, each paired with its Vera CPU counterpart, all interconnected over NVLink 6. This is the Nvidia NVL-class architecture (NVLink system), and in the Vera Rubin generation it is expected to be designated NVL72 or a similar multi-chip configuration.
Key architectural characteristics of the full platform:
Unified memory address space. Across the six-chip configuration, the full HBM memory across all chips appears as a single addressable memory pool to software. This eliminates explicit data movement programming for model sharding and makes it far easier to deploy 100B+ parameter models without manual pipeline parallelism tuning.
Dedicated NVLink Switch fabric. The chips do not communicate point-to-point. An NVLink Switch ASIC (the NVSwitch) sits in the fabric, providing a full all-to-all communication topology at line rate. Any chip can send data to any other chip at the full bisection bandwidth, not just to its nearest neighbor.
Liquid cooling by default. At the power levels these chips operate, air cooling is no longer viable. The Rubin platform is designed for direct liquid cooling as a baseline requirement, not an option. This has infrastructure implications for data centers that are still primarily air-cooled and will need to retrofit before they can deploy Vera Rubin at scale.
GB300-class form factor. Nvidia's Blackwell introduced the GB200 (Grace Blackwell Superchip) as the unit of compute in the NVL72 rack. Vera Rubin is expected to follow the same physical form factor logic, likely designated as something like VR200 or an analogous naming, maintaining compatibility with the rack infrastructure hyperscalers have already built or are currently building for Blackwell.
This continuity in physical form factor is deliberate and significant. Hyperscalers do not tear down and rebuild data center infrastructure between generations. By maintaining rack-level hardware compatibility, Nvidia ensures that customers who deploy Blackwell infrastructure today are not stranded — they can swap in Vera Rubin compute nodes without redesigning their facilities.
Shipping timeline and hyperscaler pre-orders
The current timeline, consistent with what Nvidia CFO Colette Kress indicated on the Q4 FY26 earnings call, points to Vera Rubin samples already in customer hands as of late February 2026, with broader commercial availability in H2 2026.
That means a gap of roughly six to nine months between the GTC 2026 announcement and when customers outside the early-access program can actually deploy the hardware at scale. That timeline is consistent with how Blackwell shipped: announced at GTC 2024, initial samples in mid-2024, mass production ramping through Q3-Q4 2025.
The H2 2026 window for Vera Rubin puts volume availability squarely in Q3-Q4 2026.
What to watch at the March 16 keynote on the pre-order and customer front:
Hyperscaler commitments. Microsoft, Google, Amazon, and Meta are all expected to announce early access or pre-order agreements at the keynote or in accompanying press releases. These announcements are as much business signals as technical ones — they tell the market that the largest AI infrastructure buyers have evaluated the platform and committed capital. Given that all four have already flagged double-digit-billion annual AI capex plans for 2026 and 2027, absorbing the first Vera Rubin allocations is consistent with their stated strategies.
Sovereign AI buyers. GTC has increasingly become a venue for Nvidia to announce national-level partnerships with governments building domestic AI infrastructure. Saudi Arabia, UAE, Japan, and several European nations have all made sovereign AI investments in recent years. Vera Rubin announcements targeted at sovereign cloud buyers are likely.
CSP listing timelines. AWS, Azure, and Google Cloud will each need to certify and list Vera Rubin-based instances before enterprise customers can access the hardware without buying it directly. Cloud instance availability typically lags physical hardware availability by one to two quarters. Announcements of planned instance types (analogous to the p5 for H100 or p5e for H200) are expected.
One important economic note: Vera Rubin systems will cost more per unit than Blackwell systems. That is consistent with every prior generation transition. But on a performance-per-dollar basis, the 3–5x improvement means the effective cost of running a given AI workload should fall — which is what drives adoption, not the sticker price on the chassis.
Competition: AMD MI400, Intel Falcon Shores, custom chips
Nvidia does not compete in a vacuum, and the landscape around Vera Rubin is worth examining clearly.
AMD MI400. AMD's current roadmap has the MI350 series (successor to MI300X) shipping in late 2025 and early 2026, with the MI400 series targeting 2026 availability. AMD has made genuine progress in the data center GPU market with MI300X, particularly on inference tasks that benefit from the chip's large unified HBM memory pool (192GB on the MI300X). The MI400 is expected to incorporate CDNA 4 architecture and compete more directly on the inference-optimized workloads where Nvidia is targeting Vera Rubin. AMD's challenge remains software ecosystem depth: ROCm is improving, but CUDA's decade-long head start in tooling, libraries, and developer familiarity is not closed in a single generation.
Intel Falcon Shores. Intel's Falcon Shores GPU — their next-generation data center accelerator — has been through several delays and scope changes. The product is now expected to ship in 2025-2026 as a more focused accelerator product. Intel's Gaudi 3 found some traction for inference at lower price points, and Falcon Shores aims to build on that. Intel's path to competing with Nvidia's full-stack platform (silicon + networking + software + ecosystem) remains the hardest challenge in the industry. Hardware is only one component.
Custom silicon. The most structurally significant competitive threat to Nvidia is not AMD or Intel — it is the hyperscalers themselves. Google's TPU v5 is already deployed at massive scale for Google's own workloads. Amazon's Trainium 2 and Inferentia 3 chips are increasingly used for AWS-internal AI workloads, with external customer access growing. Microsoft is investing heavily in its Maia AI accelerator family. Meta has deployed its MTIA inference chips internally.
The key question is not whether these custom chips can match Vera Rubin in raw performance — most cannot yet. The question is whether they can match it on the specific workloads their operators care about, at lower cost, without depending on Nvidia's supply chain. For the hyperscalers, reducing Nvidia dependency is a strategic objective, not just a cost optimization.
For now, Nvidia's competitive moat remains the full-stack advantage: world-class silicon, the NVLink interconnect that no competitor has matched, and the CUDA software ecosystem that represents trillions of lines of optimized code across research and enterprise applications. That moat does not disappear because AMD ships a faster chip. It erodes slowly, as alternative stacks mature and specific workloads migrate.
Vera Rubin is Nvidia's answer to that slow erosion: ship a platform so capable, so well-integrated, and so ahead of the competition that migrating away requires accepting a multi-year performance disadvantage. If the 3–5x claims hold, that strategy is working.
What this means for AI infrastructure planning in 2026-2027
For enterprise and hyperscale buyers who are in the middle of multi-year AI infrastructure planning cycles, the Vera Rubin announcement creates a familiar strategic tension: commit to Blackwell now, or wait?
The calculus depends on your deployment timeline:
If you need inference capacity in H1 2026, Blackwell is your only option. Vera Rubin will not be in volume production until H2 2026 at the earliest. The Grace Blackwell NVL72 is an extremely capable platform, and the cost per token on current Blackwell inference is already 5–10x lower than Hopper was two years ago. Waiting is not free when you have production workloads to run.
If you are planning infrastructure for late 2026 and 2027, the Vera Rubin timeline is directly relevant. A 3–5x improvement in inference performance per chip, combined with the memory bandwidth increases from HBM4, means that a given inference workload will require proportionally fewer chips per unit output. If you are sizing a cluster today that will be operational in Q4 2026, building it around Blackwell and then refreshing with Vera Rubin in 2027 is a more efficient use of capital than building a very large Blackwell cluster that will be partially displaced within 18 months.
For software teams, the GTC 2026 announcement is a signal to start validating your inference stack against the new capabilities — specifically the larger KV cache capacity, the NVLink 6 bandwidth available for distributed inference, and the FP4/INT4 precision improvements. Workloads that currently require multiple Blackwell chips due to memory constraints may fit on a single Vera Rubin chip. That has real implications for inference serving architecture, batching strategies, and cost models.
For AI product builders, the downstream implication is that inference costs will continue to fall through 2026 and 2027. The pattern from the last three GPU generations has been consistent: each new generation reduces cost per token by 3–10x within 18 months of volume availability. Products that are currently economically marginal due to inference costs may become viable. That is a planning variable, not a guarantee, but it is consistent with Nvidia's roadmap trajectory.
For investors and analysts, the Vera Rubin launch confirms that the AI capex cycle has at least another 18–24 months of structural demand ahead of it. Hyperscalers do not pre-order next-generation hardware unless they have confidence in continued workload growth to justify it. The pre-orders expected at GTC 2026 are the most explicit forward demand signal Nvidia can provide.
One broader structural note: Nvidia is not just selling chips. With each new platform generation, it deepens the stack of hardware-optimized software, networking, and services that surround the silicon. NVLink, NCCL, cuDNN, TensorRT, Triton, and NIM microservices are all components of a platform that compounds in value with each new generation. Vera Rubin chips running on NVLink 6 fabric, managed by Nvidia's networking software stack, accelerated by TensorRT-LLM — that full stack is what customers are buying. And that full stack is what competitors would need to replicate to displace it.
GTC 2026 is where Nvidia raises the bar again.
Frequently asked questions
What is Vera Rubin, and how does it differ from Blackwell?
Vera Rubin is Nvidia's next-generation GPU and CPU platform, succeeding the Grace Blackwell architecture. Where Blackwell paired the Grace CPU with the Blackwell GPU in the GB200 superchip, Vera Rubin is a ground-up redesign pairing a new Vera GPU with a new Vera CPU. The expected performance improvement is 3.3x to 5x on AI inference workloads compared to Blackwell, driven by HBM4 memory, NVLink 6 interconnect, and architectural changes targeting agentic AI workloads specifically.
When will Vera Rubin GPU ship?
Nvidia has shipped early samples to select customers as of late February 2026, consistent with CFO Colette Kress's statement on the Q4 FY26 earnings call. Broader commercial availability is expected in the second half of 2026. Volume production sufficient for large-scale hyperscaler deployments is targeted for Q3-Q4 2026.
What is NVLink 6, and why does it matter?
NVLink 6 is Nvidia's sixth-generation GPU interconnect, expected to deliver 3.6 TB/s of bisection bandwidth — roughly double NVLink 5 in the Blackwell generation. Higher bandwidth matters for agentic AI because agents run multiple models in parallel, require persistent KV cache sharing across chips, and process long-context workloads where inter-chip communication is a primary bottleneck.
AMD's MI400 series targets 2026 availability and will compete directly with Vera Rubin on inference workloads. AMD has improved its competitive position significantly with MI300X's large unified memory architecture. However, Nvidia's NVLink interconnect, the CUDA software ecosystem, and the full-stack integration of Vera Rubin are advantages that AMD has not fully replicated. Performance comparisons will depend heavily on specific workloads and software optimization once both chips are in full production.
Which hyperscalers are expected to adopt Vera Rubin?
Microsoft, Google, Amazon, and Meta are all expected to announce early access or pre-order agreements at or around the GTC 2026 keynote on March 16. All four have disclosed aggressive AI infrastructure capex plans for 2026 and 2027 that are consistent with first-wave Vera Rubin adoption. Sovereign AI buyers from the Middle East, Asia, and Europe are also expected to feature in Nvidia's GTC announcements.
The Rubin platform at full scale integrates six Vera Rubin GPU chips in a single rack-scale node, all interconnected via NVLink 6 through an NVLink Switch fabric. This creates a unified memory address space across all six chips, allowing very large AI models to run without manual pipeline parallelism configuration. The platform is designed for liquid cooling as a baseline requirement. It follows the same rack-level physical form factor logic as the Blackwell NVL72, enabling data center compatibility with existing Blackwell infrastructure.
Is it worth waiting for Vera Rubin instead of deploying Blackwell now?
It depends on deployment timing. For workloads that need to go into production before Q3 2026, Blackwell is the right choice — the hardware is available, well-understood, and extremely capable. For infrastructure planning targeting late 2026 and 2027, Vera Rubin's 3–5x performance improvement and lower cost per token make it the right planning assumption. Companies with flexible timelines should factor Vera Rubin availability into their infrastructure roadmaps.
What does GTC 2026 mean for AI inference costs?
If Vera Rubin delivers on its 3–5x inference performance claims, the effective cost per token for running AI workloads will continue to fall materially through 2026 and 2027. This follows the pattern of every prior Nvidia GPU generation transition and makes AI products that are currently marginal on unit economics progressively more viable. For software builders, it is a reason to revisit pricing models and product architectures that were designed around higher inference costs.
GTC 2026 runs March 17–21, 2026, in San Jose, California. Jensen Huang's keynote is scheduled for March 16. Nvidia's official announcements, technical sessions, and customer case studies will be streamed live at nvidia.com/gtc.