TL;DR: NVIDIA signed a $20 billion non-exclusive licensing deal with Groq in December 2025, surfaced publicly on March 2, to build a dedicated AI inference chip using Groq's LPU architecture. OpenAI is the primary customer with a committed 3-gigawatt inference capacity build-out. Groq founder Jonathan Ross and president Sunny Madra join NVIDIA, while Groq continues independently under new CEO Simon Edwards. The new chip debuts at GTC 2026 in San Jose on March 16.
What you will learn
- The deal: $20 billion for Groq's inference technology
- What is an LPU and why it matters
- GPUs vs LPUs: the fundamental difference
- Why NVIDIA is betting against its own product
- OpenAI's 3GW commitment: the inference economy
- Jonathan Ross joins NVIDIA: what it signals
- GTC March 16: what to expect
- Competition: AMD, Intel, and custom silicon
- What this means for AI infrastructure costs
- Frequently asked questions
The deal: $20 billion for Groq's inference technology
The headline figure is $20 billion. But the structure of the deal matters as much as the size.
NVIDIA did not acquire Groq. It signed a non-exclusive licensing agreement to use Groq's Language Processing Unit architecture in a new chip that NVIDIA will design, manufacture, and sell. Groq remains an independent company. GroqCloud, Groq's inference API service, continues to operate. The founder and president joined NVIDIA individually, not as part of an acquisition. Simon Edwards steps in as Groq's new CEO to continue the independent business.
The licensing structure is deliberate. Groq retains the right to continue licensing its architecture to others. NVIDIA gets access to the LPU design without paying acquisition premiums, without inheriting Groq's liabilities, and without triggering the kind of antitrust scrutiny that a full acquisition of a key semiconductor IP holder would face in the current regulatory climate.
The deal was signed in December 2025. It remained confidential for roughly ten weeks before SiliconAngle reported on March 1 that NVIDIA was working on a "top-secret AI inference chip," followed by Tom's Hardware confirming the Groq licensing arrangement on March 2. The information went from industry rumor to confirmed fact in less than 24 hours — which is itself a signal that multiple sources inside the deal were ready to talk.
The $20 billion price point is striking in context. It is larger than most semiconductor acquisitions. For a licensing deal, it is extraordinary. NVIDIA is not buying factories, employees, or a customer list. It is buying the right to use a specific approach to chip design that Groq developed over nearly a decade. That NVIDIA would pay that figure for an architectural license signals how urgently it believes it needs a different kind of chip for inference workloads.
What is an LPU and why it matters
Groq introduced the term Language Processing Unit in 2020, positioning it as a purpose-built alternative to GPUs for the specific task of running language model inference at scale.
The core insight behind LPU design is that inference — generating tokens from a trained model — has fundamentally different computational characteristics than training. Training requires massive parallel floating-point operations across enormous batches of data. The GPU's strength is exactly this: thousands of cores running the same operation simultaneously on different data. That parallelism was built for graphics rendering and adapted brilliantly for neural network training.
Inference is different. A typical inference request is a single conversation. The model must generate tokens sequentially — each token depends on all previous tokens. You cannot parallelize across the sequence the way you can parallelize across a training batch. The memory bandwidth requirements are intense because the model weights must be loaded from memory for every token generation step. A large language model generating a response is, at the hardware level, primarily a memory-bound operation, not a compute-bound one.
LPU architecture is designed around this reality. Groq's chips use deterministic scheduling — the execution of every operation is planned at compile time, with no dynamic branching or cache misses during inference. The chip knows exactly what memory it will access and when, before the request even starts. This eliminates the stochastic behavior that makes GPUs energy-hungry and latency-variable during inference.
The result is measurable: Groq's public benchmarks consistently showed inference speeds of 500-800 tokens per second on large models, compared to 50-150 tokens per second from comparable GPU-based systems. Energy per token generated on Groq hardware is dramatically lower — a fact that becomes financially significant at OpenAI's scale of inference requests.
GPUs vs LPUs: the fundamental difference
To understand why this deal matters, you have to understand what GPUs were actually designed to do and where that design creates friction for inference workloads.
A modern NVIDIA GPU — an H100 or the newer Blackwell B200 — contains tens of thousands of CUDA cores. These cores are designed to execute the same instruction on many different pieces of data simultaneously. This is called SIMD (Single Instruction, Multiple Data) execution. It was invented for rendering the millions of pixels in a 3D scene. It was then discovered to be equally excellent for the matrix multiplications that define deep learning. For training, where you run the same forward and backward pass on thousands of examples in a batch, SIMD is close to ideal.
For inference, you are usually running one request at a time (or a modest batch of them). The sequential nature of token generation means the GPU's massive parallel throughput is underutilized for much of the inference process. Meanwhile, the GPU still consumes peak power. The H100's thermal design power is 700 watts. A data center full of H100s running inference at low batch sizes is burning enormous power for a fraction of the theoretical compute.
LPU architecture attacks this mismatch directly. Rather than a sea of general-purpose cores, the LPU uses statically scheduled compute units with dedicated memory paths for weight loading. The deterministic scheduler eliminates cache management overhead. The chip runs cooler, produces output faster per query, and consumes less energy per token.
VentureBeat's characterization of the deal was precise: "inference is splitting in two." The AI compute market is separating into training workloads, where GPUs remain optimal, and inference workloads, where purpose-built deterministic architectures may hold significant structural advantages.
Why NVIDIA is betting against its own product
VentureBeat's headline on the Groq deal was: "Nvidia just admitted the general-purpose GPU era is ending." That framing is aggressive but not inaccurate.
NVIDIA has built a $3 trillion market cap on the premise that GPUs are the right hardware for AI. The H100 and H200 dominate the training compute market. Blackwell is extending that dominance into the next generation. Every hyperscaler — Microsoft, Google, Amazon, Meta — has committed to buying NVIDIA chips in quantities that have defined the entire AI build-out of 2023–2026.
So why would NVIDIA sign a $20 billion deal to license an alternative architecture that implicitly concedes GPUs are not optimal for inference?
The answer is that NVIDIA is doing what great technology companies do when they see a discontinuity coming: they cannibalize themselves before a competitor does it for them.
The inference problem is real and growing. As frontier models become commoditized — as GPT-4-level capability becomes available from dozens of providers — the competitive edge in AI shifts from model capability to inference economics. The company that can serve a response at the lowest cost per token wins the economics of AI at scale. GPUs, for all their training dominance, are expensive and inefficient inference engines. Custom silicon from Google (TPUs), Amazon (Trainium), and Microsoft (Maia) is already eating into the inference workload that would otherwise go to NVIDIA hardware.
NVIDIA had two options: watch the inference market fragment away from GPUs, or build the best inference chip in the world. By licensing Groq's architecture and bringing Jonathan Ross inside the tent, it chose the latter. The $20 billion license fee is the price of remaining relevant in the inference layer that is about to become the dominant AI compute workload by volume.
OpenAI's 3GW commitment: the inference economy
The customer detail buried inside the deal announcement is more significant than the headline number. OpenAI has committed to 3 gigawatts of dedicated inference capacity using the new NVIDIA-Groq chip.
Three gigawatts is not a meaningful figure in the abstract. To contextualize it: a large nuclear power plant generates about 1 gigawatt. The entire country of Ireland consumes about 3.5 gigawatts at peak demand. OpenAI is committing to a build-out of AI inference infrastructure that consumes the power equivalent of approximately three nuclear plants — dedicated to generating tokens.
This is not a speculative future projection. OpenAI's current inference load — handling hundreds of millions of ChatGPT users, enterprise API customers, and its growing suite of operator-built applications — already demands substantial compute. The commitment to 3GW of LPU-based inference capacity signals that OpenAI expects that load to grow by an order of magnitude, and that it expects the new chip to be economically superior to H100/H200-based inference at that scale.
The economic logic is straightforward. If LPU-based inference reduces energy consumption per token by even 40–50% compared to GPU-based inference, the savings on 3GW of infrastructure are enormous — potentially billions of dollars per year in operating costs. The chip itself may cost more upfront (NVIDIA will price it accordingly), but the total cost of ownership over a multi-year deployment likely favors the purpose-built architecture.
OpenAI as the anchor customer also signals something about the deal's strategic purpose. OpenAI is not a company that adopts experimental hardware for production inference. If OpenAI is committing 3GW of capacity to this chip, it has seen internal benchmarks that satisfy its reliability, throughput, and total cost requirements. That validation matters for every other potential buyer watching the GTC announcement.
Jonathan Ross joins NVIDIA: what it signals
Jonathan Ross is not an ordinary hire. He is the founder of Groq and one of the most consequential chip architects of the past decade — before Groq, he was the lead architect on Google's first-generation TPU. He designed the chip that gave Google its decisive edge in training efficiency from 2016 onward. He then left to build Groq specifically because he believed the inference problem required a fundamentally different approach.
His decision to join NVIDIA, and to bring Groq president Sunny Madra with him, is a signal that he believes the collaboration can produce something that neither company could build alone. NVIDIA brings manufacturing relationships with TSMC, packaging expertise, and the world's most sophisticated chip design organization. Ross brings the LPU architecture and ten years of inference-specific engineering insight.
The unusual structure of the arrangement — Ross at NVIDIA while Groq continues independently — suggests NVIDIA wants Ross's direct involvement in designing the new chip, not just access to documented IP. Architecture knowledge that lives primarily in the heads of the engineers who created it is notoriously difficult to transfer through documentation alone. By bringing the inventor inside, NVIDIA avoids the implementation pitfalls that typically occur when a company licenses architecture it did not design.
For Groq as a company, the leadership transition to Simon Edwards is a pragmatic restructuring. Edwards's mandate appears to be managing the existing GroqCloud business and the non-exclusive licensing pipeline while the founders focus on the NVIDIA collaboration. The company is not shutting down; it is reorganizing around a reality in which its most significant customer is now also its licensing partner.
GTC March 16: what to expect
NVIDIA's GPU Technology Conference returns to San Jose on March 16, and the Groq chip announcement is expected to be the centerpiece of Jensen Huang's keynote.
GTC 2025 was notable for the Blackwell architecture reveal and NVIDIA's infrastructure partnerships. GTC 2026 appears likely to mark a more significant philosophical shift: NVIDIA announcing that it is building chips that are not GPUs.
Based on the information confirmed as of March 2, what GTC is expected to include:
The new chip's name is not yet publicly confirmed. Some industry observers expect NVIDIA to introduce a new product line separate from the Hopper/Blackwell GPU families — potentially an "inference series" or "language processing" branded product to distinguish it from the training-focused GPU lineup.
Performance benchmarks will be central to the announcement. Expect NVIDIA to present side-by-side inference latency and energy comparisons between the new chip and H100/H200 at representative model sizes and batch configurations. The numbers need to be compelling enough to justify the $20 billion investment and to give OpenAI's 3GW commitment public validation.
Pricing will not be announced at GTC — it rarely is at hardware reveals — but the analyst community will be focused on inferring it from the performance numbers. If the chip delivers 5–6x inference throughput per watt versus H100, NVIDIA can charge a significant premium while still offering customers lower total cost of ownership.
Jensen Huang's keynote style has become the defining theatrical event of the chip industry. Expect a similar production to GTC 2025, with customer cameos, live benchmark demonstrations, and a narrative that frames the chip as a new category rather than an iteration.
Competition: AMD, Intel, and custom silicon
The Groq deal does not exist in a vacuum. Every major chip company is pursuing inference efficiency, and NVIDIA is entering this segment with significant headwind from custom silicon built by its largest customers.
Google TPUs: Google's Tensor Processing Units have been in production for a decade. The fifth-generation TPU v5e is purpose-built for inference and is what powers much of Google's Gemini serving infrastructure. Google does not sell TPUs externally, but internal deployment gives it inference economics that NVIDIA-based competitors cannot match at Google's scale.
Amazon Trainium and Inferentia: Amazon has invested heavily in both training (Trainium) and inference (Inferentia) custom silicon through its Annapurna Labs subsidiary. Inferentia 2 is in production across AWS, and Amazon has publicly stated it will continue investment in both lines. AWS's ability to offer inference at lower cost on custom silicon is a direct competitive threat to NVIDIA-based cloud inference.
AMD MI300X: AMD's MI300X has positioned itself as the GPU alternative for inference workloads, with higher memory bandwidth than comparable NVIDIA parts and more VRAM per chip. AMD has won inference contracts at Microsoft and others. The MI300X is not an LPU — it is still a GPU — but it addresses some of the memory bandwidth constraints that make NVIDIA GPUs less efficient at inference.
Intel Gaudi: Intel's Gaudi 3 accelerator is a more distant competitor, but Intel's manufacturing relationships and cost structure give it a presence in the market that cannot be ignored.
What distinguishes the NVIDIA-Groq chip from all of these competitors is the deterministic scheduling approach. Google TPUs use systolic array architecture. AMD MI300X uses conventional GPU architecture with higher memory capacity. None of them implement Groq's specific approach to eliminating dynamic scheduling overhead. If the benchmark data at GTC validates the LPU advantage, NVIDIA will have a chip with a differentiated architectural argument rather than just better numbers on the same metric axes.
What this means for AI infrastructure costs
The downstream implication of a successful LPU-based inference chip is a meaningful reduction in the cost of running AI at scale. That reduction matters not just for hyperscalers but for every enterprise that is trying to make the economics of AI-powered products work.
Current inference costs for frontier models are a real constraint on product development. Running GPT-4 level capability on GPU infrastructure costs enough that many product teams are forced to choose between quality and economics. They use smaller, cheaper models for the majority of interactions and reserve frontier models for high-value queries. This architectural compromise exists primarily because GPU-based inference is expensive.
If LPU-based inference reduces energy per token by 50–70% — the range that Groq's existing hardware suggests is achievable — the unit economics of AI products change materially. Applications that are currently marginal at frontier model quality become economically viable. Voice AI, real-time code review, continuous document processing — all of these workloads become cheaper to run.
The effect compounds at OpenAI's scale. At 3GW of committed inference capacity, even a 40% reduction in energy per token translates to billions of dollars in annual operating cost savings. Those savings can be passed to customers through lower API pricing, retained as margin, or reinvested in further model development. Any of those outcomes is good for the broader AI ecosystem.
For the AI infrastructure investment thesis broadly, the deal also signals a bifurcation of the chip market that investors need to model. Training compute demand is not going away — frontier model training is getting more expensive, not less. But inference compute demand is growing faster than training demand, and inference may ultimately be served by a completely different class of hardware. The companies that own both ends of that equation — NVIDIA with this deal, Google with TPUs, Amazon with Trainium and Inferentia — will have structural cost advantages that pure GPU players cannot match.
Frequently asked questions
Is NVIDIA acquiring Groq?
No. The deal is a non-exclusive licensing agreement. NVIDIA licensed Groq's LPU architecture and brought the founder and president on board as employees, but Groq continues as an independent company under new CEO Simon Edwards. GroqCloud remains active. Groq retains the right to license its architecture to other chip companies. The structure was deliberately chosen to avoid acquisition complexity and antitrust scrutiny.
What is the difference between a GPU and an LPU?
A GPU (Graphics Processing Unit) is a general-purpose parallel processor originally designed for rendering graphics. Its thousands of cores excel at running the same operation on many data points simultaneously — ideal for training neural networks. An LPU (Language Processing Unit) is purpose-built for inference: generating tokens sequentially from a trained model. LPUs use deterministic scheduling, meaning every memory access and compute operation is planned at compile time, eliminating the dynamic overhead that makes GPUs energy-inefficient at inference.
Why is OpenAI committing 3 gigawatts to this chip before it is publicly announced?
OpenAI has presumably seen internal benchmark data that justifies the commitment. At OpenAI's scale of inference demand — serving hundreds of millions of users plus enterprise API customers — even a moderate improvement in energy efficiency per token represents billions of dollars in operating cost savings annually. The 3GW commitment is a bet on those economics, not on an unproven concept.
Does this deal mean NVIDIA GPUs will stop being used for AI?
No. GPUs will continue to dominate training workloads for the foreseeable future. The new chip targets inference specifically. Most large-scale AI deployments use separate hardware for training and inference already. NVIDIA is extending its product portfolio into inference with a purpose-built chip, not replacing its GPU line.
When can enterprises buy the new chip?
No purchase availability date has been publicly confirmed. The chip is expected to be unveiled at GTC on March 16. Production availability timelines typically follow hardware reveals by 6–18 months for new chip architectures. Given that OpenAI has committed 3GW of capacity, a 2026 production timeline is plausible, but NVIDIA has not confirmed this.
How does this affect existing GPU investments that enterprises have already made?
Existing GPU infrastructure remains fully capable for inference in the near term. The new chip represents a cost efficiency advantage at scale, not a capability gap. Enterprises that have already purchased H100 or Blackwell systems for inference will not see those investments become obsolete — they will simply have a more efficient option available when they expand capacity. The LPU advantage is most pronounced at very high inference volumes where energy costs are the dominant operating expense.