- What the Maia 200 is and how it differs from Maia 100 - Full hardware specifications: process node, memory, TDP, and compute - How Microsoft's chip stacks up against Google TPU v6e, Amazon Trainium3, and NVIDIA H200/Blackwell in a direct comparison table - Why inference — not training — became the strategic battleground in AI infrastructure - The scale-up architecture behind Maia 200's cluster design - What "30% better performance per dollar" actually means in practice - How NVIDIA's Rubin platform (arriving H2 2026) changes the competitive calculus - The total addressable market for custom AI silicon and where it is heading by 2033 - Which workloads Maia 200 is optimized for — and which it is not - What this chip means for Azure customers, OpenAI, and Microsoft 365 Copilot - Why hyperscalers are racing to reduce NVIDIA dependence now - The key risks and open questions in Microsoft's silicon strategy ---

Microsoft Maia 200: the inference chip built to challenge N…

TL;DR: Microsoft launched the Maia 200, a 3nm custom inference chip delivering 10+ petaFLOPS of FP4 compute with 216 GB HBM3e memory at 750W -- roughly half the power draw of NVIDIA's Blackwell GPUs. It clusters to 6,144 accelerators and can run the largest frontier models on a single node, offering 30% better performance-per-dollar. With Google, Amazon, and now Microsoft fielding production-grade custom silicon, NVIDIA's ~80% market share faces structural pressure.

Microsoft's Maia 200 is a 3nm inference accelerator built on TSMC silicon, delivering 10+ petaFLOPS of FP4 compute, 216 GB of HBM3e memory, and 30% better performance-per-dollar than any prior hardware in Microsoft's fleet — all at 750W, roughly half the power draw of NVIDIA's Blackwell GPUs.

The chip clusters to 6,144 accelerators across 1,536 nodes, making it the first hyperscaler-built silicon capable of serving today's largest frontier models without sharding across multiple GPU islands.

NVIDIA still controls ~80% of the AI accelerator market. But with Google, Amazon, and now Microsoft each fielding production-grade custom silicon, that number is under structural pressure for the first time.

What you will learn

What the Maia 200 is and how it differs from Maia 100
Full hardware specifications: process node, memory, TDP, and compute
How Microsoft's chip stacks up against Google TPU v6e, Amazon Trainium3, and NVIDIA H200/Blackwell in a direct comparison table
Why inference — not training — became the strategic battleground in AI infrastructure
The scale-up architecture behind Maia 200's cluster design
What "30% better performance per dollar" actually means in practice
How NVIDIA's Rubin platform (arriving H2 2026) changes the competitive calculus
The total addressable market for custom AI silicon and where it is heading by 2033
Which workloads Maia 200 is optimized for — and which it is not
What this chip means for Azure customers, OpenAI, and Microsoft 365 Copilot
Why hyperscalers are racing to reduce NVIDIA dependence now
The key risks and open questions in Microsoft's silicon strategy

Background: from Maia 100 to Maia 200

Microsoft first revealed custom AI silicon in November 2023 with the Maia 100, a chip designed for training workloads and initially deployed internally with limited scope. The Maia 100 was a proof of concept — evidence that Microsoft could design and tape out its own accelerators. It was never made available to Azure customers and generated more press than production traffic.

Maia 200 is a different kind of announcement. This is not a research project or a credibility signal. It is production infrastructure, deployed in Microsoft's US Central datacenter near Des Moines, Iowa as of January 2026, with the US West 3 region near Phoenix coming next. The chip is already running workloads for Microsoft's own products — including the latest GPT-5 class models in Microsoft Foundry and inference traffic for Microsoft 365 Copilot.

The strategic pivot from Maia 100 to Maia 200 reflects a broader industry shift. Training large models happens rarely and at enormous cost. Inference — running those models millions of times per second to answer user queries — is the ongoing operational expense. As AI gets embedded into every enterprise product, inference becomes the dominant cost center. That is the market Microsoft is targeting.

Full specifications: what is inside Maia 200

Microsoft built Maia 200 on TSMC's N3 (3nm) process node, the same cutting-edge fabrication technology used by Apple's A18 Pro and NVIDIA's upcoming Rubin architecture. The chip packs over 140 billion transistors into a single die.

Compute:

FP4 performance: 10+ petaFLOPS
FP8 performance: 5+ petaFLOPS
Native FP8 and FP4 tensor cores, purpose-built for inference precision requirements

Memory:

216 GB HBM3e (High Bandwidth Memory, third generation enhanced)
Memory bandwidth: 7 TB/s
On-chip SRAM: 272 MB

Power:

TDP: 750W per chip
This is roughly 37–40% lower than NVIDIA's Blackwell B200 at ~1,200W

Networking and clustering:

Up to 1,536 coherent nodes, 6,144 accelerators per cluster
Four Maia accelerators per tray, fully connected via direct non-switched links
Proprietary Maia AI Transport Protocol for intra-rack and inter-rack communication
Specialized DMA engine and on-die SRAM for high-bandwidth data movement

The 216 GB of HBM3e per chip is the headline number for inference. Larger memory capacity means larger models fit without aggressive quantization or model splitting across chips. For context, the full GPT-4 class models require hundreds of gigabytes of memory to load. A cluster of Maia 200 chips at this memory density can serve frontier-scale models with substantially less inter-chip communication overhead than GPU clusters with tighter per-chip memory budgets.

The chip comparison table: Maia 200 vs. the field

Direct comparisons across custom silicon are methodologically difficult — each vendor reports different benchmarks under favorable conditions. The table below uses the most commonly cited, publicly available figures.

Chip	Vendor	Process	FP8 Compute	FP4 Compute	Memory	Bandwidth	TDP	External Sales
Maia 200	Microsoft	TSMC 3nm	5+ PFLOPs	10+ PFLOPs	216 GB HBM3e	7 TB/s	750W	No
Google TPU v7 (Ironwood)	Google	TSMC 3nm	~7.37 TB/s bandwidth	—	192 GB HBM	7.37 TB/s	~800W	Via GCP
Amazon Trainium3	AWS	TSMC 3nm	2.52 PFLOPs	—	144 GB HBM3e	4.9 TB/s	~600W	No
NVIDIA H200	NVIDIA	TSMC 4nm	1.98 PFLOPs	—	141 GB HBM3e	4.8 TB/s	700W	Yes
NVIDIA Blackwell B200	NVIDIA	TSMC 4nm	9 PFLOPs	20 PFLOPs	192 GB HBM3e	8 TB/s	~1,200W	Yes

Key takeaways from the table:

Maia 200 triples the FP4 compute of Trainium3, per Microsoft's own claims (independently unverified but consistent with published specs)
Maia 200's FP8 performance exceeds Google's seventh-generation TPU per Microsoft's benchmarks
NVIDIA Blackwell B200 leads raw FP4 compute at 20 petaFLOPS — double the Maia 200 — but at 60% higher TDP
NVIDIA remains the only vendor offering external sales; Maia 200, Trainium3, and TPUs are captive silicon
The Maia 200's 7 TB/s memory bandwidth matches or exceeds all competitors except Blackwell's 8 TB/s

The absence of external sales is the most important constraint. Microsoft cannot monetize Maia 200 directly as a hardware product. Its value is entirely captured through Azure service margins and reduced dependence on NVIDIA procurement.

Why inference became the strategic battleground

From 2020 through 2023, the dominant AI infrastructure story was training. Organizations raced to acquire GPU clusters to train ever-larger foundation models. NVIDIA captured that market almost completely — estimates from Omdia and Bloomberg Intelligence consistently put NVIDIA's share of AI accelerator revenue at 70–80%.

The economics of inference are structurally different from training, and this is why hyperscalers are building inference-specific chips now rather than general-purpose accelerators.

Training happens once (or rarely) at enormous upfront compute cost. It demands the highest possible raw throughput, supports long job durations, and tolerates complex distributed programming models.

Inference happens continuously, millions of times per second, at latency budgets measured in milliseconds. It rewards memory capacity (to load the model), memory bandwidth (to move activations), and power efficiency (to minimize cost-per-token). Raw FLOPS matter less than the ratio of useful work per watt and per dollar.

NVIDIA's GPUs are general-purpose. They are excellent at both training and inference, but they carry the cost of that generality — high TDP, premium pricing, and a software stack designed for flexibility rather than inference-specific optimization.

A purpose-built inference chip can make different tradeoffs: larger on-chip SRAM buffers, specialized attention kernels, lower precision formats optimized for serving rather than training, and aggressive power capping that maps to lower operating costs in a datacenter running 24/7.

Microsoft claims 30% better performance per dollar than the latest generation hardware in its fleet. That is the real number driving the Maia 200 program — not competitive optics, but operating margin on Azure AI services.

Scale-up architecture: how Maia 200 clusters work

One of the more technically interesting aspects of Maia 200 is its clustering design. Four chips per tray are connected via direct non-switched links — meaning chip-to-chip communication within a tray bypasses the fabric entirely. This is similar in principle to NVIDIA's NVLink, but implemented as part of a proprietary tray architecture rather than a general-purpose interconnect.

Beyond the tray, Microsoft built a custom network fabric called the Maia AI Transport Protocol. This handles both intra-rack and inter-rack communication using the same protocol, reducing the software complexity of managing heterogeneous network layers at scale.

At full deployment, a Maia 200 cluster spans 1,536 nodes containing 6,144 accelerators. That is the scale required to run today's largest frontier models — GPT-4 class and beyond — as a single coherent serving system rather than a loosely coupled collection of independent GPU servers.

The significance: most cloud GPU deployments for large model inference involve complex model-parallelism schemes where the model is sliced across dozens or hundreds of GPUs, with constant inter-GPU communication creating latency and throughput bottlenecks. Maia 200's memory capacity and cluster coherence are designed to minimize that overhead. A single node can hold enough of a frontier model to handle inference requests with reduced cross-node communication compared to smaller-memory alternatives.

The NVIDIA context: Blackwell now, Rubin arriving H2 2026

Any honest assessment of Maia 200 requires confronting what NVIDIA is shipping. The Blackwell B200 delivers 20 petaFLOPS of FP4 compute — twice the Maia 200's figure. Blackwell Ultra is already shipping. And NVIDIA's next architecture, Vera Rubin, is entering full production ahead of schedule with an H2 2026 launch target.

Rubin raises the stakes dramatically:

FP4 compute: 50 petaFLOPS per chip (5x Maia 200)
Rubin Ultra (2027): 100 petaFLOPS
Vera Rubin NVL72 rack system: up to 5x greater inference performance than Blackwell, 10x lower cost per token
Process: TSMC 3nm with HBM4 memory

NVIDIA's one-year release cadence means that by the time Maia 200 reaches broad Azure deployment, the competition will be Rubin, not Blackwell. Microsoft is not building Maia 200 to win on raw performance against NVIDIA's best. It is building it to eliminate NVIDIA from a specific, high-volume slice of its workload — internal inference for Microsoft's own products.

That framing matters. Maia 200 does not need to beat Rubin. It needs to be cheaper, more power-efficient, and controllable enough to serve a defined set of workloads profitably. For Microsoft's internal economics, 30% better performance-per-dollar at 750W is a more useful metric than absolute FLOPS.

The risk is NVIDIA's pace. Rubin promises 10x lower cost-per-token than Blackwell. If that claim holds in production, the economic case for custom silicon narrows significantly — unless hyperscalers can iterate their own chips fast enough to keep up.

Market context: the custom silicon land grab

Microsoft is not acting in isolation. The movement toward hyperscaler-built AI silicon is a structural trend accelerating across the industry.

Google has deployed TPUs since 2016 and now controls an estimated 58% of the custom cloud AI accelerator market. Google's seventh-generation Ironwood TPU features 192 GB HBM and 7.37 TB/s bandwidth per chip, targeting both inference and training. Google Cloud offers TPU access externally via GCP.

Amazon has Trainium2 in production (400,000 chips deployed for Anthropic's "Project Rainier") and Trainium3 entering full deployment in early 2026 — promising 40% better energy efficiency than Trainium2 at TSMC 3nm. Amazon does not sell Trainium externally; it is captive to AWS workloads.

Meta has MTIA (Meta Training and Inference Accelerator) scaling across its recommendation and ranking workloads, targeting the massive inference load generated by Facebook, Instagram, and Threads.

The market numbers support the urgency. Bloomberg Intelligence estimates the AI accelerator market will exceed $600 billion by 2033, up from $116 billion in 2024. Custom ASIC shipments from cloud providers are projected to grow 44.6% in 2026, compared to 16.1% for GPU shipments. Custom silicon is expected to capture 15–25% market share by mid-decade, primarily driven by internal hyperscaler inference workloads.

NVIDIA's ~80% market share is real but concentrated in training and external cloud sales. The custom ASIC growth is largely invisible to revenue-share analyses because hyperscalers are consuming their own chips rather than selling them — the displacement of NVIDIA shows up as a slowdown in GPU procurement, not as a competing product in a market report.

What this means for Azure customers and OpenAI

Maia 200 is not available for Azure customers to provision directly. There is no "Standard_Maia200" VM SKU. The chip is infrastructure, not a product — it runs behind the abstractions that power Azure AI Foundry, Azure OpenAI Service, and Microsoft 365 Copilot.

For Azure customers, the impact is indirect but real:

Lower inference costs over time: if Microsoft's 30% performance-per-dollar claim holds at scale, that headroom eventually gets passed through to service pricing in a competitive market
Availability and latency: Maia 200's large memory capacity and cluster coherence should improve serving latency for large models under high load
Model diversity: Microsoft states Maia 200 will support "the latest GPT-5.2 models from OpenAI" — meaning OpenAI's inference traffic on Azure is partly running on Microsoft-built silicon

The OpenAI relationship is worth examining closely. OpenAI is both Microsoft's most important AI partner and, increasingly, a strategic liability in terms of GPU procurement costs. Every token generated by ChatGPT or the OpenAI API running on Azure costs Microsoft compute. Maia 200 reduces the per-token cost of that compute. The chip is, in part, a cost-management instrument for the OpenAI partnership.

For enterprises running Azure AI workloads, the practical implication is straightforward: Microsoft's ability to offer competitive inference pricing on its platform is directly tied to the performance of programs like Maia 200. The chip is infrastructure investment that underwrites the economics of the AI services layer above it.

The limits of Maia 200: what it cannot do

Maia 200 is an inference accelerator. That is its purpose and also its constraint. It is not designed for training large models from scratch. It does not offer the general-purpose programmability of NVIDIA's CUDA ecosystem. It will not be available as a standalone product on the open market.

Microsoft's software stack for Maia 200 is built around the ONNX Runtime and Azure's internal serving infrastructure. Porting arbitrary PyTorch training code to run on Maia 200 is not the use case. This is a difference from NVIDIA's offering, where the same GPU handles training experiments, fine-tuning runs, and production inference behind a single programming model.

For customers who need flexibility — the ability to train, fine-tune, and serve from the same hardware pool — NVIDIA GPUs remain the more versatile option. Maia 200's specialization is a feature for Microsoft's internal cost structure and a limitation for any use case that falls outside its target workload profile.

The power consumption figure also bears scrutiny. 750W per chip is lower than Blackwell's 1,200W, but a 6,144-chip cluster still draws substantial power. Maia 200's efficiency advantage compounds at scale, but the absolute power demand of frontier model inference remains enormous regardless of which silicon is serving it.

FAQ

Is Maia 200 available on Azure for external customers? No. Maia 200 is Microsoft-internal infrastructure. It powers services like Azure AI Foundry, Azure OpenAI Service, and Microsoft 365 Copilot, but there is no direct customer-facing SKU. Azure customers continue to access GPU instances backed by NVIDIA hardware for general compute workloads.

How does Maia 200 compare to NVIDIA Blackwell in raw performance? NVIDIA Blackwell B200 delivers 20 petaFLOPS of FP4 compute compared to Maia 200's 10+ petaFLOPS. Blackwell leads on raw throughput. Maia 200's advantage is power efficiency (750W vs. ~1,200W) and cost-per-inference for the specific workloads it targets.

Can Maia 200 train large AI models? No. Maia 200 is an inference accelerator. Its architecture — large HBM capacity, high memory bandwidth, inference-optimized tensor cores — is purpose-built for serving models, not training them. Microsoft uses NVIDIA GPUs and other hardware for model training.

Why are hyperscalers building their own chips instead of buying from NVIDIA? Three reasons: cost, supply chain control, and optimization. Custom silicon cuts inference cost-per-token by eliminating the NVIDIA margin and enabling workload-specific hardware tradeoffs. It reduces dependency on NVIDIA's supply chain and pricing. And it allows fine-tuned architectural choices — like large on-chip SRAM and inference-specific precision formats — that a general-purpose GPU cannot make. Microsoft claims 30% better performance-per-dollar vs. its prior fleet as the quantified outcome.

Does NVIDIA's Rubin architecture make Maia 200 obsolete? Not in the near term. Rubin arrives H2 2026 with 50 petaFLOPS of FP4 compute and promises 10x lower cost-per-token than Blackwell. But Maia 200 is already in production, runs at lower TDP, and is not subject to external procurement costs. The question for Microsoft is whether Rubin's efficiency improvements outpace Maia 200's operational cost advantages — and whether Microsoft can iterate Maia 300 fast enough to stay competitive.

What AI models run on Maia 200? Microsoft has confirmed that Maia 200 powers the latest GPT-5.2 class models from OpenAI running on Azure, as well as inference workloads for Microsoft 365 Copilot. The full scope of models running on Maia 200 has not been publicly disclosed.

Where is Maia 200 currently deployed? As of January 2026, Maia 200 is live in Microsoft's US Central datacenter near Des Moines, Iowa. The US West 3 region near Phoenix, Arizona is the next planned deployment, with additional regions to follow.

What happens to Microsoft's NVIDIA GPU fleet? Maia 200 does not replace NVIDIA GPUs in Microsoft's infrastructure. It supplements them for specific inference workloads. Microsoft will continue to procure NVIDIA hardware — including Blackwell and eventually Rubin — for training runs, flexible compute, and workloads that fall outside Maia 200's optimization target. The relationship is displacement at the margin, not wholesale replacement.

Let's Build Something Together

Microsoft Maia 200: the inference chip built to challenge NVIDIA

Weekly Newsletter