Microsoft Maia 200: the inference chip built to challenge NVIDIA
Microsoft launches the Maia 200 inference chip capable of running today's largest AI models on a single node, joining Google and Amazon in the custom silicon race.
Whether you're looking for an angel investor, a growth advisor, or just want to connect — I'm always open to great ideas.
Get in TouchAI, startups & growth insights. No spam.
TL;DR: Microsoft launched the Maia 200, a 3nm custom inference chip delivering 10+ petaFLOPS of FP4 compute with 216 GB HBM3e memory at 750W -- roughly half the power draw of NVIDIA's Blackwell GPUs. It clusters to 6,144 accelerators and can run the largest frontier models on a single node, offering 30% better performance-per-dollar. With Google, Amazon, and now Microsoft fielding production-grade custom silicon, NVIDIA's ~80% market share faces structural pressure.
Microsoft's Maia 200 is a 3nm inference accelerator built on TSMC silicon, delivering 10+ petaFLOPS of FP4 compute, 216 GB of HBM3e memory, and 30% better performance-per-dollar than any prior hardware in Microsoft's fleet — all at 750W, roughly half the power draw of NVIDIA's Blackwell GPUs.
The chip clusters to 6,144 accelerators across 1,536 nodes, making it the first hyperscaler-built silicon capable of serving today's largest frontier models without sharding across multiple GPU islands.
NVIDIA still controls ~80% of the AI accelerator market. But with Google, Amazon, and now Microsoft each fielding production-grade custom silicon, that number is under structural pressure for the first time.
Microsoft first revealed custom AI silicon in November 2023 with the Maia 100, a chip designed for training workloads and initially deployed internally with limited scope. The Maia 100 was a proof of concept — evidence that Microsoft could design and tape out its own accelerators. It was never made available to Azure customers and generated more press than production traffic.
Maia 200 is a different kind of announcement. This is not a research project or a credibility signal. It is production infrastructure, deployed in Microsoft's US Central datacenter near Des Moines, Iowa as of January 2026, with the US West 3 region near Phoenix coming next. The chip is already running workloads for Microsoft's own products — including the latest GPT-5 class models in Microsoft Foundry and inference traffic for Microsoft 365 Copilot.
The strategic pivot from Maia 100 to Maia 200 reflects a broader industry shift. Training large models happens rarely and at enormous cost. Inference — running those models millions of times per second to answer user queries — is the ongoing operational expense. As AI gets embedded into every enterprise product, inference becomes the dominant cost center. That is the market Microsoft is targeting.
Microsoft built Maia 200 on TSMC's N3 (3nm) process node, the same cutting-edge fabrication technology used by Apple's A18 Pro and NVIDIA's upcoming Rubin architecture. The chip packs over 140 billion transistors into a single die.
Compute:
Memory:
Power:
Networking and clustering:
The 216 GB of HBM3e per chip is the headline number for inference. Larger memory capacity means larger models fit without aggressive quantization or model splitting across chips. For context, the full GPT-4 class models require hundreds of gigabytes of memory to load. A cluster of Maia 200 chips at this memory density can serve frontier-scale models with substantially less inter-chip communication overhead than GPU clusters with tighter per-chip memory budgets.
Direct comparisons across custom silicon are methodologically difficult — each vendor reports different benchmarks under favorable conditions. The table below uses the most commonly cited, publicly available figures.
| Chip | Vendor | Process | FP8 Compute | FP4 Compute | Memory | Bandwidth | TDP | External Sales |
|---|---|---|---|---|---|---|---|---|
| Maia 200 | Microsoft | TSMC 3nm | 5+ PFLOPs | 10+ PFLOPs | 216 GB HBM3e | 7 TB/s | 750W | No |
| Google TPU v7 (Ironwood) | TSMC 3nm | ~7.37 TB/s bandwidth | — | 192 GB HBM | 7.37 TB/s | ~800W | Via GCP | |
| Amazon Trainium3 | AWS | TSMC 3nm | 2.52 PFLOPs | — | 144 GB HBM3e | 4.9 TB/s | ~600W | No |
| NVIDIA H200 | NVIDIA | TSMC 4nm | 1.98 PFLOPs | — | 141 GB HBM3e | 4.8 TB/s | 700W | Yes |
| NVIDIA Blackwell B200 | NVIDIA | TSMC 4nm | 9 PFLOPs | 20 PFLOPs | 192 GB HBM3e | 8 TB/s | ~1,200W | Yes |
Key takeaways from the table:
The absence of external sales is the most important constraint. Microsoft cannot monetize Maia 200 directly as a hardware product. Its value is entirely captured through Azure service margins and reduced dependence on NVIDIA procurement.
From 2020 through 2023, the dominant AI infrastructure story was training. Organizations raced to acquire GPU clusters to train ever-larger foundation models. NVIDIA captured that market almost completely — estimates from Omdia and Bloomberg Intelligence consistently put NVIDIA's share of AI accelerator revenue at 70–80%.
The economics of inference are structurally different from training, and this is why hyperscalers are building inference-specific chips now rather than general-purpose accelerators.
Training happens once (or rarely) at enormous upfront compute cost. It demands the highest possible raw throughput, supports long job durations, and tolerates complex distributed programming models.
Inference happens continuously, millions of times per second, at latency budgets measured in milliseconds. It rewards memory capacity (to load the model), memory bandwidth (to move activations), and power efficiency (to minimize cost-per-token). Raw FLOPS matter less than the ratio of useful work per watt and per dollar.
NVIDIA's GPUs are general-purpose. They are excellent at both training and inference, but they carry the cost of that generality — high TDP, premium pricing, and a software stack designed for flexibility rather than inference-specific optimization.
A purpose-built inference chip can make different tradeoffs: larger on-chip SRAM buffers, specialized attention kernels, lower precision formats optimized for serving rather than training, and aggressive power capping that maps to lower operating costs in a datacenter running 24/7.
Microsoft claims 30% better performance per dollar than the latest generation hardware in its fleet. That is the real number driving the Maia 200 program — not competitive optics, but operating margin on Azure AI services.
One of the more technically interesting aspects of Maia 200 is its clustering design. Four chips per tray are connected via direct non-switched links — meaning chip-to-chip communication within a tray bypasses the fabric entirely. This is similar in principle to NVIDIA's NVLink, but implemented as part of a proprietary tray architecture rather than a general-purpose interconnect.
Beyond the tray, Microsoft built a custom network fabric called the Maia AI Transport Protocol. This handles both intra-rack and inter-rack communication using the same protocol, reducing the software complexity of managing heterogeneous network layers at scale.
At full deployment, a Maia 200 cluster spans 1,536 nodes containing 6,144 accelerators. That is the scale required to run today's largest frontier models — GPT-4 class and beyond — as a single coherent serving system rather than a loosely coupled collection of independent GPU servers.
The significance: most cloud GPU deployments for large model inference involve complex model-parallelism schemes where the model is sliced across dozens or hundreds of GPUs, with constant inter-GPU communication creating latency and throughput bottlenecks. Maia 200's memory capacity and cluster coherence are designed to minimize that overhead. A single node can hold enough of a frontier model to handle inference requests with reduced cross-node communication compared to smaller-memory alternatives.
Any honest assessment of Maia 200 requires confronting what NVIDIA is shipping. The Blackwell B200 delivers 20 petaFLOPS of FP4 compute — twice the Maia 200's figure. Blackwell Ultra is already shipping. And NVIDIA's next architecture, Vera Rubin, is entering full production ahead of schedule with an H2 2026 launch target.
Rubin raises the stakes dramatically:
NVIDIA's one-year release cadence means that by the time Maia 200 reaches broad Azure deployment, the competition will be Rubin, not Blackwell. Microsoft is not building Maia 200 to win on raw performance against NVIDIA's best. It is building it to eliminate NVIDIA from a specific, high-volume slice of its workload — internal inference for Microsoft's own products.
That framing matters. Maia 200 does not need to beat Rubin. It needs to be cheaper, more power-efficient, and controllable enough to serve a defined set of workloads profitably. For Microsoft's internal economics, 30% better performance-per-dollar at 750W is a more useful metric than absolute FLOPS.
The risk is NVIDIA's pace. Rubin promises 10x lower cost-per-token than Blackwell. If that claim holds in production, the economic case for custom silicon narrows significantly — unless hyperscalers can iterate their own chips fast enough to keep up.
Microsoft is not acting in isolation. The movement toward hyperscaler-built AI silicon is a structural trend accelerating across the industry.
Google has deployed TPUs since 2016 and now controls an estimated 58% of the custom cloud AI accelerator market. Google's seventh-generation Ironwood TPU features 192 GB HBM and 7.37 TB/s bandwidth per chip, targeting both inference and training. Google Cloud offers TPU access externally via GCP.
Amazon has Trainium2 in production (400,000 chips deployed for Anthropic's "Project Rainier") and Trainium3 entering full deployment in early 2026 — promising 40% better energy efficiency than Trainium2 at TSMC 3nm. Amazon does not sell Trainium externally; it is captive to AWS workloads.
Meta has MTIA (Meta Training and Inference Accelerator) scaling across its recommendation and ranking workloads, targeting the massive inference load generated by Facebook, Instagram, and Threads.
The market numbers support the urgency. Bloomberg Intelligence estimates the AI accelerator market will exceed $600 billion by 2033, up from $116 billion in 2024. Custom ASIC shipments from cloud providers are projected to grow 44.6% in 2026, compared to 16.1% for GPU shipments. Custom silicon is expected to capture 15–25% market share by mid-decade, primarily driven by internal hyperscaler inference workloads.
NVIDIA's ~80% market share is real but concentrated in training and external cloud sales. The custom ASIC growth is largely invisible to revenue-share analyses because hyperscalers are consuming their own chips rather than selling them — the displacement of NVIDIA shows up as a slowdown in GPU procurement, not as a competing product in a market report.
Maia 200 is not available for Azure customers to provision directly. There is no "Standard_Maia200" VM SKU. The chip is infrastructure, not a product — it runs behind the abstractions that power Azure AI Foundry, Azure OpenAI Service, and Microsoft 365 Copilot.
For Azure customers, the impact is indirect but real:
The OpenAI relationship is worth examining closely. OpenAI is both Microsoft's most important AI partner and, increasingly, a strategic liability in terms of GPU procurement costs. Every token generated by ChatGPT or the OpenAI API running on Azure costs Microsoft compute. Maia 200 reduces the per-token cost of that compute. The chip is, in part, a cost-management instrument for the OpenAI partnership.
For enterprises running Azure AI workloads, the practical implication is straightforward: Microsoft's ability to offer competitive inference pricing on its platform is directly tied to the performance of programs like Maia 200. The chip is infrastructure investment that underwrites the economics of the AI services layer above it.
Maia 200 is an inference accelerator. That is its purpose and also its constraint. It is not designed for training large models from scratch. It does not offer the general-purpose programmability of NVIDIA's CUDA ecosystem. It will not be available as a standalone product on the open market.
Microsoft's software stack for Maia 200 is built around the ONNX Runtime and Azure's internal serving infrastructure. Porting arbitrary PyTorch training code to run on Maia 200 is not the use case. This is a difference from NVIDIA's offering, where the same GPU handles training experiments, fine-tuning runs, and production inference behind a single programming model.
For customers who need flexibility — the ability to train, fine-tune, and serve from the same hardware pool — NVIDIA GPUs remain the more versatile option. Maia 200's specialization is a feature for Microsoft's internal cost structure and a limitation for any use case that falls outside its target workload profile.
The power consumption figure also bears scrutiny. 750W per chip is lower than Blackwell's 1,200W, but a 6,144-chip cluster still draws substantial power. Maia 200's efficiency advantage compounds at scale, but the absolute power demand of frontier model inference remains enormous regardless of which silicon is serving it.
Is Maia 200 available on Azure for external customers? No. Maia 200 is Microsoft-internal infrastructure. It powers services like Azure AI Foundry, Azure OpenAI Service, and Microsoft 365 Copilot, but there is no direct customer-facing SKU. Azure customers continue to access GPU instances backed by NVIDIA hardware for general compute workloads.
How does Maia 200 compare to NVIDIA Blackwell in raw performance? NVIDIA Blackwell B200 delivers 20 petaFLOPS of FP4 compute compared to Maia 200's 10+ petaFLOPS. Blackwell leads on raw throughput. Maia 200's advantage is power efficiency (750W vs. ~1,200W) and cost-per-inference for the specific workloads it targets.
Can Maia 200 train large AI models? No. Maia 200 is an inference accelerator. Its architecture — large HBM capacity, high memory bandwidth, inference-optimized tensor cores — is purpose-built for serving models, not training them. Microsoft uses NVIDIA GPUs and other hardware for model training.
Why are hyperscalers building their own chips instead of buying from NVIDIA? Three reasons: cost, supply chain control, and optimization. Custom silicon cuts inference cost-per-token by eliminating the NVIDIA margin and enabling workload-specific hardware tradeoffs. It reduces dependency on NVIDIA's supply chain and pricing. And it allows fine-tuned architectural choices — like large on-chip SRAM and inference-specific precision formats — that a general-purpose GPU cannot make. Microsoft claims 30% better performance-per-dollar vs. its prior fleet as the quantified outcome.
Does NVIDIA's Rubin architecture make Maia 200 obsolete? Not in the near term. Rubin arrives H2 2026 with 50 petaFLOPS of FP4 compute and promises 10x lower cost-per-token than Blackwell. But Maia 200 is already in production, runs at lower TDP, and is not subject to external procurement costs. The question for Microsoft is whether Rubin's efficiency improvements outpace Maia 200's operational cost advantages — and whether Microsoft can iterate Maia 300 fast enough to stay competitive.
What AI models run on Maia 200? Microsoft has confirmed that Maia 200 powers the latest GPT-5.2 class models from OpenAI running on Azure, as well as inference workloads for Microsoft 365 Copilot. The full scope of models running on Maia 200 has not been publicly disclosed.
Where is Maia 200 currently deployed? As of January 2026, Maia 200 is live in Microsoft's US Central datacenter near Des Moines, Iowa. The US West 3 region near Phoenix, Arizona is the next planned deployment, with additional regions to follow.
What happens to Microsoft's NVIDIA GPU fleet? Maia 200 does not replace NVIDIA GPUs in Microsoft's infrastructure. It supplements them for specific inference workloads. Microsoft will continue to procure NVIDIA hardware — including Blackwell and eventually Rubin — for training runs, flexible compute, and workloads that fall outside Maia 200's optimization target. The relationship is displacement at the margin, not wholesale replacement.
AI chipmaker Cerebras raises $1 billion at $23 billion valuation and prepares for a Q2 2026 IPO, positioning as the most credible NVIDIA inference competitor.
Microsoft expands Dragon Copilot into a unified clinical AI platform at HIMSS 2026, promising 50%+ documentation cuts and deep Epic/Cerner integration by end of March.
While the DoD blacklisted Anthropic as a supply chain risk, Microsoft Azure and Google Cloud continue offering Claude to commercial enterprise clients — creating a two-tier AI reality.