TL;DR: Anthropic has made the 1 million token context window generally available for both Claude Opus 4.6 and Claude Sonnet 4.6 as of March 13, 2026. There is no longer a beta flag to set, no pricing multiplier above 200K tokens, and no request ceiling — a 900K-token call is billed at the exact same per-token rate as a 9K call. Opus 4.6 scores 78.3% on MRCR v2 at 1M tokens, the highest recall figure among frontier models at that length. The window is live on the Claude Platform, Microsoft Azure AI Foundry, and Google Cloud Vertex AI.
Table of contents
- What changed and when
- How the pricing works — and why the math matters
- MRCR v2 and GraphWalks BFS: what the benchmarks actually test
- Recall performance against GPT-5.4 and Gemini 3.1 Pro
- Media limits: 600 images and PDF pages per request
- How long-context requests work technically
- Where it is available: platform support and API access
- Practical use cases that 1M context unlocks
- Limitations and things to watch
- What this means for teams building on Claude
- 20 frequently asked questions
What changed and when
On March 13, 2026, Anthropic published a blog post on claude.com and an announcement on X declaring the 1M token context window generally available for Claude Opus 4.6 and Claude Sonnet 4.6.
The announcement is short but the implications are significant. Three things changed simultaneously:
The beta header is gone. During the extended preview period, developers had to pass a specific anthropic-beta header to unlock context beyond 200K tokens. That requirement is removed. Any API call to claude-opus-4-6 or claude-sonnet-4-6 now accepts up to 1 million tokens by default.
The pricing multiplier is gone. Anthropic had previously charged a premium above the 200K token threshold. That surcharge has been eliminated. Requests are billed at the same flat per-token rate regardless of context length.
The media ceiling increased sixfold. The limit on images and PDF pages per request has risen from 100 to 600. A single API call can now include an entire book-length PDF or several hundred images alongside the rest of the prompt.
This is a meaningful operational change for teams who had been calculating separate budget lines for long-context usage, or who had been avoiding the 1M window because of billing unpredictability.
How the pricing works
Pricing for Claude Opus 4.6 and Claude Sonnet 4.6 is unchanged from the standard rates announced when these models launched.
The key point is that these rates apply uniformly across the full context window. A request with 900,000 input tokens costs the same per token as a request with 9,000 input tokens. There is no pricing step-function at 200K, 500K, or any other threshold.
That eliminates a category of cost modeling complexity. Previously, a team building a retrieval-augmented pipeline had to account for two different per-token rates depending on whether any single call exceeded the premium threshold. The billing surface is now flat.
For comparison: during the beta period, long-context pricing varied but developers often reported effective rates 1.5 to 2x the base rate for requests in the 500K–1M range. That premium disappears entirely.
The flat pricing also changes the calculus around chunking strategies. Teams that had been splitting long documents across multiple shorter API calls to avoid the premium tier may now find it simpler and cheaper to pass the full document in a single request and let the model reason across the whole thing.
MRCR v2 and GraphWalks BFS: what the benchmarks actually test
Anthropic published two benchmark figures alongside the GA announcement. Understanding what these benchmarks actually measure determines how much weight to place on the numbers.
MRCR v2 (Multi-needle Retrieval and Coreference Resolution)
MRCR v2 is a long-context recall benchmark that tests whether a model can accurately retrieve and reason over multiple pieces of information scattered across a very long document. The name references its two core challenges:
Multi-needle retrieval places several distinct facts — "needles" — at different positions in a large context window filled with semantically plausible but irrelevant text. The model must locate all of them and integrate them into a coherent answer. A single-needle test only verifies that the model can find one fact; MRCR v2 tests coherence across multiple simultaneous retrievals.
Coreference resolution tests whether the model correctly tracks entities across long distances. When a name mentioned on page 1 is referenced only by pronoun on page 400, the model must connect them without confabulating a different referent. This is the failure mode that causes "hallucination" in long-document summarization — the model loses track of who did what and substitutes plausible-sounding but incorrect information.
Opus 4.6 scores 78.3% on MRCR v2 at 1M tokens. That means the model correctly resolves and integrates roughly four out of five multi-needle retrieval challenges across a full million-token context. For a document-length context, that is a high bar.
GraphWalks BFS (Breadth-First Search traversal)
GraphWalks BFS tests a different long-context skill: reasoning over relational structure rather than linear text retrieval. The benchmark encodes a graph as a large text document and asks the model to walk paths through that graph correctly. BFS traversal requires the model to maintain a frontier of visited and unvisited nodes, expand systematically, and avoid revisiting or skipping nodes — all from in-context state rather than external memory.
Sonnet 4.6 scores 68.4% on GraphWalks BFS at 1M tokens. This matters most for use cases involving knowledge graphs, dependency trees, and codebases where the model needs to trace relationships across a large amount of structure rather than retrieve a specific fact.
The frontier model context window landscape changed significantly in early 2026. Comparing Claude's 1M GA to competitors requires accuracy about what each model actually offers.
GPT-5.4, launched by OpenAI on March 5, 2026, also supports a 1 million token context window (specifically 922K input, 128K max output). GPT-5.4 is a capable model with strong benchmark performance across many tasks. However, Anthropic's published MRCR v2 data positions Opus 4.6 at 78.3% at the 1M mark. OpenAI has not published a comparable MRCR v2 figure for GPT-5.4 at 1M tokens. For teams where long-context recall fidelity is the primary selection criterion, the MRCR v2 number is currently the most directly relevant available data point.
Gemini 3.1 Pro, released by Google in February 2026, supports up to 1,048,576 tokens (approximately 1M) of input context. On MRCR v2 at the 128K range, Gemini 3.1 Pro and Claude Opus 4.6 score similarly — both around 84.9% at that shorter context length. At the full 1M token mark, Gemini 3.1 Pro's recall performance at that context length is not as well-documented as Claude's, and Anthropic explicitly positions Opus 4.6's 78.3% MRCR v2 score as the highest among frontier models at 1M tokens.
The comparison is imperfect — different benchmarks test different things — but the directional finding holds: Anthropic has the most transparent, directly stated long-context recall figures at the 1M token mark among the major frontier labs as of this writing.
The sixfold increase in media capacity per request is practically significant for multimodal workflows.
The previous limit of 100 images or PDF pages per request was a hard constraint for certain use cases: an audit workflow scanning a 200-page report, a legal review team processing a multi-exhibit case file, or a medical records pipeline summarizing a year's worth of lab results and imaging reports. Workarounds required either splitting the document across multiple API calls (losing cross-document context) or preprocessing to compress content before submission.
The new limit of 600 images or PDF pages per request substantially changes the calculus for these workflows. A 600-page PDF can be processed in a single call. A batch of product images for catalog enrichment can be submitted together. A legal team can provide an entire deposition transcript, supporting exhibits, and case law references in one context window.
Important caveats apply. Media tokens consume context just as text does — a high-resolution image can cost several hundred to several thousand tokens depending on dimensions and detail level. Developers building pipelines near the 600-media-item ceiling need to account for total token consumption across both media and text components to stay within the 1M token limit.
Anthropic has not specified whether the 600-item limit refers to total media items or to images and pages separately. For workflows mixing images and PDF pages, testing the exact ceiling empirically is recommended until the documentation is more explicit.
How long-context requests work
For developers who have not used the API above 200K tokens before, there are a few operational realities worth understanding.
Requests over 200K tokens work automatically. There is no mode switch, no header, no configuration toggle. Pass more than 200K tokens in your messages array and the API handles it. The max_tokens parameter continues to control the maximum response length, which remains separate from the input context length.
Latency scales with context. Processing a 1M-token context takes meaningfully longer than processing a 10K-token context. Time-to-first-token will be higher. Streaming responses (via server-sent events) remain the recommended approach for interactive applications — they allow the UI to start rendering output before the full response is generated, masking some of the latency.
Prompt caching works at scale. Anthropic's prompt caching feature, which reduces cost and latency for repeated context prefixes, applies across the full context window. A team building a legal research tool that repeatedly queries the same 500K-token case document can cache that document prefix and pay cache read rates on subsequent calls. Cache read pricing is significantly lower than standard input pricing.
Context compaction is available for agentic workloads. Claude Opus 4.6 introduced adaptive context compaction for long-running agentic sessions. Rather than hitting the context ceiling and failing, the model can intelligently summarize earlier conversation turns to make room for new content. This feature works alongside the 1M context GA rather than being superseded by it — a session that starts compact and grows toward 1M tokens can use compaction to extend its effective lifespan beyond even the 1M ceiling.
The 1M token context GA is available across three platforms simultaneously.
Claude Platform (api.anthropic.com). Direct API access is available immediately. Authentication uses standard Anthropic API keys. No additional configuration is required. The model identifiers are claude-opus-4-6 and claude-sonnet-4-6 (check the API documentation for the exact versioned identifiers, as Anthropic periodically updates these).
Microsoft Azure AI Foundry. Azure customers accessing Claude through Azure AI Foundry can use the 1M context window without modifications to their existing API integration. Billing continues through Azure's standard model pricing pipeline, which may differ from the list prices above depending on your Azure agreement.
Google Cloud Vertex AI. Vertex AI customers have the same access. The 1M window is available on the same model identifiers used for Claude on Vertex. As with Azure, billing flows through the Google Cloud pricing model, which may reflect Anthropic's list prices or negotiated enterprise rates.
Third-party integrations. Tools and platforms built on top of the Claude API — including coding assistants, document processors, and enterprise software — need to update their own integrations to pass larger context windows if they had been artificially truncating inputs for cost reasons. There is no guarantee that a third-party tool exposes the full 1M context simply because the underlying API now supports it.
Practical use cases that 1M context unlocks
A 1M token context window at flat pricing changes the economics of several categories of application.
Full codebase analysis
A 1M token context can hold approximately 750,000 words or roughly 3 million characters of code. That is enough to fit substantial production codebases — a mid-sized TypeScript monorepo, an entire Python data platform, or a large mobile app — in a single context window. Developers can ask the model to trace a bug across the full call graph, refactor an API surface consistently across all call sites, or generate a comprehensive architecture document from the actual code rather than handwritten diagrams.
Legal and compliance document review
Complex M&A transactions, regulatory filings, or litigation matters can involve thousands of pages of documents. At 600 PDF pages per request, a legal team can submit a full deal room's worth of contracts and ask the model to identify inconsistent definitions, missing representations, or conflicting indemnification clauses across the entire set.
Academic literature reviews, policy analyses, and competitive intelligence reports often require synthesizing dozens of long documents. A researcher can submit an entire corpus of papers and ask the model to synthesize findings, identify methodological inconsistencies, or map citation relationships — without the information loss that comes from chunking documents and retrieving only the most-similar fragments.
Extended agentic sessions
Autonomous AI agents running multi-step tasks accumulate context as they work. An agent that retrieves web pages, runs code, reads files, and refines its approach can operate for much longer without hitting context limits. Combined with context compaction for sessions that exceed even 1M tokens, Claude agents can now handle genuinely long-running tasks.
Financial data analysis
Earnings calls, analyst reports, SEC filings, and market data can be substantial in aggregate. A 1M token window allows an analyst to load multiple quarters of filings alongside the relevant earnings calls and ask the model to explain the delta in margin guidance, trace changes in accounting treatment, or identify disclosed risks that evolved over time.
Limitations and things to watch
Transparency requires acknowledging where the 1M context window does not fully solve the underlying problems.
Recall degrades with context length. The 78.3% MRCR v2 score for Opus 4.6 is the highest published figure at 1M tokens, but it also means roughly one in five multi-needle retrieval tasks fails at that context length. For most use cases, this accuracy level is sufficient. For applications where every missed reference has legal or financial consequences, human review of model outputs remains necessary.
Cost at scale is still real. A 1M input token request with Opus 4.6 costs $5.00 for the input alone. An application that makes hundreds of such calls per day accumulates meaningful API costs. The elimination of the premium surcharge changes the unit economics, but does not make long-context calls free.
Latency for interactive applications. A million-token context incurs processing time before the model begins generating output. Applications that require near-real-time response — a customer-facing chat interface, for example — may find 1M token calls impractical. The appropriate use of this capability is for batch, asynchronous, or power-user workflows rather than latency-sensitive interactions.
Model identifier stability. Anthropic periodically updates specific model identifiers. Production applications should pin to a specific versioned model identifier rather than a generic alias to avoid unexpected behavior changes when Anthropic updates the underlying model weights.
The 600-item media limit is still a ceiling. While the sixfold increase helps significantly, very large document sets still require batching. A warehouse of 10,000 product images, for example, still needs to be split across multiple requests.
What this means for teams building on Claude
The practical implication of the 1M token GA at flat pricing is the elimination of a planning assumption that has constrained Claude-based application architecture since long-context APIs became available.
Teams that had designed chunking and retrieval pipelines to stay under 200K tokens — or to manage the premium pricing above that — now face a genuine architectural choice. Passing more context directly is not always the right answer (retrieval-augmented generation often produces better results because it focuses the model on the most relevant content), but it is now a legitimate engineering trade-off to evaluate on its merits rather than avoid by default.
For greenfield applications, the 1M context GA effectively commoditizes "fit the whole document" as a capability. The differentiation question shifts toward what the model does with that context — the quality of reasoning, the consistency of output, and the accuracy of recall — rather than whether the context can be accepted at all.
The MRCR v2 and GraphWalks BFS scores represent Anthropic's current answer to the recall quality question. A 78.3% recall rate at 1M tokens for Opus 4.6 is a strong published figure. Teams evaluating Claude against alternatives should request comparable benchmark data from other vendors at equivalent context lengths before drawing conclusions.
The generally available 1M context is also a statement about Anthropic's infrastructure confidence. Moving a capability from beta to GA signals that the engineering team believes the feature is stable, performant, and supportable at scale. The infrastructure investment required to serve 1M token requests reliably at production volumes is substantial. GA status implies Anthropic is committed to maintaining that service level.
20 frequently asked questions
1. What models support the 1M token context window?
Claude Opus 4.6 and Claude Sonnet 4.6. Both are now generally available with the full 1M token context at standard pricing as of March 13, 2026.
2. Do I need to change my API code to use the 1M context?
No. Remove any anthropic-beta header if you had set it for long-context access. Beyond that, simply pass more tokens in the messages array. No other configuration change is required.
3. What does 1 million tokens correspond to in words or pages?
Roughly 750,000 words, approximately 1,500 pages of standard text, or 3–4 million characters of code depending on the language and formatting.
4. Is the pricing really flat across the full 1M window?
Yes. $5.00 per million input tokens for Opus 4.6 and $3.00 per million input tokens for Sonnet 4.6, regardless of how much of the context window you use. There is no surcharge at any token count.
5. How does this compare to what I was paying during the beta?
During the beta period, requests above 200K tokens were subject to a pricing multiplier. That multiplier is now gone. If you were using long-context in beta, your costs for equivalent calls will be lower.
6. What is MRCR v2 and why does Anthropic cite it?
MRCR v2 (Multi-needle Retrieval and Coreference Resolution) tests whether a model can accurately find and integrate multiple facts scattered across a large context. It is a proxy for real-world long-document recall quality. Anthropic cites Opus 4.6's 78.3% score as evidence of reliable recall at 1M tokens.
7. What is GraphWalks BFS and why does it matter?
GraphWalks BFS tests whether a model can correctly traverse a graph encoded as text — visiting nodes in breadth-first order without revisiting or skipping. It measures structured relational reasoning rather than fact retrieval. Sonnet 4.6 scores 68.4% on this benchmark at 1M tokens.
8. How does Claude Opus 4.6 compare to GPT-5.4 on long-context recall?
GPT-5.4 also supports approximately 1M tokens of context (922K input, 128K output) but OpenAI has not published a comparable MRCR v2 score at 1M tokens. The available data makes a direct apples-to-apples comparison difficult as of this writing.
9. How does Claude compare to Gemini 3.1 Pro?
Gemini 3.1 Pro supports approximately 1M input tokens. At the 128K context length, both Gemini 3.1 Pro and Claude Opus 4.6 score similarly on MRCR v2. At the full 1M token mark, Anthropic claims the highest published recall figure among frontier models. Google has not published an equivalent 1M-token MRCR v2 score for Gemini 3.1 Pro.
10. Does prompt caching work across the full 1M window?
Yes. Prompt caching applies across the entire context. Repeated calls using a shared context prefix (such as a large document that stays constant while the question changes) benefit from cache read pricing on subsequent calls, which is significantly cheaper than full input pricing.
11. What happened to the 100-image and 100-PDF-page limit?
It was raised sixfold. The new limit is 600 images or PDF pages per request. This applies to both Opus 4.6 and Sonnet 4.6.
12. Can I use this on Azure and Google Cloud, or only directly through Anthropic?
All three platforms are supported at GA. Claude via Microsoft Azure AI Foundry and Google Cloud Vertex AI both support the 1M context window. Billing for Azure and Vertex goes through those platforms' respective pricing models.
13. Will the 1M context window be added to older Claude models?
Anthropic has not announced plans to backport the 1M context GA to Claude 3.x models. The announcement specifically covers Claude Opus 4.6 and Claude Sonnet 4.6.
14. Does Haiku support 1M tokens?
Claude Haiku is not included in the 1M GA announcement. Check the Claude models documentation for the current context window specifications for each model tier.
15. How should I handle latency for 1M token requests?
Use streaming (server-sent events) for interactive applications to start receiving output before the full response is complete. For batch workloads, design for asynchronous processing with appropriate timeout handling. Do not set short synchronous timeouts on long-context calls.
16. What is context compaction and how does it interact with the 1M window?
Context compaction, available for Opus 4.6, intelligently summarizes earlier conversation turns when a session approaches the context ceiling. It works alongside the 1M window, not instead of it — a very long agentic session that grows past 1M tokens can use compaction to continue operating.
17. Are there use cases where chunked retrieval is still better than full-context?
Yes. RAG (retrieval-augmented generation) focuses the model on the most relevant content, which often improves output quality by reducing noise. For tasks where precision matters more than comprehensive coverage — targeted question answering, for example — a well-designed retrieval pipeline may outperform stuffing a full document into context.
18. What model identifier should I use in production?
Use a specific versioned model identifier rather than a generic alias. Anthropic updates model aliases over time. Check the official models documentation for the current pinnable identifiers for Opus 4.6 and Sonnet 4.6.
19. Does the 1M context window apply to the claude.ai consumer product?
The GA announcement focuses on the API. The claude.ai web and app experience may reflect different context limits depending on the subscription tier. Check the current claude.ai plan documentation for consumer-facing context window specifications.
20. Where can I read Anthropic's official announcement?
The official blog post is at claude.com/blog/1m-context-ga. The X announcement is at x.com/claudeai/status/2032509548297343196. API documentation for the affected models is at platform.claude.com/docs/en/about-claude/models/whats-new-claude-4-6.