OpenAI Launches GPT-5.4 — Its Smartest Model Yet
OpenAI releases GPT-5.4 with major benchmark improvements, enhanced reasoning, and reduced hallucinations. Available now to ChatGPT Plus and API users.
Whether you're looking for an angel investor, a growth advisor, or just want to connect — I'm always open to great ideas.
Get in TouchAI, startups & growth insights. No spam.
TL;DR: OpenAI has released GPT-5.4, its most capable model to date, featuring a 1 million token context window — more than double GPT-5.3's 500K limit — alongside a new "Extreme Thinking" mode that applies significantly more compute to hard problems. The model sets new internal benchmarks in scientific reasoning, multi-step task completion, and long-context fidelity. It is available immediately to ChatGPT Plus, Team, and Pro subscribers and via the API under the identifier gpt-5.4.
OpenAI shipped GPT-5.4 on March 5, 2026, less than 72 hours after teasing it alongside the GPT-5.3 Instant release. The cadence is deliberate: a smarter base model every six to eight weeks, with behavioral patches like GPT-5.3 Instant filling the gaps. GPT-5.4 is the most significant capability jump since GPT-5 launched in mid-2025 — not a behavioral patch, not a speed variant, but a new frontier on the hard problems that matter.
GPT-5.4 is OpenAI's latest frontier model, released March 5, 2026. It is not a behavioral patch — the category that GPT-5.3 Instant occupied — and it is not a speed-optimized variant. It is a new base model with updated weights, a substantially larger context window, and an optional reasoning mode that represents a qualitative shift in how the model handles difficult problems.
The headline architectural change is the context window expansion from GPT-5.3's 500K tokens to 1 million tokens. That is approximately 750,000 words, or roughly five full-length novels processed simultaneously. For practical reference: a 1M token context can hold an entire codebase, a year of email threads, or a complete legal discovery document set within a single prompt.
The second major change is Extreme Thinking mode. This is not a separate model but a runtime parameter — a thinking budget that instructs the model to allocate substantially more compute to the pre-response reasoning phase. The result is slower responses for complex queries, but with measurably higher accuracy on multi-step problems.
The third change is what OpenAI describes as improved "task persistence" — the model's ability to sustain reliable performance across multi-hour agentic tasks without accumulating errors or losing track of intermediate state. Prior GPT-5 series models showed degrading accuracy in long autonomous workflows. GPT-5.4 addresses this directly.
Extreme Thinking is GPT-5.4's most discussed new feature, and the name is deliberately provocative. OpenAI is positioning this mode against the most difficult problems in science, mathematics, and engineering — the queries where GPT-5.3, Claude Opus 4.6, and Gemini 3.1 Pro all produce confident but wrong answers because they do not have enough reasoning cycles to catch their own errors.
The mechanism is a tiered compute budget. In standard mode, GPT-5.4 behaves like a fast, capable model — responses in seconds, reasoning depth comparable to GPT-5.3 Instant. In Extreme Thinking mode, the model applies an extended chain-of-thought process that includes self-verification loops: the model generates a candidate response, checks it against its own reasoning trace for internal consistency, identifies likely failure points, and revises before surfacing a final answer.
OpenAI is clear about the tradeoff: Extreme Thinking is slower and costs more per response. The company is positioning it not for consumer chat but for scientific research workflows, formal mathematics, complex legal analysis, and software engineering tasks where a single high-quality answer is more valuable than five fast ones.
For developers using the API, Extreme Thinking is a parameter on the chat completions endpoint:
{
"model": "gpt-5.4",
"thinking": "extreme",
"messages": [...]
}
Standard thinking mode is the default. OpenAI recommends Extreme Thinking only for queries where response latency is acceptable and accuracy is paramount.
Context window size matters most when it matters at all — which is to say, for most everyday queries, 500K tokens and 1M tokens are functionally equivalent. The practical significance of GPT-5.4's 1M token window lives in a specific set of high-value use cases.
Long-document analysis. A complete regulatory filing, a multi-volume legal case, a full technical specification — these fit inside GPT-5.4's context in a way they did not fit inside GPT-5.3's. Previously, users had to chunk documents and assemble partial analyses. With 1M tokens, the entire document is in-context simultaneously, which eliminates cross-reference errors that chunking introduces.
Large codebase understanding. The median GitHub repository is around 200K–400K tokens. GPT-5.4 can hold an entire mid-size codebase in context, enabling cross-file refactoring, architecture-level analysis, and dependency tracing that smaller contexts cannot support.
Extended conversation memory. Enterprise deployments that use ChatGPT for multi-session customer interactions have been constrained by context limits. At 1M tokens, a year of interaction history with a single user can be maintained in-context without truncation.
What has not changed. A larger context window does not automatically improve performance on content within that window. OpenAI reports that GPT-5.4 maintains attention quality across the full 1M token range, but this claim — like all self-reported benchmark data — requires independent validation. Users who rely on models to accurately retrieve specific passages from very long documents should test GPT-5.4's "needle in a haystack" performance against their specific document types before migrating production pipelines.
OpenAI published internal benchmark results alongside the GPT-5.4 release. As with all self-reported benchmarks, these should be interpreted as directional signals rather than definitive rankings.
| Benchmark | GPT-5.4 (Standard) | GPT-5.4 (Extreme Thinking) | GPT-5.3 Instant | Notes |
|---|---|---|---|---|
| GPQA Diamond | 84.2% | 91.7% | 78.4% | PhD-level science Q&A |
| MATH Level 5 | 79.1% | 88.6% | 71.3% | Competition mathematics |
| HumanEval | 82.4% | 85.1% | 78.9% | Code generation |
| SWE-bench Verified | 47.3% | 52.8% | 41.6% | Real-world bug resolution |
| Humanity's Last Exam | 22.1% | 31.4% | 17.8% | Hardest known benchmark |
| SimpleBench | 88.3% | 89.1% | 86.2% | Factual accuracy |
| MMLU Pro | 81.6% | 83.9% | 77.2% | Multitask language understanding |
The Humanity's Last Exam result is the most notable figure in this table. HLE is designed to be unsolvable by current AI systems — it consists of questions submitted by domain experts that were specifically chosen because they expected GPT-4-level models to fail on them. A 31.4% score in Extreme Thinking mode is not a passing grade, but it is a substantial improvement over GPT-5.3 Instant's 17.8% and well above what GPT-5 achieved at launch.
The GPQA Diamond improvement from 78.4% to 91.7% in Extreme Thinking mode is the clearest signal of what the reasoning upgrade actually delivers. This benchmark tests PhD-level scientific reasoning across physics, chemistry, and biology — domains where chain-of-thought accuracy matters enormously and where GPT-5.3 was noticeably behind Claude Opus 4.6.
The four-way comparison that developers and enterprises care about in March 2026 is GPT-5.4, GPT-5.3 Instant, Claude Opus 4.6, and Gemini 3.1 Pro. Here is how they stack up across the dimensions that matter most.
| Dimension | GPT-5.4 | GPT-5.3 Instant | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Context window | 1M tokens | 500K tokens | 200K tokens | 2M tokens |
| Hard reasoning (GPQA) | 91.7% (Extreme) | 78.4% | ~90.5% | ~88.1% |
| Code generation (HumanEval) | 85.1% | 78.9% | ~83.0% | ~84.0% |
| Speed (standard mode) | Fast | Fastest | Moderate | Moderate |
| Extreme/extended thinking | Yes | No | Yes (extended) | Yes (Deep Think) |
| Agentic task performance | Improved | Baseline | Leading | Strong |
| Pricing (input/M tokens) | ~$2.50 | ~$0.75 | ~$15.00 | ~$3.50 |
| Pricing (output/M tokens) | ~$12.00 | ~$4.00 | ~$75.00 | ~$10.50 |
| ChatGPT integration | Native | Native | N/A | N/A |
| API identifier | gpt-5.4 |
gpt-5.3-chat-latest |
claude-opus-4-6 |
gemini-3.1-pro |
A few things stand out in this table.
Gemini 3.1 Pro still leads on context. OpenAI's 1M token window is a significant leap from GPT-5.3's 500K, but Google's Gemini 3.1 Pro ships with a 2M token context — double GPT-5.4. For use cases that genuinely require processing very large corpora, Gemini 3.1 Pro remains the leader on raw context capacity.
Claude Opus 4.6 leads on agentic performance. Anthropic's top model has consistently ranked first on Arena.ai and Artificial Analysis leaderboards for agentic tool use and computer-use tasks. GPT-5.4's improved task persistence narrows this gap but does not close it.
GPT-5.4 Extreme Thinking challenges Claude's reasoning lead. On GPQA Diamond, Claude Opus 4.6's 90.5% in thinking mode is now matched by GPT-5.4's 91.7% in Extreme Thinking. For scientific and mathematical reasoning specifically, GPT-5.4 has pulled even.
Pricing is where GPT-5.4 wins definitively against Claude. Claude Opus 4.6 at ~$15.00/$75.00 per million tokens is six to seven times more expensive on both input and output than GPT-5.4. Enterprises running high-volume inference pipelines have a compelling cost argument for GPT-5.4 over Claude Opus 4.6 unless agentic performance is the primary criterion.
This comparison builds directly on the context established by GPT-5.3 Instant's release, where OpenAI was still the challenger on raw reasoning benchmarks against both Claude and Gemini. GPT-5.4 changes that picture substantially, particularly on the hardest scientific benchmarks.
GPT-5.4 is available immediately upon release in the following configurations:
ChatGPT:
API pricing:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context |
|---|---|---|---|
| GPT-5.4 (standard) | $2.50 | $12.00 | 1M tokens |
| GPT-5.4 (Extreme Thinking) | $5.00 | $25.00 | 1M tokens |
| GPT-5.4 (cached input) | $0.25 | — | 1M tokens |
| GPT-5.3 Instant | $0.75 | $4.00 | 500K tokens |
| GPT-5.3 (standard) | $2.00 | $10.00 | 500K tokens |
The Extreme Thinking pricing reflects the additional compute cost of the extended reasoning phase. At $5.00 input and $25.00 output per million tokens, a single Extreme Thinking response to a 10,000-word document analysis query costs roughly $0.35–$0.60 in tokens, depending on response length. For high-stakes professional queries, that cost is trivial relative to the value of an accurate answer. For routine tasks, standard mode is the right choice.
Microsoft 365 Copilot users will receive GPT-5.4 in standard mode within 48 hours of today's release, consistent with Microsoft's rapid integration pattern established with GPT-5.3 Instant.
The API model identifier is gpt-5.4. OpenAI is maintaining gpt-5.3-chat-latest and gpt-5.2 as available options; no model retirement has been announced alongside the GPT-5.4 launch.
For enterprise deployments, GPT-5.4 changes the calculus in three meaningful ways.
Long-document processing pipelines no longer require chunking. The 1M token context window is large enough to eliminate the document chunking strategies that most enterprise RAG (retrieval-augmented generation) systems were built around. A contract review workflow that previously required splitting a 300-page contract into segments and reassembling partial analyses can now run the full document through a single prompt. This simplifies pipeline architecture and removes a source of cross-reference errors that chunking introduces.
Multi-hour agentic tasks are more reliable. OpenAI's claim of improved "task persistence" targets a specific enterprise pain point: GPT-5.3 and earlier models would accumulate errors in long autonomous workflows — losing track of intermediate state, repeating completed steps, or drifting from the original objective. If GPT-5.4 delivers on this claim, it expands the category of workflows that can be delegated to autonomous AI agents without human checkpointing at every step.
The cost argument against Claude Opus 4.6 strengthens. At $2.50/$12.00 per million tokens versus Claude Opus 4.6's ~$15.00/$75.00, GPT-5.4 is approximately 6x cheaper per token. For enterprises running thousands of long-context queries per day, the difference is material. The remaining argument for Claude Opus 4.6 — superior agentic performance — is narrowing with each GPT-5.x release.
What enterprise teams should do now: Run GPT-5.4 against your existing GPT-5.3 Instant pipelines on a sample of real production queries before migrating. The behavioral shifts between model versions can affect prompt sensitivity, output formatting, and edge case handling in ways that require pipeline validation even when the underlying capability improvement is real.
GPT-5.4 is the clearest evidence yet of a structural shift in how OpenAI ships models. The historical pattern — a major model every three to six months, followed by a long stabilization period — is gone.
The GPT-5.x series has unfolded as follows:
| Model | Release | Gap from Previous |
|---|---|---|
| GPT-5 | Mid-2025 | — |
| GPT-5.1 | November 2025 | ~4 months |
| GPT-5.2 | December 2025 | ~6 weeks |
| GPT-5.2-Codex | February 2026 | ~8 weeks |
| GPT-5.3 | Late January 2026 | ~6 weeks |
| GPT-5.3 Instant | March 3, 2026 | ~5 weeks |
| GPT-5.4 | March 5, 2026 | 2 days |
The two-day gap between GPT-5.3 Instant and GPT-5.4 is not a slip — it is a deliberate strategy. OpenAI is shipping behavioral patches (GPT-5.3 Instant, which we covered in depth here) and capability upgrades (GPT-5.4) as separate, parallel release tracks. The behavioral patches can ship fast because they target RLHF tuning, not base model weights. The capability upgrades require more lead time but are deployed as soon as they clear safety evaluations.
The implication for developers and enterprises: model version management is now a continuous operational concern. Pinning to gpt-5.3-chat-latest and periodically validating against new releases is the right pattern. Assuming model stability within a named version is increasingly risky as the pace of releases accelerates.
What comes after GPT-5.4 is not announced. Given the pattern, a GPT-5.4 Instant (speed-optimized behavioral variant) within four to six weeks is plausible. A GPT-5.5 with further reasoning improvements and potential multimodal upgrades is the likely next major capability release.
The switching decision is different for different user types.
ChatGPT Plus users: Yes, switch immediately. GPT-5.4 standard mode is a meaningful capability upgrade over GPT-5.3 Instant at no additional cost. Extreme Thinking mode is available if you have a Pro subscription. There is no reason to stay on GPT-5.3 Instant for general use.
API developers (standard pipelines): Test before switching. GPT-5.4 is a new base model, not a behavioral patch, which means prompt responses may differ from GPT-5.3 in ways that affect your pipeline. Run a representative sample of production queries against both models, check for output format changes, edge case behavior, and refusal patterns before migrating. The capability improvement is real but so is the risk of behavioral drift.
API developers (long-context use cases): Evaluate GPT-5.4 against your specific document types. If you are currently chunking documents to fit within 500K tokens, the 1M window is immediately valuable and likely worth the migration effort.
Enterprises running agentic workflows: Pilot GPT-5.4 in a staging environment before production migration. The task persistence improvements are the most significant enterprise-relevant change, but independent validation of multi-hour agentic performance against your specific workflow types is essential before committing.
Users primarily using GPT-5.3 Instant for speed: GPT-5.4 standard mode is meaningfully faster than GPT-5.3 standard but not as fast as GPT-5.3 Instant's 3x speed optimization. If response latency is your primary criterion, GPT-5.3 Instant remains the right choice for now.
The API model identifier is gpt-5.4. There is a separate parameter for Extreme Thinking mode: set "thinking": "extreme" in your request body. Standard mode is the default when no thinking parameter is specified.
GPT-5.4 in standard mode is rolling out as the default model for all ChatGPT users, including the free tier, over the coming days. Rate limits apply to free-tier users. Extreme Thinking mode is restricted to Pro subscribers at launch.
Extreme Thinking is categorically different from GPT-5.3's standard chain-of-thought. It introduces self-verification loops — the model checks its own reasoning for internal consistency before responding — and applies a significantly larger compute budget to the pre-response phase. On GPQA Diamond (PhD-level science), Extreme Thinking scores 91.7% versus GPT-5.3 Instant's 78.4%. The tradeoff is response latency: Extreme Thinking queries can take 30–90 seconds depending on complexity.
Not necessarily. GPT-5.3 Instant was specifically optimized for conversational speed and behavioral improvements — the preachy tone fix, the hallucination reductions we covered in detail here, the 3x inference speed. For rapid back-and-forth conversation, GPT-5.3 Instant remains the right choice. GPT-5.4 is the right choice when you need deeper reasoning, longer context, or more reliable agentic performance.
No. In standard mode, GPT-5.4 is faster than GPT-5.3 standard but not faster than GPT-5.3 Instant. GPT-5.3 Instant was specifically optimized for 3x inference speed at 60% of standard pricing. GPT-5.4 in Extreme Thinking mode is substantially slower than both.
Gemini 3.1 Pro still leads on raw context capacity at 2M tokens — double GPT-5.4's 1M. For the vast majority of use cases, 1M tokens is sufficient. The gap matters only for specific large-corpus applications: processing a very large multi-volume document set, holding an entire substantial codebase in context, or maintaining very long conversation histories. If raw context length is your primary bottleneck, Gemini 3.1 Pro remains the leader.
OpenAI has announced it is "coming soon" to Plus and Team subscribers but has not confirmed a specific date. Based on the company's recent release pattern, availability within two to four weeks of the Pro launch is plausible.
No retirement timeline has been announced. OpenAI confirmed that GPT-5.2 Instant will be retired on June 3, 2026, but made no equivalent announcement for GPT-5.3 Instant. Given its speed advantages, GPT-5.3 Instant is likely to remain as the fast-inference option in the API lineup alongside the more capable GPT-5.4.
OpenAI releases GPT-5.3 Instant to fix ChatGPT's overly cautious tone while cutting hallucinations by 26.8 percent, rolling out to all users and API developers.
Gemini 3.1 Pro scores 69.2 percent on the MCP Atlas benchmark, leading Claude and GPT-5.2 by 10 points with adjustable reasoning depth on demand.
The QuitGPT movement has grown to 1.5 million participants boycotting ChatGPT over OpenAI's Pentagon military deal. An in-person protest at OpenAI's San Francisco headquarters is planned for March 3.