Claude Opus 4 and Sonnet 4 launch with hybrid thinking and SWE-bench records
Anthropic releases Claude Opus 4 and Sonnet 4 with hybrid instant-and-extended thinking, setting new SWE-bench records at 72.5% and 72.7% respectively.
Whether you're looking for an angel investor, a growth advisor, or just want to connect — I'm always open to great ideas.
Get in TouchAI, startups & growth insights. No spam.
TL;DR: Anthropic has released Claude Opus 4 and Claude Sonnet 4, the first models in the Claude lineup to support a hybrid architecture combining instant response mode with extended thinking in a single model. Opus 4 scores 72.5% on SWE-bench Verified and 43.2% on Terminal-bench, making it Anthropic's declared "world's best coding model" for sustained agentic work. Sonnet 4 scores 72.7% on SWE-bench — fractionally ahead of Opus 4 on that benchmark — while carrying a price tag five times cheaper at $3/$15 per million tokens versus Opus 4's $15/$75. Both models can use tools during extended thinking, support parallel tool calls, and ship with significantly improved memory capabilities for long-running agent workflows. Pro, Max, Team, and Enterprise plans include both models plus extended thinking; Sonnet 4 is also available to free users.
Every AI model before the Claude 4 generation made a binary choice at training time: reason fast or reason slow. Fast models returned answers quickly but could not pause to verify multi-step logic. Extended thinking models — like Claude 3.7 Sonnet — could work through hard problems, but the extended thinking mode was a distinct operating state, not something you could dial up or down per request.
Claude Opus 4 and Sonnet 4 change that. Both are hybrid models, meaning a single model weight supports two operating modes: near-instant responses for straightforward queries, and extended thinking for problems that benefit from deeper reasoning. The model does not switch between different checkpoints. The same weights produce both behaviors, controlled by the developer at inference time through effort controls that modulate how much thinking the model does before generating output.
This is the architecture that matters more than any single benchmark number. It means developers building on the Claude API can write applications where the model reasons extensively on hard subproblems and returns quickly on easy ones — within the same session, the same context window, the same pricing tier. The intelligence-to-cost ratio becomes dynamic rather than fixed.
The practical implication is significant. A coding agent using Claude 4 can return instant autocomplete suggestions for boilerplate code, spend thirty seconds thinking through a complex algorithm before proposing a solution, and decide on its own which mode is appropriate for each step. Previous architectures forced a single setting for the entire session.
Anthropic has also added effort controls — developer-configurable parameters that let you specify how much extended thinking to apply and at what cost. This is a direct response to the criticism that extended thinking models are expensive for production workloads: you now have the knobs to tune the cost-intelligence trade-off per call.
SWE-bench Verified is the benchmark that matters most to developers because it tests something real: given an actual GitHub issue from a real open-source repository, can the model resolve it by writing working code? Not by describing what it would do. Not by proposing a patch. By submitting code that passes the repository's existing test suite.
Claude Opus 4 scores 72.5% on SWE-bench Verified. Claude Sonnet 4 scores 72.7% — a hair ahead. Both results represent a substantial advance over Claude 3.7 Sonnet's prior scores and position the Claude 4 family at or near the top of the global benchmark leaderboard at launch.
To calibrate what these numbers mean: a score of 72% means the model resolved 72 out of every 100 real-world GitHub issues it was given. The test set spans Python repositories across multiple domains — web frameworks, data science libraries, CLI tools, API clients. The issues range from one-line fixes to multi-file refactors that require understanding a codebase's internal architecture. There is no test-set memorization advantage available; the issues were selected specifically to resist contamination.
The slightly higher Sonnet 4 score versus Opus 4 on this specific benchmark is not a hierarchy inversion. SWE-bench measures a narrow axis of software engineering skill: issue resolution in isolated Python repositories. Opus 4 is designed for a different performance profile — sustained operation across very long agent workflows, complex multi-step reasoning, and tasks that require hours of continuous execution. A single-issue resolution test is not the right measure of that capability.
The correct read on both numbers: the Claude 4 generation solves roughly three out of four real GitHub issues autonomously. For a developer evaluating AI coding tools, that is the operative fact.
Terminal-bench is a less well-known benchmark but arguably more predictive of agent behavior than SWE-bench for teams building agentic software engineering workflows. It evaluates the model's ability to navigate real terminal environments: managing file systems, installing dependencies, running builds, executing tests, debugging output, and completing multi-step command-line tasks without human intervention.
Opus 4's 43.2% Terminal-bench score is the highest any model achieved on the benchmark at the time of the Claude 4 launch. This matters specifically for the use case Anthropic is positioning Opus 4 toward: long-running agent workflows where the model operates autonomously over hours, not seconds.
SWE-bench measures whether you can fix a bug. Terminal-bench measures whether you can operate a development environment. The distinction is the difference between a model that gives you good code and a model that can actually ship. Teams building CI/CD pipelines, automated testing infrastructure, or full software development agents care about Terminal-bench more than SWE-bench because their agent is not just writing code — it is running it, debugging it, and iterating on failures.
Anthropic specifically describes Opus 4 as optimized for "sustained performance on complex, long-running tasks and agent workflows" and notes it can "work continuously for several hours" and "thousands of steps." Terminal-bench is the proxy for whether that claim holds under real conditions.
The Claude 4 release creates a genuine choice architecture for developers, and the right answer depends heavily on workload type.
| Dimension | Claude Opus 4 | Claude Sonnet 4 |
|---|---|---|
| SWE-bench Verified | 72.5% | 72.7% |
| Terminal-bench | 43.2% | Not specified |
| Input pricing (per M tokens) | $15 | $3 |
| Output pricing (per M tokens) | $75 | $15 |
| Price ratio | 5x more expensive | Baseline |
| Best for | Long-running agent workflows, complex reasoning, sustained multi-step tasks | Most coding tasks, chat, moderate complexity, cost-sensitive production |
| Free tier availability | No | Yes |
| Extended thinking | Yes | Yes |
| Parallel tool use | Yes | Yes |
The 5x price differential is the central decision variable. For most development tasks — code review, bug fixing, feature implementation, code explanation — Sonnet 4 at 72.7% SWE-bench is the better economic choice. It costs dramatically less and scores marginally higher on the benchmark most teams use to evaluate coding capability.
Opus 4's case is made on Terminal-bench and on workflow duration, not on SWE-bench. If your agent runs for hours, executes hundreds of tool calls, needs to maintain state and build context across a long session, and operates in a real terminal environment rather than an isolated code-fix setup — Opus 4 is the correct choice and the price is justified. For everything else, Sonnet 4 is the rational default.
The frontier AI pricing landscape in early 2026 has stratified into clear tiers. Understanding where Claude 4 sits relative to comparable models matters for teams making platform commitments.
| Model | Input (per M tokens) | Output (per M tokens) | SWE-bench | Notes |
|---|---|---|---|---|
| Claude Opus 4 | $15 | $75 | 72.5% | Best for long agentic workflows |
| Claude Sonnet 4 | $3 | $15 | 72.7% | Strong value for coding tasks |
| GPT-5.3-Codex | ~$20 | ~$100 | ~80%+ | Terminal-bench leader at launch |
| Gemini 3.1 Pro | ~$6 | ~$30 | ~80.6% | 1M context window, 60% cheaper than Opus |
| DeepSeek R2 | ~$1 | ~$5 | ~68% | Significant cost advantage, lower ceiling |
A few things stand out in this comparison. First, Sonnet 4 is priced competitively against Gemini 3.1 Pro while delivering comparable SWE-bench performance — the choice between them depends more on ecosystem preferences than raw cost-performance ratio. Second, Opus 4 at $15/$75 is expensive in absolute terms but not unreasonable for the specific workload it targets; long-running agent tasks are inherently token-intensive, and the real cost question is output quality per task completed, not output price per token. Third, DeepSeek R2 represents a different cost tier entirely but with a meaningful performance gap on complex reasoning tasks.
The practical pricing footnote: Anthropic's consumer plans price differently from API access. Pro ($20/month), Max ($100/month), Team, and Enterprise plans include both models plus extended thinking. For teams with moderate usage who do not want to manage API billing, the subscription plans can be significantly more cost-effective than API access at scale.
The competitive benchmark picture as of the Claude 4 launch is nuanced in ways that summary headlines typically flatten.
| Benchmark | Claude Opus 4 | Claude Sonnet 4 | GPT-5.3 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 72.5% | 72.7% | ~80%+ | ~80.6% |
| Terminal-bench | 43.2% | — | ~77.3% | — |
| MMLU | — | — | — | — |
| LiveCodeBench | — | — | — | ~2,887 Elo |
| Context Window | 200K | 200K | ~128K | 1M |
The honest read on this table: GPT-5.3 and Gemini 3.1 Pro have pulled ahead of the Claude 4 launch versions on SWE-bench, with both sitting around 80% versus Claude 4's 72-73%. That gap is meaningful, not cosmetic — about 8 percentage points on a benchmark that directly measures real-world coding performance.
Where Claude 4 maintains distinction is in the Terminal-bench result and in workflow continuity. The 43.2% Terminal-bench score for Opus 4 versus GPT-5.3's ~77.3% shows GPT-5.3 leading on autonomous terminal navigation, but Opus 4's strength is sustained multi-hour agent operation, which Terminal-bench does not fully capture. Context window parity at 200K is broadly adequate for most agentic tasks; Gemini 3.1 Pro's 1M token window is a real advantage only for specific use cases (full codebase analysis in a single prompt).
The competitive picture is that the Claude 4 launch is competitive but not dominant. Sonnet 4 at $3/$15 with 72.7% SWE-bench is an excellent value proposition. Opus 4 at $15/$75 needs to justify its cost on the specific use cases it targets — and for long-running agent workflows, the Terminal-bench result and Anthropic's engineering decisions around memory and context management make a credible case.
The benchmark numbers are important, but the most operationally significant changes in Claude 4 are the agent infrastructure improvements that do not show up in any benchmark table.
Memory. Both Opus 4 and Sonnet 4 ship with significantly improved memory capabilities when given access to local files. The model can now extract key facts from long conversations and save them to disk, building persistent knowledge across sessions. This is not in-weights memory — the model is not trained with your session data. It is tool-mediated memory: the model actively decides what is worth preserving, writes it to a file, and reads it back in future sessions. For agent workflows that run over days rather than hours, this is a qualitative shift. The model no longer starts each session from zero.
Agent teams. In Claude Code — Anthropic's developer coding tool — you can now assemble teams of Claude agents that work on tasks together. One agent might own architecture, another handles test writing, a third manages documentation. The orchestration is handled by the model itself: you describe the project and the team structure, and Claude coordinates task distribution, parallel execution, and result synthesis. This mirrors the way software engineering teams actually work and represents a substantial advance over single-agent sequential execution.
Context compaction. On the API, Claude 4 supports compaction — the model can summarize its own context when it approaches the context window limit, distilling the most important information to carry forward rather than simply truncating earlier content. For tasks that require thousands of steps, this is the mechanism that makes multi-hour operation practically feasible. Without compaction, long agent sessions degrade when context fills. With compaction, the model maintains coherent operation indefinitely.
These three features — memory, agent teams, and context compaction — are more significant for production engineering teams than any single benchmark improvement. They address the failure modes that made previous Claude models impractical for long-running autonomous work.
Claude 3.7 Sonnet introduced extended thinking, but with an important limitation: the model could think before calling a tool, and it could think after receiving a tool result, but it could not interleave reasoning and tool use in the same thinking trace. The extended thinking state and the tool use state were separate operating modes.
Claude 4 removes this constraint. Both Opus 4 and Sonnet 4 can use tools during extended thinking — specifically, can invoke tools like web search as part of the reasoning process itself, not just before or after it. This means the model's chain of thought can now include live information retrieval, verification of facts during reasoning, and iterative refinement based on tool outputs, all within a single thinking trace.
The architectural significance is this: the model can now reason about a problem, discover a gap in its knowledge mid-reasoning, fill that gap via tool call, and continue reasoning with the new information — without breaking the thinking state and without the developer having to orchestrate this externally. For tasks that require current information, that involves multi-step lookups, or that benefit from hypothesis-test-revise cycles, this changes the quality ceiling substantially.
For code generation specifically: a model working on a complex implementation can now search documentation, verify API signatures, check dependency compatibility, and reason through the results — all in one extended thinking pass. Previous architectures required either pre-loading all relevant documentation into context (expensive) or breaking the task into multiple turns (slow and stateful).
Anthropic's rollout tiers Claude 4 access in ways that differ from some competitor launches.
| Tier | Opus 4 | Sonnet 4 | Extended Thinking |
|---|---|---|---|
| Free | No | Yes | No |
| Pro ($20/month) | Yes | Yes | Yes |
| Max ($100/month) | Yes | Yes | Yes (higher limits) |
| Team | Yes | Yes | Yes |
| Enterprise | Yes | Yes | Yes |
| API | Yes ($15/$75) | Yes ($3/$15) | Yes (billed by tokens) |
Sonnet 4 reaching free tier users is notable. The previous free offering was Claude 3.5 Haiku — a smaller, faster model without extended thinking. Sonnet 4 is a full frontier model. Giving free users access to a model that scores 72.7% on SWE-bench, supports parallel tool calls, and includes improved memory capabilities is a significant upgrade to the free offering and a clear competitive response to ChatGPT's GPT-4o access on the free tier.
The extended thinking availability across all paid plans — including the $20/month Pro tier — removes the concern that hybrid thinking was a premium-only feature. Any developer on a Pro plan can use extended thinking in the Claude.ai interface; API access to extended thinking is available at standard per-token pricing.
The Claude 4 launch is best understood as a systems-level bet, not a benchmark win. On raw SWE-bench numbers, GPT-5.3 and Gemini 3.1 Pro are ahead at launch. Anthropic is not winning the point-in-time benchmark race with the Claude 4 generation.
What Anthropic is doing instead is building the infrastructure for the next phase of AI value creation: autonomous agents that operate continuously, maintain memory, coordinate in teams, and handle real software engineering environments over sustained sessions. The hybrid thinking architecture, context compaction, agent teams, and tool use during extended thinking are all investments in that specific future.
The bet makes sense if you believe — as Anthropic clearly does — that the value frontier for AI software engineering is moving from "best single-query response" to "most reliable multi-hour autonomous operation." SWE-bench measures the former. Terminal-bench, memory persistence, and agent team coordination measure the latter.
The competitive risk is that GPT-5.3 and Gemini 3.1 Pro are also building toward that future and currently lead on key benchmarks. Claude 4's 8-point SWE-bench deficit to the leaders is not noise. Teams evaluating models for production coding workloads will see that gap and factor it into their decisions.
The honest position: Claude Sonnet 4 at $3/$15 is one of the most cost-effective frontier coding models available and the right default for most development workloads. Claude Opus 4 is a specialized tool for sustained agentic operation that requires the features no benchmark currently measures well. Neither is the clear market leader at launch, but both are credible production choices for the right workloads.
What Anthropic has built is a platform for the kind of AI engineering work that does not yet have a benchmark — the kind where the model runs for three hours, navigates a real codebase, builds context across sessions, coordinates with other agent instances, and ships something that works. That is where the Claude 4 generation is designed to win, and it may take the rest of 2026 for the industry to develop the measurement frameworks that confirm whether it does.
What is the difference between Claude Opus 4 and Claude Sonnet 4?
Opus 4 is optimized for sustained long-running agent workflows and achieves 43.2% on Terminal-bench — the highest recorded at launch. It costs $15/$75 per million tokens (input/output). Sonnet 4 scores marginally higher on SWE-bench Verified at 72.7% versus Opus 4's 72.5%, costs five times less at $3/$15 per million tokens, and is available to free users. For most coding and reasoning tasks, Sonnet 4 is the better economic choice. Opus 4's case is made specifically on long-running agent tasks that require hours of continuous operation and real terminal environment navigation.
What is hybrid thinking in Claude 4?
Hybrid thinking means a single model weight supports both instant response mode and extended thinking mode. Previous models either always used extended thinking or never did. Claude 4 lets developers configure the level of reasoning per API call, using effort controls to balance intelligence, speed, and cost. The model can return instant responses for simple queries and engage extended thinking for complex ones, within the same application and session.
Why does Sonnet 4 score higher than Opus 4 on SWE-bench if Opus 4 is the "better" model?
SWE-bench Verified measures isolated single-issue code resolution in Python repositories. Opus 4 is not optimized for that specific benchmark — it is optimized for sustained multi-step agent operation, real terminal environment navigation (Terminal-bench), and long-running workflows. The SWE-bench difference is within noise range (72.5% vs 72.7%) and should not be used to conclude that Sonnet 4 is more capable overall. The models have different performance profiles for different task types.
Can Claude 4 use tools while in extended thinking mode?
Yes. This is a new capability in Claude 4 and a meaningful architectural advance over Claude 3.7. Both Opus 4 and Sonnet 4 can invoke tools — including web search — during the extended thinking process itself, not just before or after it. This allows the model to retrieve information mid-reasoning, verify facts, and incorporate results into the ongoing chain of thought without breaking the thinking state.
How does Claude 4 compare to GPT-5 and Gemini 3 on coding?
On SWE-bench Verified, GPT-5.3 and Gemini 3.1 Pro score approximately 80%+, about 8 points ahead of the Claude 4 generation at launch. GPT-5.3 leads on Terminal-bench at approximately 77.3% versus Opus 4's 43.2%. Gemini 3.1 Pro offers a 1M token context window and is roughly 60% cheaper than Opus 4 on output. Claude Sonnet 4 at $3/$15 is competitively priced against Gemini 3.1 Pro with comparable SWE-bench performance. The choice between these models depends on specific workload requirements, context window needs, and cost sensitivity.
Is Claude 4 available on the free plan?
Claude Sonnet 4 is available to free users. Claude Opus 4 requires a paid plan (Pro at $20/month or higher). Extended thinking is available on Pro, Max, Team, and Enterprise plans but not on the free tier.
What is Terminal-bench and why does it matter?
Terminal-bench evaluates a model's ability to operate in real terminal environments: navigating file systems, managing dependencies, running builds, executing tests, and completing multi-step command-line tasks autonomously. It is a better proxy than SWE-bench for teams building agentic software engineering workflows where the model needs to operate a development environment, not just write code. Opus 4's 43.2% Terminal-bench score was the highest recorded at the Claude 4 launch, though GPT-5.3 has since surpassed it at approximately 77.3%.
What are agent teams in Claude Code?
Agent teams is a Claude Code feature that lets you assemble multiple Claude instances working on a task together. Different agents can own different aspects of a project — architecture, testing, documentation — and coordinate autonomously. The orchestration is model-managed rather than manually configured. This feature ships with the Claude 4 generation and requires Claude Code, Anthropic's developer coding tool.
Legendary computer scientist Donald Knuth publicly confirmed that Claude Opus 4.6 solved an open mathematical conjecture he'd pursued for weeks, calling it 'a dramatic advance in automatic deduction.'
Anthropic publishes its 2026 agentic coding trends report identifying 8 key shifts, including multi-agent systems replacing single agents and engineers evolving into AI supervisors.
Check Point Research discloses two Claude Code vulnerabilities — CVE-2025-59536 (CVSS 8.7) for remote code execution and CVE-2026-21852 (CVSS 5.3) for API key theft — triggered by opening malicious git repos.