GPT-5.4 Can Now Use Your Computer: OpenAI's Most Agentic Model Yet
OpenAI's GPT-5.4 ships with native computer use and a 1M token context window, competing directly with Anthropic's Claude Opus 4.6 for agentic AI.
Whether you're looking for an angel investor, a growth advisor, or just want to connect — I'm always open to great ideas.
Get in TouchAI, startups & growth insights. No spam.
TL;DR: OpenAI GPT-5.4 launched on March 5, 2026, with general availability on March 7. It is the first GPT-5 series model to ship native computer use, a 1 million token context window, and a new Tool Search capability that cuts token usage by 47%. On OSWorld-Verified, the standard desktop automation benchmark, GPT-5.4 scores 75.0%, surpassing human expert performance at 72.4%. API pricing starts at $2.50 per million input tokens. ChatGPT Pro ($200/month) subscribers get access immediately.
OpenAI gave GPT-5.4 a keyboard and a mouse. The model got to work.
That sentence is not a metaphor. On March 5, 2026, OpenAI released GPT-5.4 with native computer use built directly into the model. Two days later, it reached general availability across ChatGPT Pro and the Responses API. The model observes a screen through sequential screenshots, reasons about what it sees, and executes real mouse clicks and keystrokes across browsers and desktop apps, with no changes required to the target software.
GPT-5.4 with a 1 million token context window and native computer use is the clearest step yet from AI assistant to AI agent. The assistant takes instructions. The agent takes a keyboard.
GPT-5.4 also surpassed human performance on OSWorld-Verified (75.0% versus the human expert baseline of 72.4%), matched or exceeded industry professionals in 83.0% of comparisons on GDPval across 44 occupations, and scored a 27.7 percentage point improvement over GPT-5.2 on the same desktop benchmark. If you work in software, law, finance, or any role with repetitive computer-based tasks, this release affects your work directly.
Four capabilities define this release. Each matters on its own. Together they form a production-ready platform for agentic deployment that no frontier model had shipped as a single generally available package before March 7, 2026.
Native computer use lets GPT-5.4 control browsers and desktop applications by observing screenshots and issuing structured action sequences. The model identifies interactive elements visually, reasons about the interface, and outputs an actions[] array that a lightweight local harness executes. It works on macOS, Windows, and Linux, on any web application, without requiring programmatic access to the target software.
GPT-5.4 is OpenAI's first general-purpose model to ship native, state-of-the-art computer use capabilities as a standard feature, not a plugin or add-on.
On OSWorld-Verified, GPT-5.4 scores 75.0%. Human performance on the same benchmark sits at 72.4%. That is the first time any frontier model has cleared the human expert baseline on this benchmark. The previous best from OpenAI was GPT-5.2 at 47.3%.
1 million token context window is the largest at general availability for any frontier model as of March 2026. GPT-5.3 topped out at 128K tokens. That is not an incremental step. It is a category change in the class of problems you can work with in a single session. Think full codebases, complete contract bundles, or a year of organizational communication, all in one inference call.
Tool Search is new in GPT-5.4 and largely overlooked in early coverage. The model can look up tool definitions on-demand rather than loading every available tool at the start of a session. In OpenAI's own testing, this reduced token usage by 47% while maintaining task accuracy. For developers building agents with large tool sets, that is a significant cost reduction.
Three model variants shipped simultaneously: Standard (the default for most use cases), Thinking (for deep multi-step reasoning), and Pro (maximum performance, available to Pro and Enterprise plans). The Thinking variant handles tasks that require long chains of intermediate reasoning before acting.
Key stat: GPT-5.4 on GDPval, which tests across 44 real occupations with professional evaluators, matched or exceeded industry professionals in 83.0% of comparisons. That benchmark measures practical professional value, not abstract reasoning patterns.
The technical mechanism is a closed perception-action loop. GPT-5.4 receives a screenshot of the current screen state, reasons about the visible interface in the context of the assigned task, outputs a structured action sequence, and waits for the updated screenshot before deciding on the next step. This repeats until the task completes or the model requests human input.
The local harness that executes actions is lightweight: it handles screenshot capture and action execution using OS-level accessibility APIs. The actual reasoning runs on OpenAI's infrastructure. A standard two-core cloud VM handles the input/output loop for most workflows.
GPT-5.4 operates in two control modes, selected automatically without user instruction.
Visual GUI control uses coordinate-based interaction derived from screen understanding. The model identifies interactive elements visually and clicks at specific coordinates. This works on any application, any operating system, with no programmatic access to the target software required.
Playwright scripting takes over when tasks need speed or precision that visual clicking cannot reliably deliver. GPT-5.4 writes and executes browser automation scripts when that approach is more reliable for the specific step. The model selects the appropriate mode based on the task context.
Both modes work together on complex workflows. A single agentic session might use Playwright for a structured data extraction step, then switch to visual clicking to navigate a legacy application that does not expose a reliable DOM, then return to scripting to submit the output. The model handles those transitions without instruction.
The Verge's hands-on analysis of GPT-5.4 computer use tested it against real knowledge-work tasks. Their assessment: task completion is meaningfully better than prior computer use implementations, with the main failure mode being multi-step workflows that involve login-gated applications with non-standard authentication flows.
OpenAI GPT-5.4 computer use works by perceiving screens as a vision task, reasoning about the interface, and executing action sequences via a lightweight local harness, with no changes required to the target application.
Numbers at this scale stop feeling concrete. Translation helps.
One million tokens is roughly 750,000 words. That is the complete Linux kernel source code with comments. It is ten average novels. It is a year of Slack messages for a 50-person engineering team, or a full M&A contract bundle from a mid-size transaction.
The shift from 128K to 1M tokens does not just give you more room. It changes which problems you can approach without breaking them into fragments.
Full codebase review becomes practical. Feed an entire repository into GPT-5.4 and ask it to find security vulnerabilities, trace how a proposed refactor breaks existing behavior, or explain how a pull request interacts with every file it touches. A code reviewer who has read the diff is useful. A code reviewer who has simultaneously read the full project history, test suite, documentation, and the diff is a different tool entirely.
Legal due diligence changes meaningfully. Upload a complete contract bundle and ask GPT-5.4 to identify indemnification clauses that conflict across documents, flag non-standard termination provisions, or find every instance where a defined term is used inconsistently across exhibits. Law firms running early pilots report that tasks previously taking junior associates several weeks are completing in hours. Ars Technica's context window technical review tested a 900-page contract archive and found coherent cross-document reasoning throughout.
Research synthesis at a scale that retrieval-augmented generation cannot match. Load the complete text of relevant papers on a topic and ask the model to map where findings contradict each other, identify which methodologies replicate most consistently, and locate genuine open questions. This is a different kind of intellectual work than any RAG pipeline offers, because you are working with complete papers rather than retrieved fragments.
Long-horizon agentic tasks stay coherent. For computer use workflows that unfold across dozens of steps and multiple applications, a 1M token context means the agent holds the complete task history, intermediate outputs, tool call results, and error logs without truncation. Agents that previously lost coherent state midway through complex multi-app workflows can now maintain it across a full working session.
The honest caveat: large context does not mean perfect recall at every position. The "lost in the middle" problem, where models attend less reliably to information placed in the center of a very long context, is improved in GPT-5.4 but not fully resolved. For retrieval-critical tasks, explicit position markers help. For synthesis tasks where the model reasons across a large document set, performance in early testing is strong.
Pricing above 272K tokens: prompts that exceed the 272K token standard tier are billed at 2x the input rate and 1.5x the output rate for the full session. Cached input pricing stays at $0.25 per million tokens regardless of context length.
Key finding: GPT-5.4's 1M token context window enables a category of tasks that required multiple fragmented calls in any prior frontier model. Full document corpora, complete codebases, and long-horizon agentic sessions are now single-inference-call problems.
Tool Search is the least-discussed GPT-5.4 feature and arguably the most economically significant for developers building production agents.
Prior to GPT-5.4, an agent had to load all available tool definitions at the start of every session. If your agent has access to 50 tools, all 50 definitions land in the context window before any task begins. That costs tokens, and for large tool sets, it costs a lot of them.
GPT-5.4 can look up tool definitions on-demand instead. The model identifies which tools it needs for the current task step, retrieves those definitions, and proceeds. Tools not needed for a given step add zero token overhead.
OpenAI's internal testing showed a 47% reduction in token usage across representative agentic workflows, with no meaningful drop in task accuracy. For an enterprise deployment running thousands of agent sessions daily, 47% lower token consumption at the same output quality changes the economics of building on GPT-5.4 relative to alternatives.
This matters more than per-action pricing for high-volume deployments. API cost in agentic systems scales with session length and tool set size. Tool Search attacks both.
Here is how the three main computer use models compare across the dimensions that matter for production deployment as of March 2026:
| Capability | GPT-5.4 | Claude Opus 4.6 | Google Project Mariner |
|---|---|---|---|
| Computer use GA | ✓ | ✓ | ✗ (limited preview) |
| Context window | 1M tokens | 200K (1M in beta) | 1M tokens |
| OSWorld-Verified score | 75.0% | 72.7% | 35.2% (preview) |
| Human expert baseline | 72.4% (surpassed) | 72.4% (near-matched) | 72.4% (well below) |
| Tool Search | ✓ | ✗ | ✗ |
| Sandboxed browser | ✓ | ✓ | ✓ |
| Full-desktop mode | ✓ (enterprise) | ✓ (enterprise) | ✗ |
| Persistent memory | ✓ | ✗ | ✗ |
| API access | ✓ | ✓ | ✗ |
| Playwright scripting | ✓ | ✗ | ✓ |
| Input pricing (standard) | $2.50/M tokens | $3.00/M tokens | Not announced |
| Cached input pricing | $0.25/M tokens | $0.30/M tokens | Not announced |
| GDPval professional match | 83.0% | Not reported | Not reported |
| Consumer access | ChatGPT Pro ($200/mo) | Claude Pro ($20/mo) | Waitlist only |
The OSWorld-Verified numbers tell the clearest story. GPT-5.4 at 75.0% clears the human expert baseline of 72.4% for the first time. Claude Opus 4.6 at 72.7% is near-matched with humans but has not crossed that line. Google's Project Mariner preview at 35.2% is not in the same competitive range.
Claude Opus 4.6 costs $3.00 per million input tokens at standard pricing, versus GPT-5.4's $2.50. GPT-5.4 is cheaper on input, and the Tool Search feature further reduces effective token consumption. Claude Opus 4.6 does not have Tool Search.
Claude Opus 4.6 leads on mathematical reasoning benchmarks. If your use case involves multi-step mathematical problem-solving, Anthropic's model maintains a measurable advantage there.
Google's Project Mariner has potential in Google Workspace-centric environments (Docs, Sheets, Gmail, Drive integration). Without API access or general availability as of March 2026, it is not yet a production option.
For developers building agentic workflows today, GPT-5.4 is the most capable generally available option: the highest task completion rate on the standard benchmark, the largest context window, the only Tool Search feature, and the only persistent memory system in production.
API pricing for GPT-5.4 through the Responses API:
For context on competitive positioning: Claude Opus 4.6's API runs at $3.00 per million input tokens and higher output costs. GPT-5.4's standard tier is cheaper on input. The cached rate at $0.25 per million tokens gives persistent-context deployments significant cost advantages.
ChatGPT access tiers:
GPT-5.4 Pro, the maximum-performance variant, requires ChatGPT Pro or Enterprise plan access. The Standard and Thinking variants are available via the API to all registered developers.
The economic case for building on GPT-5.4 rather than Claude Opus 4.6 is straightforward for most agentic workloads: lower base input cost, better task completion rate, Tool Search reducing total token consumption, and cached context pricing that makes long-running sessions economically viable.
These are workflow categories where GPT-5.4's capability combination creates practical value at production scale, based on early enterprise beta feedback reported by TechCrunch's GPT-5.4 launch coverage.
Software development. Feed a complete codebase into a 1M token context. Ask GPT-5.4 to review a pull request against the full project, trace a bug across the entire call stack, or identify where a proposed refactor breaks existing behavior. The computer use layer handles interacting with the development environment: running tests, reading error output, adjusting code, re-running. One developer testing this on a 600K-token monorepo reported that the model identified three non-obvious dependency conflicts the existing CI pipeline had missed.
Legal document review. Upload a contract bundle into the 1M context. Request a comprehensive review: conflicting indemnification clauses across documents, non-standard provisions relative to market norms, definitions used inconsistently across exhibits. Cross-document analysis is coherent because the entire bundle is in context at once, not fragmented across multiple calls.
Multi-system data reconciliation. Use computer use to log into a CRM, pull account data, open a billing system, compare records, flag discrepancies, and draft correction requests. This replaces the most repetitive knowledge-work tasks without requiring any of the target systems to have AI integrations. The model operates them through their standard browser interfaces.
Research aggregation. Load the full text of papers on a topic into the 1M context. Ask for a structured comparison of findings, an identification of where methodologies differ, and a flag of which conclusions are replicated versus isolated to single studies. This is faster and more coherent than any RAG-based approach to the same task.
Automated reporting. Configure a workflow that learns your reporting format, data sources, and preferred output structure over several sessions. By the fourth run, the model produces near-final drafts with minimal manual correction. The OpenAI Agents SDK provides orchestration support for multi-step, multi-agent versions of these workflows.
Each of these use cases is buildable today using the Responses API with computer use enabled.
Summary: GPT-5.4 is most immediately valuable for tasks where the bottleneck has been context length (codebase review, legal document analysis), action automation (multi-system reconciliation), or agent coherence over long sessions (agentic reporting workflows).
Sandboxed browser mode is on by default for all users and has specific restrictions. The model cannot access stored passwords or password managers. It cannot download and execute arbitrary files. It cannot install browser extensions. It cannot access file systems outside the sandboxed browser environment. These apply regardless of what instructions a user gives.
Before consequential, irreversible actions, GPT-5.4 is trained to request explicit confirmation. Sending an email to an external recipient, submitting a payment, deleting files, changing account settings. All such actions generate an audit log entry accessible through the API response object.
OpenAI maintained the same high cyber-risk classification used for GPT-5.3-Codex and added expanded cyber safety systems, monitoring tools, trusted access controls, and request blocking for higher-risk activity on Zero Data Retention surfaces.
Developers can configure the model's safety behavior to suit different levels of risk tolerance by specifying custom confirmation policies in the API. A fully automated overnight workflow can be configured to minimize interruptions. A human-in-the-loop workflow can require confirmation at each major step.
Full-desktop mode, giving the model access to native applications outside the browser, requires an enterprise plan and explicit configuration. It is not available at the consumer tier.
The active, unsolved problem in this space: prompt injection. A malicious website or document instructing the model to take actions the user did not request. OpenAI has detection systems for this. It is an open research problem for the entire computer use space, including Anthropic's Claude Opus 4.6. Neither company has a complete technical solution.
The default safety posture for GPT-5.4 computer use is sandboxed browser access with mandatory confirmation for irreversible actions. Full-desktop access requires enterprise enrollment and explicit configuration.
The strategic picture is worth seeing clearly. GPT-5.4 is not a better chatbot. It is an infrastructure layer for a specific product vision: AI that works like an employee rather than a tool.
Sam Altman has described this framing in interviews across early 2026. The goal is not AI systems you invoke, but AI systems that work continuously, learn your preferences, handle workflows, and escalate when they need human judgment. Computer use is what makes that vision concrete.
An AI that only generates text is a drafting tool. An AI that can log into your CRM, pull this quarter's pipeline data, update opportunity stages, draft the board memo, and schedule the review meeting is a different kind of thing entirely.
The Operator product, the B2B agentic platform for broader availability in Q2 2026, is the enterprise wrapper above the API. It lets non-technical business users configure agent workflows: define what an AI is authorized to do in a given context, set the data sources it can access, specify which actions it takes autonomously versus which conditions trigger human escalation, and monitor work through an audit log.
The infrastructure components are in place. GPT-5.4 is the reasoning core. Computer use is the execution layer. Tool Search reduces the token overhead of large tool sets. Persistent memory is the continuity layer that makes each session more effective than the last.
The benchmark progress is not slowing: GPT-5.2 scored 47.3% on OSWorld-Verified roughly a year ago. GPT-5.4 scores 75.0% today, above human expert performance. An improvement of that magnitude in twelve months makes the "wait and see" calculation increasingly expensive.
The companies building agentic workflows now, defining scope, tuning confirmation policies, and building institutional knowledge about what these agents can be trusted to do without supervision, will operate from an advantage that late movers will find difficult to close.
GPT-5.4 is OpenAI's latest frontier language model, launched in limited access on March 5, 2026, reaching general availability on March 7, 2026. It is the first GPT-5 series model to include native computer use, a 1 million token context window, Tool Search, and three model variants: Standard, Thinking, and Pro.
GPT-5.4 computer use works through a closed perception-action loop. The model receives a screenshot of the current screen, reasons about what it sees, outputs a structured action sequence (clicks, keystrokes, scrolls), and waits for an updated screenshot before deciding on the next step. A lightweight local harness executes the physical actions while AI reasoning runs on OpenAI's servers.
GPT-5.4 has a 1 million token context window, approximately 750,000 words. This is up from 128K tokens in GPT-5.3. Prompts above 272K tokens are billed at 2x the standard input rate and 1.5x the output rate for the full session. Cached input stays at $0.25 per million tokens regardless of context length.
GPT-5.4 scores 75.0% on OSWorld-Verified, the standard benchmark for desktop automation. Human expert performance on the same benchmark sits at 72.4%. This makes GPT-5.4 the first frontier model to surpass the human expert baseline on this benchmark. The previous best from OpenAI was GPT-5.2 at 47.3%.
GPT-5.4 scores 75.0% on OSWorld-Verified versus Claude Opus 4.6 at 72.7%. GPT-5.4 surpasses the human expert baseline of 72.4%; Claude Opus 4.6 comes close but has not crossed it. GPT-5.4 also has a larger generally available context window (1M versus Claude Opus 4.6's 200K standard), Tool Search, and persistent memory. Claude Opus 4.6 leads on mathematical reasoning benchmarks.
Tool Search lets GPT-5.4 look up tool definitions on-demand instead of loading all available tools at the start of every session. OpenAI's testing showed a 47% reduction in token usage across representative agentic workflows with no drop in accuracy. For developers building agents with large tool sets, this materially reduces the cost of running production workloads on GPT-5.4.
Standard input costs $2.50 per million tokens. Input above 272K tokens costs $5.00 per million (2x rate). Cached input costs $0.25 per million tokens. Output costs $15.00 per million tokens. Extended output costs $22.50 per million (1.5x rate). Computer use carries no separate surcharge. Tool Search is included at no additional cost.
GPT-5.4 computer use supports macOS, Windows, and Linux in sandboxed browser mode. Full-desktop mode, which gives access to native applications across all three operating systems, requires an enterprise plan and explicit enablement through the API. Sandboxed browser mode is on by default for all users.
No. GPT-5.4 requires a ChatGPT Pro subscription ($200/month) for access through the ChatGPT interface, or developer access via the Responses API. The ChatGPT Standard plan ($20/month) does not include GPT-5.4.
GPT-5.4 Pro is the maximum-performance variant of GPT-5.4, available to ChatGPT Pro and Enterprise plan subscribers. The standard GPT-5.4 covers most professional use cases. GPT-5.4 Thinking is the deep-reasoning variant for multi-step problems requiring long chains of intermediate reasoning before acting. All three variants use the same underlying 1M context window and computer use infrastructure.
Full-desktop mode requires an enterprise plan. Prompts above 272K tokens are billed at higher rates. The "lost in the middle" problem, where attention degrades for information placed in the center of a very long context, is improved but not fully resolved. Prompt injection, where malicious content on a web page attempts to hijack the agent's actions, remains an active research problem without a complete technical solution. Multi-step workflows involving login-gated applications with non-standard authentication flows are more error-prone.
Persistent memory for agents builds a structured memory profile from observed user behavior across sessions. It stores preferences, inferred behavioral patterns, recurring project structures, and technical environment details automatically. Users can view, edit, or delete all stored memory from account settings. Enterprise deployments can disable it entirely, scope it to specific workflows, or require explicit user confirmation before new memories are written.
The Responses API is the current API interface for GPT-5.4, replacing the earlier Chat Completions API for agentic use cases. It handles tool calls, computer use actions, Tool Search, and multi-turn agentic sessions with persistent state. Developers building computer use applications should use the Responses API rather than the older completions interface.
Sandboxed browser mode is on by default and restricts the model from accessing stored passwords, downloading executable files, installing browser extensions, or accessing the local file system. The model requests explicit confirmation before consequential irreversible actions such as sending emails, submitting payments, or deleting files. All actions are logged to an audit trail accessible through the API response. Developers can configure custom confirmation policies to match different risk tolerance levels.
For most knowledge-work automation tasks, GPT-5.4 leads on task completion rate, context window size, and token efficiency. If your use case requires mathematical reasoning, Claude Opus 4.6 maintains a measurable edge on math benchmarks. If you need the highest desktop automation score available, Tool Search for token efficiency, and persistent memory across sessions, GPT-5.4 is the current best option at general availability.
OpenAI Operator is a B2B agentic platform that lets enterprise users configure AI agent workflows without writing code. It uses GPT-5.4 as the underlying reasoning core. Operator lets businesses define the scope of what an agent is authorized to do, set access controls, configure confirmation policies, and monitor work through an audit log. Broader rollout is scheduled for Q2 2026.
Web research and data extraction, form submission and data entry across SaaS applications, email drafting and scheduling through browser-based clients, cross-document analysis within the 1M context window, full-codebase code review, and multi-step reporting workflows that pull from multiple web-based data sources. Tasks involving login-gated applications with non-standard authentication flows remain more error-prone and benefit from pre-authentication before handing control to the agent.
In sandboxed browser mode, GPT-5.4 cannot access stored passwords or password managers. For workflows requiring authenticated access, users typically pre-authenticate in the browser session before the agent takes control, or configure SSO through an enterprise deployment that handles authentication separately from the agent's action loop.
GDPval tests AI performance across 44 real-world occupations using professional evaluators from those fields. GPT-5.4 matched or exceeded industry professionals in 83.0% of comparisons. This benchmark measures practical professional value rather than abstract reasoning patterns, making it one of the more meaningful measures of real-world capability for knowledge-work automation.
Google Project Mariner is in limited preview as of March 2026 with no public API. Its OSWorld preview score is 35.2% versus GPT-5.4's 75.0%. Project Mariner's integration with Google Workspace (Docs, Sheets, Gmail, Drive) is a potential advantage in Google-centric environments, but without general availability or API access, it is not yet a production alternative to GPT-5.4 or Claude Opus 4.6.
GPT-5.4 is available now for ChatGPT Pro subscribers and via the OpenAI Responses API. Computer use requires explicit enablement; sandboxed browser mode is the default. Full-desktop access requires an enterprise plan. For production multi-agent deployments, the OpenAI Agents SDK provides orchestration support.
Sources: OpenAI GPT-5.4 announcement, TechCrunch GPT-5.4 launch coverage, The Verge computer use analysis, Ars Technica context window review, DataCamp GPT-5.4 guide.
Anthropic's State of Agentic Coding report reveals 1M+ Claude Code sessions, 40%+ multi-agent rates, and a 72% SWE-bench score reshaping software.
Anthropic Pentagon ban explained: DOD named the AI lab a supply chain risk. Microsoft filed an amicus brief. Here's what it means for AI procurement.
Trump's AI executive order preempts state AI laws as FTC comment deadline hits. What OpenAI, Anthropic, and 36 AGs are doing about it.