TL;DR: Andreessen Horowitz has led a $43M Series A in Deeptune, a New York startup building high-fidelity reinforcement learning environments — "training gyms" — where AI agents practice workplace tasks before deployment. The round arrives as Jensen Huang projects 7.5 million AI agents entering the workforce, Microsoft launches Copilot Cowork, and a16z identifies agent-training environments as a core new infrastructure category. With seven-figure ARR in its first six months, Deeptune is targeting 10x growth as enterprise demand for agent reliability infrastructure spikes.
If Jensen Huang's 7.5 million AI agents need training, someone has to build the gym. Deeptune just raised $43 million from Andreessen Horowitz to do exactly that — constructing simulated professional environments where agents learn to navigate Slack, Salesforce, ticketing systems, and financial tools before they ever touch a real enterprise workflow.
What you will learn
- What Deeptune actually builds — and why "training gym" is the right metaphor
- The a16z thesis: why agent infrastructure is the new platform layer
- The agent training problem — why scale breaks current approaches
- How training gyms work: simulation, reward signals, and evaluation loops
- The competitive landscape: who else is building environments
- Enterprise use cases: where trained agents create real value
- The broader agent stack — where Deeptune sits
- What this means for the AI infrastructure market
- Key takeaways
What Deeptune actually builds — and why "training gym" is the right metaphor
Deeptune's core product is a set of high-fidelity reinforcement learning environments that simulate the day-to-day workflows of professional roles. Think accountants reconciling invoices, customer support reps navigating ticketing queues, or DevOps engineers triaging monitoring alerts — but entirely synthetic, sandboxed, and instrumented for agent learning rather than human work.
The company was founded by Tim Lupo (CEO) and Lukas Schmit, whose team of roughly twenty engineers and operators draws from Anthropic, Scale AI, Palantir, Hebbia, Glean, Retool, and Uber. They are based in New York and have been operating with unusual capital efficiency: the company crossed seven-figure ARR within six months of launch, before closing this Series A.
The "training gym" framing is deliberate and precise. Just as a human athlete does not walk onto a field for the first time during a championship game, an AI agent should not navigate a live Salesforce instance for the first time while handling a real customer escalation. Deeptune builds the practice facility — fully simulated, instrumented with reward signals, and designed to push agents through increasingly difficult task variants until they can complete them reliably.
The environments currently simulate workflows across Slack, Salesforce, financial reconciliation tools, IT monitoring platforms, and customer support systems. Agents practice navigating multi-step tasks that cross application boundaries: retrieving a customer record in Salesforce, updating a ticket in a support system, and sending a resolution summary in Slack — all within a single coherent workflow. This cross-application complexity is precisely what makes enterprise agent deployment hard in practice and what makes Deeptune's simulation fidelity the core differentiator.
The round was led by Andreessen Horowitz and joined by 776, Abstract Ventures, and Inspired Capital. Angel investors include Noam Brown, the OpenAI researcher behind o1's reasoning breakthrough, Mercor CEO Brendan Foody, and Applied Compute CEO Yash Patil — a roster that signals the company is well-networked inside the research and infrastructure communities it depends on.
The a16z thesis: why agent infrastructure is the new platform layer
Andreessen Horowitz did not stumble into this investment. The firm has been publicly developing an agent infrastructure thesis for the better part of a year, and Deeptune is a direct expression of it.
In its Big Ideas 2026 publication, a16z identified reinforcement learning environments as a new category of companies worth backing at the infrastructure level. The firm's framing: to achieve the next level of capability from AI models, reinforcement learning must be applied by situating the model in very specific scenarios. Environments that put agents in a kind of gym — to practice repeatedly and evaluate task completion at increasingly higher success rates — are the missing layer between model capability and production deployment.
A16z has deployed a dedicated $1.7 billion slice of its $15 billion mega-fund toward AI infrastructure. The broader infrastructure thesis centers on a fundamental shift from human-speed to agent-speed workloads. Enterprise backends were built for a 1:1 ratio of human action-to-system response. They are not architected for a single agentic goal to trigger a recursive fan-out of thousands of sub-tasks, database queries, and internal API calls in rapid succession. The infrastructure gaps this creates — reliability, error handling, context management, and evaluation — are exactly where a16z is placing bets.
Every previous computing platform — PCs, smartphones, cloud — created massive infrastructure companies that captured more durable value than the applications running on top of them. AWS captured more value than most apps running on AWS. The App Store and Google Play captured more value than most individual apps. If agents are the next platform, the infrastructure companies that enable agent development, training, deployment, and monitoring will be the AWS-equivalents of the agent era.
Deeptune solves the evaluation and training side of that gap. An agent cannot be deployed reliably if there is no mechanism to measure its performance, identify failure modes, and improve it through practice. The training gym is that mechanism. A16z's investment signals that it views training environments not as a research experiment but as a production infrastructure category with real enterprise buyers — Deeptune's early ARR validates the demand side of that thesis.
The firm has also noted that major AI labs are reportedly considering spending more than a billion dollars on such environments. If foundation model providers need Deeptune's environments to train the next generation of agents, that is a very different market size than enterprise SaaS adoption alone. It is potentially a compute-level procurement decision.
The agent training problem — why scale breaks current approaches
The agent training problem is not obvious until you try to solve it at scale, and then it becomes the only problem that matters.
Current approaches to training AI agents for enterprise tasks rely primarily on three methods, each with fundamental scaling limitations.
Prompt engineering is the dominant method today. Teams write detailed system prompts telling agents what to do, how to behave, and what to avoid. This works for simple tasks but collapses as complexity increases. As tasks get more complex — navigating multi-step, multi-application workflows with edge cases, error states, and ambiguous inputs — the instructions required to handle every scenario exceed what fits in a context window and what any human can anticipate. The agent that performs correctly in a demo fails on the third variant of a task it was never shown.
Supervised fine-tuning on human demonstrations has a data problem. Collecting high-quality human demonstrations of enterprise workflows is expensive, slow, and creates a dataset that is instantly stale as the underlying software tools change their interfaces. The agent trained on Salesforce Classic demonstrations will fail when the interface updates. Maintaining human demonstration datasets at the scale required to cover enterprise software diversity is economically untenable.
RLHF (Reinforcement Learning from Human Feedback) is effective but expensive. Human evaluators rate agent outputs, and the model learns from those ratings. At scale, this requires hundreds of human evaluators working continuously, which is prohibitively expensive and slow. It also introduces human bias and inconsistency — two evaluators may rate the same agent response differently.
Evaluation benchmarks like SWE-bench and GAIA test agent capabilities on standardized tasks, but they do not train agents for your specific workflows. An agent that scores well on SWE-bench may still fail at your company's particular code review process because SWE-bench does not simulate your codebase, your coding standards, or your CI/CD pipeline.
Reinforcement learning from environment interaction solves these problems — but only if the environments are good enough. An agent trained in a low-fidelity simulation will fail when deployed in a real system. The simulation needs to accurately reproduce the state transitions, error conditions, and interface behaviors of real enterprise software. Building that fidelity is the hard technical problem Deeptune has chosen to solve.
The global reinforcement learning market is projected to grow from roughly $11.6 billion in 2025 to more than $90 billion by 2034. That growth is driven by exactly this shift: from static, web-scale training data toward interactive, simulation-based learning for agents operating in real-world task environments.
How training gyms work: simulation, reward signals, and evaluation loops
The mechanics of Deeptune's training environments break into three components that form a closed loop: the simulation layer, the reward signal layer, and the evaluation loop.
The simulation layer is the environment itself — a faithful reproduction of the software interfaces and state transitions that an enterprise agent will encounter in production. This includes the full interaction graph of a CRM system like Salesforce: creating records, querying against filters, updating field values, triggering workflow automations, and handling the edge cases that real data produces. Deeptune builds these simulations by studying actual enterprise workflow patterns, which means the environments are not generic — they are calibrated to the specific software stacks enterprise customers actually use.
The reward signal layer defines what "correct" looks like within the simulation. Unlike supervised learning, where correctness is defined by matching a human demonstration, reinforcement learning requires a reward function: a quantitative signal that tells the agent whether its actions moved toward or away from the goal. In enterprise workflow contexts, this is harder to specify than it sounds. Did the agent resolve the customer ticket correctly? Success is not just "ticket closed" — it depends on resolution quality, time taken, compliance with process requirements, and downstream consequences. Deeptune's expertise is partly in building reward functions that capture real enterprise success criteria, not proxy metrics that agents can exploit.
The evaluation loop is what distinguishes a training gym from a sandbox. The loop systematically exposes agents to task variants of increasing difficulty, measures performance across the distribution, identifies failure modes, and feeds that information back into the training process. This closed loop is what produces reliability gains over time rather than one-time optimization. An agent that completes 95% of straightforward customer service tasks and 40% of complex escalations has a specific failure mode that the evaluation loop can surface and the training process can target.
The combination implements curriculum learning — simulations start simple and progressively increase in difficulty, preventing the agent from being overwhelmed by edge cases before mastering the basics. The evaluation layer is designed to be composable: enterprise customers can define custom evaluation criteria — specific compliance rules, brand voice requirements, accuracy thresholds — and stack them into evaluation pipelines that run automatically after each training iteration.
The key advantage over human-in-the-loop training is speed. An agent that would take weeks to evaluate through human feedback loops can run through thousands of simulated scenarios overnight. This compresses training cycles from weeks to hours and makes iterative improvement economically viable at enterprise scale.
The competitive landscape: who else is building environments
Deeptune is not the only company that identified agent training environments as a category worth building. The competitive landscape includes a mix of UI-focused gyms, domain-specific environments, and data infrastructure plays — most of them significantly earlier stage.
The competitive picture has a clear shape: most competitors are seed-stage, narrowly focused on one or two workflow types, and primarily serving a handful of customers. Deeptune has differentiated on two dimensions — breadth of environments (multiple professional domains and software stacks) and early commercial traction (seven-figure ARR before Series A).
The more important competitive pressure comes from the demand side. Data-labeling incumbents like Scale AI are reportedly racing to build their own environment offerings. If the market for agent training environments is large enough for Scale AI to pursue, the category has definitional legitimacy — but Deeptune is operating against a much better-capitalized potential entrant. The $43 million Series A is partly about moving fast enough that Deeptune owns the workflow-simulation niche before larger platforms can commoditize it.
Foundation model providers — OpenAI, Anthropic, Google DeepMind — are also building their own agent evaluation frameworks, though these tend to be internal tooling for model development rather than commercial offerings. The risk for Deeptune is that a foundation model provider decides to productize its internal evaluation infrastructure. The counterargument is that enterprise-specific workflow simulation requires deep domain knowledge that generic model providers are unlikely to build from scratch for every vertical.
LangChain and LangSmith provide tooling for building and monitoring LLM applications, including agents. LangSmith's tracing and evaluation features overlap with Deeptune's evaluation capabilities, but LangChain is primarily a development framework rather than a training infrastructure. It helps you build and debug agents, not systematically train them through simulated practice.
Patronus AI focuses on AI evaluation and red-teaming, helping companies test their AI systems for failures. Evaluation is one component of Deeptune's training loop, but Patronus does not provide the simulated training environments that make reinforcement learning possible.
Enterprise use cases: where trained agents create real value
The business case for training gyms is straightforward when you examine the specific failures that untrained agents produce in production.
Customer service agents are among the most common enterprise deployment targets. An agent handling inbound support tickets must navigate a CRM to retrieve account history, consult a knowledge base for resolution steps, apply business rules for escalation, and update the ticket with the correct resolution code. Each of these sub-tasks has failure modes. An agent that misclassifies a ticket, retrieves the wrong account, or applies an incorrect resolution code creates more work for the human team that has to clean it up. Training gyms let companies measure and improve agent performance on each sub-task before deployment, with target success rates defined by the business rather than by what the model can achieve out of the box.
Financial operations agents face an even more consequential failure environment. An agent reconciling invoices, flagging payment exceptions, or routing expense approvals must operate within strict compliance requirements. The cost of an error is not just rework — it is audit risk, regulatory exposure, and potentially material financial misstatement. A training gym that simulates financial workflows with realistic edge cases (duplicate invoices, currency conversion errors, multi-entity consolidation) allows companies to validate agent reliability against compliance thresholds before production deployment.
DevOps and IT operations agents must navigate monitoring dashboards, incident management systems, and infrastructure tooling under time pressure. An agent that misroutes an incident, fails to escalate correctly, or takes a remediation action that makes an outage worse is a significant operational liability. Simulation environments that reproduce realistic incident scenarios — including the cascade failures and ambiguous signals that characterize real outages — allow teams to certify agent behavior before it matters.
Code review and development pipeline agents are a newer but fast-growing use case, driven by the success of tools like Cursor and GitHub Copilot at the individual developer level. Enterprise teams want agents that can review PRs for compliance with internal standards, identify security vulnerabilities, or manage routine pipeline maintenance. Training gyms calibrated to specific codebases and engineering processes let teams build agents that behave consistently within their specific development culture, not just against generic benchmarks.
Sales and outreach agents that craft personalized emails, qualify leads, and schedule meetings need to understand the company's ICP (ideal customer profile), value propositions, and competitive positioning. Simulation environments can generate realistic prospect profiles and conversation scenarios, letting sales agents practice before touching real leads — reducing the cost of bad outreach and protecting brand reputation.
The pattern across these use cases is consistent: enterprises have the foundation model capability but lack the task-specific training infrastructure to make agents reliable enough for production deployment. That gap is Deeptune's market.
The broader agent stack — where Deeptune sits
Understanding Deeptune's position requires placing it in the emerging agent infrastructure stack, which has four layers that are each attracting significant investment.
Layer 1: Foundation models — the base intelligence layer (GPT-5, Claude, Gemini). This is where the raw capability comes from. Foundation model providers compete on benchmark performance, reasoning ability, and tool-use sophistication.
Layer 2: Orchestration and workflow — frameworks that allow developers to chain model calls, manage state, and coordinate multi-agent workflows (LangGraph, AutoGPT-style frameworks, OpenAI's agent SDK). This layer handles the plumbing between models and tasks.
Layer 3: Tool and environment access — the connections between agents and the systems they interact with. This includes browser control, API integrations, computer use interfaces, and the simulation environments Deeptune provides. Training gyms live at this layer — they are high-fidelity reproductions of the tool environment agents will operate in.
Layer 4: Evaluation and reliability — the infrastructure for measuring whether agents work correctly, identifying failure modes, and improving performance over time. Deeptune's reward signal and evaluation loop components sit here.
Deeptune occupies layers three and four simultaneously, which is a defensible position. A company that builds both the simulation environment and the evaluation methodology is harder to displace than one that provides only a simulation layer (which can be replaced by better simulations) or only an evaluation framework (which can be replaced by better metrics). The integration between the two is where the compound value lives.
The analogy to cloud computing is instructive. In the early cloud era, companies ran workloads directly on raw compute instances, manually configuring everything. Over time, specialized infrastructure emerged — container orchestration (Kubernetes), CI/CD pipelines (Jenkins, GitHub Actions), monitoring (Datadog) — that made cloud deployment reliable and scalable. Agent training gyms are the CI/CD equivalent for the agent era: the infrastructure that transforms raw model capability into reliable production behavior.
The agent stack is still being assembled. In the past twelve months, the orchestration layer has consolidated around a handful of frameworks. The foundation model layer has clear leaders. The training and evaluation layer — Deeptune's territory — is the least mature, which means it has the most room to grow and the most risk of consolidation.
What this means for the AI infrastructure market
Deeptune's $43 million round is a data point in a larger pattern: the AI infrastructure investment thesis is shifting from model training and inference toward the layers that make models useful in production.
The 2023-2024 infrastructure wave was dominated by GPU compute (CoreWeave, Lambda Labs), model serving (Together AI, Fireworks AI), and foundational tooling (LangChain, vector databases). Those categories are maturing. The 2025-2026 wave is moving up the stack toward reliability, evaluation, and workflow-specific optimization — the unglamorous infrastructure that determines whether agents actually work for the business outcome they were hired for.
Jensen Huang framed this shift at GTC 2026, projecting 7.5 million AI agents entering the workforce and calling agentic systems "the new computer." Microsoft's Copilot Cowork is a concrete enterprise product built on that premise. Salesforce has Agentforce. ServiceNow has AI Agents. Every major enterprise software company is building an agent layer. The number of deployed agents will grow by orders of magnitude in the next 24 months.
What none of those announcements adequately addressed is the gap between "we have an agent" and "the agent works reliably at the task we need." Training gyms are the answer to that gap.
The reinforcement learning environment market is projected to grow from $11.6 billion in 2025 to over $90 billion by 2034. That growth is predicated on the same assumption Deeptune is betting on: that RL-based agent training becomes the standard approach for enterprise deployment, rather than a research technique used primarily by AI labs.
Deployment without training is deployment without reliability. And unreliable agents in production create real damage — wrong refunds, missed compliance requirements, bad code merged to main, misleading financial analyses. The cost of agent failures at scale will make current concerns about AI hallucination look quaint.
The seven-figure ARR in the first six months suggests the demand is real, not hypothetical. Enterprises are not buying training environments as a speculative investment in future capability — they are buying them because they are trying to deploy agents now and encountering exactly the reliability and evaluation gaps that Deeptune's environments address.
Training infrastructure is also sticky. Once an enterprise has built hundreds of simulated environments for its agents inside Deeptune, switching costs become prohibitive. The training data, the reward functions, the evaluation benchmarks, and the institutional knowledge embedded in those simulations represent months of work. That is the kind of lock-in that venture investors prize — and that makes training gyms a potentially more durable business than many other AI infrastructure categories.
The remaining question is whether Deeptune can build a defensible position before the category attracts the kind of capital that commoditizes early movers. The $43 million gives the team runway to expand the simulation library, deepen enterprise relationships, and establish the evaluation methodology as a standard before competitors catch up. That is the race Deeptune is now running.
For the enterprise AI market, the signal is clear: the agent economy infrastructure layer is real, it has paying customers, and it is attracting serious capital. The companies that build the training infrastructure for AI agents may ultimately have the same strategic position that database providers had in the software era — invisible to end users, indispensable to the systems that serve them.
TL;DR
- The deal: Andreessen Horowitz led a $43M Series A in Deeptune, with participation from 776, Abstract Ventures, and Inspired Capital, plus angels including OpenAI researcher Noam Brown.
- The product: High-fidelity reinforcement learning environments — "training gyms" — that simulate professional workflows (Slack, Salesforce, support systems, financial tools) so AI agents can learn tasks before deployment.
- The team: ~20 people in New York, founded by Tim Lupo and Lukas Schmit, with talent from Anthropic, Scale AI, Palantir, Hebbia, and Retool.
- Early traction: Seven-figure ARR in the first six months, targeting 10x growth on new capital.
- The a16z thesis: Agent training environments are a new infrastructure category; major AI labs may spend over $1 billion on such environments; the RL market grows from $11.6B (2025) to $90B+ (2034).
- The training problem: Prompt engineering, supervised fine-tuning, and RLHF all break at enterprise scale — systematic simulation-based training is the missing layer.
- The competitive field: Mostly seed-stage companies (Turing, Mechanize, Fleet, Bespoke Labs, Vmax) focused on narrow workflow types; Deeptune leads on breadth and commercial traction.
- The market moment: Jensen Huang projects 7.5M agents, Microsoft ships Copilot Cowork, and enterprises are deploying agents at scale — creating urgent demand for training and evaluation infrastructure.
- The strategic position: Deeptune occupies the training environment and evaluation layers of the agent stack simultaneously, a defensible combination that positions it as foundational infrastructure for the enterprise agent economy.