TL;DR: AT&T's Chief Data Officer Andy Markus revealed that rebuilding the company's internal AI assistant on a multi-agent architecture — where small language models handle routine work and large models tackle only complex reasoning — cut AI costs by 90% and tripled daily token throughput from 8 billion to 27 billion. The "bigger model is better" assumption that drove enterprise AI spending for the past three years is, by AT&T's account, simply wrong for most production workloads. This is the clearest enterprise case study yet for agentic AI architecture as a cost strategy rather than just a capability story.
What you will learn
- The headline numbers: 90% cheaper, 3x more tokens
- What happened: Ask AT&T's transformation
- The architecture: super agents and worker agents
- Small language models explained: why less is more
- LangChain multi-agent stack: how it works
- The math: from 8 billion to 27 billion tokens per day
- Why "bigger equals better" is dead for enterprise AI
- Who else is doing this: enterprise SLM adoption
- What this means for AI infrastructure spending
- Frequently asked questions
The headline numbers
The numbers AT&T's Chief Data Officer presented are not incremental improvements. They are the kind of transformation that forces a rethink of an entire category assumption.
Before the rebuild, AT&T's "Ask AT&T" internal AI assistant was processing approximately 8 billion tokens per day. The cost to run that workload was significant — large language model inference at enterprise scale is expensive, and AT&T is a company with over 150,000 employees who interact with internal tooling constantly.
After the rebuild onto a multi-agent architecture that routes tasks between large and small language models based on complexity, the throughput tripled to 27 billion tokens per day. The cost dropped by 90%.
Let that combination sit for a moment. Three times the work. One tenth the cost. In the same production environment, serving the same users, solving the same business problems. The only thing that changed was which model handled which task.
This is not a controlled benchmark on curated data. This is a production system at one of the largest telecommunications companies in the world, running at scale, serving real employee queries, and delivering those results with real cost consequences visible on an actual budget line.
Andy Markus, the CDO who oversaw the transformation, summarized the outcome in terms that are unusually direct for an executive at a company this size: "I believe the future of agentic AI is many, many, many small language models."
That is not a hedge. That is a strategic position.
Ask AT&T began as AT&T's attempt to bring generative AI inside the organization — a conversational internal assistant that employees could query for information, help with tasks, and navigate the kind of complex institutional knowledge that accumulates in any large enterprise over decades.
The initial version was built on the approach that most enterprises took when they first deployed generative AI in 2023 and 2024: route everything through a capable frontier model. The logic was sound at the time. Frontier models — GPT-4, Claude 2, the early Gemini models — were dramatically more capable than anything that had come before. They could handle ambiguity, write coherently, synthesize information across domains, and respond to poorly-specified queries in ways that actually helped users. If you needed smart answers, you used a smart model.
The problem was cost. AT&T discovered what most large enterprises eventually discover when they move from pilot to production: the economics of frontier LLM inference do not scale gracefully. Every query — whether it was a complex multi-step reasoning task or a simple lookup of a standard operating procedure — hit the same expensive model. There was no distinction between what required a 70-billion parameter model and what could be answered by something much smaller.
At 8 billion tokens per day, those undifferentiated costs become very visible very quickly.
Markus and his team's response was not to cap usage or reduce the product's scope. It was to ask a different question: what if most of those tokens do not actually need to be processed by the most expensive model available? What if the frontier model is only genuinely necessary for a fraction of the queries the system receives?
That question led to a fundamental architectural redesign.
The architecture: super agents and worker agents
The rebuilt Ask AT&T system is organized around a clear division of cognitive labor that mirrors how well-run human organizations actually work.
At the top of the hierarchy sit what Markus's team calls "super agents." These are large language models — frontier or near-frontier models with the full reasoning capability of the kind of models AT&T was previously routing everything through. Super agents handle tasks that genuinely require that level of capability: complex multi-step reasoning, synthesis across disparate information sources, ambiguous queries that require judgment about what the user actually needs, and coordination of the overall task flow.
Below the super agents sit the "worker agents." These are small language models — models that are significantly smaller in parameter count, faster, cheaper to run, and tightly focused on specific domains or task types. Worker agents handle the routine work: retrieving specific information, executing well-defined sub-tasks, performing operations with clear inputs and outputs, and handling the high-volume repetitive queries that make up the majority of any enterprise AI system's actual workload.
The super agent's job is to decompose incoming requests, route sub-tasks to the appropriate worker agents, interpret their outputs, and synthesize a final response. Worker agents do not need to understand context or exercise judgment beyond their narrow domain. They need to be accurate, fast, and cheap within that domain.
The key insight is that enterprise AI workloads are not uniformly complex. The distribution of complexity across real queries looks something like a power law: a small percentage of queries require genuine frontier-model reasoning, while the large majority are variations of queries the system has seen many times before and that could be handled by a much simpler model without any loss in quality that the user would notice.
Routing based on complexity means paying frontier-model prices only for frontier-model problems.
Small language models explained: why less is more
A small language model is not simply a worse version of a large language model. It is a different tool, optimized for a different part of the problem space.
Large language models derive their capability from scale. Training on enormous datasets with hundreds of billions of parameters produces models that generalize across almost any domain, handle novel inputs gracefully, and can reason through problems they have never explicitly seen before. That generalization is expensive — both to train and to run. A query to GPT-4 or Claude 3.7 Sonnet consumes significantly more compute per token than a query to a smaller, domain-specific model.
Small language models trade that generalization for efficiency and specificity. A model trained specifically on telecommunications support queries, standard operating procedures, and AT&T's internal knowledge base does not need to be capable of writing poetry, analyzing legal documents, or explaining quantum mechanics. It needs to be very good at a narrow range of tasks, and it can be made very good at those tasks with a fraction of the parameters a frontier model requires.
Markus put it directly: "small LMs are just about as accurate as LLMs on a given domain." That observation is consistent with what academic research on domain-specific fine-tuning has been showing for years. When you evaluate a small model fine-tuned on domain-specific data against a large general model on tasks within that domain, the accuracy gap narrows dramatically — often to the point where the quality difference is not detectable in real user interactions.
The tradeoff is that the small model fails immediately outside its domain. Ask a telecommunications SLM about constitutional law and it will produce something inadequate. But in a well-architected agentic system, that query never reaches the telecommunications SLM in the first place. The super agent handles routing, and the right tool processes the right task.
This is not a new idea in software engineering. Microservices architectures are built on the same principle: many small, specialized services that each do one thing well, orchestrated by a coordination layer that routes requests appropriately. The multi-agent AI architecture is the same concept applied to language model inference.
LangChain multi-agent stack: how it works
AT&T rebuilt Ask AT&T on LangChain, one of the most widely deployed orchestration frameworks in enterprise AI. LangChain provides the infrastructure for building multi-agent systems: tooling for defining agent roles, managing state across multi-step tasks, routing between models, handling tool calls, and chaining outputs from one agent as inputs to the next.
In the AT&T implementation, LangChain functions as the orchestration layer that sits between the user interface and the underlying models. When an employee submits a query to Ask AT&T, LangChain's orchestration logic intercepts it before it reaches any model.
The orchestration layer first classifies the query: is this a complex reasoning task or a routine retrieval and execution task? That classification determines which model handles it. Complex queries route to a super agent — a frontier LLM with full context and the tools to coordinate a multi-step response. Routine queries route directly to the appropriate worker agent SLM.
Worker agents in the LangChain stack are implemented as specialized agents with defined tool access and domain scope. An SLM configured as a policy lookup agent has access to AT&T's internal policy database, is tuned on the specific formatting and language of those policies, and is evaluated on accuracy within that domain. It is not aware of and cannot access the tools available to the billing SLM or the network operations SLM.
The super agent, when it receives a complex query, may decompose it into sub-tasks and invoke multiple worker agents sequentially or in parallel, depending on whether the sub-tasks have dependencies. It collects the outputs, evaluates whether they answer the original query, and synthesizes a final response.
What this architecture eliminates is the expensive pattern of sending every query through the most capable (and most expensive) model in the stack as a default. LangChain's routing logic is the mechanism that makes selective LLM invocation possible at production scale.
The math: from 8 billion to 27 billion tokens per day
The volume increase from 8 billion to 27 billion tokens per day deserves specific examination because it appears counterintuitive. If you reduced cost by 90%, you might expect throughput to stay flat or even decline. Instead it tripled. Why?
The answer lies in the economics of SLM inference relative to LLM inference. A small language model can process tokens at a fraction of the cost per token that a frontier model requires. When you replace a large percentage of LLM inference with SLM inference, the same budget that was processing 8 billion tokens per day through expensive models can now process dramatically more tokens per day through a mix of expensive and cheap models.
AT&T did not shrink its AI budget to achieve the 90% cost reduction. It restructured how that budget was spent. A smaller portion of the budget now goes to frontier model inference for complex tasks. A larger portion goes to SLM inference for routine tasks. The SLM inference is so much cheaper per token that the total token volume the budget can support expands dramatically.
The 27 billion tokens per day figure suggests something else important: usage grew substantially. When inference is cheap, product teams deploy AI features more aggressively. Internal users query the system more often because the system responds faster (SLMs have lower latency than frontier models at inference time). Features that would have been cost-prohibitive to build on an all-LLM stack become viable on a mixed SLM/LLM stack.
This is a virtuous cycle. Lower cost enables more deployment. More deployment generates more usage. More usage demonstrates ROI. Demonstrated ROI justifies further investment. The 90% cost cut is not just a budget saving — it is the mechanism that allowed AT&T to triple the productive output of its AI investment.
Why "bigger equals better" is dead for enterprise AI
The "bigger model is better" assumption drove enterprise AI procurement decisions from 2022 through most of 2025. The reasoning was understandable: frontier models had clearly demonstrated that scale produced qualitative capability improvements that smaller models could not match. GPT-4 could do things GPT-3 simply could not. The jump from 7-billion to 70-billion parameter models in the open-source ecosystem was similarly dramatic.
Enterprise procurement teams drew a logical but flawed conclusion: if bigger is better, always buy the biggest model you can access. The risk of using a model that is too small and producing inadequate results is worse than the risk of paying too much for a model that is more powerful than needed.
That reasoning made sense in 2023, when the enterprise use cases were primarily exploratory. Pilots, proofs of concept, and demo applications are not sensitive to inference cost. The question in a pilot is whether the model can do the thing at all. Cost per token is not the constraint.
The reasoning breaks down completely in production. When you are processing 8 billion tokens per day, cost per token is the primary constraint. At production scale, a model that is 10x more expensive to run than a smaller model with equivalent quality on your specific tasks will eventually force a budget reckoning — and the reckoning looks like what AT&T experienced.
The AT&T case makes explicit what a growing body of academic and industry evidence has been suggesting: for most production enterprise workloads, domain-specific SLMs match or approach frontier model accuracy on in-domain tasks. The broad capability of a frontier model is being purchased and paid for in full, even though the vast majority of enterprise queries do not require it.
Markus's framing — "the future of agentic AI is many, many, many small language models" — is not a dismissal of large models. It is a statement about task allocation. Large models still have a role. That role is reasoning about complex, ambiguous, high-stakes problems. For everything else, the right model is the smallest one that can do the job acceptably well.
Who else is doing this: enterprise SLM adoption
AT&T is not pioneering a concept so much as executing it at scale and being public about the results. The shift toward smaller, specialized models in enterprise AI is happening across industries and has been accelerating throughout 2025 and into 2026.
Microsoft has been the most aggressive public advocate for SLMs in enterprise contexts. The Phi model family — Phi-3, Phi-3.5, and subsequent iterations — is explicitly designed for enterprise deployment at lower cost. Microsoft's positioning for these models emphasizes on-device and on-premises deployment where data governance requirements prevent cloud-based frontier model access, and cost efficiency for high-volume applications.
Google's Gemma family serves a similar function. Meta's Llama small-parameter variants have become the backbone of countless enterprise fine-tuning projects precisely because they are small enough to tune cheaply and deploy on modest infrastructure.
In financial services, several major banks have moved routine customer service query routing to fine-tuned SLMs while reserving frontier models for fraud detection edge cases and complex product recommendation. The pattern is structurally identical to what AT&T built: tiered model selection based on task complexity.
Healthcare AI deployments have moved in the same direction for slightly different reasons. In healthcare, data governance and privacy requirements make it difficult to route sensitive patient information to external frontier model APIs. On-premises SLMs, fine-tuned on medical literature and clinical documentation, are increasingly the default for production deployments where data cannot leave the organization's infrastructure.
The trend is consistent across verticals: enterprises that built on all-LLM stacks are rebuilding on mixed architectures. The sequence is almost always the same — pilot with frontier model, scale to production, encounter cost crisis, redesign with SLMs for routine tasks, achieve dramatic cost reduction.
What this means for AI infrastructure spending
The implications of the AT&T transformation — and the broader enterprise SLM trend it represents — for AI infrastructure investment are significant and cut in several directions simultaneously.
For frontier model API providers: The AT&T result is not good news for companies whose business model depends on enterprises routing all production workloads through their most expensive models. If 90% of enterprise token volume migrates to SLMs, frontier model API revenue from those workloads drops accordingly. The frontier model providers will argue — correctly — that their models remain essential for the complex reasoning tasks at the top of the agent hierarchy. That is true. It just means they capture a smaller fraction of total enterprise token volume than they did when they were the only option.
For cloud infrastructure providers: Mixed LLM/SLM architectures favor cloud providers that offer flexible inference infrastructure across model sizes rather than those that primarily optimize for frontier model serving. AWS Bedrock, Azure AI, and Google Vertex AI have all expanded their model catalogs to include SLM options precisely because enterprise demand is moving in this direction.
For open-source model ecosystems: SLM adoption in enterprise directly benefits the open-source model communities. Fine-tuning a domain-specific SLM on proprietary enterprise data is much more feasible with an open-source base model than with a closed frontier model. The Llama, Mistral, and Phi families are the base models most commonly used for this purpose. Enterprise adoption of SLM architectures is accelerating the economic case for continued open-source model development.
For enterprise AI strategy overall: The most important implication is that "how much AI can we afford" is no longer a question about model access. The access question has been largely solved — frontier models are available via API, SLMs are available open-source. The question is architecture: how do you design a system that routes tasks to the right model at the right cost, at production scale, without sacrificing quality? That is an engineering and organizational design problem, not a procurement problem. The companies that solve it will have a durable cost advantage over those that do not.
AT&T's 90% cost reduction is not the end of that journey. It is the proof of concept that enterprise AI architecture decisions matter as much as model selection.
Frequently asked questions
What is the difference between a large language model and a small language model?
Large language models (LLMs) typically have tens of billions to hundreds of billions of parameters and are trained on broad internet-scale datasets. They excel at generalization — handling novel tasks, ambiguous queries, and reasoning across domains. Small language models (SLMs) have far fewer parameters, often between 1 billion and 14 billion, and are frequently fine-tuned on specific domain data. SLMs are faster, cheaper to run, and approach or match LLM accuracy within their specific domain, but fail quickly outside that domain. The distinction matters for production cost because inference cost scales with model size.
What is "Ask AT&T" and how do AT&T employees use it?
Ask AT&T is AT&T's internal AI assistant deployed for employee use across the company. It functions similarly to a conversational assistant that employees can query for information — policy lookups, operational guidance, knowledge base retrieval, and task assistance. At 27 billion tokens per day of throughput, it is processing an enormous volume of internal queries, suggesting substantial adoption across a workforce of over 150,000 employees.
Why did AT&T choose LangChain for orchestration?
LangChain is one of the most mature and widely deployed orchestration frameworks for multi-agent AI systems. It provides tooling for defining agent roles, managing state across multi-step interactions, routing between models, and chaining outputs. AT&T's use of LangChain is consistent with broader enterprise adoption — the framework's flexibility in supporting heterogeneous model backends (mix of frontier LLMs and SLMs) makes it well-suited for the kind of tiered architecture AT&T built.
Does switching to SLMs mean lower quality outputs for employees?
According to AT&T's CDO Andy Markus, no. His stated position is that "small LMs are just about as accurate as LLMs on a given domain." For in-domain tasks — the queries that worker agent SLMs handle — fine-tuned small models approach frontier model accuracy. Quality loss occurs only when an SLM is asked to handle tasks outside its tuned domain, which the orchestration architecture prevents by routing complex or out-of-domain queries to super agent LLMs.
Can smaller enterprises replicate what AT&T did?
The architectural pattern is not AT&T-specific. Any organization with high-volume, repetitive AI workloads and the engineering capability to implement LangChain-based orchestration can pursue the same approach. The barriers are primarily technical: building reliable routing logic, fine-tuning domain-specific SLMs, and managing the operational complexity of a multi-model architecture. These are solvable engineering problems. The 90% cost reduction AT&T achieved is at the high end of what should be expected, because AT&T's scale amplifies the savings, but meaningful cost reductions are achievable at smaller scale.
What does this mean for AI companies that sell frontier model access?
It adds pressure on pricing and forces a clearer articulation of the use cases where frontier models are genuinely irreplaceable. The enterprise AI market is bifurcating: a high-value, lower-volume market for complex reasoning tasks that require frontier capability, and a high-volume, lower-value-per-query market for routine tasks that SLMs can handle. Frontier model providers need to compete primarily on quality for complex tasks rather than on convenience for all tasks. That is a different competitive dynamic than they were operating in during 2023 and 2024.