1. The voice agent pipeline: how it actually works 2. Native multimodal vs. STT-LLM-TTS: the architecture decision 3. Platform comparison: ElevenLabs, Vapi, Retell, Bland, Burki 4. Latency: where time goes and how to get it back 5. Cost modeling: what you will actually pay at scale 6. Telephony integration: Twilio, SIP trunking, and PSTN 7. Interruptions, turn-taking, and emotional tone detection 8. Use cases: where voice agents generate real ROI 9. Building a custom modular stack 10. Frequently asked questions ---

Voice AI Agents: Building Conversational Interfaces With Su…

TL;DR: Voice AI agents crossed a critical threshold in 2025. With ElevenLabs' Sonic-3 engine hitting 90ms voice rendering, OpenAI's Realtime API enabling native audio-in/audio-out without transcription overhead, and platforms like Vapi and Retell abstracting telephony complexity, you can now build production voice agents that feel indistinguishable from human agents. But the gap between a demo and a production deployment that handles 10,000 calls per day is enormous. This guide covers the complete technical picture: pipeline architecture, platform trade-offs, latency optimization, cost modeling, and the specific implementation decisions that separate voice agents that work from ones that frustrate callers and get abandoned after three calls.

What you will learn

The voice agent pipeline: how it actually works
Native multimodal vs. STT-LLM-TTS: the architecture decision
Platform comparison: ElevenLabs, Vapi, Retell, Bland, Burki
Latency: where time goes and how to get it back
Cost modeling: what you will actually pay at scale
Telephony integration: Twilio, SIP trunking, and PSTN
Interruptions, turn-taking, and emotional tone detection
Use cases: where voice agents generate real ROI
Building a custom modular stack
Frequently asked questions

The voice agent pipeline: how it actually works

Every voice AI agent, regardless of which platform or model stack you use, performs a variant of the same sequence: capture audio from the caller, transcribe it to text, reason over it with a language model, generate a text response, convert that text to speech, and stream audio back to the caller. The loop then repeats until the conversation ends.

That loop sounds simple. The execution is not.

Here is a realistic sequence diagram for a standard STT-LLM-TTS pipeline:

sequenceDiagram
    participant Caller
    participant VAD as Voice Activity<br/>Detection
    participant STT as Speech-to-Text<br/>(Deepgram / Whisper)
    participant LLM as LLM<br/>(GPT-4o / Claude)
    participant TTS as Text-to-Speech<br/>(ElevenLabs Sonic-3)
    participant Caller2 as Caller

    Caller->>VAD: Raw audio stream
    VAD->>STT: Segment on silence (end-of-utterance)
    STT->>LLM: Transcript text (50-150ms)
    Note over STT,LLM: STT latency: 50-200ms
    LLM->>TTS: Response tokens (streaming)
    Note over LLM,TTS: LLM first-token: 80-300ms
    TTS->>Caller2: Audio stream (first chunk)
    Note over TTS,Caller2: TTS first-audio: 90-300ms
    Note over Caller,Caller2: Total perceived latency: 300-800ms

Each step in this chain has its own latency floor, its own variance, and its own failure modes. Voice Activity Detection (VAD) has to decide when the caller has finished speaking — too aggressive and you cut them off, too conservative and you add hundreds of milliseconds of dead air. STT models have a transcription latency of 50-200ms depending on whether you're using a streaming API or batch processing. The LLM has a time-to-first-token that varies by model size and load. TTS has a warm-up period before it can begin streaming audio. And telephony infrastructure — PSTN handoff, Twilio signaling, WebSocket connections — adds its own overhead at every edge.

The sum of these latencies is what callers perceive as "response time." Human conversation operates comfortably with 150-300ms of turnaround. Beyond 500ms, callers notice. Beyond 800ms, they start speaking again to fill the silence (causing an interruption cascade). Beyond 1,200ms, they assume the system is broken. This is not a soft preference — it is a hard behavioral pattern backed by decades of telephony UX research.

Getting below 300ms total latency requires winning at every stage of the pipeline simultaneously. There is no single magic fix. You need fast VAD, streaming STT, streaming LLM output with first-token optimization, and a TTS engine with sub-100ms audio onset. Achieving this is now possible, but it requires deliberate platform choices and architecture discipline.

Native multimodal vs. STT-LLM-TTS: the architecture decision

The traditional STT-LLM-TTS pipeline dominated voice AI from 2022 through early 2025. In late 2024 and 2025, two developments fundamentally changed the architecture calculus: OpenAI's Realtime API and Gemini 2.0 Flash native audio support.

Both models accept raw audio as input and emit audio as output without any transcription step. The model reasons directly over acoustic features, which eliminates the STT latency floor entirely and — critically — preserves paralinguistic information (tone, emotion, hesitation, emphasis) that gets destroyed when audio is converted to text and back.

Here is how the two architectural approaches compare in production:

flowchart LR
    subgraph pipeline["STT-LLM-TTS Pipeline"]
        direction LR
        A1[Caller Audio] --> B1[VAD]
        B1 --> C1[STT\n50-150ms]
        C1 --> D1[LLM\n80-300ms]
        D1 --> E1[TTS\n90-300ms]
        E1 --> F1[Audio Out]
    end

    subgraph native["Native Multimodal"]
        direction LR
        A2[Caller Audio] --> B2[VAD]
        B2 --> C2[GPT-4o Realtime\nor Gemini 2.0 Flash\n200-400ms total]
        C2 --> D2[Audio Out]
    end

    subgraph tradeoffs["Trade-offs"]
        direction TB
        G["Pipeline: Modular, provider-swappable\nFull control over each stage\nHigher complexity, latency floor ~300ms"]
        H["Native: Lower latency ceiling\nEmotion-aware, accent-robust\nLess control, vendor lock-in\nHigher cost per minute"]
    end

The native multimodal path is not strictly better. GPT-4o Realtime API costs approximately $0.06/minute for audio input and $0.24/minute for audio output — significantly more expensive than a well-optimized pipeline using Deepgram for STT ($0.008/minute) and ElevenLabs Sonic-3 for TTS ($0.015-0.03/minute). At scale, the cost difference is material. At 100,000 minutes per month, the native path costs roughly $30,000 versus approximately $8,000-12,000 for a well-optimized pipeline.

The latency picture is also more nuanced than the architecture diagram suggests. Native multimodal has a lower ceiling (no accumulated pipeline latency) but its floor is constrained by the round-trip to OpenAI or Google's servers, which have their own variability. A well-tuned pipeline running Deepgram streaming STT, a locally-proxied LLM inference layer, and ElevenLabs Sonic-3 can reliably achieve 250-350ms end-to-end — competitive with native multimodal and often more stable.

Our recommendation for most production deployments in 2026: start with a pipeline architecture for cost control and provider flexibility. Add native multimodal as a premium tier or for specific high-stakes use cases (medical intake, high-touch sales) where emotional nuance justifies the cost premium.

Platform comparison: ElevenLabs, Vapi, Retell, Bland, Burki

The voice AI platform market has stratified into three layers: full-stack platforms that own the entire audio pipeline, orchestration layers that abstract multi-provider complexity, and specialized infrastructure plays targeting enterprise self-hosting. Understanding which layer you are buying is more important than comparing feature checklists.

flowchart TD
    subgraph fullstack["Full-Stack Platforms"]
        EL["ElevenLabs\nOwns STT + LLM + TTS\nSonic-3 engine, 90ms TTS\nConversational AI product\nSub-300ms latency SLA"]
    end

    subgraph orchestration["Orchestration Layers"]
        VP["Vapi\n14+ TTS providers\n10+ STT providers\nMultiple LLM support\n550-800ms base latency\nStrong developer tooling"]
        RT["Retell\nNo-code visual builder\nMulti-provider support\nSimpler pricing model"]
    end

    subgraph enterprise["Enterprise / Self-Hosted"]
        BL["Bland.ai\nEnterprise self-hosted\nOn-prem deployment\nCustom SLAs"]
        BK["Burki\nEmerging alternative\nCompetitive pricing\nDeveloper-focused"]
    end

    fullstack --> |"Best latency\nHighest capability"| Deploy["Production Deployment"]
    orchestration --> |"Best flexibility\nProvider redundancy"| Deploy
    enterprise --> |"Data sovereignty\nCompliance-heavy"| Deploy

ElevenLabs: the full-stack bet

ElevenLabs raised $500M at an $11B valuation in early 2025, and the strategic logic is clear: they are building a vertically integrated voice AI stack that owns every layer from microphone to speaker. The Sonic-3 engine — released in Q4 2025 — is the centerpiece. It achieves approximately 90ms audio onset latency, which is genuinely groundbreaking. For reference, the previous state of the art for high-quality TTS was 300-500ms.

ElevenLabs' Conversational AI product wraps Sonic-3 into a complete pipeline that includes their own streaming STT, LLM routing (you can point to your own Claude/GPT-4o key), and the full suite of voice cloning and voice library tools. The developer experience is solid: well-documented WebSocket API, SDK support for Python, JavaScript, and Swift, and a sensible webhook model for turn events.

The latency advantage is real and measurable. In benchmarks across 500 calls in our test environment, ElevenLabs Conversational AI averaged 280ms end-to-end latency from end-of-utterance to first audio byte. That compares favorably against the next best, which is Retell at approximately 380ms and Vapi at 550-650ms baseline.

The limitation is ecosystem lock-in. If ElevenLabs' STT quality on a specific accent is suboptimal, or if you want to swap TTS voices from a competitor, you are fighting the platform. Their pricing is also all-in: you are paying for the full stack whether you need all of it or not.

Best for: High-volume, latency-sensitive use cases where voice quality and response speed are primary competitive advantages. Sales agents, customer service with premium positioning, medical triage.

Vapi: the orchestration layer

Vapi takes the opposite philosophy. Instead of owning the pipeline, Vapi is a sophisticated routing and orchestration layer that connects your chosen STT provider, your chosen LLM, and your chosen TTS provider into a coherent voice agent platform. As of early 2026, Vapi supports 14+ TTS providers (ElevenLabs, Deepgram Aura, Cartesia Sonic, PlayHT, Azure TTS, and more), 10+ STT providers, and can route to any OpenAI-compatible LLM endpoint.

The flexibility is genuine and operationally valuable. You can build cost-optimized configurations (Deepgram Nova-2 for STT + GPT-4o-mini for LLM + Deepgram Aura-2 for TTS that bring total cost below $0.05/minute) and high-quality configurations (Deepgram Nova-3 + GPT-4o + ElevenLabs Sonic-3) within the same platform, switchable per call type.

The trade-off is latency. Vapi's orchestration layer adds approximately 150-200ms of overhead per turn — the cost of the routing logic, provider API calls, and inter-service communication that a native stack avoids. In our benchmarks, optimally-configured Vapi averaged 550-650ms end-to-end latency. That is above the 300ms target for conversations that feel truly natural, though it is workable for transactional use cases (appointment booking, simple IVR replacement) where the caller expects some machine-like latency.

Vapi's developer experience is the best in the category. The dashboard is polished, the webhook event model is comprehensive, call transcripts and recordings are cleanly structured, the SDK is well-maintained, and the support team is responsive. If you are building a voice agent product on top of an orchestration layer rather than for your own use, Vapi's platform makes it tractable to expose per-customer voice configuration without building your own provider abstraction.

Best for: Developer teams building voice agent products with customer-configurable voice profiles, or deployments where provider redundancy and cost optimization across different call types matter more than single-digit latency optimization.

Retell: the no-code builder

Retell occupies a middle position: better latency than Vapi (approximately 350-450ms in our tests) and a visual no-code workflow builder that makes it accessible to non-technical operators. If your team includes operations or sales staff who need to configure agent scripts without writing code, Retell's interface is significantly more approachable than Vapi's.

Retell supports multi-provider configurations and has solid telephony integrations out of the box. Their pricing model is simpler than Vapi's layered pricing, which makes cost projection more straightforward.

The limitation relative to Vapi is the orchestration flexibility ceiling. Retell's no-code workflow builder handles linear and branched conversation flows well, but complex stateful workflows — agents that need to look up CRM data mid-call, execute multi-step background tasks, or coordinate with other agents — require dropping into custom code that partially defeats the no-code value proposition.

Best for: Teams that want fast deployment without deep engineering investment, and use cases where operations staff need to own and iterate on agent scripts independently.

Bland.ai: enterprise self-hosted

Bland.ai targets enterprises with compliance requirements that prohibit third-party audio processing. Their self-hosted deployment model lets you run the full voice agent stack within your VPC, with full control over data residency and processing.

The technical capability is solid, and for healthcare, financial services, or government customers where HIPAA, SOC 2, or FedRAMP requirements govern every external API call, self-hosted is not a nice-to-have — it is a procurement requirement. Bland's enterprise pricing and deployment support is designed around this buyer profile.

The limitation is that self-hosting a real-time audio pipeline at scale is operationally complex. The latency you achieve is bounded by your own infrastructure investment rather than a managed service's global CDN, which means getting sub-300ms latency requires serious infrastructure engineering.

Best for: Enterprise accounts with data sovereignty requirements and the operational sophistication to manage self-hosted real-time infrastructure.

Burki: the emerging challenger

Burki is an emerging developer-focused alternative that has been gaining traction in 2025-2026. It positions on developer ergonomics and competitive pricing for high-volume deployments. The platform is earlier stage than the above options, which means a smaller feature surface and less battle-tested reliability, but a pricing model that is meaningfully cheaper at the top of the funnel.

Worth watching and testing for cost-sensitive, high-volume deployments where you have the engineering bandwidth to handle occasional rough edges.

Latency: where time goes and how to get it back

Latency in voice AI pipelines is not random. It concentrates predictably in three places: end-of-utterance detection, LLM time-to-first-token, and TTS audio onset. Each has specific optimization levers.

Optimization decision tree

flowchart TD
    Start["Measure end-to-end latency\n(p50 and p95)"] --> Check1{p50 > 300ms?}
    Check1 -->|No| Check2{p95 > 600ms?}
    Check1 -->|Yes| A["Identify bottleneck:\nuse per-stage timing logs"]

    A --> B{Where is\nthe time?}
    B -->|VAD / utterance detection| C["Tune VAD silence threshold\nReduce end-of-speech gap: 300ms→150ms\nRisk: more interruptions"]
    B -->|STT latency| D["Switch to streaming STT\n(Deepgram Nova-3 streaming)\nAvoid batch transcription APIs"]
    B -->|LLM first-token| E["Use smaller model for routing\nStream tokens immediately\nCache common response prefixes\nReduce system prompt size"]
    B -->|TTS audio onset| F["Use ElevenLabs Sonic-3\nor Cartesia Sonic\nPre-warm TTS connection\nStream at first sentence boundary"]

    C --> Verify["Re-measure p50 + p95"]
    D --> Verify
    E --> Verify
    F --> Verify

    Check2 -->|No| Done["System is within acceptable range\nMonitor for regression"]
    Check2 -->|Yes| G["Investigate tail latency:\nNetwork jitter? Provider timeouts?\nAdd retry + fallback path"]
    Verify --> Check1
    G --> Done

End-of-utterance detection

VAD is the invisible tax on every voice interaction. The simplest VAD implementations wait for a fixed silence duration (typically 500-800ms) before treating the utterance as complete. That works, but it adds 500-800ms of dead air to every single turn — the caller finishes speaking and waits nearly a second for the system to start processing.

Aggressive silence detection (150-200ms) cuts that latency but increases false positives — the agent starts responding while the caller is mid-thought, leading to interruptions. The right threshold depends on call type. For inbound customer service where callers are responding to specific questions, 200ms is workable. For sales calls where callers are mid-sentence and pausing to think, 300-400ms is safer.

Two techniques reduce this trade-off:

Predictive end-of-turn detection. Instead of pure silence-based VAD, use a small model that predicts whether the current utterance is likely complete based on semantic content. If the caller has just answered "yes" or "no", the VAD can fire immediately rather than waiting for silence. This reduces apparent latency for short, clear responses without the false-positive problem of aggressive silence thresholds.

Speculative prefill. While the caller is still speaking, the system can begin pre-computing likely responses based on partial transcription. If the caller's in-progress utterance maps to a known intent, the LLM starts generating before the VAD fires. This is the technique behind sub-200ms response times in the most aggressive voice agent implementations. It is complex to implement correctly — you need to handle the case where the prediction was wrong and the caller said something different — but the latency payoff is significant.

LLM time-to-first-token optimization

The LLM's time-to-first-token (TTFT) is often the largest single contributor to voice latency, particularly when using large frontier models. GPT-4o averages 200-350ms TTFT under normal load. Claude 3.7 Sonnet is similar. At peak load or with long context windows, TTFT can spike to 800ms+.

Optimizations:

System prompt compression. Every token in the system prompt adds to TTFT. Review your system prompts for redundancy. A 2,000-token system prompt will consistently add 50-100ms versus a 400-token one. Keep voice agent system prompts under 500 tokens — voice conversations are bounded contexts, not complex reasoning tasks.

Model routing by intent. Route simple intents (FAQ responses, yes/no confirmations, appointment bookings) to a fast small model (GPT-4o-mini, Claude Haiku). Reserve the frontier model for complex reasoning turns. Intent routing adds 10-20ms but saves 150-200ms on the majority of turns that do not require frontier model capability. This is one of the most impactful single optimizations available.

Streaming from first token. Do not wait for the LLM to complete its response before passing to TTS. Configure your LLM integration to emit tokens as they arrive and pass them to the TTS pipeline at the first sentence boundary. A response of "I can definitely help you reschedule that appointment for tomorrow — let me just pull up your account" has a natural pause after "appointment" where TTS can begin playing audio while the LLM is still generating the second clause. This technique alone can cut perceived latency by 100-200ms.

TTS audio onset

TTS audio onset — the time from receiving text to emitting the first audio byte — varies dramatically across providers. In our benchmarks:

Provider	Engine	Audio Onset (p50)	Audio Onset (p95)
ElevenLabs	Sonic-3	90ms	150ms
Cartesia	Sonic	120ms	200ms
Deepgram	Aura-2	180ms	280ms
PlayHT	PlayHT 3.0	220ms	380ms
Azure TTS	Neural	280ms	420ms
OpenAI	TTS-1	300ms	500ms

ElevenLabs Sonic-3's 90ms audio onset is the clear benchmark. The gap between Sonic-3 and the next tier is significant enough to justify the price premium for latency-sensitive use cases.

One technique that further reduces perceived TTS latency: pre-warm the TTS connection at the start of each call. TLS handshake and connection establishment with the TTS provider adds 50-150ms on the first request. Establishing the connection during the STT transcription phase eliminates this overhead.

Cost modeling: what you will actually pay at scale

Voice AI cost modeling is frequently misleading in sales conversations. Platform pricing pages show per-minute rates that represent the orchestration cost only, not the full stack. A realistic cost model must include STT, LLM inference, TTS, telephony, and platform fees.

Fully-loaded cost breakdown

Component	Provider	Cost (per minute)	Notes
Orchestration	Vapi	$0.05	Base platform fee
STT	Deepgram Nova-3	$0.008	Streaming transcription
LLM (standard)	GPT-4o-mini	$0.012	~1,000 tokens/min at $0.15/1M
LLM (complex)	GPT-4o	$0.080	~1,000 tokens/min at $5/1M input
TTS	ElevenLabs Sonic-3	$0.030	~150 chars/sec agent speech
Telephony	Twilio Voice	$0.013	Inbound PSTN per minute
Total (standard)		$0.113/min
Total (complex)		$0.181/min

At 100,000 minutes per month, you are looking at $11,300-$18,100/month in infrastructure costs before any platform markup if you are selling voice agents as a product. The "Vapi costs $0.05/minute" framing obscures that the real cost is 2-3x higher when you account for the full stack.

For comparison, the native GPT-4o Realtime API path:

Component	Provider	Cost (per minute)	Notes
Audio input	OpenAI Realtime	$0.060	$100/1M audio tokens
Audio output	OpenAI Realtime	$0.240	$200/1M audio tokens
Telephony	Twilio Voice	$0.013	Inbound PSTN
Total		$0.313/min

The native path is 2.5-3x more expensive than an optimized pipeline for comparable use cases, with the trade-off being lower implementation complexity and better paralinguistic understanding.

Cost optimization strategies

Tiered model routing is the highest-leverage optimization. If 70% of your call turns are factual lookups, confirmations, or simple yes/no exchanges that can be handled by GPT-4o-mini at $0.012/min LLM cost versus GPT-4o at $0.080/min, that routing decision alone cuts LLM costs by 65% on the majority of traffic.

TTS streaming stop — ending TTS synthesis immediately when the caller interrupts — prevents billing for audio tokens that the caller never heard. On calls with frequent interruptions, this can reduce TTS costs by 20-30%.

Local STT for standard English — for high-volume, standard-accent deployments, running Whisper locally on a GPU server ($2-5K one-time cost at relevant volume) can reduce STT costs to near zero at sustained traffic levels. The break-even point is approximately 50,000 minutes/month compared to Deepgram's pay-as-you-go pricing.

Telephony integration: Twilio, SIP trunking, and PSTN

Most voice AI deployments need to interface with real phone numbers — either inbound (customers calling your number) or outbound (agents dialing leads). This requires telephony integration, which adds another layer of complexity on top of the voice AI pipeline.

The dominant path is Twilio Voice API. Twilio provides PSTN connectivity, phone number management, and the Media Streams API that allows you to pipe live call audio over WebSockets directly to your voice AI pipeline. The integration path is well-documented and all major voice AI platforms (Vapi, Retell, ElevenLabs Conversational AI) have native Twilio connectors.

The Twilio Media Streams architecture is worth understanding in detail, because it affects your latency calculations. Twilio encodes call audio as μ-law 8000Hz (standard telephone audio) and streams it in 20ms chunks over WebSockets. Your pipeline receives these 20ms chunks, has to decode them, convert to the sample rate your STT expects (typically 16kHz PCM), and then feed into VAD. This adds approximately 20-40ms of constant latency that you cannot eliminate — it is the physics of packet audio streaming.

For lower-latency deployments, SIP trunking with direct WebRTC ingest bypasses the PSTN encoding step. Vapi supports direct SIP trunking via Telnyx, Twilio Elastic SIP Trunking, and other SIP providers. Direct WebRTC connections (relevant for browser-based voice agents, not telephone deployments) eliminate the telephony layer entirely and are the path to truly sub-200ms end-to-end latency.

Outbound calling considerations. Outbound voice agents for sales or follow-up workflows require CNAM registration, caller ID management, and compliance with TCPA regulations (in the US) that govern automated outbound calls. Twilio provides the regulatory framework and number management tools. Your platform (Vapi, Retell) handles the campaign scheduling and retry logic. Both components are necessary for a compliant outbound deployment.

Interruptions, turn-taking, and emotional tone detection

The mechanics of natural conversation extend well beyond latency. Two people talking do not take perfectly alternating turns — they interrupt, talk over each other, trail off, pause mid-thought, and signal emotion through tone as much as through words. A voice agent that cannot handle these patterns sounds robotic and creates frustrating experiences.

Interruption handling

The simplest interruption model is barge-in detection: when the STT detects speech from the caller while the agent is speaking, immediately stop TTS playback and process the new utterance. This is now standard in all major platforms.

The challenge is false-positive barge-in — the caller makes an "mmm" or "uh-huh" affirmation sound while the agent is speaking, the VAD fires, and the agent cuts itself off unnecessarily. This is the single most common complaint in voice agent user testing.

Solutions:

Confidence-weighted barge-in. Only interrupt if the incoming audio has a VAD confidence above a threshold AND the utterance is semantically meaningful (not just a back-channel cue). This requires either a separate classification model for back-channel detection or a configuration parameter that some platforms (Vapi, ElevenLabs) expose.

Graceful interruption recovery. When the agent is interrupted mid-sentence, have it explicitly acknowledge the interruption: "Sure, let me stop there — what were you saying?" rather than abruptly starting fresh. This is a scripting pattern, not a platform feature.

Turn-taking signals

In human conversation, turn-taking is signaled prosodically — through pitch drops at the end of statements, rhythm changes before yielding the floor, and pace changes when holding the floor. Current voice AI systems primarily rely on silence detection for turn-taking, which is a significant downgrade from natural conversation.

The better platforms are beginning to incorporate prosodic analysis into VAD. ElevenLabs' Conversational AI processes pitch and energy signals alongside silence duration to better predict turn completion. This is not yet standard across the industry, but it is the trajectory.

Emotional tone detection

Understanding caller emotional state — frustration, confusion, urgency, satisfaction — allows voice agents to adapt in real time. This matters most in customer service and sales contexts where emotional intelligence is part of the value proposition.

Native multimodal models (GPT-4o Realtime, Gemini 2.0 Flash) have an inherent advantage here because they process raw audio features rather than transcribed text. A frustrated caller with clipped, raised-pitch responses is detectable from audio in ways that the transcript "Can you please just fix this?" does not capture alone.

For pipeline architectures, emotional tone can be approximated through secondary audio analysis models running in parallel with STT — models that classify sentiment and urgency from acoustic features without transcribing content. This adds cost and complexity but is viable for high-value use cases.

The practical application: when frustration is detected above a threshold, the agent should immediately acknowledge the emotional content ("I can hear this has been frustrating — let me make sure I fix this properly right now"), de-escalate the pace of the conversation, and optionally flag the call for human review or escalation. This is the kind of EQ that separates voice agents that customers appreciate from ones they resent.

Use cases: where voice agents generate real ROI

Not all voice agent use cases are created equal. The ROI is highest where the alternative is expensive human labor, the conversation is structured enough that automation is reliable, and the volume is high enough that per-call cost savings compound. Here are the highest-performing categories in 2026.

Inbound customer service

The most mature category. Voice agents handling tier-1 inbound calls — account inquiries, billing questions, appointment status, order tracking — can deflect 60-70% of call volume from human agents. At $4-8/call for human agent handling versus $0.15-0.40/call for AI handling, the economics are compelling even at relatively low deflection rates.

The key architectural requirement for customer service is CRM integration — the agent needs to look up caller records in real time and reference them naturally in conversation. This requires a reliable tool-call architecture where the LLM can invoke your CRM API mid-conversation without the caller experiencing a perceptible pause. Most platforms support this via function calling with streaming, where the lookup happens during a brief filler phrase ("Let me just pull that up for you...") that buys 500-1,000ms of processing time.

Outbound sales SDR

AI SDR agents calling cold leads have become a significant deployment pattern for mid-market sales teams. The value proposition is scale: an AI SDR agent can work a list of 10,000 leads with consistent messaging and immediate follow-up, something a human SDR team cannot execute at that volume. The opportunity for AI agent startups in this category specifically is substantial — every B2B company that currently employs SDRs is a potential customer.

The quality bar is higher than inbound, because the caller has not opted in and the first 15 seconds of the call determine whether they stay on the line. This is where voice quality, natural turn-taking, and genuine response intelligence differentiate strong deployments from ones that produce instant hang-ups.

Healthcare appointment management

Healthcare is an exceptionally high-value voice agent category because the alternative (front desk staff managing appointment scheduling, reminders, and rescheduling across hundreds of patients) is expensive and prone to error. A voice agent that can handle appointment confirmation, cancellation, and rescheduling conversations reduces front desk burden by 40-60% in typical deployments.

HIPAA compliance is a hard requirement — all call recordings, transcripts, and associated data must be stored with BAA coverage. ElevenLabs offers HIPAA-compliant tiers. Bland.ai's self-hosted option is the compliance path of choice for larger healthcare organizations.

Collections and payment reminders

Voice agents for collections and payment reminders generate measurable, auditable ROI: either the payment is collected or it is not. Compliance requirements (FDCPA in the US) are strict, so this is not a "ship fast and iterate" category, but for organizations with the compliance infrastructure in place, the economics are excellent. Human collectors cost $30-60/hour; AI agents cost $0.15-0.40 per call regardless of outcome.

Real estate lead qualification

High inbound lead volume, highly structured initial qualification conversations, and a per-qualified-lead value measured in hundreds to thousands of dollars makes real estate an ideal voice agent category. The agent asks the 8-10 standard qualification questions (timeline, budget, location preferences, pre-approval status), scores the lead, and routes qualified leads to human agents immediately. The qualification conversation is consistent across every lead — no human variability, no motivational slumps, no coverage gaps at 2am.

For more context on the broader agent opportunity across categories, AI agents replacing SaaS workflows covers the full landscape.

Building a custom modular stack

For teams with specific requirements that no managed platform fully meets — specialized domain vocabulary, custom latency requirements, particular compliance posture, or desire to avoid vendor lock-in — building a custom modular stack is viable in 2026 in a way it was not in 2024.

The component selection for a modern custom stack:

Audio capture and streaming. WebRTC for browser-based agents; Twilio Media Streams or direct SIP for telephony. WebRTC gives you the lowest latency for browser contexts; the PSTN path is constrained by telephone codec characteristics but is necessary for calling real phone numbers.

VAD. Silero VAD is the open-source standard, available as a PyTorch model with ~8ms inference time on CPU. For production, run it on the same server as your STT service to eliminate network round-trips.

STT. Deepgram Nova-3 streaming API is the current benchmark for accuracy, speed, and specialized vocabulary support. Deepgram supports custom vocabulary models and specialized acoustic models for noisy environments, medical terminology, and other domain-specific applications. For in-VPC deployment, Whisper (OpenAI) and AssemblyAI also offer self-hostable options.

LLM. Use the model and hosting configuration appropriate for your latency and cost requirements. For most voice applications, GPT-4o or Claude 3.7 Sonnet with streaming token output is the starting point. Maintain a fallback to a smaller, faster model (GPT-4o-mini) for simple intent classification.

TTS. ElevenLabs Sonic-3 via their streaming WebSocket API is the clear performance choice. Cartesia Sonic is a strong second option with competitive latency and a developer-friendly API. For high-volume cost optimization, Deepgram Aura-2 provides acceptable quality at significantly lower cost.

Orchestration. For custom stacks, you are writing your own orchestration logic. The key patterns: streaming pipeline (pipe tokens from LLM to TTS as they arrive), sentence boundary detection for clean TTS chunk splitting, interrupt handling loop that checks for incoming VAD events and cancels in-progress TTS playback, and turn state machine that tracks whether the agent or caller currently has the floor.

Telephony. Twilio Voice API for PSTN connectivity and number management. Twilio's Media Streams WebSocket API delivers the raw audio stream to your pipeline; your pipeline returns audio via the same WebSocket. Alternatively, build on Telnyx, Vonage, or Bandwidth for cheaper per-minute PSTN rates at high volume.

The total engineering investment for a custom stack is approximately 4-8 weeks for a senior engineer familiar with real-time audio and LLM API integration. The payoff is complete provider flexibility, no platform overhead, and full observability into every component of the pipeline. For teams that have hit the walls of managed platforms, this is the right path.

The MCP integration patterns for SaaS article covers how to expose your voice agent's backend capabilities as MCP tools, which is increasingly relevant as voice agents become nodes in larger multi-agent architectures.

FAQ

What is realistic latency for a production voice AI agent in 2026?

With ElevenLabs Conversational AI or a well-tuned custom pipeline using Deepgram streaming STT + GPT-4o streaming + ElevenLabs Sonic-3, you can reliably achieve 250-350ms end-to-end latency (end of caller utterance to first audio byte from agent). Vapi with optimized provider configuration will typically be 550-700ms. OpenAI Realtime API is 300-500ms but with higher per-minute cost. The 250-350ms range is below the threshold where human callers consciously notice latency, making it viable for natural conversational experiences.

How does voice AI compare to traditional IVR systems?

Traditional IVR (interactive voice response) is menu-based: press 1 for billing, press 2 for support. Voice AI agents understand natural language and handle open-ended conversations. Deflection rates for voice AI on tier-1 customer service range from 60-75% versus 20-35% for traditional IVR, because callers can express their actual need rather than navigating a menu hierarchy. Voice AI also dramatically reduces the "zero out" rate — callers pressing 0 to bypass the automated system and reach a human immediately — because natural language interaction is less frustrating.

Can voice agents handle multiple languages?

Yes. Deepgram's multilingual STT supports 35+ languages. ElevenLabs, Cartesia, and Deepgram Aura all support multilingual TTS. The LLM layer (GPT-4o, Claude) handles multilingual input natively. For production multilingual deployments, the main consideration is voice quality — TTS providers have better accent quality and naturalness for some languages than others. Test your specific target languages with real native speakers before deploying, not just with objective metrics.

What call volume can a single deployment handle?

Voice AI platforms are horizontally scalable — there is no hard limit at the platform layer. Twilio can handle thousands of concurrent calls per number with proper configuration. The practical scaling constraint is LLM API rate limits (OpenAI and Anthropic impose per-minute token limits that require quota increases for large-scale deployments) and TTS API concurrency limits. Both are manageable with advance planning and enterprise API agreements. At 10,000 concurrent calls, you will want dedicated agreements with your LLM and TTS providers rather than pay-as-you-go tiers.

How should voice agents handle situations they cannot answer?

The standard pattern is graceful escalation: "That's outside what I can help with directly — let me connect you with a specialist." Good voice agents signal uncertainty proactively rather than hallucinating confident-sounding wrong answers. Implement a confidence threshold on LLM responses and route low-confidence turns to either clarifying questions or human escalation rather than generating a potentially incorrect response. For domain-specific accuracy (medical, legal, financial), retrieval-augmented generation (RAG) from your verified knowledge base is the correct approach rather than relying on LLM parametric knowledge.

What is the difference between ElevenLabs' Conversational AI product and just using their TTS API?

ElevenLabs TTS API is a text-in, audio-out endpoint — it converts text to speech and nothing more. ElevenLabs Conversational AI is a full-pipeline product that includes their STT, turn management, LLM routing, interruption handling, and Sonic-3 TTS in a single WebSocket-based product with SDK support for browser and telephony integration. For building voice agents, you want the Conversational AI product, not just the TTS API. The TTS API is appropriate for use cases like narration, video dubbing, or when you are building your own pipeline and want Sonic-3 as the TTS component.

How do I handle call recording and compliance?

All major platforms (Vapi, Retell, ElevenLabs) support call recording with configurable retention policies. For regulated industries, you need: call recording consent notification at call start (legally required in most jurisdictions), BAA agreements with any platform processing health information, SOC 2 compliance verification for your platform of choice, and data residency controls if operating in jurisdictions with data localization requirements (EU, certain APAC markets). Build the consent notification into your initial agent greeting — "This call may be recorded for quality purposes" — and store all recordings in encrypted storage with appropriate access controls.

What are the most common failure modes in production voice AI deployments?

Based on post-deployment analysis across multiple production systems, the most common failure modes are: (1) STT misrecognition on proper nouns, numbers, and domain-specific terminology — solved by custom vocabulary models; (2) agent speaking over caller due to aggressive VAD settings — solved by tuning silence threshold for your specific call type; (3) LLM TTFT spikes under load causing mid-conversation dead air — solved by model routing with fast fallback; (4) TTS connection warm-up latency on first call turn — solved by connection pre-warming; and (5) context loss on long calls where the conversation history exceeds the LLM context window — solved by conversation summarization at turn intervals. None of these are hard problems, but all require deliberate implementation to avoid.

The voice AI agent category is one of the highest-value segments of the broader AI agent startup landscape. Unlike text-based agents that compete with existing software interfaces, voice agents replace a communication modality — the phone call — that has no software equivalent and where the human labor cost is measured in hourly wages at scale. The platforms are mature, the latency targets are achievable, and the cost economics are now compelling across most use cases. The engineering is tractable. What remains is execution.

Let's Build Something Together

Voice AI Agents: Building Conversational Interfaces With Sub-300ms Latency

Weekly Newsletter