Here’s a number that should stop you cold: 87% of companies have now deployed voice AI agents. And here’s the number nobody wants to talk about: only 12% of them are actually satisfied with the results.
That’s not from a skeptic’s blog post. That’s from Assembly AI’s State of Voice Agents report — the people who have been processing production voice traffic at massive scale. Nearly 9 out of 10 voice agent implementations are falling short of expectations. In an industry that’s been breathlessly hyping conversational AI for years, that is a stunning failure rate.
I’ve been building voice applications for 30 years. I’ve seen this pattern before — the excitement, the early deployments, the quiet disappointment. What’s different now is that the technology is actually good enough to succeed. So why are most implementations failing anyway?
The Gap Is Not a Technology Problem
This is the part most vendors won’t tell you: the voice agent failure crisis is not primarily a technology problem. The underlying models are capable. The speech recognition is better than it’s ever been. The text-to-speech quality is frankly stunning — ElevenLabs, Cartesia, and others have cracked what seemed impossible just three years ago.
The gap is an implementation philosophy problem. Most deployments prioritize technical feasibility over actual outcomes. Teams ask “can we make this work?” instead of “will this accomplish something specific that matters to the business?”
Those are very different questions, and conflating them is where projects go off the rails.
The Counterintuitive Lesson: Smarter Is Not Better
Here’s the insight that took me a long time to fully internalize: giving your voice agent more intelligence is often the wrong move.
The instinct is to make the agent handle more, adapt more, improvise more. Let the LLM do its thing. What could go wrong? The answer, from founders who’ve processed millions of production calls, is: quite a lot.
“Anytime you let an LLM decide what to say, you’re at risk,” as one production voice AI founder put it bluntly. The implementations that actually perform — that hit their task completion rates, that customers don’t hate — are the ones that tightly constrain the conversation. They script the predictable paths and reserve LLM reasoning for the edges.
This probably sounds obvious to anyone who built Alexa skills or IVR systems back in the day. We learned it the hard way with early chatbots too. But every generation of new voice AI developers seems to have to relearn it: constraint enables success. Open-ended AI conversations at scale are an invitation for expensive, unpredictable failures.
What the Production Data Actually Shows
Beyond the philosophical, there are some hard technical truths from real deployments that are worth calling out:
Latency is still the silent killer. The current production standard for acceptable voice agent response time is sub-1.6 seconds — from end of user speech to first audio out. Early systems launching with 3.5-second latency were disasters. Users tolerate a lot from a voice agent, but they will not tolerate feeling like they’re talking to something slow. Interestingly, human-to-human phone calls naturally have 1-2 second pauses, which gives voice agents some cover — but only if the latency is predictably within that window, not randomly spiking.
Redundancy isn’t optional at scale. Every component in your voice stack — transcription, LLM, TTS, telephony — will fail independently at some point. The teams succeeding in production run parallel providers: Twilio plus Telnyx, multiple transcription vendors, pre-cached speech for high-frequency responses. This isn’t engineering perfectionism; it’s table stakes for anything handling serious call volume.
Voicemail detection is still broken. I’m going to name this because nobody talks about it enough: voicemail detection across virtually all vendors is terrible. False positives — where the agent thinks it’s reached a human when it’s actually talking to a voicemail greeting — happen constantly. This is an unsexy problem that meaningfully impacts outbound call performance, and it remains largely unsolved.
Monitoring matters more than deployment. Clients who get this right care intensely about post-call analytics and QA. They want to understand every conversation, not just the ones that flagged an error. The ability to systematically review what happened and why is what separates improving systems from stagnant ones. Deployment is a starting line, not a finish line.
Vertical Specialization Is Where This Is Going
The generic “voice agent that can handle anything” play is failing. The implementations actually working in production are deeply vertical: healthcare intake, banking collections, insurance claims, staffing outreach. They know their domain, they know their user population, and they’ve optimized relentlessly for a narrow set of high-volume interactions.
This matches where I see AI development heading more broadly. The romanticized vision of general-purpose AI agents that do everything is giving way to practical reality: specialized agents that do one thing extremely well are where the value is. Voice is just arriving at this lesson a bit earlier because the stakes of a bad conversation are so immediately apparent — the customer hangs up.
Samsung just shipped their callable agent architecture to 300 million devices — Bixby going fully LLM-powered, callable from other agents as part of an agentic chain. That’s a significant signal. It’s not a “voice assistant gets smarter” story; it’s voice becoming an action layer inside a larger agentic system. The conversational interface isn’t the end product anymore — it’s the input mechanism for agents that actually do things.
The Metric That Actually Matters
One of the most striking things I’ve encountered from people running millions of production calls: the best metric they found was whether the customer thanked the agent at the end of the call.
Not task completion rate (important, but gameable). Not conversation length (a proxy, not a signal). Not sentiment scores. Whether a human, at the end of an automated phone call, spontaneously said “thank you.”
That’s when you’ve nailed it. That’s when the experience was human enough, helpful enough, and smooth enough that the person’s instinct was to be polite.
We’re not there yet at scale. But we’re closer than the 12% satisfaction figure suggests we should be. The technology is capable. The lessons from production are now widely available. What’s needed is the discipline to build constrained, outcome-focused, monitored systems instead of chasing the demo.
Thirty years of building voice applications taught me that the difference between a good voice experience and a bad one almost never comes down to the technology. It comes down to the choices made by the people building it.
The 88% who are unsatisfied made different choices. The good news: those choices are fixable.
Pete Haas is CTO at WellSaid AI and has been building voice applications and conversational AI systems for over 30 years. He writes weekly on Voice AI at conversationcurve.com.