For most of my career, building voice applications has meant accepting a fundamental constraint: one side talks, the other listens. Not because that’s how human conversation works — it’s not — but because it’s how the technology worked. Sequential pipelines. Silence thresholds. Awkward pauses. The walkie-talkie model, dressed up in machine learning clothing.
That era is ending. And if you build voice applications for a living, you should be paying close attention.
The Pipeline We’ve All Been Living With
If you’ve spent any time building voice apps — IVRs, voice assistants, conversational agents — you know this architecture by heart:
- User speaks
- ASR transcribes the audio to text
- LLM processes the text and generates a response
- TTS converts the response back to audio
- System speaks
- Repeat
It’s clean. It’s sequential. And it has one massive problem: it’s nothing like how humans actually talk to each other.
Real conversation is chaotic and overlapping. We interrupt. We trail off and invite a response mid-sentence. We say “actually, wait” and change direction. We confirm understanding with brief backchannels — “uh-huh,” “right,” “got it” — without surrendering the floor. The half-duplex pipeline can’t handle any of that gracefully. So we’ve spent thirty years papering over it with silence detection, barge-in hacks, and increasingly clever prompt engineering.
It’s been good enough. But good enough and natural are very different things.
What Full-Duplex Actually Means
Full-duplex voice AI sounds simple: both parties can talk simultaneously, just like a phone call. But the implementation challenges are genuinely hard, which is why we haven’t had production-scale systems until now.
There are three problems that need to be solved at the same time:
Continuous audio processing. The model has to process incoming audio while it’s generating output. There’s no recording window, no “your turn” signal. It’s a continuous stream in both directions simultaneously.
Acoustic echo cancellation. When the AI speaks through a speaker, that sound enters the microphone. The system has to cancel its own voice from the input signal in real time — otherwise it hears itself and spirals into confusion. This is as much a signal processing problem as an AI problem.
Turn-taking intelligence. This is the hard one. In half-duplex systems, end-of-turn detection is solved with silence thresholds: if the user stops talking for 500ms, assume they’re done. It works okay, but it’s brittle. It mistakes thinking pauses for turn endings. It cuts people off mid-sentence. It has no concept of conversational state.
Full-duplex turn-taking requires the model to understand why there’s silence, not just that there is silence. Is this a cognitive pause? A trailing invitation to respond? Background noise? An interruption attempt? Getting this right requires combining acoustic signals with semantic understanding of the conversation — knowing that “so, the thing is…” is probably not an endpoint, while “and that’s really all I have to say” probably is.
The Latency Story Has Changed Too
Separate from the duplex problem, the raw latency numbers in 2026 are remarkable compared to even two years ago.
Time-to-first-audio (TTFA) in production voice systems used to mean 1-2 seconds in the best case. That’s long enough for users to feel the system “thinking.” It creates that characteristic robotic rhythm — statement, pause, response — that marks a conversation as AI-generated even when the voice sounds human.
Models shipping now are hitting 75-100ms TTFA. Sub-800ms end-to-end. That’s in the range of normal human response latency. When latency drops below perception thresholds, the whole interaction dynamic changes. The “AI pause” disappears. The conversation starts to feel ambient rather than transactional.
I’ve been building voice applications for thirty years, and I can tell you: latency is the variable that matters more than almost anything else for perceived naturalness. People forgive imperfect transcription. They forgive slightly odd phrasing. They do not forgive feeling like they’re talking to something slow. Sub-100ms TTFA is a bigger deal than it sounds on paper.
What This Breaks (Intentionally)
Here’s the thing about full-duplex that doesn’t get talked about enough: it doesn’t just improve existing voice apps. It makes some of the assumptions baked into existing voice app design actively wrong.
Consider turn design. Traditional voice UX is built around clear turn boundaries. Scripts are designed with explicit hand-off signals. Prompts are written to elicit complete, bounded responses before passing control. A lot of voice app design work is essentially about managing the awkwardness of half-duplex interaction.
Full-duplex makes those constraints evaporate — but it also means your existing interaction patterns may feel wrong. A voice agent designed for half-duplex will feel stilted and over-structured in a full-duplex environment. The scaffolding you built to compensate for technology limitations becomes visible when the limitations go away.
This is a rewrite problem, not an upgrade problem. And it’s an opportunity.
What Builders Should Be Doing Right Now
I’m not suggesting you tear up your production voice stack this week. But here’s where I’d be focusing attention:
Audit your turn-taking assumptions. Go through your conversation flows and identify every place you’ve engineered around half-duplex limitations. Explicit hand-off prompts. Silence threshold tuning. Barge-in logic. These are the seams that will show when you migrate to full-duplex systems.
Start experimenting with full-duplex in low-stakes contexts. Internal tools, demos, prototypes. The interaction design patterns for full-duplex are genuinely different and you don’t want to learn them in production.
Think carefully about interruption handling. Full-duplex means users can interrupt. That’s a feature, not a bug — but your agent needs to handle it gracefully. What does it mean for your system when a user cuts the agent off mid-sentence? Do you start over? Incorporate the new input? This needs to be designed, not just inherited from the model defaults.
Revisit your latency SLAs. If your architecture adds significant overhead on top of model latency — network hops, preprocessing, logging pipelines — now is the time to optimize. The model giving you 75ms TTFA doesn’t help if your infrastructure is adding 800ms before the audio hits the user.
The Bigger Picture
I’ve watched voice technology go through several inflection points. The move from DTMF to speech recognition. The move from grammar-based ASR to statistical models. The move from scripted IVRs to LLM-driven agents. Each shift felt significant at the time, but in retrospect they were incremental improvements on the same fundamental architecture.
Full-duplex feels different to me. It’s not an improvement on half-duplex interaction — it’s the elimination of a constraint that has shaped how we design voice systems at a foundational level. When voice AI can listen and speak simultaneously, the whole design space opens up. Voice assistants that feel like background presences rather than query interfaces. Ambient agents that can jump in during a pause rather than waiting to be explicitly addressed. Real-time coaching that interrupts gracefully rather than waiting for sentence boundaries.
We’ve been building for the technology we had. The technology just changed. Time to catch up.
Pete Haas is CTO at WellSaid AI and has been building voice applications for thirty years. He writes about voice AI, conversational systems, and the intersection of AI and human communication at Conversation Curve.