The 800-Millisecond Conversation
When a visitor speaks into a Talking Widget on a business website, a chain of events unfolds across multiple AI systems in roughly 800 milliseconds. By the time the AI voice agent begins its response, your words have been converted to text, interpreted by a large language model, converted back to natural speech, and streamed directly to your browser. That entire cycle — hear, understand, think, speak — happens faster than a human can form a reply.
Understanding how this works matters for businesses evaluating AI voice agents, developers building with the technology, and anyone curious about what separates a genuinely useful voice agent from a clunky phone menu. There are five distinct layers to the stack.
The Full Pipeline: A Visual Walk-Through
Here is the complete journey a single utterance takes, from the moment you finish speaking to the moment the AI voice begins its response.
The raw audio waveform from your microphone is streamed over WebRTC to an acoustic model. The model — trained on millions of hours of speech — converts the waveform into a probability distribution over possible word sequences, producing a transcript. Modern systems like Deepgram Nova-3 and Whisper use transformer architectures that process audio in streaming chunks, enabling near-instantaneous transcription rather than waiting for a full sentence.
Typical latency: 100–200msThe transcript text is passed to a large language model along with a system prompt containing your business information: services, pricing, hours, FAQs, and booking logic. The LLM — a neural network trained on vast amounts of text — generates a contextually appropriate, helpful response. Crucially, LLMs are stateful within a conversation: they remember what was said earlier and can carry context forward across multiple turns. This is what enables natural, multi-turn conversations rather than single isolated Q&A pairs.
Typical latency: 200–400ms (first token)The LLM's text response is converted to natural-sounding speech audio. Modern neural TTS systems — like Telnyx NaturalHD — use neural vocoders that model human vocal tract acoustics, prosody (the rhythm and intonation of speech), and emotional tone. The key advancement over older TTS was the shift from concatenative synthesis (stitching together recorded speech fragments) to neural synthesis (generating the audio waveform directly from a learned model). This is why modern AI voices sound genuinely natural rather than robotic.
Typical latency: 80–150ms to first audio chunkWebRTC (Web Real-Time Communication) is the open standard that enables audio to flow between your browser and the AI infrastructure without plugins, apps, or dial-in numbers. It handles packet loss, jitter buffering, echo cancellation, and adaptive bitrate in real time. For talking websites specifically, WebRTC means any visitor can speak to the AI directly from their browser — on desktop or mobile — with zero friction. No phone number needed, no app to download.
Ongoing: <50ms round-trip on good connectionsBeyond conversation, AI voice agents can take action mid-call. Through function calling — a capability built into modern LLMs — the AI can query a calendar for available slots, create a contact in a CRM, send an SMS confirmation, or trigger a webhook. The caller experiences this as the AI smoothly handling an appointment booking; behind the scenes, a structured API call was made and returned a result that the LLM incorporated into its next response.
Triggered on-demand during conversationSpeech Recognition: How the AI Hears You
The acoustic model at the heart of speech-to-text has changed dramatically in the last three years. Older systems used Hidden Markov Models and trained on relatively small, controlled datasets. Modern systems — including the Whisper family from OpenAI and commercial systems like Deepgram — use transformer attention mechanisms and are trained on hundreds of thousands of hours of real-world speech, including diverse accents, background noise, and informal speech patterns.
For business voice agents, this matters for one key reason: Australian accents, regional dialects, and industry-specific terminology all need to be handled correctly. A plumber discussing "flexi hose" or a dentist mentioning "periodontal" needs the STT to transcribe accurately — misrecognition at this stage degrades everything downstream. The best modern STT systems achieve better than 99% word error rate on clear speech, falling to around 94–97% in noisy environments.
One important feature in real-time voice agents is endpointing — detecting when the speaker has finished talking. This prevents the AI from interrupting mid-sentence, and prevents long silences from being misread as speech. Endpointing models are trained separately from transcription models and run continuously alongside them.
Large Language Models: The Brain of the Agent
The LLM is where the intelligence lives. Unlike a decision tree or a FAQ lookup, an LLM generates responses by predicting the most appropriate continuation of the conversation given everything it knows — the system prompt, the conversation history, and the caller's most recent message.
This generative approach is what allows AI voice agents to handle questions they've never explicitly been trained on. If a caller asks "do you do work in Penrith?" and the business's service area includes the Blue Mountains, the LLM can reason that Penrith is adjacent to that area and respond appropriately — even if "Penrith" never appeared in the training data for that specific agent.
System prompts: programming the agent's personality and knowledge
Every AI voice agent is configured through a system prompt — a set of instructions that defines how the agent should behave, what it knows, what it should avoid, and what tone to use. For a dental clinic, this might include: the clinic's services and pricing, the booking system API credentials, instructions to always ask for the caller's name and date of birth, and guidance on how to handle after-hours emergency calls.
The LLM reads this system prompt at the start of every conversation and uses it as the governing context for everything it says. Well-written system prompts produce consistent, on-brand agents. Poorly written prompts produce agents that go off-topic, give wrong information, or adopt an inappropriate tone.
Text-to-Speech: Making the AI Sound Human
Voice quality is the single biggest factor in whether callers trust an AI voice agent or find it uncanny. Early TTS systems sounded robotic because they concatenated short recordings of human speech — an approach that produced audible stitching artefacts and flat intonation. Neural TTS changed everything.
Neural voice synthesis works by training a model to map text to mel spectrograms (a frequency-based representation of audio), then passing those spectrograms through a neural vocoder that generates the final audio waveform. The models learn not just pronunciation but prosody — the natural rise and fall of pitch, the micro-pauses between clauses, the subtle emphasis on key words — all of which are essential for speech that sounds genuinely natural.
The voices available today — including the Telnyx NaturalHD voice family used in Talking Widget — are trained on high-quality studio recordings of professional voice actors, then fine-tuned for different styles and emotional registers. The result is speech that most callers cannot distinguish from a human recording.
WebRTC: Why No App Is Needed
WebRTC was developed by Google and standardised by the W3C and IETF. It is supported natively in every major browser — Chrome, Safari, Firefox, Edge — without any plugin or installation. For a talking website, this means a visitor can click a button and immediately start a voice conversation with no friction whatsoever.
The protocol handles all the hard parts of real-time audio: establishing peer connections, negotiating codecs (Opus is standard for voice), managing network address translation (NAT traversal), handling packet loss with forward error correction, and adapting bitrate when network conditions change. From the developer's perspective, WebRTC reduces to a few lines of JavaScript. From the caller's perspective, it simply works.
Streaming transcription, Australian English optimised, 99%+ accuracy on clear speech, sub-200ms latency
235B parameter mixture-of-experts model, hosted directly by Telnyx — no external API routing, zero latency penalties
Neural vocoder synthesis, multiple voice personas, streaming audio output, Australian-English prosody
Open W3C standard, native in all browsers, Opus codec, adaptive bitrate, no app install required
Why This Matters for Your Business
The five-layer pipeline above is not academic. It has direct implications for what AI voice agents can and cannot do, and why the best-implemented ones feel remarkably natural while poorly-implemented ones feel frustrating.
Latency compounds. If each layer adds 200ms of delay, a conversation with a 1-second round-trip feels like talking to someone on a bad satellite connection. The best implementations push the total latency below 800ms by streaming audio from TTS as the LLM generates tokens, rather than waiting for the full response before beginning synthesis.
Context is cumulative. Because the LLM maintains conversation history, the agent gets more useful as the conversation progresses — just as a human receptionist becomes more helpful once they understand what you need. A caller who mentions "I'm calling about my appointment next Tuesday" doesn't need to re-explain that context two exchanges later.
Actions close the loop. Voice agents that only talk are useful. Voice agents that talk and then actually book the appointment, send the confirmation SMS, and create the CRM contact are transformative. The action layer turns conversation into revenue.
Hear the technology in action
Don't read about it — experience it. Click the widget in the corner of this page and talk to Maya, the AI voice agent powering Talking Widget. Ask anything about how it works.
Try it now on our homepage →