Real-time Voice AI Agents: Build Guide & Learning Path

A curated learning path for building real-time voice AI agents climbed to 174 points on Hacker News this week, and it’s worth your attention if you’ve been circling this space. According to Hacker News, the modern voice stack has converged on a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text, LLM, and text-to-speech, plus a turn-taking model that decides when the agent should speak. The guide moves from foundations to production telephony, with resources tagged Beginner, Intermediate, or Advanced.

What stands out here is the recommended order. Skip a step and you’ll waste weeks on the wrong abstraction.

Quick Start

You’ll learn the five layers of a production voice agent and which resources to hit at each stage. You need basic Python or TypeScript, an LLM API key, and patience for latency math.

Step 1: Lock In the Foundations

Start with the mental model. Without it, every framework decision is a coin flip.

Voice AI & Voice Agents: An Illustrated Primer by Kwindla Hultman Kramer. The de facto textbook for the field, free and regularly updated.
Voice Agent Architecture (LiveKit): visual walkthrough of streaming patterns and where latency accumulates.
Everything You Need to Know About Voice AI Agents (Deepgram): end-to-end primer on ASR, LLM reasoning, and synthesis.
Core Latency in AI Voice Agents (Twilio): end-of-turn detection, silence thresholds, smart endpointing.
Advice on Building Voice AI in June 2025 (Daily.co): P50/P95 latency-budget guidance from Pipecat’s creators.

Why it matters: latency, not raw quality, is what kills voice agents. Get this in your bones first.

Step 2: Pick One Framework and Ship Hello-World

For open-source production, LiveKit Agents and Pipecat are the two safest bets. For managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.

LiveKit Agents Voice AI Quickstart: working assistant in under 10 minutes via Python or TypeScript over WebRTC.
Pipecat Quickstart: scaffolds a Deepgram + OpenAI + Cartesia pipeline you can talk to in the browser in 5 minutes.
Ultravox: open-weight multimodal speech LLM that skips the separate ASR stage for ~150 ms TTFT (Advanced).

Step 3: Swap Components to Learn the Layers

Once you’ve shipped something that talks, swap pieces to feel what each layer contributes.

Speech-to-text (streaming, first-byte under 200 ms is the bar):

openai/whisper for DIY ASR
SYSTRAN/faster-whisper, up to 4x faster with INT8
NVIDIA NeMo (Parakeet/Canary) for top-leaderboard open models
Moonshine for tiny on-device streaming (~190 MB)

Text-to-speech:

Coqui TTS (idiap fork): the most battle-tested OSS toolkit
Piper: fast local neural TTS, runs on a Raspberry Pi
Kokoro 82M: Apache-licensed, tops community ELO arenas, CPU-friendly
F5-TTS: diffusion-transformer with zero-shot voice cloning
Orpheus-TTS: Llama-3B emotive TTS with emotion tags
Sesame CSM: context-aware multi-speaker with Mimi codec (Advanced)

LLMs for real-time: Sub-300 ms time-to-first-token changes the conversation feel entirely. Groq, Cerebras, and SambaNova Cloud all push tokens fast. For full speech-to-speech, look at OpenAI Realtime API, Google Gemini Live, and Moshi from kyutai-labs (open-source full-duplex with 200 ms latency).

Step 4: Connect to a Real Phone Number

This is where toy demos become product. The guide points to WebRTC fundamentals and SIP/telephony resources before suggesting tutorials and starter repos. Resist skipping it.

Step 5: Make It Safe Enough to Ship

The final layer covers evaluation and testing, production deployment and scaling, plus ethics, safety, and regulation. Hacker News commenters flagged endpointing (knowing when the user is done talking) as the most underestimated problem in the stack. AssemblyAI’s deep-dive on intelligent turn detection is the clearest treatment.

Practical Next Steps

Pick LiveKit Agents or Pipecat. Ship a hello-world today. Measure your P95 latency budget before swapping any component. Then connect a real phone number before worrying about model quality. The full curated list, with every resource tagged by level, lives at the original source.

Read original article