Running the same prompts that work in chat completion inside OpenAI Realtime API? They’ll fail you, and the failure mode won’t be obvious until you’re live in front of real users.
One developer spent a month running a production voice tutor on the Realtime API (open source here). The finding: Realtime isn’t chat completion with audio on top. It’s a different system with different rules, and most of what you know about prompting doesn’t transfer. The mismatch runs deeper than syntax. It’s a different mental model for how the model processes instructions and context over time.
The core difference
Chat completion is forgiving. You can patch mid-conversation, inject a clarifying message, nudge the tone, add context. The model reads it and adjusts. Made the assistant too formal? Drop a message telling it to loosen up. Forgot to include a constraint? Add it on the next turn. The architecture gives you room to iterate as the conversation unfolds.
Realtime is live audio. The session is streaming. There’s no turn to inject corrections into. Your system prompt isn’t one message in a conversation. It’s the entire personality for the full session. If it’s vague, you’ll notice. You just can’t fix it while it’s happening. Think of it less like texting back and forth and more like briefing someone right before they walk into a live interview. Whatever you didn’t cover in the briefing, they’re winging it on stage. The stakes of your upfront spec are completely different.
This also changes how you think about length and specificity. In chat prompting, shorter is often better. In Realtime, a longer, denser behavioral spec tends to outperform a minimal one, because the model has to carry it through the entire session without any correction anchors along the way.
What breaks from chat prompting
- 🔁 Mid-session corrections ignored roughly 40% of the time. The model stops responding to them in long sessions. After several minutes of streaming audio, the original system prompt context decays in influence. A correction you inject at minute eight carries far less weight than you’d expect.
- Few-shot examples backfire. Pasting “Example user: X / Example AI: Y” in the system prompt makes the model treat those as real conversation turns. Instead of learning from the pattern, it gets confused about where the conversation actually started. This is especially damaging in tutoring or interview-style apps where turn sequence matters.
- Tool calls mid-speech interrupt the audio stream and sound like glitches. The model stops itself mid-sentence to call a function, then tries to resume. Users hear a hard cut, sometimes silence, then a response that doesn’t always reconnect smoothly to what came before. In production, this tanks perceived quality fast.
What actually works 🎙️
- Voice-first framing: “respond conversationally, in 1-2 sentences, like you’re sitting next to the user” cuts verbosity by roughly 50%. Without this, the model defaults to the same structured, multi-paragraph style it uses in text, which sounds unnatural when spoken aloud and makes sessions feel robotic.
- Behavioral descriptions instead of examples: “when the user asks for steps, give them numbered, one at a time, wait for confirmation.” This pattern gives the model a rule it can apply consistently rather than a template it might misinterpret as conversation history.
- Context injection as a dummy user turn: Inject screen state via
conversation.item.createwithrole: userright before each response. Fresh context the model actually uses, not stale system prompt updates. This is especially useful in apps where the user’s environment changes, like moving between screens or completing steps in a flow.
How to adapt a chat prompt for Realtime
- Rewrite the system prompt as a full personality spec. Not instructions for one exchange, a character that stays consistent for 10+ minutes with no reinforcement mid-session. Define tone, pacing, how it handles confusion, what it does when the user goes off-topic. Anything you’d normally course-correct in chat needs to be baked in upfront.
- Replace every few-shot example with a behavioral rule. “When the user does X, do Y” instead of showing a conversation snippet. Read through your existing prompt and flag every place you used an example. Convert each one to a declarative rule. This alone fixes most of the context-confusion issues.
- Add “always finish your current sentence before invoking tools.” This eliminates the mid-speech interruption bug roughly 80% of the time. It’s a single instruction that costs nothing and the audio quality improvement is immediate.
- Inject live context as a dummy user turn, not a system prompt update. The model treats it as current information rather than stored memory. Pair this with a light prefix like “Context update:” so the model understands it’s receiving state, not a user question requiring a response.
One pattern worth calling out: test your Realtime prompt by imagining someone reading it aloud as a briefing, not as written instructions. If it sounds like a memo, it’ll perform like one. If it sounds like a coherent description of a person and how they behave, the model will carry that identity much further into the session without drift.
The full open-source implementation is at github.com/tryskilly/skilly. The code is worth reading alongside the prompting notes if you’re building anything on the Realtime API. The prompt files in particular show exactly how these rules look when applied to a real production tutor, not just a toy demo.
Frequently Asked Questions
Q: How should I structure prompts for Realtime API differently than chat completion?
Realtime’s system prompt is a one-shot constitution, it’s not reinforced with each turn like chat completion. You can’t layer mid-conversation corrections without sounding jarring because there’s no natural seam in the stream. Build your system prompt to be comprehensive and specific upfront, with imperative language and explicit behavioral bounds (e.g., “always finish your current sentence before invoking tools” beats “be polite about tool calls”).
Q: Do few-shot examples work the same way in Realtime?
No, Realtime confuses example turns for real user turns, which messes with the model’s behavior. Use behavioral descriptions instead (“When asked for steps, provide numbered steps one at a time and wait for confirmation”). The key: imperative voice, present tense, explicit bounds. “You typically give numbered steps” will be ignored ~50% of the time because the model treats “typically” as a hint, not a rule.
Q: How do you prevent tool calls from breaking up speech mid-sentence?
Prompt the model to finish its current sentence before invoking any tools, this gives it a concrete sequence contract to follow, not just a vague politeness rule. It works ~80% of the time when phrased clearly. This reframes the problem from “how to be polite” to “what’s the exact ordering rule.”
Q: What’s the best way to keep responses short in voice conversations?
Use voice-aware prompts: “respond conversationally in 1-2 sentences, like you’re sitting next to the user.” This alone cuts verbosity by ~50%. Pair it with persona anchoring through voice selection (shimmer, alloy, etc.) and a 1-sentence personality description (“warm, patient teacher who never makes the user feel dumb”).
Q: How do you inject new context without the user noticing?
Instead of stuffing screen state into your system prompt (which goes stale), inject fresh context via a dummy user turn right before each response using conversation.item.create with role: user, content: “[screen now shows: …]”. The model treats it as part of the natural conversation, making the seam less jarring than baking everything into the system prompt upfront.
What I learned from running OpenAI Realtime API in production for a month — prompting + state management notes
by u/engmsaleh in PromptEngineering