Real-time voice AI gets a major upgrade from Google

Google DeepMind just dropped Gemini 3.1 Flash Live, its most capable audio and voice model to date. The model is designed for real-time dialogue, targeting developers building voice-first AI agents and enterprises that need reliable task execution at scale.

The headline number: 90.8% on ComplexFuncBench Audio, a benchmark that measures multi-step function calling with various constraints. That’s a notable jump over Google’s previous model, according to Google DeepMind, and it signals that voice AI is getting meaningfully better at handling complex, chained tasks rather than just simple Q&A.

Why this matters

Voice-first AI has been stuck in an awkward middle ground. Models could talk, but they couldn’t reliably do things. Booking a flight, pulling data from multiple sources, executing a sequence of API calls mid-conversation. These multi-step tasks broke down fast. A 90.8% score on complex function calling suggests that gap is closing.

For developers, the practical implication is clear: building voice agents that handle real workflows (not just scripted responses) is becoming viable. For enterprises, it means fewer fallbacks to human operators when a voice agent hits a multi-step request.

What’s new

  • Better reasoning during live audio: The model can handle complex, multi-constraint tasks while maintaining natural conversation flow
  • Improved reliability: Google DeepMind emphasizes that 3.1 Flash Live is built for production-scale deployment, not just demos
  • Speed + quality: The “Flash” lineage means low latency, but this version doesn’t sacrifice reasoning quality to get there
  • Broad availability: Rolling out across Google products, with API access for developers

The bigger picture

This release fits a clear industry trend. OpenAI, ElevenLabs, and others have all been pushing voice AI capabilities in recent months. But the competition is shifting from “can it sound natural?” to “can it actually get things done while sounding natural?” Google is making a direct play for that second question.

The ComplexFuncBench Audio benchmark is worth watching. Function calling, the ability to trigger actions, pull data, and chain operations during a conversation, is what separates a voice chatbot from a voice agent. Scoring above 90% there puts real pressure on competitors to publish comparable numbers.

For developers evaluating which voice platform to build on, 3.1 Flash Live is worth testing against your specific use cases. Benchmarks tell one story; production performance with your APIs and your edge cases tells another.

More details are available in Google DeepMind’s official announcement.

Scroll to Top