Microsoft drops three AI models to rival Google and OpenAI

Microsoft AI just released three foundational models that handle text transcription, voice generation, and image-to-video creation. The launch, reported by TechCrunch AI, marks a clear signal that Microsoft isn’t content just reselling OpenAI’s tech, it wants its own competitive stack.

The three models come from Microsoft’s MAI Superintelligence team, led by Microsoft AI CEO Mustafa Suleyman, which was formed back in November 2025.

What’s in the box

Here’s what each model does:

  • MAI-Transcribe-1 — Speech-to-text across 25 languages. Microsoft claims it’s 2.5x faster than its own Azure Fast transcription service. Starts at $0.36 per hour.
  • MAI-Voice-1 — Generates 60 seconds of audio in just one second. Supports custom voice creation. Starts at $22 per 1 million characters.
  • MAI-Image-2 — A video-generating model (despite the name). Originally launched on MAI Playground on March 19, now getting wider availability. Starts at $5 per 1M tokens for text input and $33 per 1M tokens for image output.

All three models are now available on Microsoft Foundry, with the transcription and voice models also accessible through MAI Playground, Microsoft’s new LLM testing platform.

The pitch: cheaper than the competition

In a market overflowing with AI models, Microsoft is betting on price as a differentiator. The company explicitly positioned these models as cheaper alternatives to comparable offerings from Google and OpenAI, according to TechCrunch AI.

That’s a notable move. Microsoft has poured over $13 billion into OpenAI and still hosts OpenAI’s models across its product lineup. Yet here it is, building competing models and undercutting its own partner on price.

The OpenAI relationship: it’s complicated

Suleyman addressed the obvious tension head-on. In an interview with VentureBeat, he reaffirmed Microsoft’s commitment to OpenAI. But a recent renegotiation of that partnership specifically opened the door for Microsoft to pursue its own superintelligence research, as Suleyman told The Verge.

Microsoft compared the approach to its chip strategy, it both builds its own custom silicon and buys from external suppliers. The message: these things aren’t mutually exclusive.

“At Microsoft AI, we’re building Humanist AI. We have a distinct view when creating our AI models, putting humans at the center, optimizing for how people actually communicate, training for practical use,” Suleyman wrote in a blog post. He also teased that more models are coming soon, both in Foundry and directly in Microsoft products.

Why this matters

This release tells us two things. First, Microsoft is serious about reducing its dependency on OpenAI for foundational model capabilities. Having your own transcription, voice, and video generation stack means you control the pricing, the roadmap, and the integration depth.

Second, the multimodal focus is deliberate. These aren’t chatbots or reasoning models, they’re practical building blocks for developers who need speech, audio, and visual generation at scale. The speed claims (2.5x faster transcription, 60 seconds of audio in one second) suggest Microsoft is targeting production workloads, not research demos.

With Suleyman promising more models soon, Microsoft’s MAI team is clearly just getting started. For more details on the launch, check out the full coverage at TechCrunch AI.

Scroll to Top