Key Takeaways
- Three proprietary AI models from Microsoft—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—are now accessible via Microsoft Foundry.
- The transcription model MAI-Transcribe-1 demonstrates superior accuracy across 25 languages, surpassing both OpenAI’s Whisper and Google Gemini Flash in benchmarks.
- A contract renegotiation with OpenAI in late 2025 granted Microsoft freedom to develop frontier AI models independently.
- Remarkably, each AI model was created by engineering teams with fewer than 10 members, utilizing approximately 50% of the GPU resources required by competitors.
- Microsoft AI’s CEO Mustafa Suleiman announced intentions to develop a frontier large language model, aiming for complete AI self-sufficiency.
In a bold strategic move, Microsoft unveiled three proprietary AI models on Wednesday, positioning itself as a direct competitor to OpenAI, Google, and emerging AI startups in the artificial intelligence landscape.
The newly released trio—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—is currently accessible through Microsoft Foundry and the newly introduced MAI Playground. These models address speech-to-text conversion, text-to-speech synthesis, and image creation capabilities. Mustafa Suleiman, CEO of Microsoft AI, characterized this debut as the initial offering from his “superintelligence team,” established merely six months prior.
MSFT stock completed its weakest quarter since 2008, declining approximately 17% year-to-date. This model introduction marks Suleiman’s inaugural public response to shareholder demands for tangible returns on the corporation’s substantial AI investments.
The flagship offering, MAI-Transcribe-1, achieves the most favorable average Word Error Rate on the FLEURS benchmark across the 25 most-used languages in Microsoft products, registering an average of 3.8%. The company asserts it surpasses OpenAI’s Whisper-large-v3 across all 25 languages and exceeds Google’s Gemini 3.1 Flash performance on 22 languages. The system handles MP3, WAV, and FLAC files up to 200MB, delivering batch processing speeds 2.5 times faster than Azure’s current solution. Internal testing is underway within Teams and Copilot Voice.
MAI-Voice-1 produces 60 seconds of realistic audio output in just one second and enables custom voice generation from merely seconds of sample audio input. Pricing is set at $22 per million characters. MAI-Image-2 secures a top-three position on the Arena.ai leaderboard and is being integrated into Bing and PowerPoint, with pricing at $5 per million input tokens and $33 per million image output tokens. WPP has emerged as an early enterprise adopter implementing it at scale.
Renegotiated Agreement Unlocks New Possibilities
This product release would have been impossible twelve months ago. Before October 2025, Microsoft faced contractual restrictions preventing independent pursuit of artificial general intelligence under its initial 2019 agreement with OpenAI.
When OpenAI pursued expanded computing infrastructure beyond Microsoft’s ecosystem—establishing agreements with SoftBank and other partners—Microsoft initiated contract renegotiations. The updated terms permitted Microsoft to pursue proprietary frontier models while maintaining licensing rights to all OpenAI developments through 2032.
Suleiman informed VentureBeat: “Back in September of last year, we renegotiated the contract with OpenAI, and that enabled us to independently pursue our own superintelligence.” He emphasized the partnership with OpenAI continues through at least 2032.
Lean Teams Deliver Substantial Results
Among the most striking revelations from the announcement: each model emerged from teams comprising fewer than 10 engineers. Suleiman noted the audio model was developed by 10 individuals and attributed performance improvements to model architecture and data strategies rather than workforce size.
“Our image team, equally, is less than 10 people,” he revealed. This methodology contrasts sharply with prevailing industry practices, where organizations like Meta have allegedly offered individual researchers compensation packages valued between $100 million and $200 million.
Microsoft indicates its pricing strategy is intentionally competitive—structured to undercut Amazon and Google. Suleiman characterized it as “the cheapest of any of the hyperscalers.” The organization is currently planning frontier-scale GPU clusters for deployment within the next 12 to 18 months.
Suleiman verified a large language model is included in future development plans, stating Microsoft’s objective is achieving “completely independent” operations and delivering “state of the art models across all modalities.”



