Xiaomi Launches MiMo AI Voice Models to Challenge Google and OpenAI

April 25, 2026

Xiaomi has announced an update to its MiMo voice AI platform with the launch of the MiMo-V2.5-TTS series and MiMo-V2.5-ASR. The company describes the new lineup as a full-link voice model system designed for the agent era, covering both speech output and speech input. The launch follows Xiaomi’s MiMo-V2-TTS model from March, which focused on detailed control over tone, emotion, and speaking style. With this release, Xiaomi launches MiMo AI as a serious competitor to Google and OpenAI.

Text-to-speech models

The Xiaomi MiMo-V2.5-TTS lineup includes three separate models and is available for a limited time at no cost through Xiaomi’s MiMo Open Platform. The base MiMo-V2.5-TTS model includes preset voices and supports adjustments for speech rate, tone, and emotion. MiMo-V2.5-TTS-VoiceDesign allows users to create entirely new voice timbres using a short input sentence. MiMo-V2.5-TTS-VoiceClone reproduces a specific voice using a small number of samples while maintaining consistency across different speaking styles and instructions. When Xiaomi launches MiMo AI, it prioritizes natural language instructions over structured parameters.

Users can describe how a voice should sound in plain language, similar to directing a voice actor. The system also supports layered script-style inputs for use cases such as game characters and audio dramas, allowing separate control of character traits, scenes, and dialogue. Inline audio tags let users adjust emotion or delivery at specific points within a sentence. These tags can mix in the same text and work in both Chinese and English.

Speech recognition model

Xiaomi is also releasing MiMo-V2.5-ASR as an open-source speech recognition model. The company designed it for real-world scenarios such as bilingual conversations, regional dialects, and noisy environments. Supported Chinese dialects include Wu, Cantonese, Minnan, and Sichuanese. The model can switch between Chinese and English without preset language tags. It can also recognize song lyrics even when vocals mix with music. For meetings and multi-speaker environments, the system transcribes overlapping conversations with speaker separation. Xiaomi said it maintains accuracy in high-noise settings and with far-field audio capture.

Structured transcripts and availability

MiMo-V2.5-ASR includes built-in phonetics and context-based punctuation, reducing the need for post-processing. Xiaomi said the model delivers state-of-the-art or near-state-of-the-art results on benchmarks covering bilingual recognition, dialect processing, and code-switching tasks. The TTS models are available through Xiaomi’s platform and can be tested in MiMo Studio. The ASR model is available with open-source weights and code for direct use or customization.