Voxtral transcribes at the speed of sound.
- •Mistral AI launches Voxtral Transcribe 2, featuring ultra-low 200ms latency and state-of-the-art speaker diarization.
- •Voxtral Realtime is released with open weights under Apache 2.0, supporting efficient execution on edge devices.
- •New models offer 13-language support and significant cost advantages over GPT-4o mini and Gemini 2.5 Flash.
Mistral AI has significantly raised the bar for speech technology with the launch of Voxtral Transcribe 2, a next-generation model family designed for both high-efficiency batch processing and live interactions. The release includes two distinct versions: the Voxtral Mini Transcribe V2 for high-volume tasks and Voxtral Realtime, which achieves an impressive sub-200ms latency. This ultra-fast response time is critical for developing responsive voice agents that feel natural and fluid rather than lagging during conversation.
A standout feature is the introduction of precision speaker diarization—the process of identifying and labeling different speakers within an audio stream—which is essential for accurately transcribing meetings or interviews. Furthermore, the models support context biasing, allowing users to provide specific technical terms or names to improve accuracy. This ensures that industry-specific jargon or unique proper nouns are captured correctly, overcoming a common hurdle for generic transcription services.
Mistral has released Voxtral Realtime with open weights under the Apache 2.0 license. With a compact 4B parameter footprint, the model is optimized for local inference on edge devices, providing a privacy-first solution for sensitive enterprise data. By combining low word error rates (a measure of transcription accuracy) with competitive pricing, Mistral is directly challenging established industry giants across various benchmarks in the speech-to-text landscape.