Voxtral transcribes at the speed of sound
- •Mistral releases Voxtral Transcribe 2, featuring both Open Weights and API-based audio models.
- •Voxtral-Mini-4B-Realtime-2602 offers high-performance transcription under an Apache-2.0 license via Hugging Face.
- •Managed API includes Speaker Diarization and context biasing features priced at $0.003 per minute.
Mistral AI has launched its next-generation audio transcription suite, Voxtral Transcribe 2, signaling a significant leap in real-time speech processing. Building on their initial 2025 release, the company has introduced a dual-track strategy featuring both an Open Weights model for local deployment and a robust managed API for enterprise use. This move ensures that developers can choose between maintaining full data sovereignty on their own hardware or utilizing Mistral's optimized cloud infrastructure for scalability.
The open-weights version, named Voxtral-Mini-4B-Realtime-2602, is particularly noteworthy for developers. Released under the Apache 2.0 license, it allows for significant flexibility in how the model is integrated into custom applications and private servers. Early demonstrations highlight its impressive low latency, accurately transcribing technical terminology like WebAssembly and Django in near real-time without requiring a persistent cloud connection.
For those preferring a managed solution, the "voxtral-mini-latest" model via the Mistral API introduces sophisticated features like Speaker Diarization, which allows the AI to distinguish between different speakers in a single recording. It also supports context biasing, a technique that helps the model recognize specific technical terms or unique names by providing them as hints during the transcription process. With pricing set at a competitive $0.18 per hour, Mistral is positioning itself as a direct challenger to established speech-to-text incumbents by offering a powerful balance of performance and affordability.