Qwen3-TTS Family is Now Open Sourced: Voice Design, Clone, and Generation
- •Alibaba's Qwen team open-sources Qwen3-TTS, a multilingual text-to-speech family supporting 10 languages.
- •Models enable high-fidelity 3-second voice cloning and description-based voice manipulation via natural language prompts.
- •Lightweight 0.6B and 1.7B parameter versions released under Apache 2.0 license for local and browser execution.
The Qwen team has officially open-sourced Qwen3-TTS, a sophisticated family of text-to-speech models designed for high-fidelity voice synthesis and cloning. Unlike traditional speech tools, this release provides state-of-the-art 3-second voice cloning, allowing users to replicate a speaker's unique vocal characteristics with minimal audio input. The models are trained on a massive dataset of 5 million hours across 10 languages, utilizing a dual-track architecture that enables real-time, streaming audio generation.
A standout feature is the description-based control, which lets users "design" voices through text prompts like "gruff voice" or "energetic pirate." This level of fine-grained manipulation is achieved through a specialized language model (LLM) framework that treats audio as a sequence of tokens. By releasing versions ranging from 0.6B to 1.7B parameters under the Apache 2.0 license, the team ensures these capabilities are accessible to anyone with basic hardware or even just a web browser via Hugging Face.
Simon Willison highlighted the lowering barrier to entry for such powerful tools, demonstrating how simple CLI commands can generate high-quality audio. Using the mlx-audio library—a framework optimized for Apple Silicon—developers can now run these models locally. This shift marks a moment where sophisticated voice cloning moves from high-end research labs to the personal computers of students and hobbyists alike, provided they have sufficient video memory (VRAM).