What are the key points?

Artificial Analysis releases AA-WER v2.0 featuring a proprietary voice agent dataset called AA-AgentTalk. Cleaned versions of VoxPopuli and Earnings22 datasets reduce Word Error Rate by up to 5.6%. ElevenLabs Scribe v2 tops the leaderboard with 2.3% WER, outperforming Google’s Gemini 3 Pro.

Artificial Analysis Launches New Speech-to-Text Accuracy Benchmark

•Artificial Analysis releases AA-WER v2.0 featuring a proprietary voice agent dataset called AA-AgentTalk.
•Cleaned versions of VoxPopuli and Earnings22 datasets reduce Word Error Rate by up to 5.6%.
•ElevenLabs Scribe v2 tops the leaderboard with 2.3% WER, outperforming Google’s Gemini 3 Pro.

Artificial Analysis has unveiled AA-WER v2.0, a significant overhaul of its Speech-to-Text (STT) benchmarking suite designed to reflect modern AI usage. While traditional benchmarks often rely on formal recordings like parliamentary sessions, the new version introduces AA-AgentTalk, a proprietary dataset specifically modeling interactions with voice agents. This change is crucial as enterprises increasingly deploy AI for customer service, where natural speech patterns and varied accents differ significantly from structured public readings.

The update also focuses on data integrity by releasing cleaned transcriptions for the VoxPopuli and Earnings22 datasets. By manually correcting ground-truth transcripts—the verified text used as a reference to measure model accuracy—the team removed errors that previously penalized models for transcribing audio correctly. This refinement led to a measurable drop in Word Error Rate (WER), the standard metric for tracking transcription mistakes by comparing model output to the original audio.

Performance results show a competitive landscape where specialized models are challenging general-purpose AI. ElevenLabs Scribe v2 currently leads the benchmark with an overall score of 2.3%, establishing it as the current SOTA in transcription accuracy. The release also includes an open-source text normalizer, which helps ignore trivial formatting differences like "7:00pm" versus "7pm," ensuring that evaluations focus strictly on linguistic accuracy rather than stylistic preference.

Artificial Analysis has unveiled AA-WER v2.0, a significant overhaul of its Speech-to-Text (STT) benchmarking suite designed to reflect modern AI usage. While traditional benchmarks often rely on formal recordings like parliamentary sessions, the new version introduces AA-AgentTalk, a proprietary dataset specifically modeling interactions with voice agents. This change is crucial as enterprises increasingly deploy AI for customer service, where natural speech patterns and varied accents differ significantly from structured public readings.

The update also focuses on data integrity by releasing cleaned transcriptions for the VoxPopuli and Earnings22 datasets. By manually correcting ground-truth transcripts—the verified text used as a reference to measure model accuracy—the team removed errors that previously penalized models for transcribing audio correctly. This refinement led to a measurable drop in Word Error Rate (WER), the standard metric for tracking transcription mistakes by comparing model output to the original audio.

Performance results show a competitive landscape where specialized models are challenging general-purpose AI. ElevenLabs Scribe v2 currently leads the benchmark with an overall score of 2.3%, establishing it as the current SOTA in transcription accuracy. The release also includes an open-source text normalizer, which helps ignore trivial formatting differences like "7:00pm" versus "7pm," ensuring that evaluations focus strictly on linguistic accuracy rather than stylistic preference.

Artificial Analysis Launches New Speech-to-Text Accuracy Benchmark

Tags