Honest pros, cons, and verdict on this audio processing tool
â Lowest published price point at $0.003/min for batch transcription, roughly one-fifth the cost of ElevenLabs Scribe v2
Starting Price
Free
Free Tier
Yes
Category
Audio Processing
Skill Level
Any
Next-generation speech-to-text models offering state-of-the-art transcription quality, real-time diarization, and ultra-low latency for voice applications. Includes batch transcription and real-time streaming capabilities across 13 languages.
Voxtral Transcribe 2 is an Audio Processing speech-to-text model family from Mistral AI that delivers state-of-the-art transcription, speaker diarization, and sub-200ms streaming latency, with pricing starting at $0.003 per minute. It's built for developers, voice-agent builders, contact centers, and media teams that need accurate, low-cost transcription at scale across 13 languages.
The family includes two models: Voxtral Mini Transcribe V2, a batch transcription model achieving approximately 4% word error rate on the FLEURS benchmark, and Voxtral Realtime, a 4B-parameter streaming model released under the Apache 2.0 license on Hugging Face. Realtime uses a novel streaming architecture that transcribes audio as it arrives rather than chunking offline models, allowing latency to be configured down to sub-200ms for voice agents while staying within 1-2% word error rate of offline accuracy. At a 2.4-second delay, Realtime matches the batch model, making it suitable for live subtitling. Both support English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
per month
per month
Advanced speech-to-text and text-to-speech API with industry-leading accuracy, real-time streaming, and support for 30+ languages. Built for developers creating voice applications, call transcription, and conversational AI.
Starting at Free
Learn more âProduction-grade speech-to-text API with Universal-3 Pro model, real-time streaming, and audio intelligence features for voice AI applications.
Starting at Free
Learn more âVoxtral Transcribe 2 delivers on its promises as a audio processing tool. While it has some limitations, the benefits outweigh the drawbacks for most users in its target market.
Next-generation speech-to-text models offering state-of-the-art transcription quality, real-time diarization, and ultra-low latency for voice applications. Includes batch transcription and real-time streaming capabilities across 13 languages.
Yes, Voxtral Transcribe 2 is good for audio processing work. Users particularly appreciate lowest published price point at $0.003/min for batch transcription, roughly one-fifth the cost of elevenlabs scribe v2. However, keep in mind context biasing is optimized for english; support for other languages is labeled experimental.
Yes, Voxtral Transcribe 2 offers a free tier. However, premium features unlock additional functionality for professional users.
Voxtral Transcribe 2 is best for Meeting intelligence platforms transcribing multilingual recordings with speaker diarization for who-said-what attribution at high volume and Voice agents and virtual assistants requiring sub-200ms transcription latency in a pipeline with an LLM and TTS for natural conversation. It's particularly useful for audio processing professionals who need speaker diarization with start/end timestamps.
Popular Voxtral Transcribe 2 alternatives include Deepgram, AssemblyAI. Each has different strengths, so compare features and pricing to find the best fit.
Last verified March 2026