Next-generation speech-to-text models offering state-of-the-art transcription quality, real-time diarization, and ultra-low latency for voice applications. Includes batch transcription and real-time streaming capabilities across 13 languages.
Voxtral Transcribe 2 is an Audio Processing speech-to-text model family from Mistral AI that delivers state-of-the-art transcription, speaker diarization, and sub-200ms streaming latency, with pricing starting at $0.003 per minute. It's built for developers, voice-agent builders, contact centers, and media teams that need accurate, low-cost transcription at scale across 13 languages.
The family includes two models: Voxtral Mini Transcribe V2, a batch transcription model achieving approximately 4% word error rate on the FLEURS benchmark, and Voxtral Realtime, a 4B-parameter streaming model released under the Apache 2.0 license on Hugging Face. Realtime uses a novel streaming architecture that transcribes audio as it arrives rather than chunking offline models, allowing latency to be configured down to sub-200ms for voice agents while staying within 1-2% word error rate of offline accuracy. At a 2.4-second delay, Realtime matches the batch model, making it suitable for live subtitling. Both support English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
Key capabilities include speaker diarization with start/end timestamps, context biasing of up to 100 words or phrases for proper nouns and domain vocabulary, word-level timestamps, noise robustness for factory floors and call centers, and support for recordings up to 3 hours per request. Based on our analysis of 870+ AI tools, Voxtral Mini Transcribe V2 offers one of the most aggressive price-performance ratios in the speech-to-text category â Mistral claims it outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova on accuracy while processing audio approximately 3x faster than ElevenLabs Scribe v2 at one-fifth the cost. Compared to the other Audio Processing tools in our directory, Voxtral is differentiated by combining open-weights deployment (Realtime under Apache 2.0), enterprise-grade compliance (GDPR and HIPAA via on-premise or private cloud), and one of the lowest published API rates at $0.003/min for batch and $0.006/min for realtime.
Was this helpful?
Voxtral Realtime is built on a novel streaming architecture that transcribes audio as it arrives rather than batching it into chunks. Latency is configurable down to sub-200ms while staying within 1-2% WER of the offline model, making it suitable for production voice agents where conversational responsiveness is critical.
Voxtral Mini Transcribe V2 generates transcripts annotated with speaker labels and precise start/end times for each turn. This is engineered for meeting transcription, interview analysis, and multi-party call processing, with diarization error rate benchmarked across Switchboard, CallHome, AMI-IHM, AMI-SDM, SBCSAE, and TalkBank multilingual datasets.
Developers can supply up to 100 custom words or phrases to steer the model toward correct spellings of proper nouns, technical terms, or industry jargon. This is especially valuable for medical, legal, and technical transcription where standard models routinely miss domain terms. Optimized for English; experimental in other languages.
Voxtral Realtime is released as open weights on the Hugging Face Hub under the permissive Apache 2.0 license. With its 4B-parameter footprint, it can run on edge devices, allowing fully private, on-device transcription for sensitive deployments without any audio leaving the user's environment.
Voxtral Mini Transcribe V2 accepts single requests up to 3 hours long, eliminating most chunking overhead for podcasts, depositions, or full meetings. It also maintains accuracy in challenging acoustic environments such as factory floors, busy call centers, and field recordings.
Free
$0.003/min
$0.006/min
Free (Apache 2.0)
Contact Sales
Ready to get started with Voxtral Transcribe 2?
View Pricing Options âWe believe in transparent reviews. Here's what Voxtral Transcribe 2 doesn't handle well:
Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
Mistral released Voxtral Transcribe 2 in 2026, introducing two new models: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live streaming. Updates include a novel streaming architecture with sub-200ms configurable latency, expanded language support to 13 languages, support for recordings up to 3 hours, context biasing of up to 100 custom terms, an audio playground inside Mistral Studio, and the open-weights release of Voxtral Realtime on Hugging Face under Apache 2.0.
AI Model APIs
Advanced speech-to-text and text-to-speech API with industry-leading accuracy, real-time streaming, and support for 30+ languages. Built for developers creating voice applications, call transcription, and conversational AI.
AI Model APIs
Production-grade speech-to-text API with Universal-3 Pro model, real-time streaming, and audio intelligence features for voice AI applications.
No reviews yet. Be the first to share your experience!
Get started with Voxtral Transcribe 2 and see if it's the right fit for your needs.
Get Started âTake our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack âExplore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates â