Stay free if you only need test voxtral transcribe 2 directly in-browser and upload up to 10 audio files. Upgrade if you need full model weights on hugging face hub and 4b parameter footprint, runs on edge devices. Most solo builders can start free.
Why it matters: Context biasing is optimized for English; support for other languages is labeled experimental
Available from: Voxtral Mini Transcribe V2 (API)
Why it matters: With overlapping speech, the model typically transcribes only one speaker rather than separating concurrent voices
Available from: Voxtral Mini Transcribe V2 (API)
Why it matters: Only 13 languages supported, fewer than competitors like Whisper (99+) or Deepgram for niche language coverage
Available from: Voxtral Mini Transcribe V2 (API)
Why it matters: Realtime model is open-weights but Mini Transcribe V2 is API-only, limiting self-hosted batch workflows
Available from: Voxtral Mini Transcribe V2 (API)
Why it matters: Documentation and tooling are newer than incumbents like AssemblyAI or Deepgram, so ecosystem integrations are still maturing
Available from: Voxtral Mini Transcribe V2 (API)
Why it matters: Get help when stuck. Can save hours of troubleshooting on critical projects.
Available from: Voxtral Mini Transcribe V2 (API)
Voxtral Mini Transcribe V2 costs $0.003 per minute via API for batch transcription, and Voxtral Realtime costs $0.006 per minute for streaming. Mistral positions this as the lowest price point in the category â roughly one-fifth the cost of ElevenLabs Scribe v2 at comparable quality. Voxtral Realtime is also available as free open weights under the Apache 2.0 license on Hugging Face, so self-hosters only pay infrastructure costs. There is also a free audio playground in Mistral Studio for testing.
Both Voxtral Mini Transcribe V2 and Voxtral Realtime natively support 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. According to Mistral's FLEURS benchmark results, non-English performance significantly outpaces competitors. Note that the context biasing feature is optimized primarily for English, with support for other languages still considered experimental.
Voxtral Realtime uses a novel streaming architecture that transcribes audio as it arrives, rather than adapting offline models by processing audio in chunks. Latency is configurable: at sub-200ms it powers responsive voice agents while staying within 1-2% word error rate of the batch model, and at 2.4 seconds delay it fully matches Voxtral Mini Transcribe V2's accuracy â ideal for live subtitling. The 4B-parameter footprint means it can also run on edge devices for privacy-sensitive deployments.
Yes. Mistral states that both models support GDPR and HIPAA-compliant deployments through secure on-premise or private cloud setups. The open-weights release of Voxtral Realtime under Apache 2.0 is particularly relevant here because it allows organizations to run transcription entirely within their own infrastructure, with no audio leaving their environment. This makes it well-suited for healthcare, legal, financial services, and other regulated industries.
Per Mistral's published benchmarks, Voxtral Mini Transcribe V2 outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova on word error rate across FLEURS, while costing $0.003/min â significantly less than incumbents. It also processes audio approximately 3x faster than ElevenLabs Scribe v2 at one-fifth the cost. Compared to OpenAI Whisper (open source), Voxtral covers fewer languages (13 vs 99+) but offers higher accuracy in supported languages plus a hosted API with diarization and streaming built in.
Start with the free plan â upgrade when you need more.
Get Started Free âStill not sure? Read our full verdict â
Last verified March 2026