Voxtral Transcribe 2 Review 2026

Name: Voxtral Transcribe 2
Brand: Voxtral Transcribe 2
Availability: InStock

Honest pros, cons, and verdict on this testing & quality tool

✅ Lowest published price point at $0.003/min for batch transcription, roughly one-fifth the cost of ElevenLabs Scribe v2

Starting Price

Free

Free Tier

Yes

What is Voxtral Transcribe 2?

Next-generation speech-to-text models offering state-of-the-art transcription quality, real-time diarization, and ultra-low latency for voice applications. Includes batch transcription and real-time streaming capabilities across 13 languages.

Voxtral Transcribe 2 is an Audio Processing speech-to-text model family from Mistral AI that delivers state-of-the-art transcription, speaker diarization, and sub-200ms streaming latency, with pricing starting at $0.003 per minute. It's built for developers, voice-agent builders, contact centers, and media teams that need accurate, low-cost transcription at scale across 13 languages.

The family includes two models: Voxtral Mini Transcribe V2, a batch transcription model achieving approximately 4% word error rate on the FLEURS benchmark, and Voxtral Realtime, a 4B-parameter streaming model released under the Apache 2.0 license on Hugging Face. Realtime uses a novel streaming architecture that transcribes audio as it arrives rather than chunking offline models, allowing latency to be configured down to sub-200ms for voice agents while staying within 1-2% word error rate of offline accuracy. At a 2.4-second delay, Realtime matches the batch model, making it suitable for live subtitling. Both support English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Key Features

✓Speaker diarization with start/end timestamps

✓Sub-200ms configurable streaming latency

✓Context biasing with up to 100 custom words/phrases

✓Word-level timestamps

✓13-language multilingual support

✓Audio playground in Mistral Studio

Pricing Breakdown

Mistral Studio Audio Playground

Free

✓Test Voxtral Transcribe 2 directly in-browser
✓Upload up to 10 audio files
✓Toggle diarization and timestamp granularity
✓Add context bias terms
✓Supports .mp3, .wav, .m4a, .flac, .ogg up to 1GB each

Voxtral Mini Transcribe V2 (API)

$0.003/min

per month

✓Batch transcription via API
✓Speaker diarization with timestamps
✓Context biasing (up to 100 terms)
✓Word-level timestamps
✓Support for recordings up to 3 hours

Voxtral Realtime (API)

$0.006/min

per month

✓Real-time streaming transcription
✓Configurable latency down to sub-200ms
✓13 languages supported
✓Purpose-built for voice agents and live applications
✓Matches batch accuracy at 2.4s delay

Pros & Cons

✅Pros

•Lowest published price point at $0.003/min for batch transcription, roughly one-fifth the cost of ElevenLabs Scribe v2
•Sub-200ms streaming latency makes it viable for real-time voice agents, with only 1-2% WER degradation versus offline mode
•Voxtral Realtime ships as open weights under Apache 2.0, enabling private on-device deployment for sensitive workloads
•Approximately 4% word error rate on FLEURS benchmark, beating GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova per Mistral's published comparisons
•Native multilingual support across 13 languages with strong non-English performance, not just English-first adaptation
•Long-form support up to 3 hours per request reduces chunking overhead for meetings and podcasts

❌Cons

•Context biasing is optimized for English; support for other languages is labeled experimental
•With overlapping speech, the model typically transcribes only one speaker rather than separating concurrent voices
•Only 13 languages supported, fewer than competitors like Whisper (99+) or Deepgram for niche language coverage
•Realtime model is open-weights but Mini Transcribe V2 is API-only, limiting self-hosted batch workflows
•Documentation and tooling are newer than incumbents like AssemblyAI or Deepgram, so ecosystem integrations are still maturing

Who Should Use Voxtral Transcribe 2?

✓Meeting intelligence platforms transcribing multilingual recordings with speaker diarization for who-said-what attribution at high volume
✓Voice agents and virtual assistants requiring sub-200ms transcription latency in a pipeline with an LLM and TTS for natural conversation
✓Contact center automation that transcribes calls in real time so AI systems can analyze sentiment, suggest responses, and populate CRM fields mid-conversation
✓Live multilingual subtitle generation for media and broadcast workflows, using context biasing to handle proper nouns and technical terminology
✓Compliance and audit documentation in regulated industries (healthcare, finance, legal), with on-premise HIPAA/GDPR deployment and word-level timestamps for precise audit trails
✓Edge or on-device transcription for privacy-first applications using the open-weights Voxtral Realtime model on a 4B-parameter footprint

Who Should Skip Voxtral Transcribe 2?

×You're concerned about context biasing is optimized for english; support for other languages is labeled experimental
×You're concerned about with overlapping speech, the model typically transcribes only one speaker rather than separating concurrent voices
×You're concerned about only 13 languages supported, fewer than competitors like whisper (99+) or deepgram for niche language coverage

Alternatives to Consider

Deepgram

Speech-to-text, text-to-speech and voice agent APIs with industry-leading latency, accuracy and per-language model quality.

Starting at Free

Learn more →

AssemblyAI

Developer speech AI API platform for transcription, real-time speech-to-text, speech understanding, guardrails, and voice agents.

Starting at Free

Learn more →

Our Verdict

✅

Voxtral Transcribe 2 is a solid choice

Voxtral Transcribe 2 delivers on its promises as a testing & quality tool. While it has some limitations, the benefits outweigh the drawbacks for most users in its target market.

Try Voxtral Transcribe 2 →Compare Alternatives →

Frequently Asked Questions

What is Voxtral Transcribe 2?

Is Voxtral Transcribe 2 good?

Yes, Voxtral Transcribe 2 is good for testing & quality work. Users particularly appreciate lowest published price point at $0.003/min for batch transcription, roughly one-fifth the cost of elevenlabs scribe v2. However, keep in mind context biasing is optimized for english; support for other languages is labeled experimental.

Is Voxtral Transcribe 2 free?

Yes, Voxtral Transcribe 2 offers a free tier. However, premium features unlock additional functionality for professional users.

Who should use Voxtral Transcribe 2?

Voxtral Transcribe 2 is best for Meeting intelligence platforms transcribing multilingual recordings with speaker diarization for who-said-what attribution at high volume and Voice agents and virtual assistants requiring sub-200ms transcription latency in a pipeline with an LLM and TTS for natural conversation. It's particularly useful for testing & quality professionals who need speaker diarization with start/end timestamps.

What are the best Voxtral Transcribe 2 alternatives?

Popular Voxtral Transcribe 2 alternatives include Deepgram, AssemblyAI. Each has different strengths, so compare features and pricing to find the best fit.

More about Voxtral Transcribe 2

Pricing Alternatives Free vs Paid Pros & Cons Worth It?Tutorial

📖 Voxtral Transcribe 2 Overview 💰 Voxtral Transcribe 2 Pricing 🆚 Free vs Paid 🤔 Is it Worth It?

Last verified March 2026

What is Voxtral Transcribe 2?

Pricing Breakdown

Mistral Studio Audio Playground

Free

✓Test Voxtral Transcribe 2 directly in-browser
✓Upload up to 10 audio files
✓Toggle diarization and timestamp granularity
✓Add context bias terms
✓Supports .mp3, .wav, .m4a, .flac, .ogg up to 1GB each

Voxtral Mini Transcribe V2 (API)

$0.003/min

per month

✓Batch transcription via API
✓Speaker diarization with timestamps
✓Context biasing (up to 100 terms)
✓Word-level timestamps
✓Support for recordings up to 3 hours

Voxtral Realtime (API)

$0.006/min

per month

✓Real-time streaming transcription
✓Configurable latency down to sub-200ms
✓13 languages supported
✓Purpose-built for voice agents and live applications
✓Matches batch accuracy at 2.4s delay

Pros & Cons

✅Pros

•Lowest published price point at $0.003/min for batch transcription, roughly one-fifth the cost of ElevenLabs Scribe v2
•Sub-200ms streaming latency makes it viable for real-time voice agents, with only 1-2% WER degradation versus offline mode
•Voxtral Realtime ships as open weights under Apache 2.0, enabling private on-device deployment for sensitive workloads
•Approximately 4% word error rate on FLEURS benchmark, beating GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova per Mistral's published comparisons
•Native multilingual support across 13 languages with strong non-English performance, not just English-first adaptation
•Long-form support up to 3 hours per request reduces chunking overhead for meetings and podcasts

❌Cons

•Context biasing is optimized for English; support for other languages is labeled experimental
•With overlapping speech, the model typically transcribes only one speaker rather than separating concurrent voices
•Only 13 languages supported, fewer than competitors like Whisper (99+) or Deepgram for niche language coverage
•Realtime model is open-weights but Mini Transcribe V2 is API-only, limiting self-hosted batch workflows
•Documentation and tooling are newer than incumbents like AssemblyAI or Deepgram, so ecosystem integrations are still maturing

Who Should Use Voxtral Transcribe 2?

✓Meeting intelligence platforms transcribing multilingual recordings with speaker diarization for who-said-what attribution at high volume
✓Voice agents and virtual assistants requiring sub-200ms transcription latency in a pipeline with an LLM and TTS for natural conversation
✓Contact center automation that transcribes calls in real time so AI systems can analyze sentiment, suggest responses, and populate CRM fields mid-conversation
✓Live multilingual subtitle generation for media and broadcast workflows, using context biasing to handle proper nouns and technical terminology
✓Compliance and audit documentation in regulated industries (healthcare, finance, legal), with on-premise HIPAA/GDPR deployment and word-level timestamps for precise audit trails
✓Edge or on-device transcription for privacy-first applications using the open-weights Voxtral Realtime model on a 4B-parameter footprint

Who Should Skip Voxtral Transcribe 2?

×You're concerned about context biasing is optimized for english; support for other languages is labeled experimental
×You're concerned about with overlapping speech, the model typically transcribes only one speaker rather than separating concurrent voices
×You're concerned about only 13 languages supported, fewer than competitors like whisper (99+) or deepgram for niche language coverage

Alternatives to Consider

Deepgram

Speech-to-text, text-to-speech and voice agent APIs with industry-leading latency, accuracy and per-language model quality.

Starting at Free

Learn more →

AssemblyAI

Developer speech AI API platform for transcription, real-time speech-to-text, speech understanding, guardrails, and voice agents.

Starting at Free

Learn more →

Frequently Asked Questions

What is Voxtral Transcribe 2?

Is Voxtral Transcribe 2 good?

Is Voxtral Transcribe 2 free?

Yes, Voxtral Transcribe 2 offers a free tier. However, premium features unlock additional functionality for professional users.

Who should use Voxtral Transcribe 2?

What are the best Voxtral Transcribe 2 alternatives?

Popular Voxtral Transcribe 2 alternatives include Deepgram, AssemblyAI. Each has different strengths, so compare features and pricing to find the best fit.