Audio Processing

Voxtral Transcribe 2

Name: Voxtral Transcribe 2
Brand: Voxtral Transcribe 2
Availability: InStock

Next-generation speech-to-text models offering state-of-the-art transcription quality, real-time diarization, and ultra-low latency for voice applications. Includes batch transcription and real-time streaming capabilities across 13 languages.

Starting atFree

Visit Voxtral Transcribe 2 →

Overview

Voxtral Transcribe 2 is an Audio Processing speech-to-text model family from Mistral AI that delivers state-of-the-art transcription, speaker diarization, and sub-200ms streaming latency, with pricing starting at $0.003 per minute. It's built for developers, voice-agent builders, contact centers, and media teams that need accurate, low-cost transcription at scale across 13 languages.

The family includes two models: Voxtral Mini Transcribe V2, a batch transcription model achieving approximately 4% word error rate on the FLEURS benchmark, and Voxtral Realtime, a 4B-parameter streaming model released under the Apache 2.0 license on Hugging Face. Realtime uses a novel streaming architecture that transcribes audio as it arrives rather than chunking offline models, allowing latency to be configured down to sub-200ms for voice agents while staying within 1-2% word error rate of offline accuracy. At a 2.4-second delay, Realtime matches the batch model, making it suitable for live subtitling. Both support English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.

Key capabilities include speaker diarization with start/end timestamps, context biasing of up to 100 words or phrases for proper nouns and domain vocabulary, word-level timestamps, noise robustness for factory floors and call centers, and support for recordings up to 3 hours per request. Based on our analysis of 870+ AI tools, Voxtral Mini Transcribe V2 offers one of the most aggressive price-performance ratios in the speech-to-text category — Mistral claims it outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova on accuracy while processing audio approximately 3x faster than ElevenLabs Scribe v2 at one-fifth the cost. Compared to the other Audio Processing tools in our directory, Voxtral is differentiated by combining open-weights deployment (Realtime under Apache 2.0), enterprise-grade compliance (GDPR and HIPAA via on-premise or private cloud), and one of the lowest published API rates at $0.003/min for batch and $0.006/min for realtime.

🎨

Vibe Coding Friendly?

▼

Difficulty:intermediate

Suitability for vibe coding depends on your experience level and the specific use case.

Learn about Vibe Coding →

Was this helpful?

Key Features

Sub-200ms Streaming Architecture+

Voxtral Realtime is built on a novel streaming architecture that transcribes audio as it arrives rather than batching it into chunks. Latency is configurable down to sub-200ms while staying within 1-2% WER of the offline model, making it suitable for production voice agents where conversational responsiveness is critical.

Speaker Diarization with Timestamps+

Voxtral Mini Transcribe V2 generates transcripts annotated with speaker labels and precise start/end times for each turn. This is engineered for meeting transcription, interview analysis, and multi-party call processing, with diarization error rate benchmarked across Switchboard, CallHome, AMI-IHM, AMI-SDM, SBCSAE, and TalkBank multilingual datasets.

Context Biasing for Domain Vocabulary+

Developers can supply up to 100 custom words or phrases to steer the model toward correct spellings of proper nouns, technical terms, or industry jargon. This is especially valuable for medical, legal, and technical transcription where standard models routinely miss domain terms. Optimized for English; experimental in other languages.

Open Weights Under Apache 2.0+

Voxtral Realtime is released as open weights on the Hugging Face Hub under the permissive Apache 2.0 license. With its 4B-parameter footprint, it can run on edge devices, allowing fully private, on-device transcription for sensitive deployments without any audio leaving the user's environment.

Long-Form & Noise-Robust Audio Processing+

Voxtral Mini Transcribe V2 accepts single requests up to 3 hours long, eliminating most chunking overhead for podcasts, depositions, or full meetings. It also maintains accuracy in challenging acoustic environments such as factory floors, busy call centers, and field recordings.

Pricing Plans

Mistral Studio Audio Playground

Free

✓Test Voxtral Transcribe 2 directly in-browser
✓Upload up to 10 audio files
✓Toggle diarization and timestamp granularity
✓Add context bias terms
✓Supports .mp3, .wav, .m4a, .flac, .ogg up to 1GB each

Voxtral Mini Transcribe V2 (API)

$0.003/min

✓Batch transcription via API
✓Speaker diarization with timestamps
✓Context biasing (up to 100 terms)
✓Word-level timestamps
✓Support for recordings up to 3 hours
✓13 languages supported

Voxtral Realtime (API)

$0.006/min

✓Real-time streaming transcription
✓Configurable latency down to sub-200ms
✓13 languages supported
✓Purpose-built for voice agents and live applications
✓Matches batch accuracy at 2.4s delay

Voxtral Realtime (Open Weights)

Free (Apache 2.0)

✓Full model weights on Hugging Face Hub
✓4B parameter footprint, runs on edge devices
✓Apache 2.0 license — commercial use allowed
✓Self-hosted, privacy-first deployment
✓GDPR/HIPAA-compatible on-prem use

Enterprise / Private Cloud

Contact Sales

✓GDPR and HIPAA-compliant deployments
✓Secure on-premise or private cloud setup
✓Dedicated support
✓Custom SLAs
✓Volume pricing

See Full Pricing →Free vs Paid →Is it worth it? →

Ready to get started with Voxtral Transcribe 2?

View Pricing Options →

Best Use Cases

🎯

Meeting intelligence platforms transcribing multilingual recordings with speaker diarization for who-said-what attribution at high volume

⚡

Voice agents and virtual assistants requiring sub-200ms transcription latency in a pipeline with an LLM and TTS for natural conversation

🔧

Contact center automation that transcribes calls in real time so AI systems can analyze sentiment, suggest responses, and populate CRM fields mid-conversation

🚀

Live multilingual subtitle generation for media and broadcast workflows, using context biasing to handle proper nouns and technical terminology

💡

Compliance and audit documentation in regulated industries (healthcare, finance, legal), with on-premise HIPAA/GDPR deployment and word-level timestamps for precise audit trails

🔄

Edge or on-device transcription for privacy-first applications using the open-weights Voxtral Realtime model on a 4B-parameter footprint

Limitations & What It Can't Do

We believe in transparent reviews. Here's what Voxtral Transcribe 2 doesn't handle well:

⚠Cannot reliably transcribe overlapping speech — typically captures only one speaker when multiple talk simultaneously
⚠Language support capped at 13 languages, narrower than Whisper's 99+ for global localization needs
⚠Context biasing accuracy varies outside English, where it remains experimental
⚠Voxtral Mini Transcribe V2 is API-only, with no published self-hosting path for batch workloads
⚠Maximum single-request audio length is 3 hours, requiring chunking for longer recordings such as full-day events

Pros & Cons

✓ Pros

✓Lowest published price point at $0.003/min for batch transcription, roughly one-fifth the cost of ElevenLabs Scribe v2
✓Sub-200ms streaming latency makes it viable for real-time voice agents, with only 1-2% WER degradation versus offline mode
✓Voxtral Realtime ships as open weights under Apache 2.0, enabling private on-device deployment for sensitive workloads
✓Approximately 4% word error rate on FLEURS benchmark, beating GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova per Mistral's published comparisons
✓Native multilingual support across 13 languages with strong non-English performance, not just English-first adaptation
✓Long-form support up to 3 hours per request reduces chunking overhead for meetings and podcasts

✗ Cons

✗Context biasing is optimized for English; support for other languages is labeled experimental
✗With overlapping speech, the model typically transcribes only one speaker rather than separating concurrent voices
✗Only 13 languages supported, fewer than competitors like Whisper (99+) or Deepgram for niche language coverage
✗Realtime model is open-weights but Mini Transcribe V2 is API-only, limiting self-hosted batch workflows
✗Documentation and tooling are newer than incumbents like AssemblyAI or Deepgram, so ecosystem integrations are still maturing

Frequently Asked Questions

How much does Voxtral Transcribe 2 cost?+

Voxtral Mini Transcribe V2 costs $0.003 per minute via API for batch transcription, and Voxtral Realtime costs $0.006 per minute for streaming. Mistral positions this as the lowest price point in the category — roughly one-fifth the cost of ElevenLabs Scribe v2 at comparable quality. Voxtral Realtime is also available as free open weights under the Apache 2.0 license on Hugging Face, so self-hosters only pay infrastructure costs. There is also a free audio playground in Mistral Studio for testing.

What languages does Voxtral support?+

Both Voxtral Mini Transcribe V2 and Voxtral Realtime natively support 13 languages: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. According to Mistral's FLEURS benchmark results, non-English performance significantly outpaces competitors. Note that the context biasing feature is optimized primarily for English, with support for other languages still considered experimental.

How does Voxtral Realtime achieve sub-200ms latency?+

Voxtral Realtime uses a novel streaming architecture that transcribes audio as it arrives, rather than adapting offline models by processing audio in chunks. Latency is configurable: at sub-200ms it powers responsive voice agents while staying within 1-2% word error rate of the batch model, and at 2.4 seconds delay it fully matches Voxtral Mini Transcribe V2's accuracy — ideal for live subtitling. The 4B-parameter footprint means it can also run on edge devices for privacy-sensitive deployments.

Is Voxtral suitable for HIPAA or GDPR-regulated workflows?+

Yes. Mistral states that both models support GDPR and HIPAA-compliant deployments through secure on-premise or private cloud setups. The open-weights release of Voxtral Realtime under Apache 2.0 is particularly relevant here because it allows organizations to run transcription entirely within their own infrastructure, with no audio leaving their environment. This makes it well-suited for healthcare, legal, financial services, and other regulated industries.

How does Voxtral compare to Whisper, Deepgram, and AssemblyAI?+

Per Mistral's published benchmarks, Voxtral Mini Transcribe V2 outperforms GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova on word error rate across FLEURS, while costing $0.003/min — significantly less than incumbents. It also processes audio approximately 3x faster than ElevenLabs Scribe v2 at one-fifth the cost. Compared to OpenAI Whisper (open source), Voxtral covers fewer languages (13 vs 99+) but offers higher accuracy in supported languages plus a hosted API with diarization and streaming built in.

🦞

New to AI tools?

Learn how to run your first agent with OpenClaw

Learn OpenClaw →

Get updates on Voxtral Transcribe 2 and 370+ other AI tools

Weekly insights on the latest AI tools, features, and trends delivered to your inbox.

What's New in 2026

Mistral released Voxtral Transcribe 2 in 2026, introducing two new models: Voxtral Mini Transcribe V2 for batch transcription and Voxtral Realtime for live streaming. Updates include a novel streaming architecture with sub-200ms configurable latency, expanded language support to 13 languages, support for recordings up to 3 hours, context biasing of up to 100 custom terms, an audio playground inside Mistral Studio, and the open-weights release of Voxtral Realtime on Hugging Face under Apache 2.0.

Alternatives to Voxtral Transcribe 2

Deepgram

AI Model APIs

Advanced speech-to-text and text-to-speech API with industry-leading accuracy, real-time streaming, and support for 30+ languages. Built for developers creating voice applications, call transcription, and conversational AI.

AssemblyAI

AI Model APIs

Production-grade speech-to-text API with Universal-3 Pro model, real-time streaming, and audio intelligence features for voice AI applications.

View All Alternatives & Detailed Comparison →

User Reviews

No reviews yet. Be the first to share your experience!

Quick Info

Try Voxtral Transcribe 2 Today

Get started with Voxtral Transcribe 2 and see if it's the right fit for your needs.

Get Started →

Need help choosing the right AI stack?

Take our 60-second quiz to get personalized tool recommendations

Find Your Perfect AI Stack →

Want a faster launch?

Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.

Browse Agent Templates →

More about Voxtral Transcribe 2

Pricing Review Alternatives Free vs Paid Pros & Cons Worth It?Tutorial

Overview

Key Features

Sub-200ms Streaming Architecture+

Speaker Diarization with Timestamps+

Context Biasing for Domain Vocabulary+

Open Weights Under Apache 2.0+

Long-Form & Noise-Robust Audio Processing+

Pricing Plans

Mistral Studio Audio Playground

Free

✓Test Voxtral Transcribe 2 directly in-browser
✓Upload up to 10 audio files
✓Toggle diarization and timestamp granularity
✓Add context bias terms
✓Supports .mp3, .wav, .m4a, .flac, .ogg up to 1GB each

Voxtral Mini Transcribe V2 (API)

$0.003/min

✓Batch transcription via API
✓Speaker diarization with timestamps
✓Context biasing (up to 100 terms)
✓Word-level timestamps
✓Support for recordings up to 3 hours
✓13 languages supported

Voxtral Realtime (API)

$0.006/min

✓Real-time streaming transcription
✓Configurable latency down to sub-200ms
✓13 languages supported
✓Purpose-built for voice agents and live applications
✓Matches batch accuracy at 2.4s delay

Voxtral Realtime (Open Weights)

Free (Apache 2.0)

✓Full model weights on Hugging Face Hub
✓4B parameter footprint, runs on edge devices
✓Apache 2.0 license — commercial use allowed
✓Self-hosted, privacy-first deployment
✓GDPR/HIPAA-compatible on-prem use

Enterprise / Private Cloud

Contact Sales

✓GDPR and HIPAA-compliant deployments
✓Secure on-premise or private cloud setup
✓Dedicated support
✓Custom SLAs
✓Volume pricing

Ready to get started with Voxtral Transcribe 2?

View Pricing Options →

Best Use Cases

🎯

Meeting intelligence platforms transcribing multilingual recordings with speaker diarization for who-said-what attribution at high volume

⚡

Voice agents and virtual assistants requiring sub-200ms transcription latency in a pipeline with an LLM and TTS for natural conversation

🔧

Contact center automation that transcribes calls in real time so AI systems can analyze sentiment, suggest responses, and populate CRM fields mid-conversation

🚀

Live multilingual subtitle generation for media and broadcast workflows, using context biasing to handle proper nouns and technical terminology

💡

Compliance and audit documentation in regulated industries (healthcare, finance, legal), with on-premise HIPAA/GDPR deployment and word-level timestamps for precise audit trails

🔄

Edge or on-device transcription for privacy-first applications using the open-weights Voxtral Realtime model on a 4B-parameter footprint

Limitations & What It Can't Do

We believe in transparent reviews. Here's what Voxtral Transcribe 2 doesn't handle well:

⚠Cannot reliably transcribe overlapping speech — typically captures only one speaker when multiple talk simultaneously

⚠Language support capped at 13 languages, narrower than Whisper's 99+ for global localization needs

⚠Context biasing accuracy varies outside English, where it remains experimental

⚠Voxtral Mini Transcribe V2 is API-only, with no published self-hosting path for batch workloads

⚠Maximum single-request audio length is 3 hours, requiring chunking for longer recordings such as full-day events

Pros & Cons

✓ Pros

✓Lowest published price point at $0.003/min for batch transcription, roughly one-fifth the cost of ElevenLabs Scribe v2
✓Sub-200ms streaming latency makes it viable for real-time voice agents, with only 1-2% WER degradation versus offline mode
✓Voxtral Realtime ships as open weights under Apache 2.0, enabling private on-device deployment for sensitive workloads
✓Approximately 4% word error rate on FLEURS benchmark, beating GPT-4o mini Transcribe, Gemini 2.5 Flash, AssemblyAI Universal, and Deepgram Nova per Mistral's published comparisons
✓Native multilingual support across 13 languages with strong non-English performance, not just English-first adaptation
✓Long-form support up to 3 hours per request reduces chunking overhead for meetings and podcasts

✗ Cons

✗Context biasing is optimized for English; support for other languages is labeled experimental
✗With overlapping speech, the model typically transcribes only one speaker rather than separating concurrent voices
✗Only 13 languages supported, fewer than competitors like Whisper (99+) or Deepgram for niche language coverage
✗Realtime model is open-weights but Mini Transcribe V2 is API-only, limiting self-hosted batch workflows
✗Documentation and tooling are newer than incumbents like AssemblyAI or Deepgram, so ecosystem integrations are still maturing