Stay free if you only need $200 in free api credits on signup and access to nova stt, aura tts, and voice agent api. Upgrade if you need discounted volume pricing on stt and tts and higher concurrency and rate limits. Most solo builders can start free.
Why it matters: Aura TTS offers a smaller voice catalog and less expressive range than specialized providers like ElevenLabs or PlayHT
Available from: Pay As You Go
Why it matters: Custom model fine-tuning is gated behind enterprise contracts with significant minimum commitments
Available from: Pay As You Go
Why it matters: Cloud API requires internet connectivity by default; offline use requires the more expensive self-hosted tier
Available from: Pay As You Go
Why it matters: Documentation depth on advanced features (custom vocabulary tuning, on-prem ops) lags behind hyperscaler competitors
Available from: Pay As You Go
Why it matters: Audio files longer than ~4 hours typically need to be chunked client-side for optimal batch performance
Available from: Pay As You Go
Deepgram's Nova model consistently posts the lowest word error rates in independent benchmarks, particularly on conversational audio with accents, crosstalk, or background noise. Real-world deployments report 15-30% relative WER reductions compared to Google Speech-to-Text and AWS Transcribe. Against AssemblyAI, Deepgram tends to win on streaming latency and pricing, while AssemblyAI is competitive on long-form batch accuracy. For multilingual conversational use, the new Flux model raises the bar further with built-in language detection across 10 languages.
Deepgram offers $200 in free credits on signup with no credit card required, which translates to roughly 750 hours of Nova streaming transcription. Pay-as-you-go STT pricing starts around $0.0043 per minute for pre-recorded Nova and $0.0077 per minute for streaming, with TTS billed per character. Growth and Enterprise tiers offer volume discounts, committed-use contracts, and custom model training. This pricing is typically 50-75% below Google Cloud Speech and AWS Transcribe at comparable quality levels.
End-to-end speech-to-text latency is typically 100-300ms over the WebSocket streaming API, with interim results returned even faster. The unified Voice Agent API further compresses round-trip time by collocating STT, LLM orchestration, and TTS — eliminating the network hops you'd see when stitching three separate vendors together. The new Flux model adds intelligent endpointing so the system reliably knows when a user has stopped speaking, which is critical for natural turn-taking in phone-quality conversations.
Yes — self-hosted deployment is one of Deepgram's key differentiators in the speech API category. Enterprise customers can run the same Nova and TTS models inside their own VPC, on-premises data centers, or air-gapped environments. This makes it viable for HIPAA-regulated medical transcription, financial services with data-residency rules, and government workloads. Most major cloud-only competitors do not offer a comparable self-hosted option.
Deepgram supports 30+ languages for transcription, with the new 2026 Flux model offering conversational STT in 10 languages including English, Spanish, German, French, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch with automatic language detection. Beyond raw transcription, the Audio Intelligence API adds summarization, sentiment analysis, topic detection, intent recognition, speaker diarization, and smart formatting. These can be applied to both batch files and live streams via flags on the same API call.
Start with the free plan — upgrade when you need more.
Get Started Free →Still not sure? Read our full verdict →
Last verified March 2026