Open source voice cloning desktop application with support for multiple TTS engines that allows users to clone any voice and generate natural speech locally.
Voicebox is a Voice/Audio open-source desktop application that enables local voice cloning and text-to-speech generation across multiple TTS engines, with pricing that is completely free under the MIT license. It is built for developers, game designers, content creators, and privacy-conscious users who need professional voice synthesis without cloud dependencies, API keys, or per-character fees.
The application bundles seven distinct TTS engines â Qwen3-TTS (1.7B and 0.6B parameter variants by Alibaba), Chatterbox and Chatterbox Turbo (by Resemble AI, 350M params), LuxTTS (by ZipVoice, 48kHz output), Qwen CustomVoice (with nine preset speakers), TADA (by Hume AI, 3B and 1B variants), and Kokoro (by hexgrad, 82M params under Apache 2.0). Together these engines cover up to 23 languages, support delivery instructions in natural language, handle paralinguistic tags like [laugh] and [sigh], and deliver performance exceeding 150x realtime on CPU with approximately 1GB VRAM. The TADA engine can produce 700+ seconds of coherent long-form audio without drift, making it viable for audiobook production.
Use cases span game NPC dialogue generation, AI agent voice replies, accessibility readouts, audiobook batch processing, podcast automation, and integrations with tools like Stream Deck via a localhost URL. Voicebox ships with a built-in REST API (curl-compatible) that mirrors commercial TTS endpoints but runs entirely on-device. Compared to the other Voice/Audio tools in our directory â many of which rely on subscription-based cloud APIs like ElevenLabs ($5â$330/month) or Play.ht â Voicebox offers zero-cost, unlimited, offline inference with source code on GitHub. Based on our analysis of 870+ AI tools, it is one of the few voice cloning solutions that is both genuinely local-first and multi-engine in a single unified studio interface, with native builds for macOS (Apple Silicon and Intel x64), Windows 64-bit MSI, and Linux.
Was this helpful?
Voicebox is the only studio in our directory that bundles 7 distinct TTS engines â Qwen3-TTS, Chatterbox, Chatterbox Turbo, LuxTTS, Qwen CustomVoice, TADA, and Kokoro â into a single UI. Users can pick the right model per task: LuxTTS for fast iteration, TADA for long-form audiobooks, Chatterbox for 23-language reach, or Kokoro for minimal-footprint CPU use.
Every model runs entirely on the user's machine with no API keys, no rate limits, no per-character fees, and no internet required after download. This preserves the privacy of voice samples and generated audio, and eliminates ongoing cost regardless of volume, which is particularly valuable for commercial projects generating thousands of lines.
Qwen3-TTS and Qwen CustomVoice accept an `instruct` parameter (e.g. "warm, slow, cinematic" or "authoritative and clear") that steers tone, pace, and emotion at generation time. This is a significant differentiator over most open-source TTS tools, giving users commercial-grade prosody control without training custom models.
Voicebox exposes a curl-compatible HTTP endpoint that takes text, profile_id, engine, and instruct fields and returns WAV audio. This turns the desktop app into a local inference server for games, AI agents, scripts, Stream Deck macros, or any tool that can hit a URL â all without authentication or external network dependencies.
Chatterbox Turbo supports embedded tags like [laugh], [sigh], and [gasp] directly in text for expressive delivery, while Hume AI's TADA engine produces 700+ seconds of coherent audio without drift. Combined, these features let users produce emotionally nuanced long-form content â audiobooks, narrated tutorials, podcast episodes â in a single generation pass.
Free
Ready to get started with Voicebox?
View Pricing Options âWe believe in transparent reviews. Here's what Voicebox doesn't handle well:
Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
Voicebox v0.2.0 ships with a 7-engine multi-engine architecture including Hume AI's TADA (3B/1B) for long-form 700+ second coherent generation, Alibaba's Qwen3-TTS with natural-language delivery instructions, Chatterbox Turbo with paralinguistic tags ([laugh], [sigh], [gasp]), and ZipVoice's LuxTTS delivering 48kHz output at 150x realtime on CPU. The project is released under MIT license in 2026 alongside sister projects Spacebot and Spacedrive.
audio
Leading AI voice synthesis platform with realistic voice cloning and generation
Audio
AI voice platform for text-to-speech, voice cloning, and multilingual dubbing with over 800 natural-sounding voices across 142 languages.
Voice APIs
AI voice platform combining voice cloning, text-to-speech, speech-to-speech, deepfake detection, and AI watermarking in a single ecosystem for content creators, game studios, and enterprises.
Voice Agents
Murf AI: AI voice generation platform offering 200+ ultra-realistic text-to-speech voices in 35+ languages for voiceovers, audiobooks, and presentations.
No reviews yet. Be the first to share your experience!
Get started with Voicebox and see if it's the right fit for your needs.
Get Started âTake our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack âExplore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates â