Automation & Workflows

OpenAI Realtime API

Name: OpenAI Realtime API
Brand: OpenAI Realtime API
Price: 40 USD
Availability: InStock

OpenAI's API for real-time voice conversations and audio processing, enabling low-latency speech-to-speech interactions.

Starting atFrom $40/1M audio input tokens (gpt-4o-mini-realtime)

Visit OpenAI Realtime API →

💡

In Plain English

OpenAI's API for real-time voice conversations and audio processing, enabling low-latency speech-to-speech interactions.

Overview

The OpenAI Realtime API is a paid, usage-based Voice/Audio AI developer service that delivers sub-second speech-to-speech interactions starting at $40 per 1M audio input tokens and $80 per 1M audio output tokens (gpt-4o-mini-realtime), enabling developers to build low-latency voice agents without stitching together separate STT, LLM, and TTS pipelines.

Rather than cascading audio through discrete speech-to-text, language model, and text-to-speech stages — an approach that typically adds 2–5 seconds of round-trip latency — the Realtime API accepts audio (and text) input and returns audio (and text) output through a single streaming connection. This unified architecture delivers approximately 300 ms first-audio latency over WebRTC, preserves prosodic and emotional nuances of speech, and enables natural turn-taking behaviors such as interruption handling and back-channeling that feel much closer to human conversation than traditional cascaded voice stacks.

Under the hood, the Realtime API exposes a persistent, bidirectional session — established via WebSocket or WebRTC — over which developers exchange structured events. These events cover session configuration (voice selection, instructions, modalities, turn detection settings), conversation state (adding user messages, managing conversation items), response generation (triggering model responses, streaming audio deltas), and tool/function calling. The event-driven model lets applications react incrementally as audio tokens stream back, so users start hearing responses within hundreds of milliseconds rather than waiting for a full generation to complete.

The API supports server-side voice activity detection (VAD) with configurable silence thresholds (default 500 ms), which automatically detects when a user starts and stops speaking, enabling hands-free, always-listening experiences. It also supports function calling in the same way the standard Chat Completions and Responses APIs do, which means voice agents can look up data, trigger workflows, or interact with external systems mid-conversation. Developers can pick from a set of built-in voices, tune the model's persona via system instructions, and switch seamlessly between text and audio modalities within a single session.

Typical use cases include customer support voice agents, voice-enabled copilots inside web and mobile apps, language tutoring and pronunciation coaching, accessibility tools, in-car and smart-device assistants, and interactive gaming NPCs. Because OpenAI offers both WebRTC (ideal for browsers and mobile clients) and WebSocket (ideal for server-to-server scenarios) transports, teams can build end-user experiences that connect devices directly to OpenAI while keeping their own backend in the loop for authentication, business logic, and tool execution. The Realtime API is positioned as the foundation for a new generation of voice-first AI products, combining the reasoning quality of GPT-class models with the immediacy required for real conversation.

🎨

Vibe Coding Friendly?

▼

Difficulty:intermediate

Suitability for vibe coding depends on your experience level and the specific use case.

Learn about Vibe Coding →

Was this helpful?

Key Features

Speech-to-speech model: A unified model that directly ingests audio and emits audio, preserving tone, emotion, and timing that are typically lost in cascaded STT + LLM + TTS pipelines.+

Bidirectional streaming over WebRTC and WebSocket: Persistent connections carry typed events for session setup, conversation updates, response generation, and audio deltas, enabling approximately 300 ms first-audio latency over WebRTC.+

Server-side voice activity detection and interruption: Built-in VAD with configurable silence detection thresholds automatically segments user turns and lets users barge in on the model, with the server cleanly truncating in-flight responses.+

Tool and function calling: Developers can define tools at session start; the model will issue tool calls mid-conversation, receive results, and continue speaking without breaking the voice flow.+

Multimodal, mixed input/output: A single session can combine text and audio in either direction, allowing patterns like silent context injection, text transcripts alongside audio, or text-only fallbacks.+

Configurable voices and instructions: System instructions, temperature, modalities, and a choice of built-in voices can be set per session to shape persona, style, and behavior of the voice agent.+

Pricing Plans

Pay-as-you-go API usage

From $40/1M audio input tokens (gpt-4o-mini-realtime)

✓gpt-4o-realtime: $100 per 1M audio input tokens, $200 per 1M audio output tokens; text at $5/$20 per 1M tokens
✓gpt-4o-mini-realtime: $40 per 1M audio input tokens, $80 per 1M audio output tokens; text at $2.50/$10 per 1M tokens
✓Access to all Realtime-capable GPT models
✓WebRTC and WebSocket transports included
✓Built-in tool/function calling and VAD at no additional charge

Enterprise / Scale

Custom volume discounts

✓Negotiated per-token rates below published list prices for high-volume commitments
✓Enterprise agreements with data handling and compliance commitments
✓Higher rate limits and committed throughput capacity
✓Support SLAs and dedicated account management

See Full Pricing →Free vs Paid →Is it worth it? →

Ready to get started with OpenAI Realtime API?

View Pricing Options →

Best Use Cases

🎯

Building voice-first customer support agents that can understand speech, call backend tools, and respond with natural-sounding audio in real time

⚡

Embedding conversational voice copilots into web and mobile applications where hands-free interaction improves usability

🔧

Creating language learning and pronunciation coaching products that require immediate, expressive spoken feedback

🚀

Powering accessibility tools such as voice-controlled interfaces or reading assistants for users with visual or motor impairments

💡

Developing interactive voice experiences for games, interactive fiction, and virtual characters with expressive dialogue

🔄

Prototyping smart-device and in-vehicle assistants that need low-latency speech-to-speech reasoning with tool execution

Limitations & What It Can't Do

We believe in transparent reviews. Here's what OpenAI Realtime API doesn't handle well:

⚠The Realtime API is a cloud-only service with no self-hosted or offline deployment option, which rules out fully air-gapped environments. Pricing for audio tokens is materially higher than text tokens — up to 10–40× on a per-token basis — so long conversations or high-concurrency deployments can become costly at scale. The set of available voices is curated by OpenAI and there is no supported mechanism for cloning a user's or brand's custom voice. Latency and quality ultimately depend on the end user's network conditions, which is particularly important for mobile or global deployments. Finally, although the event-driven protocol is powerful, it requires careful state management on the client — handling reconnection, barge-in, partial audio buffering, and tool-call orchestration is non-trivial compared to a simple request/response API.

Pros & Cons

✓ Pros

✓Single speech-to-speech pipeline eliminates the latency and quality loss of chaining separate STT, LLM, and TTS services
✓Supports both WebRTC and WebSocket transports, making it suitable for browser, mobile, and server-side integrations
✓Built-in server-side voice activity detection and interruption handling produce natural turn-taking without custom audio engineering
✓Native function/tool calling within voice sessions lets agents invoke APIs, look up data, and complete tasks mid-conversation
✓Preserves prosody, tone, and emotional nuance that are typically lost when transcribing speech to text first
✓Backed by OpenAI's infrastructure and model quality, giving production-grade reasoning, multilingual coverage, and reliability

✗ Cons

✗Audio token pricing is significantly higher than text-only API usage, which can make long or high-volume voice sessions expensive
✗Realtime streaming and persistent connections add architectural complexity compared to stateless REST endpoints
✗Limited set of built-in voices and no support for fully custom voice cloning restricts brand personalization
✗Tight coupling to OpenAI means vendor lock-in and no on-premise or offline deployment option for sensitive workloads
✗Event-driven API surface has a steeper learning curve and fewer mature SDK abstractions than standard chat completions

Frequently Asked Questions

What transports does the OpenAI Realtime API support?+

The Realtime API supports WebRTC, which is recommended for browser and mobile clients that need the lowest possible latency, and WebSockets, which are better suited for server-to-server integrations where a backend service mediates between users and the API.

Does the Realtime API handle interruptions and turn-taking automatically?+

Yes. The API includes server-side voice activity detection (VAD) that detects when a user starts and stops speaking, automatically segments turns, and allows users to interrupt the model mid-response, which the model gracefully handles by truncating its current output.

Can I use function calling and tools in a voice session?+

Yes. The Realtime API supports the same tool and function-calling paradigm as OpenAI's other APIs. You can register tools during session configuration, and the model can decide to call them mid-conversation so the voice agent can fetch data or trigger external actions.

Is the Realtime API limited to audio, or can it handle text as well?+

The API is multimodal: a single session can accept and produce text, audio, or both. Developers can configure which modalities are enabled and can mix text inputs (for example, system instructions or silent context updates) with streaming audio within the same conversation.

How is pricing calculated for the Realtime API?+

Usage is billed per token with separate rates for audio and text. For the gpt-4o-realtime model, audio input costs $100 per 1M tokens and audio output costs $200 per 1M tokens, while text input is $5 and text output is $20 per 1M tokens. The more affordable gpt-4o-mini-realtime model charges $40 per 1M audio input tokens and $80 per 1M audio output tokens, with text at $2.50 input and $10 output per 1M tokens. Because speech generates more tokens per second than equivalent text, audio-heavy sessions are priced higher, and developers should monitor session duration and output length to control costs.

🦞

New to AI tools?

Read practical guides for choosing and using AI tools

Read Guides →

Get updates on OpenAI Realtime API and 370+ other AI tools

Weekly insights on the latest AI tools, features, and trends delivered to your inbox.

What's New in 2026

Through early 2026, OpenAI has continued to iterate on the Realtime API with a focus on production readiness: improved voice quality and more natural-sounding built-in voices, broader language coverage, more robust interruption handling, and tighter integration with the Responses API and Agents SDK so voice agents can share the same tool definitions and orchestration logic as text-based agents. Transport improvements on WebRTC have reduced first-audio latency, and pricing has trended downward as newer, more efficient Realtime-capable models are released alongside OpenAI's latest GPT generations.

User Reviews

No reviews yet. Be the first to share your experience!

Quick Info

Try OpenAI Realtime API Today

Get started with OpenAI Realtime API and see if it's the right fit for your needs.

Get Started →

Need help choosing the right AI stack?

Take our 60-second quiz to get personalized tool recommendations

Find Your Perfect AI Stack →

Want a faster launch?

Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.

Browse Agent Templates →

More about OpenAI Realtime API

Pricing Review Alternatives Free vs Paid Pros & Cons Worth It?Tutorial

Overview

Key Features

Speech-to-speech model: A unified model that directly ingests audio and emits audio, preserving tone, emotion, and timing that are typically lost in cascaded STT + LLM + TTS pipelines.+

Tool and function calling: Developers can define tools at session start; the model will issue tool calls mid-conversation, receive results, and continue speaking without breaking the voice flow.+

Configurable voices and instructions: System instructions, temperature, modalities, and a choice of built-in voices can be set per session to shape persona, style, and behavior of the voice agent.+

Pricing Plans

Pay-as-you-go API usage

From $40/1M audio input tokens (gpt-4o-mini-realtime)

✓gpt-4o-realtime: $100 per 1M audio input tokens, $200 per 1M audio output tokens; text at $5/$20 per 1M tokens
✓gpt-4o-mini-realtime: $40 per 1M audio input tokens, $80 per 1M audio output tokens; text at $2.50/$10 per 1M tokens
✓Access to all Realtime-capable GPT models
✓WebRTC and WebSocket transports included
✓Built-in tool/function calling and VAD at no additional charge

Enterprise / Scale

Custom volume discounts

✓Negotiated per-token rates below published list prices for high-volume commitments
✓Enterprise agreements with data handling and compliance commitments
✓Higher rate limits and committed throughput capacity
✓Support SLAs and dedicated account management

Ready to get started with OpenAI Realtime API?

View Pricing Options →

Best Use Cases

🎯

Building voice-first customer support agents that can understand speech, call backend tools, and respond with natural-sounding audio in real time

⚡

Embedding conversational voice copilots into web and mobile applications where hands-free interaction improves usability

🔧

Creating language learning and pronunciation coaching products that require immediate, expressive spoken feedback

🚀

Powering accessibility tools such as voice-controlled interfaces or reading assistants for users with visual or motor impairments

💡

Developing interactive voice experiences for games, interactive fiction, and virtual characters with expressive dialogue

🔄

Prototyping smart-device and in-vehicle assistants that need low-latency speech-to-speech reasoning with tool execution

Limitations & What It Can't Do

We believe in transparent reviews. Here's what OpenAI Realtime API doesn't handle well:

⚠The Realtime API is a cloud-only service with no self-hosted or offline deployment option, which rules out fully air-gapped environments. Pricing for audio tokens is materially higher than text tokens — up to 10–40× on a per-token basis — so long conversations or high-concurrency deployments can become costly at scale. The set of available voices is curated by OpenAI and there is no supported mechanism for cloning a user's or brand's custom voice. Latency and quality ultimately depend on the end user's network conditions, which is particularly important for mobile or global deployments. Finally, although the event-driven protocol is powerful, it requires careful state management on the client — handling reconnection, barge-in, partial audio buffering, and tool-call orchestration is non-trivial compared to a simple request/response API.

Pros & Cons

✓ Pros

✓Single speech-to-speech pipeline eliminates the latency and quality loss of chaining separate STT, LLM, and TTS services
✓Supports both WebRTC and WebSocket transports, making it suitable for browser, mobile, and server-side integrations
✓Built-in server-side voice activity detection and interruption handling produce natural turn-taking without custom audio engineering
✓Native function/tool calling within voice sessions lets agents invoke APIs, look up data, and complete tasks mid-conversation
✓Preserves prosody, tone, and emotional nuance that are typically lost when transcribing speech to text first
✓Backed by OpenAI's infrastructure and model quality, giving production-grade reasoning, multilingual coverage, and reliability

✗ Cons

✗Audio token pricing is significantly higher than text-only API usage, which can make long or high-volume voice sessions expensive
✗Realtime streaming and persistent connections add architectural complexity compared to stateless REST endpoints
✗Limited set of built-in voices and no support for fully custom voice cloning restricts brand personalization
✗Tight coupling to OpenAI means vendor lock-in and no on-premise or offline deployment option for sensitive workloads
✗Event-driven API surface has a steeper learning curve and fewer mature SDK abstractions than standard chat completions