Analytics & Monitoring🔴Developer

Arize Phoenix

Name: Arize Phoenix
Brand: Arize Phoenix
Availability: InStock

Open-source LLM observability and evaluation platform built on OpenTelemetry. Self-host for free with comprehensive tracing, experimentation, and quality assessment for AI applications.

Starting atFree

Visit Arize Phoenix →

💡

In Plain English

Open-source platform for monitoring and improving AI applications with detailed tracing, quality evaluation, and performance analytics.

Overview

Arize Phoenix is a free, open-source LLM observability and evaluation platform in the Analytics & Monitoring category, built on OpenTelemetry standards and designed for engineering teams who need comprehensive tracing, experimentation, and quality assessment for AI applications without vendor lock-in or per-trace fees.

With over 18,000 GitHub stars and millions of monthly PyPI downloads, Phoenix has established itself as one of the most widely adopted open-source tools for monitoring and debugging large language model applications. It provides end-to-end visibility into LLM calls, retrieval-augmented generation pipelines, multi-agent workflows, and tool invocations through standardized OpenTelemetry and OpenInference semantic conventions.

Phoenix addresses a critical gap in the AI development lifecycle: the need for deep, actionable observability that works across any LLM provider, framework, or deployment environment. Unlike proprietary alternatives that lock teams into specific ecosystems, Phoenix gives developers complete freedom to self-host their observability infrastructure, retain full ownership of their data, and integrate with existing OpenTelemetry-compatible backends such as Jaeger, Grafana Tempo, and Datadog.

The platform's evaluation framework is one of its strongest differentiators. Teams can define custom LLM-as-a-judge evaluators for hallucination detection, QA correctness, relevance scoring, toxicity filtering, summarization quality, and RAG-specific metrics like retrieval relevance and context precision. These evaluators can run in batch against versioned datasets or be applied in real time to production traces, enabling continuous quality monitoring and regression testing.

Experiment management in Phoenix allows teams to systematically compare different prompts, models, retrieval strategies, and system configurations. Each experiment runs against a curated dataset and produces side-by-side results with statistical summaries, making it straightforward to quantify improvements before shipping changes to production. Combined with the annotation and dataset curation workflows, this creates a complete feedback loop from production data to improved model performance.

Phoenix supports auto-instrumentation for all major LLM frameworks including LangChain, LlamaIndex, CrewAI, Haystack, DSPy, AutoGen, Semantic Kernel, and LiteLLM, as well as direct SDK support for OpenAI, Anthropic, Google Vertex and Gemini, AWS Bedrock, Mistral, and Cohere. Adding tracing to an existing application typically requires just a few lines of instrumentation code.

Deployment options range from a simple pip install for local development to production-grade Kubernetes deployments with Helm charts and PostgreSQL-backed persistent storage. The managed Phoenix Cloud service offers the same capabilities without infrastructure overhead, starting with a generous free tier for individuals and small teams. For enterprises requiring SSO, RBAC, audit logging, and SLAs, the commercial Arize AX platform extends Phoenix with enterprise-grade features and dedicated support.

The platform's embedding analysis capabilities, including UMAP and PCA projections, enable teams to visualize semantic drift, identify clustering patterns in retrieval results, and diagnose underperforming document segments in RAG pipelines. This visual approach to debugging is particularly valuable for teams working with large document collections where statistical metrics alone may not reveal the root cause of quality issues.

Phoenix is developed and maintained by Arize AI, a well-funded company with a dedicated engineering team that ships frequent releases and maintains active community support through GitHub, Slack, and comprehensive documentation. The project is licensed under the Elastic License 2.0, which permits free use for all purposes except offering Phoenix as a competing managed service.

🦞

Using with OpenClaw

▼

Integrate Phoenix to monitor OpenClaw agent performance, trace decision flows, and evaluate response quality with automated scoring.

Use Case Example:

Gain comprehensive observability into OpenClaw agent behavior with detailed tracing, quality evaluation, and performance optimization insights.

Learn about OpenClaw →

🎨

Vibe Coding Friendly?

▼

Difficulty:intermediate

Powerful observability platform requiring technical setup but providing deep analytical insights for AI application optimization.

Learn about Vibe Coding →

Was this helpful?

Editorial Review

Leading open-source LLM observability platform offering comprehensive tracing, evaluation, and experimentation without vendor lock-in. Ideal for teams with DevOps capacity who need deep analytical insights into LLM application behavior, RAG pipeline quality, and multi-agent workflow debugging. Phoenix stands out for its OpenTelemetry foundation, which ensures trace portability and avoids ecosystem lock-in, and its robust evaluation framework that supports both automated LLM-as-a-judge scoring and human annotation workflows. The self-hosted model with zero licensing costs makes it particularly attractive for regulated industries and cost-conscious teams, though the operational overhead of managing infrastructure and the steeper learning curve compared to polished SaaS alternatives like LangSmith should be weighed against these benefits. With over 18,000 GitHub stars and strong backing from Arize AI, the project demonstrates sustained momentum and community adoption.

Key Features

•LLM Tracing & Observability
•Evaluation Framework
•Experiment Management
•Embedding Analysis
•Drift Detection
•OpenTelemetry Integration

Pricing Plans

Phoenix Open Source (Self-Hosted)

Free

Phoenix Cloud

Free tier available; usage-based beyond limits

Arize AX (Enterprise)

Custom / contact sales

See Full Pricing →Free vs Paid →Is it worth it? →

Ready to get started with Arize Phoenix?

View Pricing Options →

Getting Started with Arize Phoenix

1Install Phoenix using pip install arize-phoenix or deploy with Docker for local development
2Add OpenTelemetry instrumentation to your LLM application using framework-specific guides
3Configure trace collection endpoints and start capturing application data flows
4Set up evaluation criteria and quality metrics specific to your AI application requirements
5Deploy to production environment with persistent storage and access controls configured

Ready to start? Try Arize Phoenix →

Best Use Cases

🎯

Debugging and monitoring RAG applications where retrieval quality, context relevance, and hallucination rates need to be tracked across prompts, embeddings, and documents.

⚡

Tracing complex multi-agent systems (CrewAI, AutoGen, LangGraph) to understand tool-call sequences, reasoning steps, and failure points in long-running workflows.

🔧

Running systematic prompt and model comparisons with versioned experiments, LLM-as-a-judge evaluators, and side-by-side result diffs before shipping changes to production.

🚀

Self-hosting LLM observability in regulated industries (healthcare, finance, government) where data residency, on-prem deployment, and avoidance of SaaS vendors are required.

💡

Building internal evaluation pipelines that combine automated evals with human annotation, producing golden datasets for regression testing and continuous model quality tracking.

🔄

Teams standardizing on OpenTelemetry who want LLM spans in the same observability stack as the rest of their infrastructure (Jaeger, Tempo, Datadog, Grafana, etc.).

Integration Ecosystem

20 integrations

Arize Phoenix works with these platforms and services:

🧠 LLM Providers

OpenAIAnthropicGoogleMistralollamahuggingface

☁️ Cloud Platforms

AWSGCPAzurekubernetes

📈 Monitoring

opentelemetrygrafanaprometheusjaeger

💾 Storage

postgresqlMySQLsqlite

🔗 Other

Dockerhelmjupyter

View full Integration Matrix →

Limitations & What It Can't Do

We believe in transparent reviews. Here's what Arize Phoenix doesn't handle well:

⚠Phoenix is primarily focused on LLM-specific observability and evaluation; it is not a general-purpose APM and does not replace tools like Datadog, New Relic, or Prometheus for infrastructure monitoring, request routing, or non-LLM application metrics. Self-hosting requires operational knowledge of PostgreSQL, Docker, or Kubernetes, and managing storage growth from high-volume trace ingestion can become non-trivial at scale. The default SQLite storage backend is suitable only for local development and should not be used in production. Enterprise features such as SSO/SAML, role-based access control, and audit logging are only available in the paid Arize AX product, not in the open-source Phoenix release. The Elastic License 2.0 permits free use but prohibits offering Phoenix as a competing managed observability service. UI performance may degrade with very large trace volumes unless proper data retention and cleanup policies are configured. Some auto-instrumentation libraries may introduce minor latency overhead if not configured correctly, and framework instrumentation coverage occasionally lags behind the latest framework releases by a few weeks.

Pros & Cons

✓ Pros

✓Fully open source and free to self-host, with no seat-based pricing, trace volume caps, or feature gating — a major advantage over LangSmith and other commercial competitors.
✓Built on OpenTelemetry and OpenInference standards, so instrumentation is portable and traces can be exported to other OTel backends without vendor lock-in.
✓Broad framework coverage with auto-instrumentation for LangChain, LlamaIndex, CrewAI, Haystack, DSPy, OpenAI, Anthropic, Bedrock, LiteLLM, and more — minimal code changes required to start tracing.
✓Comprehensive built-in evaluators (hallucination, relevance, toxicity, QA correctness, RAG metrics) plus a flexible framework for writing custom LLM-as-a-judge evals.
✓Backed by Arize AI, a well-resourced company with a commercial enterprise product, giving the open-source project sustained engineering investment and frequent releases.
✓Strong support for RAG debugging and agent tracing, including embedding visualization, UMAP clustering, and step-by-step inspection of tool calls and retrieval steps.

✗ Cons

✗Self-hosting requires operational effort — running Postgres, managing storage growth from high-volume traces, and handling upgrades are non-trivial for small teams without DevOps capacity.
✗UI and workflows have a steeper learning curve than polished SaaS alternatives like LangSmith, especially for users new to OpenTelemetry concepts like spans and traces.
✗Rapid release cadence occasionally introduces breaking changes to SDKs, integrations, or UI, requiring teams to pin versions and test carefully before upgrading.
✗Documentation, while extensive, can lag behind the latest features, and some advanced workflows (custom evaluators, dataset versioning, annotation APIs) require reading source code or GitHub issues.
✗Enterprise features like SSO, RBAC, audit logging, and SLAs are reserved for the paid Arize AX platform rather than the open-source Phoenix core.

Frequently Asked Questions

Is Arize Phoenix really free, and what's the catch?+

Yes — Phoenix is fully open source under the Elastic License 2.0 and free to self-host with no feature restrictions, user limits, or trace volume caps. The only restriction is that you cannot offer Phoenix itself as a competing managed observability service. Arize monetizes through its commercial Arize AX enterprise platform, which adds SSO, RBAC, audit logs, SLAs, and dedicated support on top of the Phoenix core. The open-source version receives the same core tracing, evaluation, and experimentation features — there is no intentional feature gating to push users toward paid tiers.

How is Phoenix different from LangSmith or Langfuse?+

All three provide LLM tracing and evaluation, but Phoenix is built on OpenTelemetry and OpenInference standards, making traces portable across any OTel-compatible backend (Jaeger, Grafana Tempo, Datadog). LangSmith is tightly coupled to the LangChain ecosystem and uses a proprietary tracing format, making it the fastest path for LangChain-only teams but creating vendor lock-in. Langfuse is also open source and shares Phoenix's philosophy of openness, but Phoenix offers stronger evaluation and experiment management features, deeper embedding analysis with UMAP visualizations, and benefits from Arize's sustained engineering investment. Phoenix's auto-instrumentation covers the broadest range of frameworks, while LangSmith offers the most polished UX for LangChain-specific workflows.

What LLM frameworks and providers does Phoenix support?+

Phoenix auto-instruments LangChain, LlamaIndex, CrewAI, Haystack, DSPy, AutoGen, Semantic Kernel, and LiteLLM, plus direct SDKs for OpenAI, Anthropic, Google Vertex and Gemini, AWS Bedrock, Mistral, Cohere, and Ollama. Because Phoenix is built on OpenTelemetry, any application that emits OTel-compatible spans can send data to Phoenix, even if a dedicated auto-instrumentation library does not yet exist for that specific framework or provider. New framework integrations are added regularly as the ecosystem evolves.

Can I use Phoenix in production, or is it only for development?+

Phoenix is designed for both development and production use. Many teams run it locally during development for rapid debugging and then deploy it via Docker or Kubernetes with PostgreSQL-backed storage for production observability. For high-volume production workloads, Arize recommends using PostgreSQL persistent storage, configuring appropriate data retention policies, and deploying with Kubernetes Helm charts for reliability and scalability. The managed Phoenix Cloud service is also available for teams that prefer not to manage their own infrastructure. Production deployments should plan for storage growth based on trace volume and configure cleanup policies accordingly.

Does Phoenix support human annotation and dataset curation?+

Yes. Phoenix includes comprehensive workflows for annotating traces with human feedback, building and versioning datasets from production data, running experiments against those datasets, and comparing results across prompt or model variations. Annotators can label traces directly in the UI, and these annotations feed into golden datasets used for regression testing and evaluator calibration. This creates a complete feedback loop where production issues are captured, annotated, added to evaluation datasets, and then used to validate that future changes don't reintroduce the same problems. Teams can also use the annotation API to integrate human review workflows with external labeling tools.

🔒 Security & Compliance

🛡️ SOC2 Compliant

✅

SOC2

Yes

✅

GDPR

Yes

❌

HIPAA

❌

SSO

✅

Self-Hosted

Yes

✅

On-Prem

Yes

❌

RBAC

❌

Audit Log

✅

API Key Auth

Yes

✅

Open Source

Yes

✅

Encryption at Rest

Yes

✅

Encryption in Transit

Yes

Data Retention: configurable

Data Residency: TRUE

📋 Privacy Policy →🛡️ Security Page →

🦞

New to AI tools?

Read practical guides for choosing and using AI tools

Read Guides →

Get updates on Arize Phoenix and 370+ other AI tools

Weekly insights on the latest AI tools, features, and trends delivered to your inbox.

What's New in 2026

Through late 2025 and into 2026, Phoenix has expanded agent-focused tracing with deeper support for LangGraph, CrewAI, and AutoGen, including visualizations for multi-agent coordination and tool-call sequence inspection. The evaluation framework has been enhanced with new built-in evaluators for code generation quality, multi-turn conversation coherence, and structured output validation. Session and thread-based tracing now provides better visibility into conversational AI applications, grouping related interactions and tracking context evolution across turns. The prompt playground has been upgraded with multi-model comparison capabilities, allowing teams to test prompts against several providers simultaneously and feed results directly into experiments. Guardrails integration enables teams to define and monitor safety boundaries alongside performance metrics. The annotation workflow has been streamlined with bulk labeling tools, inter-annotator agreement metrics, and API-driven integration with external labeling platforms. Infrastructure improvements include faster trace ingestion, improved query performance for large datasets, and better support for high-cardinality span attributes in production environments.

Alternatives to Arize Phoenix

LangSmith

Analytics & Monitoring

LangSmith lets you trace, analyze, and evaluate LLM applications and agents with deep observability into every model call, chain step, and tool invocation.

Langfuse

Analytics & Monitoring

Leading open-source LLM observability platform for production AI applications. Comprehensive tracing, prompt management, evaluation frameworks, and cost optimization with enterprise security (SOC2, ISO27001, HIPAA). Self-hostable with full feature parity.

Weights & Biases

Analytics & Monitoring

Experiment tracking and model evaluation used in agent development.

View All Alternatives & Detailed Comparison →

User Reviews

No reviews yet. Be the first to share your experience!

Quick Info

Try Arize Phoenix Today

Get started with Arize Phoenix and see if it's the right fit for your needs.

Get Started →

Need help choosing the right AI stack?

Take our 60-second quiz to get personalized tool recommendations

Find Your Perfect AI Stack →

Want a faster launch?

Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.

Browse Agent Templates →

More about Arize Phoenix

Pricing Review Alternatives Free vs Paid Pros & Cons Worth It?Tutorial

📚 Related Articles

AI Agent Governance: How to Control Autonomous Agents in Production

An autonomous agent at a Fortune 500 company dropped a production database table at 3am on a Saturday. The guardrail that was supposed to prevent it? A hardcoded if-statement. Here's how to actually govern AI agents in production — with the frameworks, tools, and patterns that work.

2026-03-1510 min read

Overview

Editorial Review

Getting Started with Arize Phoenix

1Install Phoenix using pip install arize-phoenix or deploy with Docker for local development

2Add OpenTelemetry instrumentation to your LLM application using framework-specific guides

3Configure trace collection endpoints and start capturing application data flows

4Set up evaluation criteria and quality metrics specific to your AI application requirements

5Deploy to production environment with persistent storage and access controls configured

Best Use Cases

🎯

Debugging and monitoring RAG applications where retrieval quality, context relevance, and hallucination rates need to be tracked across prompts, embeddings, and documents.

⚡

Tracing complex multi-agent systems (CrewAI, AutoGen, LangGraph) to understand tool-call sequences, reasoning steps, and failure points in long-running workflows.

🔧

Running systematic prompt and model comparisons with versioned experiments, LLM-as-a-judge evaluators, and side-by-side result diffs before shipping changes to production.

🚀

Self-hosting LLM observability in regulated industries (healthcare, finance, government) where data residency, on-prem deployment, and avoidance of SaaS vendors are required.

💡

Building internal evaluation pipelines that combine automated evals with human annotation, producing golden datasets for regression testing and continuous model quality tracking.

🔄

Teams standardizing on OpenTelemetry who want LLM spans in the same observability stack as the rest of their infrastructure (Jaeger, Tempo, Datadog, Grafana, etc.).

Limitations & What It Can't Do

We believe in transparent reviews. Here's what Arize Phoenix doesn't handle well:

⚠Phoenix is primarily focused on LLM-specific observability and evaluation; it is not a general-purpose APM and does not replace tools like Datadog, New Relic, or Prometheus for infrastructure monitoring, request routing, or non-LLM application metrics. Self-hosting requires operational knowledge of PostgreSQL, Docker, or Kubernetes, and managing storage growth from high-volume trace ingestion can become non-trivial at scale. The default SQLite storage backend is suitable only for local development and should not be used in production. Enterprise features such as SSO/SAML, role-based access control, and audit logging are only available in the paid Arize AX product, not in the open-source Phoenix release. The Elastic License 2.0 permits free use but prohibits offering Phoenix as a competing managed observability service. UI performance may degrade with very large trace volumes unless proper data retention and cleanup policies are configured. Some auto-instrumentation libraries may introduce minor latency overhead if not configured correctly, and framework instrumentation coverage occasionally lags behind the latest framework releases by a few weeks.

Pros & Cons

✓ Pros

✓Fully open source and free to self-host, with no seat-based pricing, trace volume caps, or feature gating — a major advantage over LangSmith and other commercial competitors.
✓Built on OpenTelemetry and OpenInference standards, so instrumentation is portable and traces can be exported to other OTel backends without vendor lock-in.
✓Broad framework coverage with auto-instrumentation for LangChain, LlamaIndex, CrewAI, Haystack, DSPy, OpenAI, Anthropic, Bedrock, LiteLLM, and more — minimal code changes required to start tracing.
✓Comprehensive built-in evaluators (hallucination, relevance, toxicity, QA correctness, RAG metrics) plus a flexible framework for writing custom LLM-as-a-judge evals.
✓Backed by Arize AI, a well-resourced company with a commercial enterprise product, giving the open-source project sustained engineering investment and frequent releases.
✓Strong support for RAG debugging and agent tracing, including embedding visualization, UMAP clustering, and step-by-step inspection of tool calls and retrieval steps.

✗ Cons

✗Self-hosting requires operational effort — running Postgres, managing storage growth from high-volume traces, and handling upgrades are non-trivial for small teams without DevOps capacity.
✗UI and workflows have a steeper learning curve than polished SaaS alternatives like LangSmith, especially for users new to OpenTelemetry concepts like spans and traces.
✗Rapid release cadence occasionally introduces breaking changes to SDKs, integrations, or UI, requiring teams to pin versions and test carefully before upgrading.
✗Documentation, while extensive, can lag behind the latest features, and some advanced workflows (custom evaluators, dataset versioning, annotation APIs) require reading source code or GitHub issues.
✗Enterprise features like SSO, RBAC, audit logging, and SLAs are reserved for the paid Arize AX platform rather than the open-source Phoenix core.