Open-source LLM observability and evaluation platform built on OpenTelemetry. Self-host for free with comprehensive tracing, experimentation, and quality assessment for AI applications.
Open-source platform for monitoring and improving AI applications with detailed tracing, quality evaluation, and performance analytics.
Arize Phoenix is a free, open-source LLM observability and evaluation platform in the Analytics & Monitoring category, built on OpenTelemetry standards and designed for engineering teams who need comprehensive tracing, experimentation, and quality assessment for AI applications without vendor lock-in or per-trace fees.
With over 18,000 GitHub stars and millions of monthly PyPI downloads, Phoenix has established itself as one of the most widely adopted open-source tools for monitoring and debugging large language model applications. It provides end-to-end visibility into LLM calls, retrieval-augmented generation pipelines, multi-agent workflows, and tool invocations through standardized OpenTelemetry and OpenInference semantic conventions.
Phoenix addresses a critical gap in the AI development lifecycle: the need for deep, actionable observability that works across any LLM provider, framework, or deployment environment. Unlike proprietary alternatives that lock teams into specific ecosystems, Phoenix gives developers complete freedom to self-host their observability infrastructure, retain full ownership of their data, and integrate with existing OpenTelemetry-compatible backends such as Jaeger, Grafana Tempo, and Datadog.
The platform's evaluation framework is one of its strongest differentiators. Teams can define custom LLM-as-a-judge evaluators for hallucination detection, QA correctness, relevance scoring, toxicity filtering, summarization quality, and RAG-specific metrics like retrieval relevance and context precision. These evaluators can run in batch against versioned datasets or be applied in real time to production traces, enabling continuous quality monitoring and regression testing.
Experiment management in Phoenix allows teams to systematically compare different prompts, models, retrieval strategies, and system configurations. Each experiment runs against a curated dataset and produces side-by-side results with statistical summaries, making it straightforward to quantify improvements before shipping changes to production. Combined with the annotation and dataset curation workflows, this creates a complete feedback loop from production data to improved model performance.
Phoenix supports auto-instrumentation for all major LLM frameworks including LangChain, LlamaIndex, CrewAI, Haystack, DSPy, AutoGen, Semantic Kernel, and LiteLLM, as well as direct SDK support for OpenAI, Anthropic, Google Vertex and Gemini, AWS Bedrock, Mistral, and Cohere. Adding tracing to an existing application typically requires just a few lines of instrumentation code.
Deployment options range from a simple pip install for local development to production-grade Kubernetes deployments with Helm charts and PostgreSQL-backed persistent storage. The managed Phoenix Cloud service offers the same capabilities without infrastructure overhead, starting with a generous free tier for individuals and small teams. For enterprises requiring SSO, RBAC, audit logging, and SLAs, the commercial Arize AX platform extends Phoenix with enterprise-grade features and dedicated support.
The platform's embedding analysis capabilities, including UMAP and PCA projections, enable teams to visualize semantic drift, identify clustering patterns in retrieval results, and diagnose underperforming document segments in RAG pipelines. This visual approach to debugging is particularly valuable for teams working with large document collections where statistical metrics alone may not reveal the root cause of quality issues.
Phoenix is developed and maintained by Arize AI, a well-funded company with a dedicated engineering team that ships frequent releases and maintains active community support through GitHub, Slack, and comprehensive documentation. The project is licensed under the Elastic License 2.0, which permits free use for all purposes except offering Phoenix as a competing managed service.
Was this helpful?
Leading open-source LLM observability platform offering comprehensive tracing, evaluation, and experimentation without vendor lock-in. Ideal for teams with DevOps capacity who need deep analytical insights into LLM application behavior, RAG pipeline quality, and multi-agent workflow debugging. Phoenix stands out for its OpenTelemetry foundation, which ensures trace portability and avoids ecosystem lock-in, and its robust evaluation framework that supports both automated LLM-as-a-judge scoring and human annotation workflows. The self-hosted model with zero licensing costs makes it particularly attractive for regulated industries and cost-conscious teams, though the operational overhead of managing infrastructure and the steeper learning curve compared to polished SaaS alternatives like LangSmith should be weighed against these benefits. With over 18,000 GitHub stars and strong backing from Arize AI, the project demonstrates sustained momentum and community adoption.
Free
Free tier available; usage-based beyond limits
Custom / contact sales
Ready to get started with Arize Phoenix?
View Pricing Options →Arize Phoenix works with these platforms and services:
We believe in transparent reviews. Here's what Arize Phoenix doesn't handle well:
Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
Through late 2025 and into 2026, Phoenix has expanded agent-focused tracing with deeper support for LangGraph, CrewAI, and AutoGen, including visualizations for multi-agent coordination and tool-call sequence inspection. The evaluation framework has been enhanced with new built-in evaluators for code generation quality, multi-turn conversation coherence, and structured output validation. Session and thread-based tracing now provides better visibility into conversational AI applications, grouping related interactions and tracking context evolution across turns. The prompt playground has been upgraded with multi-model comparison capabilities, allowing teams to test prompts against several providers simultaneously and feed results directly into experiments. Guardrails integration enables teams to define and monitor safety boundaries alongside performance metrics. The annotation workflow has been streamlined with bulk labeling tools, inter-annotator agreement metrics, and API-driven integration with external labeling platforms. Infrastructure improvements include faster trace ingestion, improved query performance for large datasets, and better support for high-cardinality span attributes in production environments.
Analytics & Monitoring
LangSmith lets you trace, analyze, and evaluate LLM applications and agents with deep observability into every model call, chain step, and tool invocation.
Analytics & Monitoring
Leading open-source LLM observability platform for production AI applications. Comprehensive tracing, prompt management, evaluation frameworks, and cost optimization with enterprise security (SOC2, ISO27001, HIPAA). Self-hostable with full feature parity.
Analytics & Monitoring
Experiment tracking and model evaluation used in agent development.
No reviews yet. Be the first to share your experience!
Get started with Arize Phoenix and see if it's the right fit for your needs.
Get Started →Take our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack →Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates →