Open-source AI observability and evaluation platform built on OpenTelemetry for tracing, debugging, and monitoring LLM applications and AI agents in production.
Open-source tool for understanding and debugging your AI — trace LLM calls, evaluate output quality, detect hallucinations, and optimize prompts with production data.
Phoenix by Arize is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting of LLM applications and AI agents. Built on OpenTelemetry standards, Phoenix provides detailed tracing that reveals where LLM workflows break — including issues with retrieval, tool execution, and hallucination.
The platform offers both real-time monitoring and offline evaluation capabilities. Phoenix automatically captures traces from 20+ popular frameworks including LangChain, LlamaIndex, OpenAI, and Anthropic SDKs, providing detailed visibility into agent execution flows, token usage, latency, and failure patterns. The tracing system supports complex multi-agent workflows with agent tracing graphs and dependency mapping across interactions.
Phoenix's evaluation engine supports multiple scoring methods: LLM-as-a-judge evaluators, code-based checks, and human annotation labels. Pre-built evaluators handle hallucination detection, relevance scoring, toxicity assessment, and custom business metrics. The platform supports both automated evaluation during development and continuous online evaluation in production, with alerts for performance degradation or safety violations.
The experiment playground enables rapid prompt iteration — replay traced LLM calls, compare different prompt variations side-by-side, adjust parameters, and measure impact with statistical rigor. This tight feedback loop between production traces and experimentation accelerates prompt optimization.
Token and cost tracking covers 100+ models across providers, giving teams visibility into spend per agent, per workflow, and per model. This financial observability helps optimize model selection and identify cost reduction opportunities.
Phoenix is available in two forms: the open-source library (free, self-hosted) and Arize AX (managed cloud platform). The open-source Phoenix library provides full observability features with self-managed infrastructure. Arize AX adds managed hosting, online evaluations, the Alyx AI assistant for trace debugging, product observability (monitors and custom metrics), and enterprise security features including SOC 2 Type II and HIPAA compliance.
Originally announced at Arize AI's Observe 2023 summit to address LLM hallucination detection, Phoenix has evolved into a comprehensive observability solution adopted by teams at companies like Handshake, GetYourGuide, and hundreds of others building production AI applications.
Was this helpful?
Phoenix is the strongest open-source option for AI observability — the OpenTelemetry foundation ensures interoperability, the multi-method evaluation engine covers diverse quality needs, and the experiment playground closes the loop between production monitoring and prompt improvement. The free self-hosted option removes cost barriers for getting started. Arize AX adds managed infrastructure and enterprise features for teams that need compliance, online evaluations, and the Alyx AI assistant.
Automatic trace collection from 20+ frameworks (LangChain, LlamaIndex, OpenAI, Anthropic) with agent tracing graphs, multi-agent workflow visualization, and span-level detail on every LLM call, tool invocation, and retrieval step.
Use Case:
Debugging a multi-agent customer service system by tracing exactly which agent handled a query, what retrieval documents were used, which tool calls were made, and where the response quality degraded.
Score traces and spans using LLM-based evaluators (hallucination, relevance, toxicity), code-based checks (regex, assertions), or human annotation labels. Supports both offline batch evaluation and continuous online evaluation in production.
Use Case:
Running hallucination detection on every production response while maintaining a human labeling queue for edge cases, creating a continuous quality improvement loop.
Replay traced LLM calls with different prompts, models, or parameters. Compare results side-by-side with evaluation scoring. Iterate rapidly on prompt engineering without deploying changes to production.
Use Case:
Taking a poorly-performing production trace, replaying it with three different prompt variations, scoring each with relevance and accuracy evaluators, and deploying the winner.
Track token usage and costs across 100+ models from all major providers. Attribute costs to specific agents, workflows, and traces for financial visibility and optimization.
Use Case:
Identifying that a sales agent's summarization step consumes 60% of total token budget, then testing a smaller model for that specific step to reduce costs while maintaining quality.
Built-in evaluators that detect when LLM responses contain fabricated information, are irrelevant to the query, or violate quality thresholds — with automatic flagging and alerting.
Use Case:
Monitoring a medical information chatbot for factual accuracy, automatically flagging responses where the model generates unsupported claims, and routing flagged interactions for human review.
Arize's built-in AI agent for trace debugging and analysis. Alyx can explain span context, debug traces, create dashboards and widgets, optimize prompts, and search traces using natural language.
Use Case:
Asking Alyx 'Why did response quality drop for support queries last Tuesday?' and getting an analysis of trace patterns, evaluation scores, and potential root causes.
$0
Teams wanting full AI observability without vendor lock-in who can manage their own infrastructure
$0
Individuals and startups getting started with production AI observability
$50
Small teams and startups needing higher volume and longer retention for production applications
Custom
Enterprise organizations requiring compliance, custom retention, and self-hosted or multi-region deployments
Ready to get started with Phoenix by Arize?
View Pricing Options →We believe in transparent reviews. Here's what Phoenix by Arize doesn't handle well:
Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
Phoenix expanded its evaluation capabilities with online evaluations (session evals and agent path evals), introduced the Alyx AI assistant for natural-language trace debugging, added agent tracing graphs for multi-agent visualization, and launched AX Pro at $50/month with 50K included spans. The platform now supports 100+ model cost tracking and multi-region data residency for enterprise deployments.
Analytics & Monitoring
LangSmith lets you trace, analyze, and evaluate LLM applications and agents with deep observability into every model call, chain step, and tool invocation.
Analytics & Monitoring
Leading open-source LLM observability platform for production AI applications. Comprehensive tracing, prompt management, evaluation frameworks, and cost optimization with enterprise security (SOC2, ISO27001, HIPAA). Self-hostable with full feature parity.
Analytics & Monitoring
Open-source LLM observability platform and API gateway that provides cost analytics, request logging, caching, and rate limiting through a simple proxy-based integration requiring only a base URL change.
No reviews yet. Be the first to share your experience!
Get started with Phoenix by Arize and see if it's the right fit for your needs.
Get Started →Take our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack →Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates →