Open-source LLM evaluation framework with 50+ research-backed metrics including hallucination detection, tool use correctness, and conversational quality. Pytest-style testing for AI agents with CI/CD integration.
A testing framework for AI applications — write tests that check if your AI's responses are accurate, helpful, and safe, just like writing unit tests for code.
DeepEval is an open-source evaluation framework designed for comprehensive testing of LLM applications and AI agents. It provides over 50 research-backed metrics that cover the full spectrum of agent quality assessment, from basic response relevancy to complex multi-turn conversational coherence and tool use correctness. The framework is designed to work like pytest for LLMs — familiar, fast, and easy to integrate into existing development workflows.
The metric suite includes hallucination detection, answer relevancy, faithfulness, contextual precision and recall (for RAG), tool correctness (for agent tool use), conversational relevancy, knowledge retention, bias detection, toxicity scoring, and more. Each metric is backed by academic research and validated against human judgment benchmarks, ensuring scores are meaningful and actionable.
DeepEval's approach to agent testing is particularly strong. The tool correctness metric evaluates whether agents call the right tools with correct parameters, essential for validating agent behavior. Conversational metrics assess multi-turn interactions for coherence, topic adherence, and knowledge retention across conversation turns.
The framework supports synthetic test data generation using an LLM to create diverse test cases from your documents, reducing the manual effort of building evaluation datasets. A built-in red-teaming module generates adversarial inputs to test agent robustness against prompt injection, bias, and toxicity.
DeepEval integrates with pytest, enabling LLM tests alongside unit tests in CI/CD pipelines. Tests can gate deployments — if quality scores drop below defined thresholds, the build fails. This prevents bad prompts and regressions from reaching production.
The Confident AI cloud platform layers on top of DeepEval, adding collaboration features, dataset management, LLM tracing with full context (inputs, outputs, tool calls, latency, token cost), real-time monitoring, performance alerting, and dashboards. Confident AI pricing starts at $19.99/user/month for Starter, $49.99/user/month for Premium, with Team and Enterprise plans offering custom pricing with self-hosted deployment, SOC 2 compliance, SSO, and HIPAA support.
Was this helpful?
DeepEval is the most comprehensive open-source LLM evaluation framework available, with 50+ metrics covering everything from basic relevancy to complex agent tool use and adversarial robustness. The pytest-style integration makes it natural for development teams already using Python testing workflows. Free for self-hosted use with Confident AI cloud providing collaboration and monitoring from $19.99/month. Best for teams needing rigorous, automated quality testing for AI applications.
Comprehensive metric suite covering hallucination detection, answer relevancy, faithfulness, contextual precision/recall, tool correctness, conversational coherence, knowledge retention, bias, toxicity, and more — each validated against human judgment benchmarks.
Use Case:
Running a full quality audit on a customer support chatbot using hallucination, relevancy, and faithfulness metrics to catch responses that fabricate information or drift from the knowledge base.
Tool correctness metric specifically evaluates whether AI agents call the right tools with correct parameters and in the right sequence — essential for validating agent behavior in production.
Use Case:
Testing an e-commerce agent to verify it correctly calls the inventory API before the order API, passes valid product IDs, and handles out-of-stock scenarios without hallucinating availability.
Write LLM tests using familiar pytest patterns with decorators and assertions. Tests run alongside unit tests in existing CI/CD pipelines. Failed quality thresholds block deployments automatically.
Use Case:
Adding DeepEval tests to a GitHub Actions pipeline that runs on every pull request — if hallucination scores exceed 10% or relevancy drops below 0.85, the PR can't merge.
Generate diverse test datasets from your documents using LLMs. Creates edge cases, adversarial inputs, and comprehensive test coverage without manual data curation.
Use Case:
Generating 500 test questions from a product documentation corpus, including paraphrases, multi-hop questions, and out-of-scope queries to stress-test a RAG chatbot.
Automated adversarial testing that generates prompt injection attempts, bias probes, toxicity triggers, and jailbreak prompts to test agent robustness before deployment.
Use Case:
Running red-team evaluations against a customer-facing agent to verify it resists prompt injection, doesn't generate biased responses, and handles toxic inputs gracefully.
Cloud platform layering on DeepEval with LLM tracing (full context: inputs, outputs, tool calls, latency, token costs), real-time monitoring, performance alerting, collaborative dataset management, prompt versioning, and dashboards. Available as SaaS or self-hosted.
Use Case:
Monitoring a production RAG system's quality in real-time — receiving alerts when hallucination rates spike, drilling into individual traces to identify root causes, and tracking quality trends across model versions.
Free
forever
Free
month
$19.99/per user/month
per user/month
$49.99/per user/month
per user/month
Custom pricing for teams
Custom pricing for enterprise
Ready to get started with DeepEval?
View Pricing Options →Integrating automated LLM evaluation into CI/CD pipelines using pytest — blocking deployments when hallucination, relevancy, or faithfulness scores drop below defined thresholds
Testing AI agents to verify they call the correct tools with proper parameters in the right sequence — catching tool misuse, incorrect API calls, and parameter errors before production
Running automated adversarial testing against customer-facing AI systems to identify vulnerabilities to prompt injection, bias amplification, and toxic output generation
Evaluating retrieval-augmented generation systems with contextual precision, recall, and faithfulness metrics to ensure answers stay grounded in retrieved documents
Monitoring production LLM application quality in real-time with tracing, alerting, and dashboards — identifying quality regressions and cost anomalies across model versions
DeepEval works with these platforms and services:
We believe in transparent reviews. Here's what DeepEval doesn't handle well:
Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
DeepEval expanded to 50+ evaluation metrics (from 14+ in 2024), including enhanced agent tool use evaluation and conversational metrics. Confident AI platform added LLM tracing at $1/GB-month, no-code evaluation workflows, auto-dataset curation from traces, real-time alerting, and self-hosted deployment. Y Combinator backed. SOC 2 compliance added for Team and Enterprise tiers.
AI Memory & Search
Open-source framework for evaluating RAG pipelines and AI agents with automated metrics for faithfulness, relevancy, and context quality.
LLM Observability
AI observability platform for evals, production tracing, prompt management, and regression detection.
AI Observability
LangSmith is LangChain's commercial observability, evaluation and prompt management platform for LLM apps and agents in production.
AI Observability
Phoenix is Arize's open-source LLM observability project, and it has quietly become the default way tens of thousands of teams see what their agents are actually doing in production. The pitch is simple: `pip install arize-phoenix`, instrument with OpenInference (or any OpenTelemetry-compatible library), and every LLM call, tool invocation, retrieval, and embedding shows up as a spanned timeline you can filter, search, and replay. No vendor account required, no proprietary SDK lock-in. The Open
No reviews yet. Be the first to share your experience!
Get started with DeepEval and see if it's the right fit for your needs.
Get Started →Take our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack →Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates →