Open-source LLM testing and evaluation framework for systematically testing prompts, models, and AI agent behaviors with automated red-teaming.
Test your AI prompts systematically — run hundreds of test cases to find the best prompt before going live.
Promptfoo is an open-source testing and evaluation framework designed to help developers systematically test LLM applications, prompts, and AI agent behaviors. It provides a CLI-driven workflow for defining test cases, running evaluations across multiple models and prompt variants, and comparing results with automated scoring — essential for building reliable AI agents that behave predictably in production.
The framework supports a wide range of assertion types including exact matching, semantic similarity, model-graded evaluations, and custom JavaScript/Python assertions. Developers can test across multiple LLM providers simultaneously, comparing how different models handle the same prompts and scenarios. This is particularly valuable for agent development where choosing the right model for each task is critical.
Promptfoo's automated red-teaming capability is a standout feature for agent security. It can automatically generate adversarial inputs to test agent robustness against prompt injection, jailbreaking, data exfiltration, and other attack vectors. This helps developers identify and fix agent vulnerabilities before deployment.
The framework integrates with CI/CD pipelines, enabling automated testing of agent behaviors on every code change. Results are displayed in an interactive web UI that makes it easy to compare outputs, identify regressions, and track improvements over time. Promptfoo supports all major LLM providers including OpenAI, Anthropic, Google, AWS Bedrock, and local models via Ollama. With its focus on practical testing workflows, Promptfoo has become the most popular open-source tool for LLM evaluation.
Was this helpful?
Feature information is available on the official website.
View Features →Contact for pricing
Contact for pricing
Ready to get started with Promptfoo?
View Pricing Options →We believe in transparent reviews. Here's what Promptfoo doesn't handle well:
Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
Voice Agents
AI observability platform with Loop agent that automatically generates better prompts, scorers, and datasets from production data. Free tier available, Pro at $25/seat/month.
Analytics & Monitoring
LangSmith lets you trace, analyze, and evaluate LLM applications and agents with deep observability into every model call, chain step, and tool invocation.
Analytics & Monitoring
Former LLMOps platform for prompt engineering and evaluation, acquired by Anthropic in August 2025. Technology now integrated into Anthropic Console as the Workbench and Evaluations features.
Testing & Quality
DeepEval: Open-source LLM evaluation framework with 50+ research-backed metrics including hallucination detection, tool use correctness, and conversational quality. Pytest-style testing for AI agents with CI/CD integration.
No reviews yet. Be the first to share your experience!
Get started with Promptfoo and see if it's the right fit for your needs.
Get Started →Take our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack →Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates →