Stay free if you only need full access to all core evaluation features and fluent assertions, stochastic evaluation, model comparison. Upgrade if you need optional add-ons on top of mit core and not yet available — in planning phase. Most solo builders can start free.
Why it matters: .NET-only — Python, JavaScript, and Go teams cannot use it and must rely on DeepEval, PromptFoo, or LangSmith instead
Available from: Commercial & Enterprise (Planned)
Why it matters: Red Team coverage is 60% of OWASP LLM Top 10, leaving 40% of categories uncovered compared to specialized security scanners
Available from: Commercial & Enterprise (Planned)
Why it matters: Commercial/Enterprise add-ons are still in planning phase, so enterprises requiring vendor SLAs and paid support have no tier to purchase
Available from: Commercial & Enterprise (Planned)
Why it matters: Small community relative to Python-era evaluation tools means fewer third-party integrations, tutorials, and Stack Overflow answers
Available from: Commercial & Enterprise (Planned)
No. AgentEval is built exclusively for .NET and ships on NuGet (nuget.org/packages/AgentEval). Python teams should use DeepEval, PromptFoo, or LangSmith for equivalent AI agent evaluation capabilities. Based on our analysis of 870+ AI tools, AgentEval is one of the only mature agent evaluation frameworks targeting the Microsoft/.NET ecosystem specifically, which is precisely its positioning.
Yes. Any .NET agent that implements IChatClient can be tested via the IChatClient.AsEvaluableAgent() one-liner extension method. A Semantic Kernel bridge is also included for SK-based agents. This cross-framework design means you are not locked into MAF, though MAF is where the deepest integration exists with automatic tool call tracking and token/cost telemetry.
DeepEval and RAGAS are Python frameworks with larger communities and broader metric catalogs. AgentEval is their .NET counterpart, offering equivalent coverage for RAG metrics (Faithfulness, Relevance, Context Precision/Recall), plus unique additions like the 192-probe Red Team module and fluent tool-chain assertions. Choose based on language ecosystem — AgentEval for C#/.NET shops, DeepEval/RAGAS for Python. All three are open source.
It scales with repetition count: 100 tests × 50 repetitions equals 5,000 LLM calls, roughly $15–$30 per test suite at GPT-4 pricing. AgentEval's recommended pattern is to use live stochastic evaluation only for new scenarios and switch to trace record/replay for regression testing in CI, which eliminates API costs entirely. The comparer's RunsPerModel option (typically 5) gives statistical stability without runaway cost.
The Red Team module runs 192 attack probes across 9 attack types: Prompt Injection, Jailbreaks, PII Leakage, System Prompt Extraction, Indirect Injection, Excessive Agency, Insecure Output Handling, API Abuse, and Encoding Evasion. This covers 6 of the OWASP LLM Top 10 2025 vulnerabilities (60% coverage) with MITRE ATLAS technique mapping, and results can be exported directly to PDF for compliance reporting via result.ExportAsync("security-report.pdf", ExportFormat.Pdf).
Start with the free plan — upgrade when you need more.
Get Started Free →Still not sure? Read our full verdict →
Last verified March 2026