AI Agent Builders🔴Developer

DSPy

Name: DSPy
Brand: DSPy
Availability: InStock

Stanford NLP's framework for programming language models with declarative Python modules instead of prompts, featuring automatic optimizers that compile programs into effective prompt strategies and fine-tuned weights.

Starting atFree

Visit DSPy →

💡

In Plain English

Automatically fine-tunes your AI's instructions so it gives better answers — like having a compiler that optimizes your AI's performance instead of hand-writing prompts.

Overview

DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that fundamentally reimagines how developers build applications with large language models by replacing fragile hand-written prompts with composable, optimizable Python modules.

Instead of manually crafting prompt strings and iterating through trial-and-error, DSPy lets you define what your program should do using typed Signatures (like context, question -> reasoning, answer) and compose behavior from built-in modules such as ChainOfThought, ReAct, and ProgramOfThought. The framework then uses automatic optimizers — including MIPROv2, GEPA, BootstrapFewShot, and COPRO — to compile your program into highly effective prompts or fine-tuned weights, given just a metric function and a small set of labeled examples.

This approach delivers several transformative benefits. First, it makes LLM programs model-portable: switching from GPT-4 to Claude to Llama requires re-optimization rather than prompt rewriting, because the optimizer discovers provider-specific strategies automatically. Second, it enables systematic quality improvement — teams routinely achieve 10-30% accuracy gains over hand-prompted baselines by letting optimizers search the space of possible instructions and demonstrations. Third, it dramatically reduces inference costs by optimizing smaller models (Llama 3, Mistral, Phi) to match or exceed the accuracy of larger models at a fraction of the per-token price.

DSPy is published at ICLR 2024 and has accumulated over 25,000 GitHub stars. It is backed by Stanford HAI and maintained by an active research team that continues to release new optimizers (GEPA for reflective prompt evolution, SIMBA for scalable optimization) and capabilities including MCP tool support, streaming, async execution, and structured output generation. The framework integrates with all major LLM providers via LiteLLM, supports vector databases like Pinecone, Weaviate, Qdrant, and Chroma for RAG pipelines, and works with observability tools including LangSmith, Langfuse, and MLflow.

DSPy is best suited for teams building production AI systems who want measurable, reproducible quality improvements rather than subjective prompt tweaking. It excels in RAG pipelines, multi-hop reasoning, classification, information extraction, and agent workflows where output quality can be quantified with a metric.

🦞

Using with OpenClaw

▼

Install DSPy in your Python environment and use it to build optimized LLM programs. OpenClaw can invoke DSPy-powered scripts for tasks requiring systematic prompt optimization.

Use Case Example:

Build DSPy-optimized RAG pipelines or classification modules that OpenClaw agents can invoke for high-quality, model-portable AI capabilities.

Learn about OpenClaw →

🎨

Vibe Coding Friendly?

▼

Difficulty:advanced

Not Recommended

Developer-only framework requiring Python proficiency, ML evaluation methodology knowledge, and understanding of prompt optimization concepts. Not suitable for no-code or vibe coding approaches — the value proposition is specifically in programmatic optimization for engineers who can define metrics and evaluation sets.

Learn about Vibe Coding →

Was this helpful?

Editorial Review

DSPy is a paradigm-shifting framework that replaces manual prompt engineering with programmatic optimization. Revolutionary for teams building complex LLM pipelines who need measurable, reproducible quality improvements backed by metrics and evaluation methodology. The automatic optimization approach delivers genuine productivity gains and model portability, though it requires a steeper initial investment in learning the framework's abstractions and creating labeled evaluation data.

Key Features

Declarative Signatures+

Define the input/output behavior of an LM call as a Python signature (e.g., `context, question -> reasoning, answer`) instead of a prompt string. Signatures specify field names, types, and descriptions, enabling DSPy's optimizers to automatically generate appropriate instructions, demonstrations, and formatting for any target model.

Optimizer Suite (MIPROv2, GEPA, BootstrapFewShot, COPRO, SIMBA)+

DSPy ships with a full library of optimizers that compile programs into better prompts or fine-tuned weights given a metric and training set. MIPROv2 jointly optimizes instructions and demonstrations using Bayesian surrogate models. GEPA uses reflective prompt evolution for complex reasoning. BootstrapFewShot generates demonstrations from the training set. SIMBA scales optimization to multi-module programs efficiently.

Composable Modules+

Built-in modules including ChainOfThought, ReAct, ProgramOfThought, CodeAct, BestOfN, Refine, MultiChainComparison, and Parallel let you compose multi-step LM programs the same way you compose PyTorch layers — each module encapsulates a prompting strategy and can be optimized independently or jointly within a larger program.

Multi-Provider LM Abstraction+

Through dspy.LM and LiteLLM under the hood, DSPy supports OpenAI, Anthropic, Google Gemini, Databricks, Together.ai, Ollama, vLLM, HuggingFace Transformers, and any OpenAI-compatible endpoint. Switching providers requires changing one line of configuration, and re-optimization adapts prompts to the new model's strengths automatically.

Evaluation & Assertions Framework+

dspy.Evaluate runs programs over a dataset with parallel execution and metric aggregation, and built-in metrics include SemanticF1, answer_exact_match, answer_passage_match, and CompleteAndGrounded. Runtime assertions (dspy.Assert and dspy.Suggest) enforce constraints on LM outputs with automatic retry and backtracking on violation.

Pricing Plans

Open Source (MIT License)

✓Full framework access — all optimizers, modules, and adapters
✓Unlimited use, commercial or non-commercial
✓Self-host on any infrastructure including local models via Ollama/vLLM
✓Community support via Discord and GitHub Issues
✓MCP support, streaming, async, caching, deployment guides
✓Only cost is LLM API usage during optimization and inference

See Full Pricing →Free vs Paid →Is it worth it? →

Ready to get started with DSPy?

View Pricing Options →

Getting Started with DSPy

1Install DSPy with `pip install dspy` and configure your LM provider in two lines of code.
2Define your first Signature (e.g., `question -> answer`) and create a Predict module to test basic inference.
3Add ChainOfThought or ReAct modules to improve reasoning quality for complex tasks.
4Create 10-50 labeled examples and run BootstrapFewShot to automatically optimize your program's prompts.
5Evaluate with built-in metrics, iterate on your program structure, and try MIPROv2 for more thorough optimization.

Ready to start? Try DSPy →

Best Use Cases

🎯

Production RAG Systems: Teams building retrieval-augmented generation pipelines where retrieval and generation quality need systematic optimization with measurable metrics, regression testing, and the ability to swap underlying models without rewriting prompts.

⚡

Model-Portable AI Programs: Organizations deploying AI across multiple LLM providers who need programs that automatically re-optimize when switching from GPT-4 to Claude to Llama without rewriting prompt logic — enabling vendor flexibility and cost negotiations.

🔧

Cost Optimization via Small Models: Teams using DSPy's optimizers to achieve competitive accuracy on smaller, cheaper models (Llama, Mistral, Phi) — reducing inference costs by 10-50x compared to hand-prompted GPT-4 while maintaining quality benchmarks.

🚀

Research & Complex Reasoning Pipelines: Research teams building multi-hop reasoning, question decomposition, math reasoning (GEPA for AIME), or tool-use agent loops that require measurable quality metrics and reproducible experimental methodology.

💡

Structured Information Extraction: Enterprise teams extracting entities, classifications, or structured fields from unstructured documents, email, or financial filings where output schemas must be strictly validated and accuracy systematically improved over time.

🔄

Agent Workflows with Tool Use: Developers building ReAct-style or CodeAct agents that decide which tools to call and how to combine results, where DSPy's MCP support and Tool primitive enable systematic optimization of agent decision-making and tool orchestration.

Integration Ecosystem

21 integrations

DSPy works with these platforms and services:

🧠 LLM Providers

OpenAIAnthropicGoogleCohereMistralOllama

📊 Vector Databases

PineconeWeaviateQdrantChromaMilvuspgvector

☁️ Cloud Platforms

AWSGCPAzure

🗄️ Databases

PostgreSQL

📈 Monitoring

LangSmithLangfusemlflow

🔗 Other

GitHubhuggingface

View full Integration Matrix →

Limitations & What It Can't Do

We believe in transparent reviews. Here's what DSPy doesn't handle well:

⚠Optimization cost: MIPROv2 and GEPA can make 1,000+ LLM calls to optimize a single program — initial setup can cost $5-20 for complex pipelines, and iteration during development compounds quickly if running full optimization passes repeatedly.
⚠Cold-start problem: you need labeled examples before you can optimize, requiring manual annotation effort upfront that some teams underestimate.
⚠Optimized prompts may overfit to the training distribution — performance can degrade on out-of-distribution inputs without careful validation set design and held-out evaluation.
⚠Limited managed deployment infrastructure — unlike LangChain's LangSmith or LlamaIndex Cloud, DSPy has no first-party hosted observability/monitoring product, so production telemetry is BYO with integrations like Langfuse or MLflow.
⚠Streaming and conversational chat interfaces are supported but less polished than batch and request-response patterns — multi-turn conversation history management requires custom module composition and careful state handling.

Pros & Cons

✓ Pros

✓Completely free and open-source under MIT license — no paid tier, no usage limits, no vendor lock-in, with 25,000+ GitHub stars and active Stanford HAI backing
✓Automatic prompt optimization eliminates manual prompt engineering — define a metric and 20-50 examples, and optimizers like MIPROv2 or GEPA find the best prompts in ~20 minutes for ~$2 of LLM API cost
✓Model portability: switching from GPT-4 to Claude to Llama requires re-optimization, not prompt rewriting — programs transfer across 10+ supported LLM providers via LiteLLM
✓Small model optimization routinely achieves competitive accuracy on Llama/Mistral models, reducing inference costs by 10-50x versus hand-prompted GPT-4
✓Strong academic foundation with ICLR 2024 publication, ongoing research output (GEPA, SIMBA, RL optimization), and reproducible benchmarks across math, classification, and multi-hop RAG tasks
✓Runtime assertions, output refinement, and BestOfN modules provide programmatic validation with automatic retry — catching LLM output errors without manual try/except scaffolding

✗ Cons

✗Steeper learning curve than prompt engineering — requires understanding signatures, modules, optimizers, metrics, and evaluation methodology before seeing benefits
✗Optimization requires labeled examples (even 10-50), which some teams don't have and must create manually before they can use the framework effectively
✗Less mature production tooling (deployment, monitoring, dashboards) compared to LangChain or LlamaIndex commercial ecosystems — most observability is roll-your-own
✗Abstraction layer can make debugging harder — when output is wrong, tracing through compiled prompts and optimizer decisions adds investigative complexity beyond reading a prompt string
✗Limited support for streaming chat interfaces and real-time conversational agents — designed primarily for batch and request-response patterns, though streaming/async support has improved

Frequently Asked Questions

How many training examples do I need for DSPy optimization?+

It depends on the optimizer. BootstrapFewShot works with as few as 10-20 examples for simple tasks. MIPROv2 and GEPA benefit from 50-200+ examples. The DSPy team recommends starting with 20-50 high-quality labeled examples, running an initial optimization, evaluating results on a held-out set, and then deciding whether to annotate more data based on the quality gap.

Can I see and edit the prompts DSPy generates?+

Yes. After optimization, you can call program.inspect() or use dspy.inspect_history(n=1) to see the last prompts sent to the LLM, and access compiled prompts through each module's demos and instructions attributes. You can manually edit these or use them as starting points for further optimization.

How does DSPy differ from LangChain?+

LangChain is an orchestration toolkit where you manually write prompts and chain LLM calls together — it gives fine-grained control over prompt details and has a much larger ecosystem of integrations and tools. DSPy takes a fundamentally different approach: you define what you want (via signatures and metrics) and let optimizers figure out how to prompt the model. Choose LangChain for rapid prototyping with manual control; choose DSPy for systematic, measurable quality optimization.

Does DSPy work with local and open-source models?+

Yes. DSPy supports any model through its LM abstraction backed by LiteLLM — OpenAI, Anthropic, Google Gemini, Databricks, Together.ai, Ollama, vLLM, HuggingFace Transformers, and any OpenAI-compatible endpoint. Local models via Ollama or vLLM work seamlessly, and DSPy's optimizers are particularly valuable for squeezing maximum performance out of smaller open-source models.

Is DSPy free to use, and what's the licensing?+

DSPy is fully free and open-source under the MIT license, with no paid tier, no usage limits, and no commercial restrictions. The only costs are the LLM API calls you make during optimization and inference, which depend on your chosen provider and usage volume.

🔒 Security & Compliance

—

SOC2

Unknown

—

GDPR

Unknown

—

HIPAA

Unknown

—

SSO

Unknown

✅

Self-Hosted

Yes

✅

On-Prem

Yes

—

RBAC

Unknown

—

Audit Log

Unknown

—

API Key Auth

Unknown

✅

Open Source

Yes

—

Encryption at Rest

Unknown

—

Encryption in Transit

Unknown

Data Retention: configurable

Data Residency: NOT APPLICABLE — SELF-HOSTED; DATA RESIDENCY DEPENDS ON YOUR INFRASTRUCTURE AND CHOSEN LLM PROVIDERS

Recent Updates

View all updates →

🚀

DSPy 2.5 Stable Release

v2.5.0

Production-ready optimizers with automatic prompt engineering and evaluation metrics.

Feb 11, 2026Source

🦞

New to AI tools?

Read practical guides for choosing and using AI tools

Read Guides →

Get updates on DSPy and 370+ other AI tools

Weekly insights on the latest AI tools, features, and trends delivered to your inbox.

What's New in 2026

Recent additions include dspy.GEPA (Reflective Prompt Evolution) with tutorials for AIME math, structured information extraction, privacy-conscious delegation, and code backdoor classification. MCP tool support enables agent workflows with external tool servers. SIMBA optimizer provides scalable multi-module optimization. Streaming and async execution are now stable, and the framework has added improved TypedPredictor support for structured outputs with Pydantic models.

Alternatives to DSPy

LangChain

AI Agent Builders

The industry-standard framework for building production-ready LLM applications with comprehensive tool integration, agent orchestration, and enterprise observability through LangSmith.

LlamaIndex

AI Agent Builders

LlamaIndex: Build and optimize RAG pipelines with advanced indexing and agent retrieval for LLM applications.

CrewAI

AI Agent Builders

Open-source Python framework that orchestrates autonomous AI agents collaborating as teams to accomplish complex workflows. Define agents with specific roles and goals, then organize them into crews that execute sequential or parallel tasks. Agents delegate work, share context, and complete multi-step processes like market research, content creation, and data analysis. Supports 100+ LLM providers through LiteLLM integration and includes memory systems for agent learning. Features 48K+ GitHub stars with active community.

Microsoft AutoGen

Multi-Agent Builders

Microsoft's open-source framework for building multi-agent AI systems with asynchronous, event-driven architecture.

View All Alternatives & Detailed Comparison →

User Reviews

No reviews yet. Be the first to share your experience!

Quick Info

Try DSPy Today

Get started with DSPy and see if it's the right fit for your needs.

Get Started →

Need help choosing the right AI stack?

Take our 60-second quiz to get personalized tool recommendations

Find Your Perfect AI Stack →

Want a faster launch?

Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.

Browse Agent Templates →

More about DSPy

Pricing Review Alternatives Free vs Paid Pros & Cons Worth It?Tutorial

📚 Related Articles

Best No-Code AI Agent Builders in 2026: Complete Platform Comparison

An honest comparison of the best no-code AI agent builders: n8n, Flowise, Dify, Langflow, Make, Zapier, and more. Features, pricing, agent capabilities, and recommendations by use case.

2026-03-127 min read