Stanford NLP's framework for programming language models with declarative Python modules instead of prompts, featuring automatic optimizers that compile programs into effective prompts and fine-tuned weights.
Automatically fine-tunes your AI's instructions so it gives better answers — like having a compiler that optimizes your AI's performance instead of hand-writing prompts.
DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that flips the standard approach to working with language models. Instead of writing and tweaking prompts by hand, you write structured Python programs using declarative modules, and DSPy's optimizers automatically compile those programs into effective prompts or fine-tuned weights for your target LLM. Think of it as the jump from assembly to a high-level language, but for AI programming.
The Core Idea: Modules Over PromptsIn DSPy, you define what you want — input/output signatures like question -> answer or context, question -> reasoning, answer — and compose modules that implement this logic. A module might chain a retriever with a language model, add a self-consistency check, or implement multi-hop reasoning. The key insight: you describe the structure of your AI program, not the exact text of your prompts. DSPy handles prompt engineering automatically.
This matters because hand-crafted prompts are brittle. Change your model from GPT-4 to Claude, and prompts that worked perfectly may degrade. Swap in a smaller model, and few-shot examples that fit GPT-4's context window need complete rework. DSPy programs are model-portable — the optimizer generates model-specific prompts from your program structure.
Optimizers: The Compiler AnalogyDSPy's optimizers are what make it genuinely different from other frameworks. Given a program, a metric function, and a small set of examples (often just 10-50), optimizers like BootstrapFewShot, COPRO, and MIPROv2 automatically find the best prompts, few-shot demonstrations, or fine-tuning data for your program. A typical optimization run costs about $2 and takes 20 minutes with a cloud LLM. The result: DSPy-optimized programs on small models (Llama2-13b) routinely outperform hand-prompted GPT-3.5 on the same tasks.
What You Can BuildDSPy handles the patterns that matter in production AI: RAG pipelines where retrieval and generation need to work together effectively, multi-hop reasoning chains that break complex questions into retrievable sub-questions, classification with structured outputs and confidence scores, agent loops where the LM decides which tools to use and how to combine results, and complex QA systems that need to reason over multiple documents.
The framework integrates with every major LLM provider through LiteLLM — OpenAI, Anthropic, Google Gemini, Databricks, Ollama for local models, and any OpenAI-compatible endpoint.
Community and MaturityDSPy has 25,000+ GitHub stars, an active Discord community, and backing from Stanford HAI. The research paper was published at ICLR 2024 with significant follow-up work. Production deployments span enterprise RAG systems, research pipelines, and commercial AI products. The framework is fully open-source under MIT license with no paid tier.
Was this helpful?
DSPy is a paradigm-shifting framework that replaces manual prompt engineering with programmatic optimization. Revolutionary for teams building complex LLM pipelines who need measurable, reproducible quality improvements. The learning curve is steep and documentation assumes ML familiarity, but the payoff — model-portable programs with systematically optimized prompts — is substantial for production AI systems.
Define LLM tasks declaratively with input/output field specifications. Each field has descriptions and optional constraints. Predictors compile signatures into optimized prompts automatically.
Use Case:
Defining a question-answering task with context and question inputs producing a concise answer — without writing any prompt text manually.
BootstrapFewShot selects optimal few-shot examples from training data. MIPROv2 optimizes instructions and examples jointly. BayesianSignatureOptimizer uses Bayesian methods to explore the prompt space efficiently.
Use Case:
Improving a classification pipeline's accuracy from 72% to 89% by running MIPROv2 with 200 labeled examples, automatically discovering the best instruction phrasing and few-shot examples.
Pre-built modules include ChainOfThought, ReAct, ProgramOfThought, and Retrieve. Modules compose using standard Python — loops, conditionals, function calls — enabling complex multi-step programs.
Use Case:
Building a multi-step research system that retrieves documents, reasons through them with ChainOfThought, and generates code to analyze findings.
dspy.Assert and dspy.Suggest add runtime validation to LLM outputs. Assertions fail and trigger retries with feedback; Suggestions guide without hard failures. Both integrate into the optimization loop.
Use Case:
Ensuring a medical information system always includes citations by asserting that generated answers contain source references, with automatic retry on failure.
Built-in evaluation tools for measuring program quality with custom metrics. Supports accuracy, F1, exact match, and custom scoring. Evaluations drive optimizer decisions and regression testing.
Use Case:
Running nightly evaluations of a RAG pipeline against 500 golden QA pairs, tracking retrieval recall and answer accuracy across code and model changes.
Configure different LMs for different modules within the same program. Native retriever integrations for ColBERT, ChromaDB, Pinecone, Weaviate, and Milvus. Switch models without code changes via LiteLLM.
Use Case:
Using a fast, cheap model for initial retrieval and classification while routing complex reasoning to a more capable model, all within one DSPy program.
Free
forever
Ready to get started with DSPy?
View Pricing Options →Teams building retrieval-augmented generation pipelines where retrieval and generation quality need systematic optimization — not prompt guessing — with measurable metrics and regression testing.
Organizations deploying AI across multiple LLM providers who need programs that automatically re-optimize when switching from GPT-4 to Claude to Llama without rewriting prompts.
Teams using DSPy's optimizers to achieve competitive accuracy on smaller, cheaper models (Llama, Mistral) — reducing inference costs by 10-50x compared to hand-prompted large models.
Research teams building multi-hop reasoning, question decomposition, or tool-use agent loops that require measurable quality metrics and reproducible optimization across experiments.
DSPy works with these platforms and services:
We believe in transparent reviews. Here's what DSPy doesn't handle well:
It depends on the optimizer. BootstrapFewShot works with 10-20 examples for simple tasks. MIPROv2 benefits from 50-200+. Start with 20-50 examples and scale up if metrics plateau. The framework includes utilities for creating training examples from existing data, and you can bootstrap examples from a strong teacher model.
Yes. After optimization, call program.inspect() or access the compiled prompt through the module's demos and instructions attributes. Use dspy.inspect_history(n=1) to see the last prompts sent to the LLM. While you can manually edit prompts, it's generally better to adjust your metric or add data and re-optimize — that's the point of the framework.
LangChain is an orchestration toolkit where you manually write prompts and chain LLM calls. DSPy is a compiler where you declare what you want and the system optimizes how to ask. LangChain gives more control over prompt details; DSPy gives systematic, measurable quality improvement. They solve different problems and can be used together.
Yes. DSPy supports any model through its LM abstraction — OpenAI, Anthropic, Together.ai, Ollama, vLLM, HuggingFace Transformers, and any OpenAI-compatible API. Optimization is particularly valuable for smaller open-source models where the right prompt and few-shot examples can significantly close the gap with larger commercial models.
Production-ready optimizers with automatic prompt engineering and evaluation metrics.
Weekly insights on the latest AI tools, features, and trends delivered to your inbox.
In 2026, DSPy continued active development with improved MIPROv2 optimizer for more efficient prompt search, MLflow integration for experiment tracking, expanded multi-agent pipeline support, and growing adoption in enterprise production systems. The framework surpassed 25K GitHub stars with contributions from 200+ developers.
People who use this tool also find these helpful
A user-friendly AI agent building platform that simplifies the creation of intelligent automation workflows with drag-and-drop interfaces and pre-built components.
An innovative AI agent creation platform that enables users to build emotionally intelligent and creative AI agents with advanced personality customization and artistic capabilities.
The standard framework for building LLM applications with comprehensive tool integration, memory management, and agent orchestration capabilities.
CrewAI is an open-source Python framework for orchestrating autonomous AI agents that collaborate as a team to accomplish complex tasks. You define agents with specific roles, goals, and tools, then organize them into crews with defined workflows. Agents can delegate work to each other, share context, and execute multi-step processes like market research, content creation, or data analysis. CrewAI supports sequential and parallel task execution, integrates with popular LLMs, and provides memory systems for agent learning. It's one of the most popular multi-agent frameworks with a large community and extensive documentation.
Open-source standard that gives AI agents a common API to communicate, regardless of what framework built them. Free to implement. Backed by the AI Engineer Foundation but facing competition from Google's A2A and Anthropic's MCP.
Open-source CLI that scaffolds AI agent projects across frameworks like CrewAI, LangGraph, and LlamaStack with one command. Think create-react-app, but for agents.
See how DSPy compares to LangChain and other alternatives
View Full Comparison →AI Agent Builders
The standard framework for building LLM applications with comprehensive tool integration, memory management, and agent orchestration capabilities.
AI Agent Builders
Data framework for RAG pipelines, indexing, and agent retrieval.
AI Agent Builders
CrewAI is an open-source Python framework for orchestrating autonomous AI agents that collaborate as a team to accomplish complex tasks. You define agents with specific roles, goals, and tools, then organize them into crews with defined workflows. Agents can delegate work to each other, share context, and execute multi-step processes like market research, content creation, or data analysis. CrewAI supports sequential and parallel task execution, integrates with popular LLMs, and provides memory systems for agent learning. It's one of the most popular multi-agent frameworks with a large community and extensive documentation.
Agent Frameworks
Open-source multi-agent framework from Microsoft Research with asynchronous architecture, AutoGen Studio GUI, and OpenTelemetry observability. Now part of the unified Microsoft Agent Framework alongside Semantic Kernel.
No reviews yet. Be the first to share your experience!
Get started with DSPy and see if it's the right fit for your needs.
Get Started →Take our 60-second quiz to get personalized tool recommendations
Find Your Perfect AI Stack →Explore 20 ready-to-deploy AI agent templates for sales, support, dev, research, and operations.
Browse Agent Templates →