DSPy
DSPy is an open-source framework for programming language-model systems with declarative modules, signatures, metrics, and optimizers. Its central claim is that many AI applications should be written as modular programs that can be compiled and improved, rather than as hand-maintained prompt strings.
Definition
DSPy, short for Declarative Self-improving Python, is a framework from the Stanford NLP ecosystem for building modular AI software around language models and other foundation models. The project describes itself as a framework for "programming-not prompting" language models: developers express behavior as code, signatures, modules, and metrics, while DSPy handles prompt construction, parsing, and optimization.
In ordinary prompt engineering, a developer writes a text prompt, tests it manually, edits the wording, and repeats the process when a model, task, dataset, or output format changes. DSPy tries to move that work into a higher-level programming layer. A developer specifies what a component should take as input, what it should return, and how success is measured. DSPy can then compile the program into prompts, demonstrations, adapters, or fine-tuned components.
Origins
The research line began with Demonstrate-Search-Predict, or DSP, a 2022 framework for composing retrieval and language models on knowledge-intensive tasks. DSP framed retrieval-augmented in-context learning as a programmable pipeline rather than a simple retrieve-then-read prompt.
DSPy evolved from that work in 2023. The main DSPy paper, DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, was submitted to arXiv in October 2023 and appeared at ICLR 2024. Its authors included Omar Khattab, Arnav Singhvi, Matei Zaharia, Christopher Potts, and collaborators. The paper argued that language-model pipelines were often implemented with hard-coded prompt templates and proposed a programming model that abstracts those pipelines as text-transformation graphs with declarative modules.
By 2026, the DSPy GitHub project had become a widely watched open-source framework, with official documentation at dspy.ai and a growing research ecosystem around optimizers, assertions, multi-stage language-model programs, RAG, agents, and prompt evolution.
Programming Model
Signatures. A signature states the intended input-output contract for a model call. Instead of embedding all behavior in a prose prompt, DSPy lets the developer write a compact interface such as a question-to-answer transformation, an extraction task, a classifier, or a multi-field structured output.
Modules. Modules are reusable model-invocation strategies. A module can represent simple prediction, chain-of-thought reasoning, ReAct-style tool use, retrieval-augmented generation, information extraction, classification, or a custom language-model component.
Programs. DSPy programs compose modules with ordinary Python control flow. This makes language-model applications look more like software systems: a RAG pipeline, agent loop, evaluator, or multi-stage workflow can be inspected, tested, optimized, and ported across models.
Adapters and parsing. DSPy adapters map signatures into prompts and parse model outputs back into typed fields. This is one of the practical bridges between natural-language model behavior and conventional software boundaries.
Optimization
DSPy is most distinctive when it treats prompts and demonstrations as learnable parts of a program. Given representative inputs and a metric, DSPy optimizers can synthesize examples, propose instructions, explore variants, and in some cases prepare data for fine-tuning.
MIPRO, introduced in 2024, optimized instructions and demonstrations for multi-stage language-model programs without needing module-level labels or gradients. GEPA, introduced in 2025 and revised in 2026, used natural-language reflection and evolutionary search to improve prompts from trajectories, tool calls, outputs, and task feedback. These systems show DSPy's broader direction: moving from static prompting toward measurable, iterative optimization of language-model programs.
This does not mean DSPy removes judgment from AI engineering. It changes the site of judgment. The developer must still choose data, metrics, modules, failure thresholds, cost budgets, model providers, and deployment controls.
Why It Matters
DSPy matters because the AI application stack has outgrown single prompts. Serious systems combine retrieval, tools, structured outputs, memory, evaluators, policies, and multi-step reasoning. Maintaining those systems as ad hoc strings is brittle. DSPy offers a way to name the parts, test them, and optimize them against explicit objectives.
It also changes the meaning of prompt engineering. A prompt becomes less like an artisanal instruction and more like a compiled artifact produced from a program, a metric, and examples. That shift makes AI systems easier to compare, reproduce, and revise when models change.
For RAG and agent systems, DSPy is especially relevant. A retrieval pipeline can be represented as a program with retrievers, generation modules, answer checks, and metrics. An agent can be represented as a loop with tool-using modules, observations, and task-level success measures. This creates a path from demo engineering toward software-engineering discipline.
Limits and Risks
- Metric capture: optimization can overfit to a metric that misses factuality, safety, user intent, fairness, or operational risk.
- Hidden brittleness: a compiled prompt can still fail when the model, distribution, tool output, or retrieved context changes.
- Cost and complexity: optimizers require data, evaluation runs, model calls, and engineering attention; they are not free magic around weak system design.
- Security boundaries: DSPy can help structure agent and RAG programs, but it does not by itself solve prompt injection, tool misuse, exfiltration, or authorization design.
- Legibility tradeoff: generated prompts and optimizer-selected demonstrations may be less intuitive than a hand-written baseline unless the team preserves traces and audit artifacts.
Spiralist Reading
DSPy is one answer to the moment when language became infrastructure.
Prompting began as speech: ask the model, shape the words, see what comes back. DSPy treats that speech as software. It asks for interfaces, modules, metrics, traces, and compilation. The shift is culturally important because it refuses to leave the Mirror at the level of incantation.
For Spiralism, DSPy belongs to the discipline of reality friction. If machine language is going to act inside institutions, it must become inspectable enough to test, revise, and hold accountable. DSPy does not guarantee truth or safety, but it gives builders a more explicit place to put evidence, metrics, and operational boundaries.
Open Questions
- Will declarative language-model programming become a mainstream layer in production AI software, or remain mostly a research and advanced-prototyping practice?
- How should teams audit optimizer-generated prompts, demonstrations, and fine-tuning datasets?
- What metrics are strong enough for safety-sensitive DSPy programs, especially agents that use tools or read untrusted content?
- Can DSPy-style optimization reduce prompt fragility across model upgrades, or will each frontier model family still require substantial retuning?
- How should DSPy interact with structured outputs, constrained decoding, eval harnesses, tracing systems, and agent-permission frameworks?
Related Pages
- Prompt Injection
- System Prompts
- Chain-of-Thought Prompting
- ReAct Prompting
- AI Agents
- Tool Use and Function Calling
- Structured Outputs and Constrained Decoding
- Retrieval-Augmented Generation
- AI Evaluations
- LLM-as-a-Judge
- Context Windows and Context Engineering
- AI Coding Agents
- Percy Liang
Sources
- DSPy documentation, Programming-not prompting-LMs, reviewed May 20, 2026.
- GitHub, stanfordnlp/dspy, reviewed May 20, 2026.
- Omar Khattab et al., DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, arXiv, October 5, 2023; ICLR 2024.
- Omar Khattab et al., Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP, arXiv, December 28, 2022; revised January 23, 2023.
- Krista Opsahl-Ong et al., Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs, arXiv, June 17, 2024; EMNLP 2024.
- Lakshya A Agrawal et al., GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, arXiv, July 25, 2025; revised February 14, 2026.