DSPy: Programming—not prompting—Foundation Models. A Brief Intro

Posted Feb 10, 2026 Updated Feb 10, 2026

7 min read

DSPy Overview

DSPy is a framework that lets people code their AI agent systems without any prompting (almost). It’s similar to LangChain and LangGraph, except it’s more convenient to code up and allows us to sort of fine tune generated prompts through training and validation sets. Those sets contain inputs to AI system and expected outputs from AI system. In addition, DSPy works with language models (LM) and natively does not support coding agents such as codex. However, there are some repositories on github that wrap the codex agent and allows us to integrate it within DSPy workflow.

DSPy allows us to write prompt through “intentions”. We specify the input that LM receives, as well as the output that language model should return. Such intention is called “signature”. Here is an example:

  
# Define a module (ChainOfThought) and assign it a signature (return an answer, given a question).
qa = dspy.ChainOfThought('question -> answer')

# Run with the default LM configured with `dspy.configure` above.
response = qa(question="How many floors are in the castle David Gregory inherited?")
print(response.answer)

The possible output would be:

The castle David Gregory inherited has 7 floors.

However, additional instructions can also be optionally provided:

  
toxicity = dspy.Predict(
    dspy.Signature(
        "comment -> toxic: bool",
        instructions="Mark as 'toxic' if the comment includes insults, harassment, or sarcastic derogatory remarks.",
    )
)
comment = "you are beautiful."
toxicity(comment=comment).toxic

Output:

False

Such simple signatures may not be enough to define LM’s behavior, so specifying the type of input and output, as well as class based definitions are also possible. Class based definitions allow more things to define and to receive and return more than one inputs and outputs.

Here is an example:

  
class CheckCitationFaithfulness(dspy.Signature):
    """Verify that the text is based on the provided context."""

    context: str = dspy.InputField(desc="facts here are assumed to be true")
    text: str = dspy.InputField()
    faithfulness: bool = dspy.OutputField()
    evidence: dict[str, list[str]] = dspy.OutputField(desc="Supporting evidence for claims")

context = "The 21-year-old made seven appearances for the Hammers and netted his only goal for them in a Europa League qualification round match against Andorran side FC Lustrains last season. Lee had two loan spells in League One last term, with Blackpool and then Colchester United. He scored twice for the U's but was unable to save them from relegation. The length of Lee's contract with the promoted Tykes has not been revealed. Find all the latest football transfers on our dedicated page."

text = "Lee scored 3 goals for Colchester United."

faithfulness = dspy.ChainOfThought(CheckCitationFaithfulness)
faithfulness(context=context, text=text)

Possible output:

Prediction(
    reasoning="Let's check the claims against the context. The text states Lee scored 3 goals for Colchester United, but the context clearly states 'He scored twice for the U's'. This is a direct contradiction.",
    faithfulness=False,
    evidence={'goal_count': ["scored twice for the U's"]}
)

DSPy works with LMs and allows various both user-defined and native LM tool calls. The difference is user-defined tool call is “decided” by LM whether to use it or not explicitly, while in native LM tool call, tool is also user-defined, but LM “emits” tool call. Very similar concepts actually.

DSPy also supports both http based as well as stdio based MCP server usage, but the tool calls of MCP server should be converted to DSPy regular tools first. Then they are provided to LMs just like any other user-defined tools.

Besides dspy.ChainOfThought, there are other modules as well. Here is a full list:

dspy.Predict: Basic predictor. Does not modify the signature. Handles the key forms of learning (i.e., storing the instructions and demonstrations and updates to the LM).
dspy.ChainOfThought: Teaches the LM to think step-by-step before committing to the signature’s response.
dspy.ProgramOfThought: Teaches the LM to output code, whose execution results will dictate the response.
dspy.ReAct: An agent that can use tools to implement the given signature.
dspy.MultiChainComparison: Can compare multiple outputs from ChainOfThought to produce a final prediction.
dspy.RLM: A Recursive Language Model that explores large contexts through a sandboxed Python REPL with recursive sub-LLM calls. Use when context is too large to fit in the prompt effectively.

The adapters in DSPy serve as a bridge between module and LM. The main purpose of using them is for more technical things, such as extracting the message list, extracting messages of particular role and asking LM to return a json output.

DSPy Evaluation

DSPy evaluation is a stage where the crafted AI system gets evaluated using user-defined metrics. Evaluation requires a dataset of inputs and outputs. A DSPy metric is just a function in Python that takes a sample from training or dev set and the output from your DSPy program, and outputs a score. Once both dataset and metric are ready, evaluation can be run in a simple Python loop:

  
scores = []
for x in devset:
    pred = program(**x.inputs())
    score = metric(x, pred)
    scores.append(score)

DSPy Optimization

Once the evaluation stage is ready, DSPy optimizers can be used to tune the prompts of the program. First of all, the set should split into training set and test set. The 20% of the set can be labeled as training set, and the rest 80% can be labeled as test set, or vice versa depending on the type of DSPy optimizer being used. After first few optimization runs, you are either very happy with everything or you’ve made a lot of progress but you don’t like something about the final program or the metric. This is a good chance to revise human defined stages of the program, such as the program workflow or the evaluation. Iterative development is key. DSPy gives the pieces to do that incrementally: iterating on the data, the program structure, the metric, and the optimization steps.

A DSPy optimizer is an algorithm that can tune the parameters of a DSPy program (i.e., the prompts and/or the LM weights) to maximize the metrics specified, like accuracy.

A typical DSPy optimizer takes three things:

Your DSPy program. This may be a single module (e.g., dspy.Predict) or a complex multi-module program.
Your metric. This is a function that evaluates the output of your program, and assigns it a score (higher is better).
A few training inputs. This may be very small (i.e., only 5 or 10 examples) and incomplete (only inputs to your program, without any labels).

What does a DSPy Optimizer tune? How does it tune them?

Different optimizers in DSPy tune program’s quality by synthesizing good few-shot examples for every module, like dspy.BootstrapRS,1 proposing and intelligently exploring better natural-language instructions for every prompt, like dspy.MIPROv2,2 and dspy.GEPA,3 and building datasets for the program modules and using them to finetune the LM weights in the system, like dspy.BootstrapFinetune.4

Here is an example usage of optimizers, optimizing prompts for a ReAct agent in this case:

  
import dspy
from dspy.datasets import HotPotQA

dspy.configure(lm=dspy.LM('openai/gpt-4o-mini'))

def search(query: str) -> list[str]:
    """Retrieves abstracts from Wikipedia."""
    results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3)
    return [x['text'] for x in results]

trainset = [x.with_inputs('question') for x in HotPotQA(train_seed=2024, train_size=500).train]
react = dspy.ReAct("question -> answer", tools=[search])

tp = dspy.MIPROv2(metric=dspy.evaluate.answer_exact_match, auto="light", num_threads=24)
optimized_react = tp.compile(react, trainset=trainset)

What does it mean for our research?

DSPy cannot really be used for a full scale AI agent system development. In addition, we need a set of good inputs and outputs for prompt optimization. Besides, the only benefit it can bring us is to further optimize our existing prompts. Writing an optimal prompt may not be the focus for our current stage. Once we have a satisfactory level framework up and running with at least minimal performing prompts, we can further optimize our prompts using GEPA separately outside DSPy library. If we want to use dspy.GEPA instead, we would need to reimplement relative parts of our framework with dspy.

This post is licensed under CC BY 4.0 by the author.