Programming, Not Prompting: A Hands-On Guide to DSPy

The modern GenAI landscape is built around prompting. We instruct LLMs like ChatGPT or Claude using long, highly detailed, step-by-step guides to achieve the desired outcomes. Crafting these prompts takes a lot of time and effort, but we’re willing to spend it since better prompts usually lead to better results.

However, reaching an optimal prompt is often a challenging task. It’s a trial-and-error process, where it’s not always clear what will work best for your specific task or a given LLM. As a result, it can take many iterations to arrive at a satisfactory result, especially when your prompt is several thousand words long.

To address these challenges, DataBricks launched the DSPy framework. DSPy stands for Declarative Self-improving Python. This framework allows you to build modular AI applications. It is based on the idea that LLM tasks can be treated as programming rather than manual prompting. Using standard building blocks, you can create a wide range of AI applications: from simple classifiers to RAG (Retrieval Augmented Generation) systems or even agents.

This approach seems promising. It would be exciting to build AI applications the same way as we build traditional software. So, I decided to give DSPy a try.

In this article, we’ll explore the DSPy framework and its capabilities for building LLM pipelines. We’ll start with a simple combinatorics task to cover the basics. Then, we’ll apply DSPy to a real business problem: classifying NPS detractor comments. Based on this example, we’ll also test one of the framework’s most promising features: automatic instruction optimisation.

DSPy basics

We’ll begin exploring the DSPy framework by installing the package.

pip install -U dspy

As mentioned above, DSPy defines LLM applications in a structured and modular way. Every application is built using three main components:

language model — LLM that will answer our questions,
signature —a declaration of the program’s input and output (what task we want to solve),
module — the prompting technique (how we want to solve the task).

Let’s try it out with a very simple example.

As usual, we will start with a language model — the core of any LLM-powered application. I will be using a local model (Llama by Meta), accessed via Ollama. If you don’t have Ollama installed yet, you can follow the instructions in the documentation.

To create a language model in a DSPy application, we need to initialise a dspy.LM object and set it as a default LLM for our app. It goes without saying that DSPy supports not only local models but also popular APIs, such as OpenAI and Anthropic.

import dspy
llm = dspy.LM('ollama_chat/llama3.2', 
  api_base='http://localhost:11434', 
  api_key='', temperature = 0.3)
dspy.configure(lm=llm)

We have our language model set up. The next step is to define the task by creating a module and a signature.

A signature defines the input and output for the model. It tells the model what we’re giving it and what result we expect in the end. The signature doesn’t specify to the model how to solve the task, it’s only a declaration.

There are two ways to define a signature in DSPy: inline or using a class. For our first quick example, we will use a simple inline approach, but we will cover class-based definitions later in the article.

Modules are the building blocks of DSPy applications. They abstract different prompting strategies, such as Chain-of-Thought or ReAct. Modules are designed to work with any signature, so you don’t need to worry about compatibility yourself.

Here are some of the most commonly used DSPy modules (you can find the full list in the documentation):

dspy.Predict — a basic predictor;
dspy.ChainOfThought — guides an LLM to think step-by-step before returning a final answer;
dspy.ReAct — a basic agent that can call tools.

We will start with the simplest version— dspy.Predict and will build a basic model that can answer combinatorics questions. Since we expect an answer to be an integer, I’ve specified that in a signature.

simple_model = dspy.Predict("question -> answer: int")

That’s all we need. Now, we can start asking questions.

simple_model(
  question="""I have 5 different balls and I randomly select 4. 
    How many possible combinations of the balls I can get?"""
)

# Prediction(answer=210)

We got the answer, but unfortunately, it’s incorrect. Still, let’s see how it works under the hood. We can see the full logs using the dspy.inspect_history command.

dspy.inspect_history(n = 1)

# System message:
# 
# Your input fields are:
# 1. `question` (str):
# Your output fields are:
# 1. `answer` (int):
# All interactions will be structured in the following way, with 
# the appropriate values filled in.
# 
# Inputs will have the following structure:
# [[ ## question ## ]]
# {question}
# 
# Outputs will be a JSON object with the following fields.
# {
#   "answer": "{answer}  # note: the value you produce must be a single int value"
# }
# In adhering to this structure, your objective is: 
#   Given the fields `question`, produce the fields `answer`.
# 
# User message:
# [[ ## question ## ]]
# I have 5 different balls and I randomly select 4. How many possible 
# combinations of the balls I can get?
# Respond with a JSON object in the following order of fields: `answer` 
# (must be formatted as a valid Python int).
# 
# Response:
# {"answer": 210}

We can see that DSPy has generated a detailed and well-structured prompt for us. That’s quite handy.

One last quick note before we move on to fixing the model: I noticed that DSPy enables caching for LLM responses by default. Caching might be helpful in some cases, for example, saving costs on debugging. However, if you want to disable it, you can either update the config or bypass it for a specific call.

# updating config
dspy.configure_cache(enable_memory_cache=False, enable_disk_cache=False)

# not using cache for specific module
math = dspy.Predict("question -> answer: float", cache = False)

Back to our task, let’s try adding reasoning to see if it improves the result. It’s as easy as changing the module.

dspy.configure(adapter=dspy.JSONAdapter()) 
# I've also changed to JSON format since it better works
# for models with structured output

cot_model = dspy.ChainOfThought("question -> answer: int")
cot_model(question="""I have 5 different balls and I randomly select 4. 
  How many possible combinations of the balls I can get?""")

# Prediction(
#   reasoning='This is a combination problem, where we need to find 
#     the number of ways to choose 4 balls out of 5 without considering 
#     the order. The formula for combinations is nCr = n! / (r!(n-r)!), 
#     where n is the total number of items and r is the number of items 
#     being chosen. In this case, n = 5 and r = 4.',
#   answer=5
# )

Hooray! The reasoning worked, and we got the correct result this time. Let’s see how the prompt has changed. The reasoning field has been added to the output variables.

dspy.inspect_history(n = 1) 

# System message:
# 
# Your input fields are:
# 1. `question` (str):
# Your output fields are:
# 1. `reasoning` (str): 
# 2. `answer` (int):
# All interactions will be structured in the following way, 
# with the appropriate values filled in.

# Inputs will have the following structure:
# [[ ## question ## ]]
# {question}

# Outputs will be a JSON object with the following fields.
# {
#   "reasoning": "{reasoning}",
#   "answer": "{answer}        # note: the value you produce must be a single int value"
# }
# In adhering to this structure, your objective is: 
#   Given the fields `question`, produce the fields `answer`.

Let’s test our system with a slightly more challenging question.

print(cot_model(question="""I have 25 different balls and I randomly select 9. 
  How many possible combinations of the balls I can get?"""))

# Prediction(
#   reasoning='This is a combination problem, where the order of selection 
#     does not matter. The number of combinations can be calculated using 
#     the formula C(n, k) = n! / (k!(n-k)!), where n is the total 
#     number of items and k is the number of items to choose.',
#   answer=55
# )

The answer is definitely wrong. LLM shared the correct formula, but gave 55 instead of the correct result (2,042,975). This is expected. The model hallucinated because it couldn’t perform the calculation accurately. So, it’s a perfect use case for an agent. We will equip our agent with a tool to do calculations and, hopefully, it will solve the problem.

Before we jump to build our first DSPy agentic flow, let’s set up observability. It will help us understand the agent’s thought process. DSPy is integrated with MLFlow (an observability tool), making it easy to track everything in a user-friendly interface.

To begin, we’ll make a couple of initial setup calls.

pip install -U mlflow

#  It is highly recommended to use SQL store when using MLflow tracing
python3 -m mlflow server --backend-store-uri sqlite:///mydb.sqlite

If you haven’t changed the default, MLFlow will be running on port 5000. Next, we need to add some Python code to our program to start tracking. That’s it.

import mlflow

# Tell MLflow about the server URI.
mlflow.set_tracking_uri("http://127.0.0.1:5000")

# Create a unique name for your experiment.
mlflow.set_experiment("DSPy")
mlflow.dspy.autolog()

Then, let’s define a calculation tool. We will give our agent the superpower to execute Python code.

from dspy import PythonInterpreter

def evaluate_math(expr: str) -> str:
  # Executes Python and returns the output as string
  with PythonInterpreter() as interp:
    return interp(expr)

Now, we have everything we need to create our agent. As you can see, defining a DSPy agent is concise and straightforward.

react_model = dspy.ReAct(
  signature="question -> answer: int", 
  tools=[evaluate_math]
)

response = react_model(question="""I have 25 different balls and I randomly 
  select 9. How many possible combinations of the balls I can get?""")

print(response.answer) 
# 2042975

Thanks to the math capabilities, we got the correct answer. Let’s take a look at how the agent came up with this answer.

print(response.trajectory)

# {'thought_0': 'To find the number of possible combinations of balls I can get, we need to calculate the number of combinations of 9 balls from a set of 25.',
#  'tool_name_0': 'evaluate_math',
#  'tool_args_0': {'expr': 'math.comb(25, 9)'},
#  'observation_0': 'Execution error in evaluate_math: \nTraceback (most recent call last):\n  File "/Users/marie/Documents/github/llm_env/lib/python3.11/site-packages/dspy/predict/react.py", line 89, in forward\n    trajectory[f"observation_{idx}"] = self.tools[pred.next_tool_name](**pred.next_tool_args)\n                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/marie/Documents/github/llm_env/lib/python3.11/site-packages/dspy/utils/callback.py", line 326, in sync_wrapper\n    return fn(instance, *args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/marie/Documents/github/llm_env/lib/python3.11/site-packages/dspy/adapters/types/tool.py", line 166, in __call__\n    result = self.func(**parsed_kwargs)\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/var/folders/7v/1ln722x97kd8bchgxpmdkynw0000gn/T/ipykernel_84644/1271922619.py", line 4, in evaluate_math\n    return interp(expr)\n           ^^^^^^^^^^^^\n  File "/Users/marie/Documents/github/llm_env/lib/python3.11/site-packages/dspy/primitives/python_interpreter.py", line 149, in __call__\n    return self.execute(code, variables)\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\ndspy.primitives.python_interpreter.InterpreterError: NameError: ["name \'math\' is not defined"]',
#  'thought_1': 'The math.comb function is not defined. We need to import the math module.',
#  'tool_name_1': 'evaluate_math',
#  'tool_args_1': {'expr': 'import math; math.comb(25, 9)'},
#  'observation_1': 2042975,
#  'thought_2': 'We need to import the math module before using its comb function.',
#  'tool_name_2': 'evaluate_math',
#  'tool_args_2': {'expr': 'import math; math.comb(25, 9)'},
#  'observation_2': 2042975,
#  'thought_3': 'We need to import the math module before using its comb function.',
#  'tool_name_3': 'evaluate_math',
#  'tool_args_3': {'expr': 'import math; math.comb(25, 9)'},
#  'observation_3': 2042975,
#  'thought_4': 'We need to import the math module before using its comb function.',
#  'tool_name_4': 'evaluate_math',
#  'tool_args_4': {'expr': 'import math; math.comb(25, 9)'},
#  'observation_4': 2042975}

Overall, the trajectory makes sense. The LLM correctly tried to calculate the number of combinations with math.comb(25, 9) . I didn’t know that such a function existed, so that was a win. However, it initially forgot to import the math module, causing the execution to fail. On the next iteration, it corrected the Python code and got the result. For some reason, though, it repeated precisely the same action three more times. Not ideal, but we still got our answer.

Since we enabled MLFlow, we can also view the complete log of the agent’s execution through the UI. It’s often more convenient than reading the trajectory as plain text.

Finally, we’ve successfully built an app that can accurately answer combinatorics questions and learned the basics of DSPy along the way. Now, it’s time to move on to actual business tasks.

NPS topic modelling

As we’ve covered the basics, let’s take a look at a real-world example. Imagine you’re a product analyst at a fashion retail company, and your task is to identify the most significant customer pain points. The company regularly conducts an NPS survey, so you decide to base your analysis on comments from NPS detractors.

Together with your product team, you reviewed a bunch of NPS comments, looked through previous customer research and brainstormed a list of key problems that customers might be facing in the product. As a result, you identified the following key topics:

Slow or Unreliable Shipping,
Inaccurate Product Description or Photos,
Limited Size or Shade Availability,
Unresponsive or Generic Customer Support,
Website or App Bugs,
Confusing Loyalty or Discount Systems,
Complicated Returns or Exchanges,
Customs and Import Charges,
Difficult Product Discovery,
Damaged or Incorrect Items.

We have a list of hypotheses and now just need to understand which problems customers mention most often. Fortunately, with LLMs, there’s no need to spend hours reading NPS comments ourselves. We will use DSPy to do topic modelling.

Let’s start by defining a signature. The model will get an NPS comment as input, and we expect it to return one or more topics as output. From a declaration perspective, output will be an array of strings from a predefined list. Since this use case is a bit more complex, we will use a class-based signature for this task.

We need to create a class that inherits from dspy.Signature . This class should include a docstring that will be shared with the model as its objective. We also need to define the input and output fields, along with their respective types.

from typing import Literal, List

class NPSTopic(dspy.Signature):
  """Classify NPS topics"""

  comment: str = dspy.InputField()
  answer: List[Literal['Slow or Unreliable Shipping', 
    'Inaccurate Product Descriptions or Photos', 
    'Limited Size or Shade Availability', 'Difficult Product Discovery',
    'Unresponsive or Generic Customer Support', 
    'Website or App Bugs', 'Confusing Loyalty or Discount Systems', 
    'Complicated Returns or Exchanges', 'Customs and Import Charges', 
    'Damaged or Incorrect Items']] = dspy.OutputField()

The next step is to define the module. Since we don’t need any tools, I will use a chain-of-thought prompting approach.

nps_topic_model = dspy.ChainOfThought(NPSTopic)

That’s it. We can give it a try. Based on a single example, the model performs quite well.

response = nps_topic_model(
  comment = """Absolutely frustrated! Every time I find something I love, 
    it's sold out in my size. What's the point of having a wishlist 
    if nothing is ever available?""")

print(response.answer)
# ["Limited Size or Shade Availability"]

You might be wondering why we’re discussing such a straightforward task. It took us just 2 minutes to build the prototype. That’s true, but the goal here is to see how DSPy optimisation works in practice using this example.

Optimisation is one of the framework’s standout features. DSPy can automatically tune the model weights and adjust instructions to optimise for the evaluation criteria you specified.

There are a bunch of DSPy optimisers available:

Automatic few-shot learning (for example, BootstrapFewShot or BootstrapFewShotWithRandomSearch) automatically selects the best examples and adds them to the signature, implementing a few-shot learning prompt.
Automatic instructions optimisation (for example, MIPROv2) can simultaneously adjust instructions and select examples for few-shot learning.
Automatic fine-tuning (for example, BootstrapFinetune) adjusts the language model’s weights.

In this article, I will focus solely on instruction optimisations. I’ve decided to start with the MIPROv2 optimiser (which stands for “Multiprompt Instruction Proposal Optimizer Version 2”), since it can tweak prompts and add examples at the same time. For more details, check the article “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs” by Opsahl-Ong et al.

If you’re interested in a fine-tuning example, you can look it up in the documentation.

For optimisation, we will need: a DSPy program (which we already have), a metric and a set of examples (ideally, divided into training and validation sets).

For the training set, I synthesised 100 examples of NPS comments with labels. Let’s split these into training and validation sets.

trainset = []
valset = []
for rec in nps_data: 
  if random.random() <= 0.5:
    trainset.append(
      dspy.Example(
        comment = rec['comment'],
        answer = rec['topics']
      ).with_inputs('comment')
    )
  else: 
    valset.append(
      dspy.Example(
        comment = rec['comment'],
        answer = rec['topics']
      ).with_inputs('comment')
    )

Now, let’s define the function that will calculate the metric. We need a custom function because the default function dspy.evaluate.answer_exact_match doesn’t work with arrays.

def list_exact_match(example, pred, trace=None):
  """Custom metric for comparing lists of topics"""
  try:
    pred_answer = pred.answer
    expected_answer = example.answer
      
    # Convert to sets for order-independent comparison
    if isinstance(pred_answer, list) and isinstance(expected_answer, list):
      return set(pred_answer) == set(expected_answer)
    else:
      return pred_answer == expected_answer
  except Exception as e:
    print(f"Error in metric: {e}")
    return False

Now, we have everything we need to start the optimisation process.

tp = dspy.MIPROv2(metric=list_exact_match, auto="light", num_threads=24)
opt_nps_topic_model =  tp.compile(
  nps_topic_model, 
  trainset=trainset, 
  valset=valset,
  requires_permission_to_run = False, provide_traceback=True)

I chose auto = "light" to keep the number of iterations low, but even with this setting, optimisation might take quite a long time (60–90 mins).

Once it’s ready, we can run both models on the validation set and compare their results.

tmp = []

for e in tqdm.tqdm(valset):
  comment = e.comment 
  prev_resp = nps_topic_model(comment = comment) 
  new_resp = opt_nps_topic_model(comment = comment)

  tmp.append(
    {
      'comment': comment,
      'sot_answer': e.answer,
      'prev_answer': prev_resp.answer,
      'new_answer': new_resp.answer
    }
  )

We’ve gained a significant improvement in accuracy: from 62.3% to 82%. That’s a really cool result.

Let’s compare the prompts to see what changed. The optimiser updated the objective from a high-level one, “Classify NPS topics”, that we initially defined, to a more specific one: “Classify customer feedback comments related to online shopping issues, such as website bugs, product availability, and inaccurate descriptions, into relevant NPS topics.” Additionally, the algorithm selected five examples to include in the prompt.

I tried a simpler version of the optimiser (BootstrapFewShotWithRandomSearch) that only adds examples to the prompt, and it achieved roughly the same results, 77% accuracy. This suggests that few-shot prompting is the main driver of the accuracy improvement.

tp_v2 = dspy.BootstrapFewShotWithRandomSearch(list_exact_match, 
  num_threads=24, max_bootstrapped_demos = 10)

opt_v2_nps_topic_model = tp_v2.compile(
  nps_topic_model, 
  trainset=trainset, 
  valset=valset)

That’s it for the topic modelling task. We achieved remarkable results using a small local model and just a few lines of code.

You can find the full code on GitHub.

Summary

In this article, we’ve explored the DSPy framework and its capabilities. Now, it’s time to wrap things up with a quick summary.

DSPy (Declarative Self-improving Python) is a modular, declarative framework for building AI applications, developed by DataBricks.
Its core philosophy is “Programming — not prompting — LMs”. So, the framework encourages you to create applications using structured building blocks like modules or signatures rather than handcrafted prompts. While I really like the idea of building LLM applications more like traditional software, I’ve grown so accustomed to prompting that it feels somewhat uncomfortable to give up this level of control.
The most impressive feature of the framework is definitely optimisers. DSPy allows you to automatically improve the pipelines either by tuning prompts (both adjusting instructions and adding optimal few-shot examples) or by fine-tuning the language model’s weights.