I edited a system prompt and had no way to prove it changed anything. So I built a measurement tool.

python dev.to

I edited a system prompt and had no way to prove it changed anything. So I built a measurement tool.

A few months ago I was on a team project. The tech lead asked me to update a chatbot's system prompt to make the responses sound more formal. I made the change, ran the chatbot a few times, and realized I had no real way to verify it worked.

I was making API calls manually, reading outputs side by side, and guessing.

So I built the tool I wished I had.


The problem with the current options

To compare two system prompts right now, your options are basically:

  1. Run them manually and eyeball the outputs. Slow and unreliable.
  2. Set up promptfoo with a YAML config. Way too much overhead for a quick gut-check.
  3. Use a platform like LangSmith. Requires signup and sends your data to a cloud dashboard.

None of these fit the use case of "I just want to validate this one change before I push it."


What I built

compare-prompts takes a dict of prompts and a list of test inputs, runs them all through your model of choice, and prints a side-by-side behavioral comparison table in your terminal. No config files, no setup.

pip install compare-prompts
Enter fullscreen mode Exit fullscreen mode
from compare_prompts import compare

compare(
    prompts={
        "original": "You are a helpful assistant.",
        "formal":   "You are a professional, formal assistant.",
    },
    inputs=[
        "Explain what a database is.",
        "What is recursion?",
        "Write a short poem about coding.",
    ],
    model="gpt-4o-mini"
)
Enter fullscreen mode Exit fullscreen mode
  Running 2 prompts x 3 inputs = 6 calls...

                    Prompt Comparison Results
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                         original             formal
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  avg length (tokens)    187                  143  (-24%)
  tone                   casual (68%)         formal (74%)
  uses lists             67%                  33%
  uses headers           33%                  66%  (+100%)
  avg cost (USD)*        $0.0021              $0.0016  (-24%)
  refusal rate           0%                   0%
  reading level          high school          college
  avg sentence length    18.3 words           24.1 words  (+32%)

* requires: pip install "compare-prompts[all]"
Enter fullscreen mode Exit fullscreen mode

Each column is a prompt. Each row is a measured behavioral difference. Numbers in parentheses are deltas from your baseline.


What it measures

Metric What it tells you
avg length (tokens) Is the change making responses longer or shorter?
tone Dominant writing style across 9 categories
avg cost (USD) Estimated API cost per call based on token usage
uses lists Does the phrasing change how often the model uses bullet points?
uses headers Same for markdown headers
refusal rate Is the new prompt accidentally making the model more cautious?
reading level Flesch-Kincaid grade
avg sentence length Proxy for response density

Supported models

compare(..., model="gpt-4o-mini")                        # OpenAI
compare(..., model="groq/llama-3.3-70b-versatile")       # Groq
compare(..., model="gemini/gemini-2.0-flash")             # Google Gemini
compare(..., model="ollama/llama3")                       # Ollama (local)
compare(..., model="anthropic/claude-3-5-haiku-20241022") # Anthropic
Enter fullscreen mode Exit fullscreen mode

For 2,600+ additional models (Azure, AWS Bedrock, Vertex AI, OpenRouter):

pip install "compare-prompts[all]"
Enter fullscreen mode Exit fullscreen mode

Three scenarios where this is actually useful

  1. You made a prompt "warmer." Did the tone column actually change, or did you just think it did?
  2. You are about to deploy a rewritten, shorter version of a 1,000-word inherited prompt. Confirm the refusal rate held at 0% before you push it.
  3. You want to know if a different model produces responses behaviorally close enough to your current one to justify switching.

What is next

Tone detection is currently keyword-based. I want to replace it with something smarter. CSV/JSON export, streaming output, and a .gitignore generator for the init command are on the roadmap.

PRs are open: https://github.com/OmarMashal0/compare-prompts

Source: dev.to

arrow_back Back to Tutorials