I edited a system prompt and had no way to prove it changed anything. So I built a measurement tool.
A few months ago I was on a team project. The tech lead asked me to update a chatbot's system prompt to make the responses sound more formal. I made the change, ran the chatbot a few times, and realized I had no real way to verify it worked.
I was making API calls manually, reading outputs side by side, and guessing.
So I built the tool I wished I had.
The problem with the current options
To compare two system prompts right now, your options are basically:
- Run them manually and eyeball the outputs. Slow and unreliable.
- Set up promptfoo with a YAML config. Way too much overhead for a quick gut-check.
- Use a platform like LangSmith. Requires signup and sends your data to a cloud dashboard.
None of these fit the use case of "I just want to validate this one change before I push it."
What I built
compare-prompts takes a dict of prompts and a list of test inputs, runs them all through your model of choice, and prints a side-by-side behavioral comparison table in your terminal. No config files, no setup.
pip install compare-prompts
from compare_prompts import compare
compare(
prompts={
"original": "You are a helpful assistant.",
"formal": "You are a professional, formal assistant.",
},
inputs=[
"Explain what a database is.",
"What is recursion?",
"Write a short poem about coding.",
],
model="gpt-4o-mini"
)
Running 2 prompts x 3 inputs = 6 calls...
Prompt Comparison Results
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
original formal
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
avg length (tokens) 187 143 (-24%)
tone casual (68%) formal (74%)
uses lists 67% 33%
uses headers 33% 66% (+100%)
avg cost (USD)* $0.0021 $0.0016 (-24%)
refusal rate 0% 0%
reading level high school college
avg sentence length 18.3 words 24.1 words (+32%)
* requires: pip install "compare-prompts[all]"
Each column is a prompt. Each row is a measured behavioral difference. Numbers in parentheses are deltas from your baseline.
What it measures
| Metric | What it tells you |
|---|---|
| avg length (tokens) | Is the change making responses longer or shorter? |
| tone | Dominant writing style across 9 categories |
| avg cost (USD) | Estimated API cost per call based on token usage |
| uses lists | Does the phrasing change how often the model uses bullet points? |
| uses headers | Same for markdown headers |
| refusal rate | Is the new prompt accidentally making the model more cautious? |
| reading level | Flesch-Kincaid grade |
| avg sentence length | Proxy for response density |
Supported models
compare(..., model="gpt-4o-mini") # OpenAI
compare(..., model="groq/llama-3.3-70b-versatile") # Groq
compare(..., model="gemini/gemini-2.0-flash") # Google Gemini
compare(..., model="ollama/llama3") # Ollama (local)
compare(..., model="anthropic/claude-3-5-haiku-20241022") # Anthropic
For 2,600+ additional models (Azure, AWS Bedrock, Vertex AI, OpenRouter):
pip install "compare-prompts[all]"
Three scenarios where this is actually useful
- You made a prompt "warmer." Did the tone column actually change, or did you just think it did?
- You are about to deploy a rewritten, shorter version of a 1,000-word inherited prompt. Confirm the refusal rate held at 0% before you push it.
- You want to know if a different model produces responses behaviorally close enough to your current one to justify switching.
What is next
Tone detection is currently keyword-based. I want to replace it with something smarter. CSV/JSON export, streaming output, and a .gitignore generator for the init command are on the roadmap.
PRs are open: https://github.com/OmarMashal0/compare-prompts