I edited a system prompt and had no way to prove it changed anything. So I built a measurement tool.

A few months ago I was on a team project. The tech lead asked me to update a chatbot's system prompt to make the responses sound more formal. I made the change, ran the chatbot a few times, and realized I had no real way to verify it worked.

I was making API calls manually, reading outputs side by side, and guessing.

So I built the tool I wished I had.

The problem with the current options

To compare two system prompts right now, your options are basically:

Run them manually and eyeball the outputs. Slow and unreliable.
Set up promptfoo with a YAML config. Way too much overhead for a quick gut-check.
Use a platform like LangSmith. Requires signup and sends your data to a cloud dashboard.

None of these fit the use case of "I just want to validate this one change before I push it."

What I built

compare-prompts takes a dict of prompts and a list of test inputs, runs them all through your model of choice, and prints a side-by-side behavioral comparison table in your terminal. No config files, no setup.

pip install compare-prompts

from compare_prompts import compare

compare(
    prompts={
        "original": "You are a helpful assistant.",
        "formal":   "You are a professional, formal assistant.",
    },
    inputs=[
        "Explain what a database is.",
        "What is recursion?",
        "Write a short poem about coding.",
    ],
    model="gpt-4o-mini"
)

  Running 2 prompts x 3 inputs = 6 calls...

                    Prompt Comparison Results
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                         original             formal
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  avg length (tokens)    187                  143  (-24%)
  tone                   casual (68%)         formal (74%)
  uses lists             67%                  33%
  uses headers           33%                  66%  (+100%)
  avg cost (USD)*        $0.0021              $0.0016  (-24%)
  refusal rate           0%                   0%
  reading level          high school          college
  avg sentence length    18.3 words           24.1 words  (+32%)

* requires: pip install "compare-prompts[all]"

Each column is a prompt. Each row is a measured behavioral difference. Numbers in parentheses are deltas from your baseline.

What it measures

Metric	What it tells you
avg length (tokens)	Is the change making responses longer or shorter?
tone	Dominant writing style across 9 categories
avg cost (USD)	Estimated API cost per call based on token usage
uses lists	Does the phrasing change how often the model uses bullet points?
uses headers	Same for markdown headers
refusal rate	Is the new prompt accidentally making the model more cautious?
reading level	Flesch-Kincaid grade
avg sentence length	Proxy for response density

Supported models

compare(..., model="gpt-4o-mini")                        # OpenAI
compare(..., model="groq/llama-3.3-70b-versatile")       # Groq
compare(..., model="gemini/gemini-2.0-flash")             # Google Gemini
compare(..., model="ollama/llama3")                       # Ollama (local)
compare(..., model="anthropic/claude-3-5-haiku-20241022") # Anthropic

For 2,600+ additional models (Azure, AWS Bedrock, Vertex AI, OpenRouter):

pip install "compare-prompts[all]"

Three scenarios where this is actually useful

You made a prompt "warmer." Did the tone column actually change, or did you just think it did?
You are about to deploy a rewritten, shorter version of a 1,000-word inherited prompt. Confirm the refusal rate held at 0% before you push it.
You want to know if a different model produces responses behaviorally close enough to your current one to justify switching.

What is next

Tone detection is currently keyword-based. I want to replace it with something smarter. CSV/JSON export, streaming output, and a .gitignore generator for the init command are on the roadmap.

PRs are open: https://github.com/OmarMashal0/compare-prompts