Prompt Hashing for Duplicate Detection: Cutting LLM Waste With SHA-256

go dev.to

You know your LLM bill is higher than it should be. You can see total spend and total tokens in the OpenAI dashboard. What you can't see: how many of those requests are asking the exact same question that was answered five minutes ago.

Prompt hashing is the cheapest LLM optimization available — no model changes, no prompt rewrites, under 1ms added to the request path, no false positives. You hash the request, check the cache, skip the LLM on a hit — and don't pay for that call.

TL;DR:

  • The average production app sends 15–30% duplicate LLM requests. SHA-256 exact hashing catches all of them with zero false positives
  • The hash key must cover model + full messages array + normalized generation params — not just the prompt string
  • When teams first connect Preto, the average exact duplicate rate on day one is 18% — pure recoverable waste

Why Duplicates Are More Common Than You Think

The first reaction: "our requests aren't duplicates — every user's query is different." True for open-ended chat. Not true for most production LLM use cases:

Support/FAQ bots. "What are your hours?" "How do I reset my password?" "What's your return policy?" These arrive hundreds of times per day, word for word. Every one hits the LLM fresh.

Scheduled jobs. Weekly reports, nightly summaries, cron-triggered pipelines. Same prompt template, same data, same request — billed every run.

Application-layer cache misses. The most common root cause: someone added LLM calls to a hot path without adding a cache. Every page load calls the LLM even when nothing changed. The cache was on the roadmap. It never shipped.

SHA-256 hashing at the proxy layer catches all of these before they reach the provider — even if the application has no caching at all. You stop paying for duplicates without touching a line of app code.


What to Hash (Most Implementations Get This Wrong)

The naive approach: hash the last user message. Wrong in production.

"Summarize this document" sent to GPT-4o with temperature 0.2 should cache differently from the same string sent to GPT-4o-mini with temperature 0.8. Same prompt text, different request, different output.

The hash key must include everything that deterministically affects the output:

  • Model namegpt-4o and gpt-4o-mini are different requests
  • Full messages array — role + content for every message, including system prompt
  • Generation params — temperature, max_tokens, top_p, frequency_penalty, presence_penalty

Exclude: user field, request IDs, stream flag, stream_options.


The Go Implementation

The critical requirement: canonical serialization. The same logical request must always produce the same byte sequence. JSON marshaling is not canonical by default (map key order undefined). We fix this with structs:

type CacheKey struct {
    Model    string    `json:"model"`
    Messages []Message `json:"messages"`
    Params   KeyParams `json:"params"`
}

type Message struct {
    Role    string `json:"role"`
    Content string `json:"content"`
}

type KeyParams struct {
    Temperature      float64 `json:"temperature,omitempty"`
    MaxTokens        int     `json:"max_tokens,omitempty"`
    TopP             float64 `json:"top_p,omitempty"`
    FrequencyPenalty float64 `json:"frequency_penalty,omitempty"`
    PresencePenalty  float64 `json:"presence_penalty,omitempty"`
}

func ComputePromptHash(req *openai.ChatCompletionRequest) string {
    key := CacheKey{
        Model:    req.Model,
        Messages: normalizeMessages(req.Messages),
        Params: KeyParams{
            Temperature: math.Round(req.Temperature*1000) / 1000,
            MaxTokens:   req.MaxTokens,
            TopP:        math.Round(req.TopP*1000) / 1000,
            // ... other params
        },
    }
    b, _ := json.Marshal(key) // struct fields marshal in declaration order — deterministic
    h := sha256.Sum256(b)
    return hex.EncodeToString(h[:])
}

func normalizeMessages(msgs []openai.ChatCompletionMessage) []Message {
    out := make([]Message, len(msgs))
    for i, m := range msgs {
        out[i] = Message{
            Role:    strings.TrimSpace(m.Role),
            Content: strings.TrimSpace(m.Content),
        }
    }
    return out
}
Enter fullscreen mode Exit fullscreen mode

The float normalization (math.Round(x*1000)/1000) matters. Different SDK versions can produce 0.7 vs 0.6999999999999998 for the same value. Without normalization, these hash differently and your cache never hits.

Redis cache:

const cacheTTL = 4 * time.Hour

func (c *Cache) Get(hash string) (*CachedResponse, bool) {
    val, err := c.redis.Get(ctx, "ph:"+hash).Bytes()
    if err != nil {
        return nil, false
    }
    var resp CachedResponse
    json.Unmarshal(val, &resp)
    return &resp, true
}

func (c *Cache) Set(hash string, resp *CachedResponse) {
    b, _ := json.Marshal(resp)
    c.redis.Set(ctx, "ph:"+hash, b, cacheTTL)
}
Enter fullscreen mode Exit fullscreen mode

Real Duplicate Rates by Use Case (Anonymized Production Data)

Use Case Exact Duplicate Rate Notes
Support / FAQ bots 35–45% Users ask the same questions constantly
Scheduled / batch jobs 20–60% Retried jobs, multi-instance deploys
Code assist / review 8–15% Boilerplate patterns, shared stubs
General chat / copilot 3–8% Lowest — open-ended queries rarely repeat
Weighted average ~18% Discovered on day one of proxy connection

18% is not an edge case — it's the first thing visible when a team adds a proxy with prompt hashing. Some teams see 40%. That's the floor of recoverable waste, before semantic caching, model routing, or any other optimization.


What to Do With the Hash Beyond Caching

Duplicate rate reporting. Track cache hit rate per endpoint. This surfaces which parts of your application send duplicate traffic — usually a missing application-layer cache that should be fixed at the root.

Cost projection. A hash seen 200 times this month × $0.008/request (roughly GPT-4o-mini at 500 tokens) = $1.60 in recoverable waste for one prompt. Multiply across all duplicate hashes for a concrete monthly saving figure.

Abuse detection. A single hash 10,000 times in an hour from one user is a different pattern from organic duplicates. Rate limit by hash to catch prompt injection loops and runaway retry logic.


Where Hashing Falls Short

SHA-256 catches exact duplicates. It misses semantic duplicates.

"What are your business hours?" and "When do you open?" hash differently. They're the same question.

Semantic caching handles this with embedding similarity — it can catch 2–3x more waste, but requires an embedding model, a vector store, and careful threshold tuning to avoid false positives.

The right order: implement exact hashing first. Zero-risk, one afternoon, immediate savings. Add semantic caching once you've measured the remaining duplicate rate.


We're building Preto.ai — the proxy layer that computes prompt hashes on every request and surfaces duplicate rates, recoverable waste, and per-feature cost breakdown in real time. Free up to 10K requests.

Source: dev.to

arrow_back Back to Tutorials