Stop Guessing on Search Tuning: Using OpenSearch Search Relevance to Improve Results

TL;DR: Search quality problems cost users and revenue. OpenSearch Search Relevance lets you measure exactly what's broken, iterate on fixes, and prove improvement with metrics. This guide walks you through a real-world search tuning workflow.

The Problem: Search That Works for You, Not Your Users

You've built a solid search system. Elasticsearch. OpenSearch. Full-text search running smoothly. But then you hear it:

"Why can't I find what I'm looking for?"
"Your search results are terrible."
"I have to refine my query five times to get what I need."

This is the gap between search that works and search that matters. Your system might be technically sound, but the relevance—the quality of what you return—is broken. And here's the problem: you can't measure what you can't improve.

Most teams guess. They tweak analyzers, adjust boost factors, shuffle query logic, deploy, and hope. Sometimes it helps. Sometimes it makes things worse. Nobody knows because they're not measuring.

This is where OpenSearch Search Relevance comes in.

What is OpenSearch Search Relevance?

OpenSearch Search Relevance is a plugin ecosystem that turns search tuning from guesswork into science. It does three key things:

Captures ground truth: You build a query set—representative questions your users actually ask
Runs experiments: You configure multiple search strategies and compare them side by side
Computes metrics: You get nDCG, precision, recall, and MRR scores—the same metrics information retrieval researchers use

The result: you know exactly what's working, what isn't, and by how much.

A Real-World Scenario: E-commerce Search Gone Wrong

Let's say you're an e-commerce platform. Your search is powered by OpenSearch. Basic setup: BM25 scoring, standard analyzer, some field boosting. Reasonable. But your metrics show a problem:

Users searching "comfortable running shoes" get back:

Hiking boots (nope)
Dress shoes on sale (nope)
A few running shoes at the bottom of page 2 (finally!)

Your nDCG score at position 10 is 0.42. That's bad. Users are leaving frustrated.

The question: what's broken? The analyzer? The query type? The field weights? Without measurement, you're shooting in the dark.

Step 1: Build Your Query Set

First, you collect representative queries. Not made-up queries—real questions from your search logs, support tickets, user research. For the e-commerce example:

"comfortable running shoes"
"women's waterproof winter boots"
"lightweight hiking shoes under $100"
"best cross-training footwear"
"slip-on shoes for office"

You grade the relevance of top results for each query. The grading is simple:

Grade 3: Perfect match (what you wanted to buy)
Grade 2: Good match (acceptable alternative)
Grade 1: Poor match (tangentially related)
Grade 0: Irrelevant (why is this here?)

This becomes your ground truth. This is what good search looks like for your domain.

Step 2: Set Up Your Baseline Search Configuration

You define how search works today. For our e-commerce example:

Query:{bool:{must:[{multi_match:{query:"<user_query>",fields:["title^2","description","category"]}}],filter:[{term:{"is_active":true}}]}}Analyzer:"standard"(defaulttokenization,lowercase)

This is your baseline. We'll measure it, then try to beat it.

Step 3: Run an Experiment

Now you hypothesize: "The standard analyzer is losing words. If we use a synonym-aware analyzer and boost title matches more, we'll rank better."

You create a second search configuration:

Query:{bool:{must:[{multi_match:{query:"<user_query>",fields:["title^3","tags^2","description"]}}],filter:[{term:{"is_active":true}}]}}Analyzer:"custom_with_synonyms"-Tokenizer:standard-Filters:lowercase,stop_words,synonym(running/jogging/athletic)

You run a PAIRWISE_COMPARISON experiment:

Take your query set
Execute each query with both configurations
Present results side by side to human evaluators (or use implicit signals like CTR)
Compute metrics for each

The output:

Baseline (standard analyzer):

nDCG@10: 0.42
Precision@10: 0.35
Recall: 0.48

Variant (synonym-aware, title boost):

nDCG@10: 0.68
Precision@10: 0.58
Recall: 0.71

That's a 62% improvement in nDCG. The variant wins decisively.

Step 4: Iterate

One win doesn't mean you're done. You run another experiment:

"What if we add a custom BM25 parameter tuning? Default is k1=1.2, b=0.75. We have short product titles—maybe b=0.5 would work better (less impact from field length)."

You create variant #3, measure it, and compare all three. Now you have data-driven evidence, not hunches.

Why This Matters

You stop guessing. Every change is measured against your ground truth.

You build confidence. When nDCG goes from 0.42 to 0.68, you're not wondering if you broke something—you know you improved it.

You compound gains. Each iteration is small (sometimes), but over months, small improvements stack. A series of 10% wins is a 2.6x total improvement.

You communicate value. When leadership asks "Did the search redesign help?", you show metrics, not opinions.

You catch regressions. New feature breaks relevance? Your metrics catch it before users do.

Technical Implementation: Getting Started

Prerequisites

OpenSearch cluster (7.10+) with search-relevance plugin installed
OpenSearch Dashboards with dashboards-search-relevance plugin

Workflow

Create a query set via the UI or API
- POST to search-relevance query set endpoint
- Provide queries + graded judgments
Define search configurations
- Store in index templates or as JSON documents
- Configurations are just OpenSearch queries + analyzer settings
Create an experiment
- Specify baseline vs variant(s)
- Set experiment type (PAIRWISE_COMPARISON, etc.)
- Run via API or UI
Evaluate results
- Dashboard shows nDCG, precision, recall, MRR
- Human evaluators refine judgments (optional)
- Export results for reporting

Example: Creating a Query Set via API

curl -X POST "localhost:9200/.search-relevance-queries/_doc" \
  -H 'Content-Type: application/json' \
  -d'{
    "query_set_name": "ecommerce_footwear_q2_2026",
    "queries": [
      {
        "query_text": "comfortable running shoes",
        "judgments": [
          {"doc_id": "shoe_123", "grade": 3},
          {"doc_id": "shoe_456", "grade": 3},
          {"doc_id": "boot_789", "grade": 0}
        ]
      }
    ]
  }'

Best Practices for Search Quality Evaluation

1. Represent your domain. Query sets should reflect real user behavior. If 40% of your queries are brand-specific ("Nike Air Max"), weight them accordingly.

2. Grade consistently. Define grade rubrics upfront. Grade 3 should mean the same thing across all evaluators. Consider inter-rater agreement checks (Kappa scores).

3. Start with quick wins. Don't boil the ocean. Fix analyzer issues, obvious field weight problems, missing synonyms. You'll see 20-30% gains fast.

4. Measure multiple metrics. nDCG is great, but also track precision (false positives matter) and recall (missed results matter). Together they tell the full story.

5. A/B test in production. Once confident in experiments, shadow your baseline for a week. Real user behavior > offline metrics.

6. Monitor over time. As your catalog changes, re-evaluate. New product types? Seasonal shifts? Your query set may need updates.

Common Pitfalls

Overfitting to your query set. If you tune only to 20 queries, you might break search for the other 980 query types you haven't measured.
Fix: Expand your query set regularly. Aim for 100+ representative queries.

Ignoring search latency. You improved nDCG by 10%, but query time went from 50ms to 500ms. Users see slower search as worse search.
Fix: Track latency alongside relevance metrics.

Forgetting about cold starts. Your new analyzer is great, but what about rare queries with few matches? Relevance breaks down at the edges.
Fix: Define fallback strategies. What happens when your perfect query gets zero results?

Not evaluating at scale. Your query set works, but only 30% of real user queries are covered. The other 70% are long-tail.
Fix: Use implicit signals (CTR, dwell time) to sample long-tail queries.

Next Steps

Start small. Pick 50 queries. Grade top-10 results. Measure baseline nDCG.
Hypothesize. What analyzer issue, field weight problem, or query logic gap would improve things?
Experiment. Create one variant. Run the experiment. Compare metrics.
Iterate. Keep the winner. Try the next improvement.
Scale. Once confident, expand your query set and refine your configuration.

Wrapping Up

Search quality problems are hidden until you measure them. OpenSearch Search Relevance gives you the tools to turn search tuning from guesswork into data-driven iteration.

Your users asked for better search. Now you can prove you delivered.

I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub: https://github.com/iprithv