For years, Elasticsearch was one of those tools I would almost automatically reach for whenever a system needed search.
And honestly, for many use cases, it is still excellent.
If you need full-text search, filtering, aggregations, faceting, observability queries, or log exploration, Elasticsearch is a very mature and powerful engine. Its lexical search capabilities, especially through BM25 and inverted indexes, are battle-tested.
But my problem started when the product requirement changed from:
“Find documents that contain these words.”
to:
“Find documents that match the meaning of what the user is asking, but still respect exact keywords when they matter.”
That is where pure keyword search was no longer enough.
We needed real hybrid search.
Not just semantic search.
Not just BM25.
Not a manually glued-together ranking system.
We needed a search engine that could combine:
- exact keyword matching,
- semantic similarity,
- metadata filtering,
- predictable latency,
- and production-level throughput.
At first, I tried to make Elasticsearch do it.
That decision taught me a lot.
Eventually, it also pushed me toward Weaviate.
This article is a practical breakdown of what went wrong, why hybrid search is harder than it looks, how Weaviate approaches the problem, and how I think about evaluating vector search quality in production.
The problem: keyword search was no longer enough
Traditional search engines are very good at lexical matching.
If a user searches for:
linux kernel tuning
BM25 can rank documents that contain terms like linux, kernel, and tuning very well.
But what happens when the user searches for:
how to reduce context switching overhead in a container runtime
The best document might not contain that exact sentence.
It might talk about:
- Linux namespaces
- cgroups
- scheduler behavior
- CPU pressure
- process isolation
- runtime overhead
A pure keyword engine may miss or rank it lower because the vocabulary is different.
This is where semantic search becomes useful.
Instead of only matching words, semantic search converts both documents and queries into vectors, usually called embeddings. These embeddings represent the meaning of the text in a high-dimensional space.
Documents with similar meaning end up close to each other in that vector space.
So now the search engine can understand that:
"improve API latency under load"
is related to:
"reduce p99 response time during high traffic"
even if the exact words are different.
But semantic search alone is also not perfect.
Sometimes exact terms matter a lot.
For example:
PostgreSQL jsonb index performance
In this query, the terms PostgreSQL, jsonb, and index are not optional. A purely semantic result about generic database performance may sound similar, but it is not good enough.
That is why hybrid search matters.
A strong search system should understand both:
- What the user literally typed
- What the user actually means
My first approach: forcing Elasticsearch to behave like a vector search engine
The first implementation was based on Elasticsearch.
The idea looked simple:
- Use BM25 for keyword scoring.
- Use vector similarity for semantic scoring.
- Combine both scores into one final score.
- Return the top results.
Conceptually, it made sense.
In practice, it became painful.
The challenge was score fusion.
BM25 scores and vector similarity scores are not naturally comparable.
BM25 might produce values like:
12.4
27.8
43.1
while cosine similarity might produce values like:
0.71
0.82
0.91
If you combine them naively, the ranking becomes unstable.
You cannot simply do:
final_score = bm25_score + cosine_similarity
because the scales are completely different.
You need normalization, weighting, ranking logic, and careful testing.
In my first version, I tried using custom scripting logic to combine lexical and vector scores at query time.
That was the mistake.
The production bottleneck
Custom scoring logic can look elegant in a prototype.
Under production traffic, it can become expensive very quickly.
The main issue was that vector math and score fusion were happening during query execution. That meant every search request had to do extra scoring work on top of the normal search pipeline.
As traffic increased, the symptoms became obvious:
- CPU usage started spiking.
- Tail latency became unstable.
- p95 and p99 response times became much worse.
- Scaling required more resources than expected.
- Search quality improvements were difficult to test safely.
The biggest lesson was not that Elasticsearch is bad.
The lesson was:
A system optimized for lexical search should not always be forced to become the core of a high-throughput vector search architecture.
Elasticsearch can support vector search in modern versions, and for many teams it may be enough. But in my case, the implementation became too complex and too expensive for the kind of hybrid search behavior we needed.
I wanted a system where hybrid search was not an afterthought.
That is when I started looking seriously at Weaviate.
Why Weaviate felt like the right tool
Weaviate is an open-source vector database written in Go.
That immediately caught my attention because I work a lot with Go, backend architecture, and performance-sensitive systems. But the language itself was not the main reason.
The real reason was the architecture.
Weaviate was designed around vector search from the beginning.
Instead of treating embeddings as an extra field attached to a traditional search engine, Weaviate treats vectors as a first-class part of the storage and search model.
At a high level, it gives you:
- vector search,
- BM25 keyword search,
- hybrid search,
- metadata filtering,
- schema-based data modeling,
- GraphQL and REST APIs,
- and production-oriented indexing options.
For my use case, the most important part was that Weaviate could combine semantic and keyword search natively.
No custom query-time score script.
No fragile manual scoring layer.
No forcing two different ranking systems together in application code.
Quick mental model: how vector search works
Before going deeper into Weaviate, it is useful to understand the basic idea behind vector search.
A machine learning model converts text into an embedding:
"Linux kernel performance tuning"
becomes something like:
[0.012, -0.431, 0.227, 0.091, ...]
That vector may have hundreds or thousands of dimensions depending on the embedding model.
The same thing happens to every document in your database.
When a user sends a query, the query is also converted into a vector. Then the search engine tries to find the nearest vectors.
The simplest way to think about this is distance.
For example, Euclidean distance between two vectors can be represented as:
d(p, q) = sqrt(sum((q_i - p_i)^2))
Another common approach is cosine similarity, which compares the angle between two vectors.
In real production systems, searching every vector one by one would be too slow at scale. If you have millions of documents, exact brute-force search is usually not practical.
That is why vector databases use Approximate Nearest Neighbor algorithms, usually called ANN.
ANN algorithms trade a tiny amount of perfect accuracy for a massive improvement in speed.
One of the most popular ANN algorithms is HNSW.
HNSW: the graph behind fast vector search
Weaviate uses HNSW, which stands for Hierarchical Navigable Small World.
You can think of HNSW as a graph structure built on top of your vectors.
Instead of scanning every vector, the engine navigates through a graph to quickly move toward the nearest neighbors.
A simplified mental model:
- Similar vectors are connected.
- The graph has multiple layers.
- Higher layers allow faster long-distance jumps.
- Lower layers refine the search near the best candidates.
This makes search much faster than brute force while still returning high-quality results.
For large-scale semantic search, this matters a lot.
A vector database is not only about storing embeddings. The real value is in how efficiently it can index, traverse, filter, and rank them under load.
Hybrid search: why score fusion is harder than it looks
The hardest part of hybrid search is not running BM25.
It is not running vector search either.
The hard part is combining the results correctly.
BM25 and vector search produce different types of scores.
BM25 is based on term frequency, inverse document frequency, field length, and lexical relevance.
Vector similarity is based on distance or similarity in embedding space.
These scores do not mean the same thing.
So instead of directly adding scores together, a better strategy is often to combine rankings.
This is where rank fusion becomes useful.
Reciprocal Rank Fusion: a better way to combine results
One common technique for hybrid ranking is Reciprocal Rank Fusion, usually shortened to RRF.
Instead of saying:
“This document has a BM25 score of 30 and a vector score of 0.88, so let’s add them.”
RRF says:
“Where did this document rank in the keyword result list, and where did it rank in the vector result list?”
Then it combines ranks.
A simplified formula looks like this:
RRF_score = 1 / (k + rank_keyword) + 1 / (k + rank_vector)
The exact implementation details can vary, but the idea is powerful: ranking position becomes more important than raw score scale.
This avoids many of the problems caused by incompatible scoring systems.
In Weaviate, hybrid search also exposes an alpha parameter, which lets you control the balance between keyword and vector search.
Conceptually:
alpha = 0.0 -> keyword-focused search
alpha = 1.0 -> vector-focused search
alpha = 0.5 -> balanced hybrid search
That is extremely useful in real systems.
Different product areas may need different search behavior.
For example:
- Documentation search may need more semantic matching.
- SKU or product-code search may need stronger keyword matching.
- Support ticket search may need a balance of both.
- Log or error search may need exact matching for IDs and stack traces.
Being able to tune this without rewriting the whole scoring system is a big win.
Example: hybrid search with the Go client
Here is a simplified example using the Weaviate Go client.
package main
import (
"context"
"fmt"
"log"
"github.com/weaviate/weaviate-go-client/v5/weaviate"
"github.com/weaviate/weaviate-go-client/v5/weaviate/graphql"
)
func main() {
cfg := weaviate.Config{
Host: "localhost:8080",
Scheme: "http",
}
client, err := weaviate.NewClient(cfg)
if err != nil {
log.Fatal(err)
}
ctx := context.Background()
result, err := client.GraphQL().Get().
WithClassName("Article").
WithFields(
graphql.Field{Name: "title"},
graphql.Field{Name: "summary"},
graphql.Field{Name: "_additional", Fields: []graphql.Field{
{Name: "score"},
}},
).
WithHybrid(client.GraphQL().HybridArgumentBuilder().
WithQuery("best practices for linux kernel tuning").
WithAlpha(0.7),
).
WithLimit(10).
Do(ctx)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%+v\n", result)
}
In this example, alpha = 0.7 means the query is more semantic than lexical, but keyword matching still contributes to the final ranking.
That is exactly the kind of control I wanted.
Production benchmark: what changed after migration
In one internal test, I used a dataset of around 1.5 million text documents, including articles, technical notes, and internal documentation.
The goal was not to create a perfect academic benchmark. The goal was to compare the behavior of the previous implementation with the new architecture under similar load.
The difference was significant.
| System | Search Type | Fusion Strategy | Approx. p99 Latency | CPU Behavior Under Load |
|---|---|---|---|---|
| Elasticsearch | Hybrid | Custom query-time scoring | ~850 ms | Frequent CPU spikes |
| Weaviate | Hybrid | Native hybrid ranking | ~45 ms | Much more stable |
The exact numbers will depend on hardware, schema, vector dimensions, indexing configuration, filters, and query patterns.
But the important part was not only the latency improvement.
The bigger win was operational simplicity.
After the migration:
- ranking logic became easier to reason about,
- query latency became more predictable,
- CPU usage was more stable,
- scaling was easier,
- and experiments with different
alphavalues became much safer.
That matters a lot in production.
Performance is not only about the fastest possible average response time.
It is also about predictability.
A search system that is fast 90% of the time but unstable at p99 can still hurt the user experience badly.
The hidden challenge: evaluating search quality
Performance is only half of the story.
A fast search engine that returns bad results is still a bad search engine.
After moving to vector or hybrid search, one of the most important questions becomes:
How do we know the new search is actually better?
This gets especially important when changing embedding models.
For example, imagine moving from one embedding model to a newer one.
The new model may produce better vectors.
Or it may be worse for your specific domain.
You cannot rely only on intuition.
You need evaluation.
Metric 1: Mean Reciprocal Rank
Mean Reciprocal Rank, or MRR, measures how early the first relevant result appears.
If the correct result is ranked first, the reciprocal rank is:
1 / 1 = 1.0
If it appears in position 5:
1 / 5 = 0.2
Then you average this across many test queries.
MRR is useful when the user usually needs one best answer.
For example:
"How do I configure cgroups v2 memory limits?"
If the best document is result number one, the search is doing well.
If the best document is buried on page three, the user experience is poor.
Metric 2: Recall@K
Recall@K answers a different question:
Out of all relevant documents, how many did we return in the top K results?
For example, Recall@10 checks how many relevant documents appear in the first 10 results.
This is useful when there may be multiple correct results.
For documentation, research, support tickets, or internal knowledge bases, Recall@K can be more useful than only checking the top result.
Metric 3: NDCG
NDCG stands for Normalized Discounted Cumulative Gain.
It is useful when relevance is not binary.
A result can be:
- perfect,
- good,
- somewhat related,
- or irrelevant.
NDCG rewards highly relevant documents appearing near the top of the result list.
This is closer to how real users experience search.
A search result ranked #1 matters more than a result ranked #9.
Metric 4: visual inspection with UMAP or t-SNE
Metrics are important, but visual inspection can also help.
One useful technique is to export a sample of your embeddings and reduce them to two dimensions using UMAP or t-SNE.
Then you can plot them and inspect whether similar documents form clear clusters.
For example, if you are indexing technical articles, you might expect clusters like:
- Linux kernel
- PostgreSQL
- Kubernetes
- Go concurrency
- distributed systems
- observability
If everything is mixed together randomly, your embedding model may not be representing your domain well.
This does not replace proper evaluation, but it can reveal issues quickly.
Practical lessons from the migration
Here are the biggest lessons I took from this project.
1. Do not treat vector search as just another field type
Embeddings change the architecture of search.
They are not just another column in your database.
You need to think about indexing, memory, distance metrics, filtering, ranking, and evaluation.
2. Hybrid search is usually better than pure semantic search
Pure semantic search can be impressive in demos, but production users often search with exact terms.
They search for:
- product codes,
- error messages,
- function names,
- ticket IDs,
- file names,
- database fields,
- framework names.
A good system should not ignore those signals.
3. Score fusion deserves serious attention
Combining BM25 and vector similarity incorrectly can make search worse.
Rank fusion techniques are often safer than manually combining raw scores.
4. p99 latency matters more than a beautiful demo
A search demo with 100 documents is easy.
A production system with millions of documents, filters, concurrent users, and unpredictable queries is different.
Always test tail latency.
5. Measure quality before changing embedding models
A newer model is not automatically better for your data.
Build a small evaluation dataset with real queries and expected results.
Even 50 to 100 carefully selected queries can reveal a lot.
When I would still use Elasticsearch
This migration does not mean I would never use Elasticsearch again.
I would still choose Elasticsearch for many use cases:
- log search,
- observability,
- analytics-heavy search,
- faceted search,
- complex aggregations,
- mature operational environments already built around Elastic.
Elasticsearch is still a great tool.
The point is not “Elasticsearch bad, Weaviate good.”
The point is:
Choose the system that matches the shape of your problem.
For my specific case, the problem was high-throughput hybrid semantic search.
For that, Weaviate was a better fit.
Final thoughts
The biggest engineering mistake is often not choosing a bad tool.
It is choosing a good tool for the wrong job.
Elasticsearch is a powerful search engine, but when I needed production-grade hybrid search with strong vector search behavior, custom score fusion became too expensive and too hard to maintain.
Weaviate gave me a cleaner architecture:
- native vector search,
- BM25 support,
- hybrid ranking,
- HNSW indexing,
- metadata filtering,
- and a much simpler path to experimentation.
The migration improved latency, reduced operational complexity, and made ranking behavior easier to tune.
But the most important lesson was deeper:
Search quality is a product feature, not just an infrastructure feature.
You need to measure it, tune it, and treat it as part of the user experience.
If you are building search for technical documentation, e-commerce, internal knowledge bases, support systems, or AI-powered products, hybrid search is probably worth serious consideration.
And if you are currently fighting custom score scripts, unstable p99 latency, and hard-to-debug ranking behavior, it may be time to ask whether your search engine is doing the job it was originally designed to do.
Discussion
Have you implemented hybrid search in production?
I would be interested to hear:
- whether you used Elasticsearch, OpenSearch, Weaviate, Qdrant, Pinecone, or another system,
- how you handled score fusion,
- and how you measured search quality beyond just latency.