Where we left off
In our last post Full Text Search in the Age of MCP, we saw how full-text search struggles with long, natural-language queries typical for MCP systems. PostgreSQL’s plainto_tsquery
was a good start but broke down when faced with verbose inputs. We combined techniques—trigram matching, semantic search with embeddings, and hybrid strategies—to restore relevance. That gave us a system that worked, but it left an important question open: how do we know it really works?
This post is about building the tools to answer that question.
Datasets
If you want to evaluate a search system, you need test data: pairs of questions and answers. The answers are not a single “correct” string, but rather the set of documents or chunks from your corpus that a good search should return.
Without a dataset, we’re just guessing. With one, we can measure, compare, and improve.
So, what is a good question?
A good question is one that represents the real use cases of your system:
- It should be expressed in natural language, not just keywords.
- It should be answerable from your corpus.
- It should be specific enough that you can judge relevance.
Three examples
-
“How do I configure Exmeralda to use semantic search?”
- Relevant chunks: code/config docs about semantic search integration.
-
“What’s the difference between
plainto_tsquery
andphraseto_tsquery
?”- Relevant chunks: docs on PostgreSQL full-text search functions.
-
“Why does ts_rank sometimes give low scores for good results?”
- Relevant chunks: ranking explanation, weighting examples.
These questions are realistic, corpus-anchored, and evaluable.
Classic Evaluation Metrics
Information retrieval research has a long tradition of metrics. The most common are:
- Precision: of the returned results, how many were relevant?
- Recall: of all relevant results, how many did we return?
- nDCG (Normalized Discounted Cumulative Gain): do we rank the most relevant results near the top?
For now, we’ll stick with the basics: precision and recall.
The Easy One: Precision
Precision measures correctness. If we return 10 documents and 8 are relevant, precision is 0.8.
Evaluating precision requires two things:
- A question.
- A set of relevant chunks (the “ground truth”).
This is relatively straightforward: you can usually judge whether a chunk is relevant to a question.
Fixing Fulltext Search
Before:
We have seen that classical fulltext search is most powerful with few and significant keywords. Keyword extraction is comparingly simple for large language models. But what about questions like these:
- “What’s the role of ts_rank normalization in long queries?”
- “How do trigram matches help when fulltext breaks?”
- “Why does plainto_tsquery return no results for long inputs?”
These aren’t “keyword” queries. They’re natural language prompts, and that’s where fulltext search alone struggles.
After:
With dataset-driven evaluation, we can see exactly how classical FTS fails—and how combining it with trigram and semantic search improves precision. Instead of guessing, we can measure.
The Difficult One: Recall
The Problem with the Ground Truth
It’s easy to tell if a list of sources is relevant for a certain question. It’s much harder to determine if a search function has found all sources that might be relevant. Recall requires knowing the full set of relevant documents, and in practice, that’s debatable.
A quick fix
Assume our combined search (fulltext + trigram + semantic) has reasonably high recall. By lowering thresholds and expanding hit counts, we can approximate a recall close to 1. Then, we use precision-based filtering to prune irrelevant chunks.
The Good Questions slight return
With a set of good questions and near-complete candidate results, we can now rank and filter. This gives us a practical way to approximate recall without infinite labeling.
Bootstrapping
We started with a broken search system. By iterating, combining methods, and evaluating precision/recall, we ended up with:
- A better search engine.
- A labeled dataset of questions and relevant chunks.
- A foundation for future improvements (new embedding models, reranking, hybrid scoring).
Now, instead of hunches, we have a feedback loop.
Conclusion
Search isn’t finished when it works. It’s finished when you can prove it works—and refine it when it doesn’t.
By generating datasets, asking the right questions, and measuring with precision and recall, we turn search from a black box into an iterative, improvable system.
The Age of Refinement is about this loop: evaluate, improve, repeat.
Do you want me to also add example SQL/Elixir snippets showing how to calculate precision/recall from a labeled dataset? That could make the post more hands-on, like your first one.
Elixir is an excellent choice for applications due to its scalability, fault tolerance, and concurrency model. Its lightweight processes and message-passing architecture make it ideal for orchestrating complex AI workflows efficiently. bitcrowd's first Elixir ML project dates back to 2020, and we have since then enabled various clients to build and scale their AI projects.
bitcrowd is an excellent choice if you need a scalable RAG system or a fully integrated AI pipeline. We help you build, optimize, and maintain it with a focus on reliability and performance.
Drop us a line via email if you want to build your next AI project with Elixir. Or book a call with us to discuss your project.