How Text Search is Broken
When we released Exmeralda in May 2025, we made a simple, yet interesting implementation mistake. This blogpost is the log of what happened when "I just went to fix that little error in the text search".
Here is what our code looked like:
generation =
Generation.new(query, opts)
|> Embedding.generate_embedding(embedding_provider())
|> Retrieval.retrieve(:fulltext_results, &query_fulltext(&1, scope))
|> Retrieval.retrieve(:semantic_results, &query_with_pgvector(&1, scope))
|> Retrieval.reciprocal_rank_fusion(@retrieval_weights, :rrf_result)
|> Retrieval.deduplicate(:rrf_result, [:id])
We wanted to combine the strength of full-text and semantic search, and were using reciprocal rank fusion to combine the two. In the 2 weeks sprint leading up to the launch, we did not check the full-text results. When we finally did, we were quite surprised.
We tried a typical search phrase against req:
What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it.
And the results were all marked false
for a match.
[
["https://hexdocs.pm/req/Req.ChecksumMismatchError.html", false],
["https://hexdocs.pm/req/Req.DecompressError.html", false],
["https://hexdocs.pm/req/Req.ArchiveError.html", false]
]
What was going on? Our search code looked quite unsuspicious:
# the original fts implementation
def get_fulltext_results(library_name, query, limit \\ 3) do
Repo.all(
from(c in Chunk,
where: c.library == ^library_name,
order_by: [desc: fragment("search @@ plainto_tsquery(?)", ^query)],
select: %{
id: c.id,
rank: fragment("search @@ plainto_tsquery(?)", ^query),
source: c.source,
content: c.content
},
limit: ^limit
)
)
end
This is because the typical ts_vector
helper functions, like plainto_tsquery
, join all input words with an AND
operator. While being very fast, this does not work well for typical search phrases:
A simple query
What is req?
is compiled to:
'what' & 'is' & 'req'
.
We see that what
and is
are not removed and required for a match through the AND
operator. As there are no matches and all results are false
, the order of records (and the result) depends on the random order of chunks in the database. If you re-evaluate the ingestion in the cell above, you will notice how the list of retrieved documents changes.
A more complex query
What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it.
is broken down to
["'what' & 'is' & 'the' & 'purpose' & 'of' & 'req.test.ownershiperror' & 'there' & 'does' & 'not' & 'seem' & 'to' & 'be' & 'any' & 'documentation' & 'associated' & 'with' & 'it'"]
On the upside, plain_totsquery
removes all special characters for us:
evil_query = """
"SELECT plainto_tsquery($1)::text;", ["-- (SMPOL)
DROP sampletable;-- DROP/*comment*/\\"sampletable SELECT CASE WHEN (1=1) THEN 'A' ELSE 'B' END;"]
\\"
'The Fat & Rats:C'
"""
{:ok, result} = Exmeralda.Repo.query("SELECT plainto_tsquery($1)::text;", [evil_query])
Enum.at(result.rows, 0)
Is compiled to:
SELECT plainto_tsquery($1)::text; ["\"SELECT plainto_tsquery($1)::text;\", [\"-- (SMPOL)\nDROP sampletable;-- DROP/*comment*/\\\"sampletable SELECT CASE WHEN (1=1) THEN 'A' ELSE 'B' END;\"]\n\\\" \n'The Fat & Rats:C' \n"]
Or, more readable:
["'select' & 'plainto' & 'tsquery' & '1' & 'text' & 'smpol' & 'drop' & 'sampletable' & 'drop' & 'comment' & 'sampletable' & 'select' & 'case' & 'when' & '1' & '1' & 'then' & 'a' & 'else' & 'b' & 'end' & 'the' & 'fat' & 'rats' & 'c'"]
TS_Query language option
We can remove many unwanted words by using the 'english' option:
query = "What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it."
{:ok, result} = Exmeralda.Repo.query("SELECT plainto_tsquery('english',$1)::text;", [query])
-> ["'purpos' & 'req.test.ownershiperror' & 'seem' & 'document' & 'associ'"]
The function also converts the words we enter into lexemes, which means that e.g. purpose
and purposes
are mapped to purpos
and associated
becomes associ
(as would association
)
However, this is often not what we want, either. If we take a simple query like:
What is req?
only req
survives, and the intention of the question is gone:
{:ok, result} = Exmeralda.Repo.query("SELECT plainto_tsquery('english', $1)::text;", ["what is req?"])
-> ["'req'"]
But before we dive into that, let's make sure we only get true results back:
Oh postgres, why can't you be true?
We have made the changes so that only matches considered "true" are represented:
# update_one: only allow matches via the @@ operator:
def get_fulltext_results_update_one(library_name, search_term, limit \\ 3) do
Repo.all(
from(c in Chunk,
cross_join: query in fragment("plainto_tsquery(?)", ^search_term),
where: fragment("f1 @@ search"), # <--- here
where: c.library == ^library_name,
order_by: [desc: fragment("ts_rank(search, f1)")],
select: %{
id: c.id,
source: c.source,
content: c.content,
rank: fragment("ts_rank(search, f1) as rank")
},
limit: ^limit
)
)
end
As expected, we now get an empty result:
query = """
What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it.
"""
IO.puts "No 'english':"
fulltext_boolean_matches = Exmeralda.Topics.Rag.get_fulltext_results_update_one("req", query, 3)
-> No 'english':
[] # <-- see? empty!
With the matches solved, let's explore the results when we use the 'english' option of plainto_tsquery:
# update_two: using plainto_tsquery('english', ?)
def get_fulltext_results_update_two(library_name, search_term, limit \\ 3) do
Repo.all(
from(c in Chunk,
cross_join: query in fragment("plainto_tsquery('english', ?)", ^search_term),
where: c.library == ^library_name,
order_by: [desc: fragment("ts_rank(search, f1, 32)")],
select: %{
id: c.id,
source: c.source,
content: c.content,
rank: fragment("ts_rank(search, f1) as rank")
},
limit: ^limit
)
)
end
Let's see how much better our search has become through the 'english'
option:
IO.puts "With 'english':"
query = """
What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it.
"""
fulltext_boolean_matches = Exmeralda.Topics.Rag.get_fulltext_results_update_two("req", query, 10)
-> With 'english':
[] # <-- see? still empty!
Wow, still no luck. As explained above, documents would have to fulfill a lot of AND-
ed criteria. But what about a shorter query?
What is req?
iex> Exmeralda.Topics.Rag.get_fulltext_results_update_two("req", "What is req?", 50)
|> Enum.frequencies_by(fn chunk -> [chunk.source] end)
-> %{
["CHANGELOG.md"] => 4,
["README.md"] => 2,
["Req.Request.html"] => 5,
["Req.Steps.html"] => 6,
["Req.Test.html"] => 3,
["Req.html"] => 7,
["changelog.html"] => 5,
["lib/req.ex"] => 5,
["lib/req/finch.ex"] => 2,
["lib/req/request.ex"] => 4,
["lib/req/steps.ex"] => 3,
["lib/req/test.ex"] => 2,
["readme.html"] => 2
...
}
Wishful thinking
So far, the results are disappointing: We still get no matches for our long query:
"What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it."
And we get many too many (and random) results for the short one.
"What is req?"
This is understandable as the PostgreSQL full-text - search functions are meant to search for matches against short search terms, e.g. to filter articles containing word "Quantum Computing".
Can we get more sensible results with ts_rank
?
Intro: Ts_rank
Instead of doing a boolean search @@ plain_totsquery
we can use ts_rank
. ts_rank
counts how often a lexeme is appearing in a text. As our chunks are in a similar length range, we give no penalty for long documents (see this part of the postgres documentation for details)
# update_three: dropping the @@ operator
get_fulltext_results_update_three(library_name, search_term, limit \\ 3) do
Repo.all(
from(c in Chunk,
cross_join: query in fragment("plainto_tsquery('english', ?)", ^search_term),
where: c.library == ^library_name,
order_by: [desc: fragment("ts_rank(search, f1, 32)")], # <-- look here!
select: %{
id: c.id,
source: c.source,
content: c.content,
rank: fragment("ts_rank(search, f1) as rank") # <-- and here !!
},
limit: ^limit
)
)
end
Is getting better?
iex> Exmeralda.Topics.Rag.get_fulltext_results_update_three("req", query, 7)
-> [
["Req.Test.OwnershipError.html", 0.17241628468036652],
["lib/req/test/ownership.ex", 0.0030119779985398054],
["LICENSE.md", 0.0021660723723471165],
["LICENSE.md", 1.3680694621598377e-7],
["Req.Request.html", 9.999999682655225e-21],
["Req.Request.html", 9.999999682655225e-21],
["Req.DecompressError.html", 9.999999682655225e-21]
]
Much better. There is still some noise, but we yield workable results. Looks like the values are very small, but least two of the top-4 chunks are relevant.
["Req.Test.OwnershipError.html", 0.17241628468036652],
["lib/req/test/ownership.ex",0.0030119779985398054],
["LICENSE.md", 0.0021586192306131124]
["Req.DecompressError.html", 9.999999682655225e-21]
Let's re-introduce the cut-off, maybe at 0.003
? It's also time to clean up our query a bit:
# Update_four: refactoring and introduction of min-rank
def get_fulltext_results_update_four(library_name, search_term, limit \\ 3) do
min_rank = 0.003
ranked_subquery =
from(c in Chunk,
select: %{
id: c.id,
source: c.source,
content: c.content,
rank: fragment("ts_rank(search, plainto_tsquery('english', ?))", ^search_term)
},
where: c.library == ^library_name
)
Repo.all(
from(r in subquery(ranked_subquery),
select: %{
id: r.id,
source: r.source,
rank: r.rank,
content: r.content
},
where: r.rank > ^min_rank,
order_by: [desc: r.rank],
limit: ^limit
)
)
end
Here we go:
iex> Exmeralda.Topics.Rag.get_fulltext_results_update_four("req", query, 7)
-> [
["Req.Test.OwnershipError.html", 0.17241628468036652],
["lib/req/test/ownership.ex", 0.0030119779985398054]
]
Neat. This is how we want it. It is important that all searches have a cut-off. Eliminating bad matches allows the use of
simple strategies like reciprocal_rank_fusion
without further cleanup down the line.
But wait - what about partial matches?
query_for_ownership_error = "OwnershipError"
fulltext_boolean_matches = Exmeralda.Topics.Rag.get_fulltext_results_2f5cbfa_3("req", query_for_ownership_error, 7)
Enum.map(fulltext_boolean_matches, fn chunk -> [chunk.source, chunk.rank] end)
-> []
Buhuuu .....
No partial matches
TS_Vector is not magical if it comes to matching OwnershipError against Req.Test.OwnershipError. This, of course, is not ideal.
Enter Trigrams: pg_trgm
The PostgreSQL extension pgtrgm offers us partial matches. To use it, we need to activate it.
The setup
If you are not running the database connection as a root user (and you really should not), you might need to add this manually:
psql> CREATE EXTENSION IF NOT EXISTS pg_trgm;
Now we can add the index gin_trgm_ops
to chunks in a migration
defmodule Exmeralda.Repo.Migrations.AddPgTrgm do
use Ecto.Migration
def up do
execute("""
CREATE INDEX IF NOT EXISTS chunks_content_trgm_idx
ON chunks
USING GIN (content gin_trgm_ops);
""")
end
def down do
execute "DROP INDEX IF EXISTS chunks_content_trgm_idx;"
end
end
Our function looks very similar to get_fulltext_results_update_four
:
def get_trgm_results(library_name, query_string, limit \\ 3) do
min_score = 0.065
ranked_subquery =
from(c in Chunk,
select: %{
id: c.id,
source: c.source,
content: c.content,
score: fragment("similarity(?, ?)", c.content, ^query_string)
},
where: c.library == ^library_name
)
Repo.all(
from(r in subquery(ranked_subquery),
select: %{
id: r.id,
source: r.source,
score: r.score,
content: r.content
},
where: r.score > ^min_score,
order_by: [desc: r.score],
limit: ^limit
)
)
end
The first trgm Query
Let's try the one-word-query that failed with ts_vector
again, this time with trgm
. We had 4 matches for Test.OwnershipError
, so let's see how the leaderboard looks like over here:
tgrm_matches = Exmeralda.Topics.Rag.get_trgm_results("req", "OwnershipError", 8)
-> [
["lib/req/test/ownership.ex", 0.09677419066429138],
["lib/req/test/ownership_error.ex", 0.07389162480831146],
["lib/req/application.ex", 0.0662650614976883],
["changelog.html", 0.04237288236618042]
]
Great! We now get the partial match we are looking for! But why do we find changelog.html
and lib/req/application.ex
in the search results?
01 # lib/req/application.ex
02
03 defmodule Req.Application do
...
09 def start(_type, _args) do
10 children = [
11 {Finch,
12 name: Req.Finch,
13 pools: %{
14 default: Req.Finch.pool_options(%{})
15 }},
16 {DynamicSupervisor, strategy: :one_for_one, name: Req.FinchSupervisor},
17 {Req.Test.Ownership, name: Req.Test.Ownership}
18 ]
...
22 end
Ah. The Req.Test.Ownership
in line 17. Looks like the trgm
is finding occurrences that ts_vector
is missing. But what about changelog.html
?
"</span>
<span class=\"p\" data-group-id=\"3838964925-12\">
{</span><span class=\"ss\">:error</span><span class=\"p\">,
</span><span class=\"w\"> </span><span class=\"n\">reason</span><
span class=\"p\" data-group-id=\"3838964925-12\">}</span>
<span class=\"w\"> </span><span class=\"o\">-></span>
<span class=\"w\">\n</span><span class=\"p\" data-group-id=\"3838964925-13\">
{</span><span class=\"n\">request</span><span class=\"p\">,</span><span class=\"w\"> </span><span class=\"nc\">RuntimeError</span><span class=\"o\">.</span>
<span class=\"n\">exception</span><span class=\"p\" data-group-id=\"3838964925-14\">
(</span><span class=\"n\">inspect</span><span class=\"p\" data-group-id=\"3838964925-15\">
(</span><span class=\"n\">reason</span><span class=\"p\" data-group-id=\"3838964925-15\">)
</span><span class=\"p\" data-group-id=\"3838964925-14\">)</span><span class=\"p\" data-group-id=\"3838964925-13\">}</span><span class=\"w\">\n </span><span class=\"k\" data-group-id=\"3838964925-5\">end</span><span class=\"w\">\n</span><span class=\"k\" data-group-id=\"3838964925-1\">end</span><span class=\"w\">"],
Is pg_trgm
confused by the html? Let's find out! We ingest the same documents again, but convert html files to markdown:
:ok = Exmeralda.Topics.Rag.ingest_from_hex("req", "0.5.10",
[chunk_size: 6000,
chunk_overlap: 0,
convert_html_to_md: true])
Better now?
query= """
I am trying to get a resource from a zipped connection (using ny own battle proven
Apache sever), but I keep getting an Req.ArchiveError. What is going on here?
"""
tgrm_matches = Exmeralda.Topics.Rag.get_trgm_results("req", query, 8)
->
[
["Req.ArchiveError.html", 0.1906779706478119],
["api-reference.html", 0.1855670064687729],
["Req.TransportError.html", 0.18149466812610626],
["Req.Request.html", 0.17661097645759583],
["Req.HTTPError.html", 0.1760299652814865],
["lib/req/test/ownership_error.ex", 0.1631944477558136],
["CHANGELOG.md", 0.16225165128707886],
["lib/req/transport_error.ex", 0.16044776141643524]
]
Semantic Search with Embedding Models
So now we have seen ts_vector
and pg_trgm
. Both are powerful, but operate entirely on textual similarity. For instance, this works well:
query = """
A file was incomplete. I get a server error
"""
tgrm_matches = Exmeralda.Topics.Rag.get_trgm_results("req", query, 8)
-> ["lib/req/decompress_error.ex", 0.1071428582072258]
But this doesn't:
query = """
A file was incomplete. I get a server a fault
"""
the_long_ownership_error_length = String.length(query)
tgrm_matches = Exmeralda.Topics.Rag.get_trgm_results("req", query, 8)
[]
For both, ts_rank and trgm, server fault
and server error
are different things entirely.
Ok, admittedly we are trying hard to avoid the word error
because that already gives our system the right idea, but you might get the idea. Imagine we are searching for a piece of code, and the functions and variables in question are just named entirely different?
If you are not familiar with semantic search through embedding models you might want to read up on it with this post
Installing pg_vector
Again, you need to install an extension with elevated privileges:
psql> CREATE EXTENSION IF NOT EXISTS vector;
With this, we can add columns with the type vector.
defmodule Exmeralda.Repo.Migrations.AddEmbeddings do
use Ecto.Migration
def change do
alter table("chunks") do
add :embedding_jina_v2_code, :vector, size: 768, null: true
add :embedding_all_minilm_l6_v2, :vector, size: 384, null: true
add :embedding_nomic_embed_text, :vector, size: 768, null: true
add :embedding_mxbai_embed_large, :vector, size: 1024, null: true
end
end
end
This is a collection of well know models. hex2context uses all_minilm_l6_v2
while hexdocs_mcp uses mxbai_embed_large
and has used nomic_embed_text
. Exmeralda uses Jina.
:ok = Exmeralda.Topics.Rag.ingest_from_hex("req", "0.5.10", [chunk_size: 6000, chunk_overlap: 0, convert_html_to_md: true, embedding_models: [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large]])
query = """
A file was incomplete. I get a server a server fault!
"""
semantic_matches = Exmeralda.Topics.Rag.semantic_search("req", query, [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large ], 8)
-> [
:mxbai_embed_large,
[
["Req.ArchiveError.html", 0.8825895652492148],
["Req.Steps.html", 0.8861386254170186],
["CHANGELOG.md", 0.8877450658370508],
["lib/req/archive_error.ex", 0.8882296960744931],
["CHANGELOG.md", 0.8982783964506256],
["lib/req/steps.ex", 0.9088519660172513],
["Req.DecompressError.html", 0.909644379977974],
["api-reference.html", 0.9114014126397327]
]
...
]
Here you can find the results from :jina_v2_code
, :all_minilm_l6_v2
and :nomic_embed_text
[
[
:jina_v2_code,
[
["lib/req/decompress_error.ex", 1.1001978869605287],
["lib/req/archive_error.ex", 1.1065949543349611],
["Req.ArchiveError.html", 1.1109306281584121],
["lib/req/finch.ex", 1.1283726470574522],
["Req.HTTPError.html", 1.1437832728481354],
["changelog.html", 1.1477073891751752],
["Req.Test.OwnershipError.html", 1.1518759602352782],
["Req.DecompressError.html", 1.1523340774388808]
]
],
[
:all_minilm_l6_v2,
[
["CHANGELOG.md", 1.187255884476853],
["api-reference.html", 1.1992400226507995],
["CHANGELOG.md", 1.214449018027894],
["Req.DecompressError.html", 1.2167737539185737],
["Req.html", 1.2187207047291677],
["Req.ArchiveError.html", 1.2209354759338176],
["lib/req/transport_error.ex", 1.223696768657922],
["Req.HTTPError.html", 1.2254885691837656]
]
],
[
:nomic_embed_text,
[
["Req.HTTPError.html", 0.9255421003661427],
["Req.ArchiveError.html", 0.9344331806147583],
["Req.TransportError.html", 0.9396665969010519],
["lib/req/archive_error.ex", 0.9512982607615971],
["lib/req/http_error.ex", 0.9652939336935658],
["Req.DecompressError.html", 0.9682236133489371],
["lib/req/transport_error.ex", 0.9696966898486066],
["Req.ChecksumMismatchError.html", 0.9715665051662066]
]
],
[
:mxbai_embed_large,
[
["Req.ArchiveError.html", 0.8825895652492148],
["Req.Steps.html", 0.8861386254170186],
["CHANGELOG.md", 0.8877450658370508],
["lib/req/archive_error.ex", 0.8882296960744931],
["CHANGELOG.md", 0.8982783964506256],
["lib/req/steps.ex", 0.9088519660172513],
["Req.DecompressError.html", 0.909644379977974],
["api-reference.html", 0.9114014126397327]
]
]
]
Most models get a hint of the correct files. Given the broadness of the input, this is not too bad. How are the distances when there is no obvious match?
query = """
What is the capitol of france?
"""
semantic_matches = Exmeralda.Topics.Rag.semantic_search("req", query, [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large], 8)
-> [
[
:jina_v2_code,
[
["search.html", 1.2013792773813514],
["changelog.html", 1.2675223081081528],
["lib/req/test/ownership.ex", 1.27652497253914],
["Req.Steps.html", 1.2884749489577967],
["Req.Test.OwnershipError.html", 1.2887939635806138],
["lib/req/transport_error.ex", 1.2910635011599565],
["lib/req/checksum_mismatch_error.ex", 1.2951447768272124],
[".formatter.exs", 1.3046684492170826]
]
],
(...)
Here you can find the results from :mxbai_embed_large
, :all_minilm_l6_v2
and :nomic_embed_text
[
[
:jina_v2_code,
[
["search.html", 1.2013792773813514],
["changelog.html", 1.2675223081081528],
["lib/req/test/ownership.ex", 1.27652497253914],
["Req.Steps.html", 1.2884749489577967],
["Req.Test.OwnershipError.html", 1.2887939635806138],
["lib/req/transport_error.ex", 1.2910635011599565],
["lib/req/checksum_mismatch_error.ex", 1.2951447768272124],
[".formatter.exs", 1.3046684492170826]
]
],
[
:all_minilm_l6_v2,
[
["Req.Request.html", 1.3125475012040022],
["changelog.html", 1.326029751652449],
["Req.Request.html", 1.347188674972259],
["Req.html", 1.34829745311331],
["Req.html", 1.350020261895043],
["lib/req/utils.ex", 1.351296238774175],
["lib/req/test/ownership.ex", 1.3532148635308432],
["Req.Steps.html", 1.3565381583398395]
]
],
[
:nomic_embed_text,
[
["Req.html", 0.9798611585883052],
["Req.html", 0.9798611585883052],
["search.html", 0.9825806991787288],
["CHANGELOG.md", 1.0566829877389166],
["Req.ChecksumMismatchError.html", 1.0655728951074148],
["Req.Test.html", 1.066903190664944],
["Req.Test.html", 1.066903190664944],
["Req.Steps.html", 1.0781216828668398]
]
],
[
:mxbai_embed_large,
[
["search.html", 1.1419219912593435],
["Req.Request.html", 1.177317278184068],
["Req.html", 1.1789670853942418],
["lib/req.ex", 1.1825928941494814],
["Req.html", 1.184483360627748],
["Req.Response.html", 1.1891909155259033],
["README.md", 1.1894402965235962],
["lib/req/steps.ex", 1.1905512509070648]
]
]
]
A fuzzy Query with ts_rank
& trgm
...
Before we move on, let's try something hard:
query = """
What are the most recent changes in this library?
"""
iex> Exmeralda.Topics.Rag.get_fulltext_results_2f5cbfa_3("req", query, 7)
-> []
iex> Exmeralda.Topics.Rag.get_trgm_results("req", query, 8)
-> [
[".formatter.exs", 0.11363636702299118],
["lib/req/transport_error.ex", 0.10784313827753067],
["lib/req/test/ownership_error.ex", 0.10132158547639847]
]
... and, in Contrast, with the Embedding Models.
iex> query = "What are the most recent changes in this library?"
iex> Exmeralda.Topics.Rag.semantic_search("req", query, [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large ], 8)
jina_v2_code:
[
["changelog.html", 0.6787963813404295],
["changelog.html", 0.7811186107777701],
(...)
]
all_minilm_l6_v2:,
[
["lib/req/response_async.ex", 0.8177686940327334],
["Req.Response.Async.html", 0.8348016598126115],
["CHANGELOG.md", 0.8403415712638496],
(...)
]
nomic_embed_text:,
[
["readme.html", 1.034428927580288],
["lib/req.ex", 1.044201525943257],
["CHANGELOG.md", 1.0647588168422053],
(...)
]
mxbai_embed_large:,
[
["CHANGELOG.md", 0.6185546412466434],
["CHANGELOG.md", 0.6508479987416208],
(...)
]
Here, the results of the embedding models differ a lot: mxbai_embed_large
and jina_v2_code
get CHANGELOG.md
on the
first rank. all_minilm_l6_v2
and nomic_embed_text
only list CHANGELOG.md
on the third place.
Again you can find the complete results from :mxbai_embed_large
, :all_minilm_l6_v2
, :nomic_embed_text
and :mxbai_embed_large
here
query = """
What are the most recent changes in this library?
"""
semantic_matches = Exmeralda.Topics.Rag.semantic_search("req", query, [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large ], 8)
[
[
:jina_v2_code,
[
["changelog.html", 0.6787963813404295],
["changelog.html", 0.7811186107777701],
["readme.html", 0.7866344675718526],
["lib/req/finch.ex", 0.8164027820860295],
["CHANGELOG.md", 0.8586129017187336],
["api-reference.html", 0.9161379148926494],
["Req.html", 0.9174212832035994],
["lib/req.ex", 0.9254078172251721]
]
],
[
:all_minilm_l6_v2,
[
["lib/req/response_async.ex", 0.8177686940327334],
["Req.Response.Async.html", 0.8348016598126115],
["CHANGELOG.md", 0.8403415712638496],
["lib/req.ex", 0.869153003390171],
["lib/req/steps.ex", 0.8702174684426948],
["lib/req/finch.ex", 0.8776942055053603],
["lib/req.ex", 0.8799764263723598],
["lib/req/steps.ex", 0.8882293605494803]
]
],
[
:nomic_embed_text,
[
["readme.html", 1.034428927580288],
["lib/req.ex", 1.044201525943257],
["CHANGELOG.md", 1.0647588168422053],
["lib/req/response.ex", 1.070890618559592],
["mix.exs", 1.0846005999352963],
["Req.ChecksumMismatchError.html", 1.0880410657102961],
["lib/req/steps.ex", 1.0933889610591538],
["Req.html", 1.094495954904274]
]
],
[
:mxbai_embed_large,
[
["CHANGELOG.md", 0.6185546412466434],
["CHANGELOG.md", 0.6508479987416208],
["Req.Response.Async.html", 0.6764724812517728],
["Req.html", 0.6895154541197983],
["lib/req/steps.ex", 0.6902661595261197],
["lib/req/response_async.ex", 0.6943797380713216],
["lib/req/request.ex", 0.6989134223104326],
["Req.Steps.html", 0.6994476796039114]
]
]
]
Of course, this kind of query is where the embedding models shine, and the text-based search strategies fail.
A Systematic Evaluation of Document Retrieval
Now that we have our testing field, let's automate testing:
- For each html file in the docs from hexdocs, we let a commercial LLM devise questions that would be answered by the file.
- For each match with either the HTML file or the corresponding .ex file, we attribute the candidate a point
- We expect to get some additional matches for related or unrelated files. We will use the ratio between them to evaluate
accuracy
,precision
,recall
andf1 score
.
The Prompt:
You are given a piece of technical documentation.
Perform two tasks:
- Extract the key assertions
- Invent realistic Stack-Overflow-style questions
(...)
"""
You are given a piece of technical documentation.
Perform two tasks:
1. **Extract the key assertions**
• Read the text carefully.
• List every important assertion the docs make (what the feature *is*, how it *works*, guarantees, limits, options, notes, etc.).
• Phrase each important assertion as a single, self-contained sentence.
• Outline the assertions that only this document makes and that will most likely be unique to this document in a list.
2. **Invent realistic Stack-Overflow-style questions**
• Think of developers encountering issues that this doc resolves.
• For **each** imagined user, write:
– **problem** : a first-person sentence that includes a tiny code snippet or concrete detail (e.g. `Req.get!("…", into: :self)`).
– **code** : that mini snippet as a plain string.
– **question**: the direct question they would post (“Why does …?”, “How can I …?”).
• Produce 3–10 of these question objects.
• Every question must be answerable solely with the assertions from task 1.
The documentation is:
======= begin documentation =======
#{markdown_content}
======= end documentation =======
**OUTPUT — return only this JSON structure, with no markdown, no comments:**
{
"questions": [
{
"problem": "<first-person problem statement>",
"code": "<inline code snippet>",
"question": "<direct question>"
}
/* 3-10 such objects total */
]
}
RULES:
• Output must be valid, minified JSON.
• Do **not** include answers or any extra keys.
• Do **not** mention these instructions or add other text.
"""
This yields questions like the following:
{
"filename":"Req.ArchiveError.html",
"question":"I'm calling Req.Steps.decode_body(response) on a potentially corrupted
archive response and it suddenly raises Req.ArchiveError.
Req.Steps.decode_body(response)
Why am I getting a Req.ArchiveError while trying to unpack the archive?"},
{"filename":"Req.ArchiveError.html",
"question":"When I run Req.Steps.decode_body(resp) with a tar.gz file, the function
throws a Req.ArchiveError after failing to unpack it.\nReq.Steps.decode_body(resp)\n
What causes Req.ArchiveError to be returned from decode_body in this scenario?"},
{"filename":"Req.ArchiveError.html",
"question":"I pass an HTTP response body to
Req.Steps.decode_body, hoping it will uncompress automatically, but it raises
Req.ArchiveError instead.\nReq.Steps.decode_body(my_response)\n
How do I handle Req.ArchiveError triggered by decode_body when unpacking an archive?"}
(...)
The questions are realistic enough in that they should have the structure of a user's question, but ideally only yield one matching file.
The Results
the results look like the following row:
query | target file | ts_rank | trgm | jina_v2_code | nomic_embed_text | all_minilm_l6_v2 | mxbai_embed_large |
---|---|---|---|---|---|---|---|
I tried calling 'HTTP.get(\"/docs/missing_page\")' but I'm getting a 'Page not found' error message.\nHTTP.get(\"/docs/missing_page\")\nWhy am I getting 'Page not found', and what should I do to locate the correct documentation page? | 404.html |
|
|
|
|
|
|
They all rank matches in the order of importance. Some of them are exact matches, others are just background noise. When now need to establish a way to tell them apart. This sounds like the perfect cliff hanger, and this post is already quiet long.
77 more examples of the file rankings that the different models produce
query | target file | ts_rank | trgm | jina_v2_code | nomic_embed_text | all_minilm_l6_v2 | mxbai_embed_large |
---|---|---|---|---|---|---|---|
I tried calling 'HTTP.get(\"/docs/missing_page\")' but I'm getting a 'Page not found' error message.\nHTTP.get(\"/docs/missing_page\")\nWhy am I getting 'Page not found', and what should I do to locate the correct documentation page? | 404.html |
|
|
|
|
|
|
When I visit 'mysite.com/api/v1/unknown', I see a message that the page does not exist.\nmysite.com/api/v1/unknown\nHow can I find the correct URL or documentation if the page I'm looking for isn't there? | 404.html |
|
|
|
|
|
|
I tried searching for a method doc by name but ended up on an error page saying it doesn't exist.\nsearch_method('fooBarMethod')\nWhere do I go after seeing the page not found message to locate the doc for the method I'm looking for? | 404.html |
|
|
|
|
|
|
I'm not sure how to proceed after seeing a 404 error while looking up 'someRandomEndpoint' in the docs.\nHTTP.get(\"/docs/someRandomEndpoint\")\nShould I use the search sidebar or the API reference to find what I'm looking for, and how does it help? | 404.html |
|
|
|
|
|
|
I'm calling Req.Steps.decode_body(response) on a potentially corrupted archive response and it suddenly raises Req.ArchiveError.\nReq.Steps.decode_body(response)\nWhy am I getting a Req.ArchiveError while trying to unpack the archive? | Req.ArchiveError.html |
|
|
|
|
|
|
When I run Req.Steps.decode_body(resp) with a tar.gz file, the function throws a Req.ArchiveError after failing to unpack it.\nReq.Steps.decode_body(resp)\nWhat causes Req.ArchiveError to be returned from decode_body in this scenario? | Req.ArchiveError.html |
|
|
|
|
|
|
I pass an HTTP response body to Req.Steps.decode_body, hoping it will uncompress automatically, but it raises Req.ArchiveError instead.\nReq.Steps.decode_body(my_response)\nHow do I handle Req.ArchiveError triggered by decode_body when unpacking an archive? | Req.ArchiveError.html |
|
|
|
|
|
|
I'm testing a custom archive through Req.Steps.decode_body(response), but the function aborts with Req.ArchiveError whenever I attempt to decode it.\nReq.Steps.decode_body(response)\nWhy does decode_body return Req.ArchiveError for my custom archive format? | Req.ArchiveError.html |
|
|
|
|
|
|
I'm trying to confirm data integrity with Req.Steps.checksum/1, but when the computed value doesn't match, I see an exception.\nReq.Steps.checksum!(source_data)\nWhat does Req.ChecksumMismatchError represent in this context? | Req.ChecksumMismatchError.html |
|
|
|
|
|
|
I used Req.Steps.checksum/1 to validate a downloaded file's SHA256, but the app crashed with Req.ChecksumMismatchError.\nReq.Steps.checksum(file_path, expected: "abc123...")\nWhy is this error triggered during the checksum step? | Req.ChecksumMismatchError.html |
|
|
|
|
|
|
When I run Req.Steps.checksum/1 on my data, I'm sometimes getting Req.ChecksumMismatchError in production.\nReq.Steps.checksum(data_bytes, expected: "def456...")\nWhat causes a checksum mismatch with this function? | Req.ChecksumMismatchError.html |
|
|
|
|
|
|
I'm using Req.Steps.checksum/1 to verify a binary stream, but I keep encountering Req.ChecksumMismatchError.\nReq.Steps.checksum(stream_data, expected: "xyz789...")\nHow does this exception indicate the checksums differ? | Req.ChecksumMismatchError.html |
|
|
|
|
|
|
When I call Req.Steps.decompress_body/1 on invalid data, I get a Req.DecompressError.\nReq.Steps.decompress_body/1\nWhy does Req.DecompressError occur when attempting to decompress a response? | Req.DecompressError.html |
|
|
|
|
|
|
I'm using Req.Steps.decompress_body/1 to handle a compressed response, but I'm seeing a Req.DecompressError in my logs.\nReq.Steps.decompress_body/1\nWhat causes a Req.DecompressError with a compressed response in Req? | Req.DecompressError.html |
|
|
|
|
|
|
After calling Req.Steps.decompress_body/1 with a broken gzip file, the function raises Req.DecompressError.\nReq.Steps.decompress_body/1\nHow can I resolve a Req.DecompressError that occurs during decompression? | Req.DecompressError.html |
|
|
|
|
|
|
I'm calling | Req.HTTPError.html |
|
|
|
|
|
|
I created a custom Req adapter for an internal API and I'm raising | Req.HTTPError.html |
|
|
|
|
|
|
When I see | Req.HTTPError.html |
|
|
|
|
|
|
After upgrading Req, I'm seeing a standardized error message: | Req.HTTPError.html |
|
|
|
|
|
|
I'm catching | Req.HTTPError.html |
|
|
|
|
|
|
In the follow-up post of this series, we will see how a common ground of truth can be established, how we can evaluate accuracy, precision, recall and f1 score of the different models. We will then use this toolset to introduce techniques to boost the search results over all mechanisms.
Elixir is an excellent choice for applications due to its scalability, fault tolerance, and concurrency model. Its lightweight processes and message-passing architecture make it ideal for orchestrating complex AI workflows efficiently. bitcrowd's first Elixir ML project dates back to 2020, and we have since then enabled various clients to build and scale their AI projects.
bitcrowd is an excellent choice if you need a scalable RAG system or a fully integrated AI pipeline. We help you build, optimize, and maintain it with a focus on reliability and performance.
Drop us a line via email if you want to build your next AI project with Elixir. Or book a call with us to discuss your project.