Full Text Search in the Age of MCP

June 24, 2025 · 21 min read

Christoph Beck

Head of Intergalactic Mischief

How Text Search is Broken

When we released Exmeralda in May 2025, we made a simple, yet interesting implementation mistake. This blogpost is the log of what happened when "I just went to fix that little error in the text search".

Here is what our code looked like:

generation =
  Generation.new(query, opts)
  |> Embedding.generate_embedding(embedding_provider())
  |> Retrieval.retrieve(:fulltext_results, &query_fulltext(&1, scope))
  |> Retrieval.retrieve(:semantic_results, &query_with_pgvector(&1, scope))
  |> Retrieval.reciprocal_rank_fusion(@retrieval_weights, :rrf_result)
  |> Retrieval.deduplicate(:rrf_result, [:id])

We wanted to combine the strength of full-text and semantic search, and were using reciprocal rank fusion to combine the two. In the 2 weeks sprint leading up to the launch, we did not check the full-text results. When we finally did, we were quite surprised.

We tried a typical search phrase against req:

What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it.

And the results were all marked false for a match.

[
  ["https://hexdocs.pm/req/Req.ChecksumMismatchError.html", false],
  ["https://hexdocs.pm/req/Req.DecompressError.html", false],
  ["https://hexdocs.pm/req/Req.ArchiveError.html", false]
]

What was going on? Our search code looked quite unsuspicious:

# the original fts implementation

def get_fulltext_results(library_name, query, limit \\ 3) do
    Repo.all(
      from(c in Chunk,
        where: c.library == ^library_name,
        order_by: [desc: fragment("search @@ plainto_tsquery(?)", ^query)],
        select: %{
          id: c.id,
          rank: fragment("search @@ plainto_tsquery(?)", ^query),
          source: c.source,
          content: c.content
        },
        limit: ^limit
      )
    )
  end

This is because the typical ts_vector helper functions, like plainto_tsquery, join all input words with an AND operator. While being very fast, this does not work well for typical search phrases:

A simple query

What is req?

is compiled to:

'what' & 'is' & 'req'.

We see that what and is are not removed and required for a match through the AND operator. As there are no matches and all results are false, the order of records (and the result) depends on the random order of chunks in the database. If you re-evaluate the ingestion in the cell above, you will notice how the list of retrieved documents changes.

A more complex query

What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it.

is broken down to

["'what' & 'is' & 'the' & 'purpose' & 'of' & 'req.test.ownershiperror' & 'there' & 'does' & 'not' & 'seem' & 'to' & 'be' & 'any' & 'documentation' & 'associated' & 'with' & 'it'"]

On the upside, plain_totsquery removes all special characters for us:

evil_query = """
"SELECT plainto_tsquery($1)::text;", ["-- (SMPOL)
DROP sampletable;-- DROP/*comment*/\\"sampletable SELECT CASE WHEN (1=1) THEN 'A' ELSE 'B' END;"]
\\"
'The Fat & Rats:C'
"""
{:ok, result} = Exmeralda.Repo.query("SELECT plainto_tsquery($1)::text;", [evil_query])
Enum.at(result.rows, 0)

Is compiled to:

SELECT plainto_tsquery($1)::text; ["\"SELECT plainto_tsquery($1)::text;\", [\"-- (SMPOL)\nDROP sampletable;-- DROP/*comment*/\\\"sampletable SELECT CASE WHEN (1=1) THEN 'A' ELSE 'B' END;\"]\n\\\" \n'The Fat & Rats:C' \n"]

Or, more readable:

["'select' & 'plainto' & 'tsquery' & '1' & 'text' & 'smpol' & 'drop' & 'sampletable' & 'drop' & 'comment' & 'sampletable' & 'select' & 'case' & 'when' & '1' & '1' & 'then' & 'a' & 'else' & 'b' & 'end' & 'the' & 'fat' & 'rats' & 'c'"]

TS_Query language option

We can remove many unwanted words by using the 'english' option:

query = "What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it."
{:ok, result} = Exmeralda.Repo.query("SELECT plainto_tsquery('english',$1)::text;", [query])

-> ["'purpos' & 'req.test.ownershiperror' & 'seem' & 'document' & 'associ'"]

The function also converts the words we enter into lexemes, which means that e.g. purpose and purposes are mapped to purpos and associated becomes associ (as would association)

However, this is often not what we want, either. If we take a simple query like:

What is req?

only req survives, and the intention of the question is gone:

{:ok, result} = Exmeralda.Repo.query("SELECT plainto_tsquery('english', $1)::text;", ["what is req?"])

-> ["'req'"]

But before we dive into that, let's make sure we only get true results back:

Oh postgres, why can't you be true?

We have made the changes so that only matches considered "true" are represented:

# update_one: only allow matches via the @@ operator:

def get_fulltext_results_update_one(library_name, search_term, limit \\ 3) do
  Repo.all(
    from(c in Chunk,
      cross_join: query in fragment("plainto_tsquery(?)", ^search_term),
      where: fragment("f1 @@ search"), # <--- here
      where: c.library == ^library_name,
      order_by: [desc: fragment("ts_rank(search, f1)")],
      select: %{
        id: c.id,
        source: c.source,
        content: c.content,
        rank: fragment("ts_rank(search, f1) as rank")
      },
      limit: ^limit
    )
  )
end

As expected, we now get an empty result:

query = """
What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it.
"""

IO.puts "No 'english':"
fulltext_boolean_matches = Exmeralda.Topics.Rag.get_fulltext_results_update_one("req", query, 3)

-> No 'english':

[] # <-- see? empty!

With the matches solved, let's explore the results when we use the 'english' option of plainto_tsquery:

# update_two: using plainto_tsquery('english', ?)

def get_fulltext_results_update_two(library_name, search_term, limit \\ 3) do
  Repo.all(
    from(c in Chunk,
      cross_join: query in fragment("plainto_tsquery('english', ?)", ^search_term),
      where: c.library == ^library_name,
      order_by: [desc: fragment("ts_rank(search, f1, 32)")],
      select: %{
        id: c.id,
        source: c.source,
        content: c.content,
        rank: fragment("ts_rank(search, f1) as rank")
      },
      limit: ^limit
    )
  )
end

Let's see how much better our search has become through the 'english' option:

IO.puts "With 'english':"
query = """
What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it.
"""
fulltext_boolean_matches = Exmeralda.Topics.Rag.get_fulltext_results_update_two("req", query, 10)

-> With 'english':

[] # <-- see? still empty!

Wow, still no luck. As explained above, documents would have to fulfill a lot of AND-ed criteria. But what about a shorter query?

What is req?

iex>  Exmeralda.Topics.Rag.get_fulltext_results_update_two("req", "What is req?", 50)
      |> Enum.frequencies_by(fn chunk -> [chunk.source] end)

-> %{
  ["CHANGELOG.md"] => 4,
  ["README.md"] => 2,
  ["Req.Request.html"] => 5,
  ["Req.Steps.html"] => 6,
  ["Req.Test.html"] => 3,
  ["Req.html"] => 7,
  ["changelog.html"] => 5,
  ["lib/req.ex"] => 5,
  ["lib/req/finch.ex"] => 2,
  ["lib/req/request.ex"] => 4,
  ["lib/req/steps.ex"] => 3,
  ["lib/req/test.ex"] => 2,
  ["readme.html"] => 2
  ...
}

Wishful thinking

So far, the results are disappointing: We still get no matches for our long query:

"What is the purpose of Req.Test.OwnershipError? There does not seem to be any documentation associated with it."

And we get many too many (and random) results for the short one.

"What is req?"

This is understandable as the PostgreSQL full-text - search functions are meant to search for matches against short search terms, e.g. to filter articles containing word "Quantum Computing".

Can we get more sensible results with ts_rank?

Intro: Ts_rank

Instead of doing a boolean search @@ plain_totsquery we can use ts_rank. ts_rank counts how often a lexeme is appearing in a text. As our chunks are in a similar length range, we give no penalty for long documents (see this part of the postgres documentation for details)

# update_three: dropping the @@ operator

get_fulltext_results_update_three(library_name, search_term, limit \\ 3) do
  Repo.all(
    from(c in Chunk,
      cross_join: query in fragment("plainto_tsquery('english', ?)", ^search_term),
      where: c.library == ^library_name,
      order_by: [desc: fragment("ts_rank(search, f1, 32)")], # <-- look here!
      select: %{
        id: c.id,
        source: c.source,
        content: c.content,
        rank: fragment("ts_rank(search, f1) as rank") # <-- and here !!
      },
      limit: ^limit
    )
  )
end

Is getting better?

iex> Exmeralda.Topics.Rag.get_fulltext_results_update_three("req", query, 7)

-> [
  ["Req.Test.OwnershipError.html", 0.17241628468036652],
  ["lib/req/test/ownership.ex", 0.0030119779985398054],
  ["LICENSE.md", 0.0021660723723471165],
  ["LICENSE.md", 1.3680694621598377e-7],
  ["Req.Request.html", 9.999999682655225e-21],
  ["Req.Request.html", 9.999999682655225e-21],
  ["Req.DecompressError.html", 9.999999682655225e-21]
]

Much better. There is still some noise, but we yield workable results. Looks like the values are very small, but least two of the top-4 chunks are relevant.

    ["Req.Test.OwnershipError.html", 0.17241628468036652],
    ["lib/req/test/ownership.ex",0.0030119779985398054],
    ["LICENSE.md", 0.0021586192306131124]
    ["Req.DecompressError.html", 9.999999682655225e-21]

Let's re-introduce the cut-off, maybe at 0.003? It's also time to clean up our query a bit:

# Update_four: refactoring and introduction of min-rank

def get_fulltext_results_update_four(library_name, search_term, limit \\ 3) do
  min_rank = 0.003

  ranked_subquery =
    from(c in Chunk,
      select: %{
        id: c.id,
        source: c.source,
        content: c.content,
        rank: fragment("ts_rank(search, plainto_tsquery('english', ?))", ^search_term)
      },
      where: c.library == ^library_name
    )

  Repo.all(
    from(r in subquery(ranked_subquery),
      select: %{
        id: r.id,
        source: r.source,
        rank: r.rank,
        content: r.content
      },
      where: r.rank > ^min_rank,
      order_by: [desc: r.rank],
      limit: ^limit
    )
  )
end

Here we go:

iex> Exmeralda.Topics.Rag.get_fulltext_results_update_four("req", query, 7)

-> [
  ["Req.Test.OwnershipError.html", 0.17241628468036652],
  ["lib/req/test/ownership.ex", 0.0030119779985398054]
]

Neat. This is how we want it. It is important that all searches have a cut-off. Eliminating bad matches allows the use of simple strategies like reciprocal_rank_fusion without further cleanup down the line.

But wait - what about partial matches?

query_for_ownership_error = "OwnershipError"

fulltext_boolean_matches = Exmeralda.Topics.Rag.get_fulltext_results_2f5cbfa_3("req", query_for_ownership_error, 7)
Enum.map(fulltext_boolean_matches, fn chunk -> [chunk.source, chunk.rank] end)

-> []

Buhuuu .....

No partial matches

TS_Vector is not magical if it comes to matching OwnershipError against Req.Test.OwnershipError. This, of course, is not ideal.

Enter Trigrams: pg_trgm

The PostgreSQL extension pgtrgm offers us partial matches. To use it, we need to activate it.

The setup

If you are not running the database connection as a root user (and you really should not), you might need to add this manually:

psql> CREATE EXTENSION IF NOT EXISTS pg_trgm;

Now we can add the index gin_trgm_ops to chunks in a migration

defmodule Exmeralda.Repo.Migrations.AddPgTrgm do
  use Ecto.Migration

  def up do
    execute("""
      CREATE INDEX IF NOT EXISTS chunks_content_trgm_idx
      ON chunks
      USING GIN (content gin_trgm_ops);
    """)
  end

  def down do
    execute "DROP INDEX IF EXISTS chunks_content_trgm_idx;"
  end
end

Our function looks very similar to get_fulltext_results_update_four:

def get_trgm_results(library_name, query_string, limit \\ 3) do
  min_score = 0.065

  ranked_subquery =
    from(c in Chunk,
      select: %{
        id: c.id,
        source: c.source,
        content: c.content,
        score: fragment("similarity(?, ?)", c.content, ^query_string)
      },
      where: c.library == ^library_name
    )

  Repo.all(
    from(r in subquery(ranked_subquery),
      select: %{
        id: r.id,
        source: r.source,
        score: r.score,
        content: r.content
      },
      where: r.score > ^min_score,
      order_by: [desc: r.score],
      limit: ^limit
    )
  )
end

The first trgm Query

Let's try the one-word-query that failed with ts_vector again, this time with trgm. We had 4 matches for Test.OwnershipError, so let's see how the leaderboard looks like over here:

tgrm_matches = Exmeralda.Topics.Rag.get_trgm_results("req", "OwnershipError", 8)

-> [
  ["lib/req/test/ownership.ex", 0.09677419066429138],
  ["lib/req/test/ownership_error.ex", 0.07389162480831146],
  ["lib/req/application.ex", 0.0662650614976883],
  ["changelog.html", 0.04237288236618042]
]

Great! We now get the partial match we are looking for! But why do we find changelog.html and lib/req/application.ex in the search results?

# lib/req/application.ex
02
defmodule Req.Application do
...
def start(_type, _args) do
  children = [
    {Finch,
     name: Req.Finch,
     pools: %{
       default: Req.Finch.pool_options(%{})
     }},
    {DynamicSupervisor, strategy: :one_for_one, name: Req.FinchSupervisor},
  {Req.Test.Ownership, name: Req.Test.Ownership}
  ]
...
end

Ah. The Req.Test.Ownership in line 17. Looks like the trgm is finding occurrences that ts_vector is missing. But what about changelog.html?

   "</span>
    <span class=\"p\" data-group-id=\"3838964925-12\">
    {</span><span class=\"ss\">:error</span><span class=\"p\">,
    </span><span class=\"w\"> </span><span class=\"n\">reason</span><
    span class=\"p\" data-group-id=\"3838964925-12\">}</span>
    <span class=\"w\"> </span><span class=\"o\">-&gt;</span>
    <span class=\"w\">\n</span><span class=\"p\" data-group-id=\"3838964925-13\">
    {</span><span class=\"n\">request</span><span class=\"p\">,</span><span class=\"w\"> </span><span class=\"nc\">RuntimeError</span><span class=\"o\">.</span>
    <span class=\"n\">exception</span><span class=\"p\" data-group-id=\"3838964925-14\">
    (</span><span class=\"n\">inspect</span><span class=\"p\" data-group-id=\"3838964925-15\">
    (</span><span class=\"n\">reason</span><span class=\"p\" data-group-id=\"3838964925-15\">)
    </span><span class=\"p\" data-group-id=\"3838964925-14\">)</span><span class=\"p\" data-group-id=\"3838964925-13\">}</span><span class=\"w\">\n  </span><span class=\"k\" data-group-id=\"3838964925-5\">end</span><span class=\"w\">\n</span><span class=\"k\" data-group-id=\"3838964925-1\">end</span><span class=\"w\">"],

Is pg_trgm confused by the html? Let's find out! We ingest the same documents again, but convert html files to markdown:

:ok = Exmeralda.Topics.Rag.ingest_from_hex("req", "0.5.10",
  [chunk_size: 6000,
  chunk_overlap: 0,
  convert_html_to_md: true])

Better now?

query= """
I am trying to get a resource from a zipped connection (using ny own battle proven
Apache sever), but I keep getting an Req.ArchiveError. What is going on here?
"""
tgrm_matches = Exmeralda.Topics.Rag.get_trgm_results("req", query, 8)

->
[
  ["Req.ArchiveError.html", 0.1906779706478119],
  ["api-reference.html", 0.1855670064687729],
  ["Req.TransportError.html", 0.18149466812610626],
  ["Req.Request.html", 0.17661097645759583],
  ["Req.HTTPError.html", 0.1760299652814865],
  ["lib/req/test/ownership_error.ex", 0.1631944477558136],
  ["CHANGELOG.md", 0.16225165128707886],
  ["lib/req/transport_error.ex", 0.16044776141643524]
]

Semantic Search with Embedding Models

So now we have seen ts_vector and pg_trgm. Both are powerful, but operate entirely on textual similarity. For instance, this works well:

query = """
A file was incomplete. I get a server error
"""
tgrm_matches = Exmeralda.Topics.Rag.get_trgm_results("req", query, 8)

-> ["lib/req/decompress_error.ex", 0.1071428582072258]

But this doesn't:

query = """
A file was incomplete. I get a server a fault
"""
the_long_ownership_error_length = String.length(query)
tgrm_matches = Exmeralda.Topics.Rag.get_trgm_results("req", query, 8)

[]

For both, ts_rank and trgm, server fault and server error are different things entirely.

Ok, admittedly we are trying hard to avoid the word error because that already gives our system the right idea, but you might get the idea. Imagine we are searching for a piece of code, and the functions and variables in question are just named entirely different?

If you are not familiar with semantic search through embedding models you might want to read up on it with this post

Installing pg_vector

Again, you need to install an extension with elevated privileges:

psql> CREATE EXTENSION IF NOT EXISTS vector;

With this, we can add columns with the type vector.

defmodule Exmeralda.Repo.Migrations.AddEmbeddings do
  use Ecto.Migration

  def change do
    alter table("chunks") do
      add :embedding_jina_v2_code, :vector, size: 768, null: true
      add :embedding_all_minilm_l6_v2, :vector, size: 384, null: true
      add :embedding_nomic_embed_text, :vector, size: 768, null: true
      add :embedding_mxbai_embed_large, :vector, size: 1024, null: true
    end
  end

end

This is a collection of well know models. hex2context uses all_minilm_l6_v2 while hexdocs_mcp uses mxbai_embed_large and has used nomic_embed_text. Exmeralda uses Jina.

:ok = Exmeralda.Topics.Rag.ingest_from_hex("req", "0.5.10", [chunk_size: 6000, chunk_overlap: 0, convert_html_to_md: true, embedding_models: [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large]])

query = """
A file was incomplete. I get a server a server fault!
"""

semantic_matches = Exmeralda.Topics.Rag.semantic_search("req", query, [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large ], 8)

-> [
    :mxbai_embed_large,
    [
      ["Req.ArchiveError.html", 0.8825895652492148],
      ["Req.Steps.html", 0.8861386254170186],
      ["CHANGELOG.md", 0.8877450658370508],
      ["lib/req/archive_error.ex", 0.8882296960744931],
      ["CHANGELOG.md", 0.8982783964506256],
      ["lib/req/steps.ex", 0.9088519660172513],
      ["Req.DecompressError.html", 0.909644379977974],
      ["api-reference.html", 0.9114014126397327]
    ]
    ...
  ]

Here you can find the results from :jina_v2_code, :all_minilm_l6_v2 and :nomic_embed_text

[
  [
    :jina_v2_code,
    [
      ["lib/req/decompress_error.ex", 1.1001978869605287],
      ["lib/req/archive_error.ex", 1.1065949543349611],
      ["Req.ArchiveError.html", 1.1109306281584121],
      ["lib/req/finch.ex", 1.1283726470574522],
      ["Req.HTTPError.html", 1.1437832728481354],
      ["changelog.html", 1.1477073891751752],
      ["Req.Test.OwnershipError.html", 1.1518759602352782],
      ["Req.DecompressError.html", 1.1523340774388808]
    ]
  ],
  [
    :all_minilm_l6_v2,
    [
      ["CHANGELOG.md", 1.187255884476853],
      ["api-reference.html", 1.1992400226507995],
      ["CHANGELOG.md", 1.214449018027894],
      ["Req.DecompressError.html", 1.2167737539185737],
      ["Req.html", 1.2187207047291677],
      ["Req.ArchiveError.html", 1.2209354759338176],
      ["lib/req/transport_error.ex", 1.223696768657922],
      ["Req.HTTPError.html", 1.2254885691837656]
    ]
  ],
  [
    :nomic_embed_text,
    [
      ["Req.HTTPError.html", 0.9255421003661427],
      ["Req.ArchiveError.html", 0.9344331806147583],
      ["Req.TransportError.html", 0.9396665969010519],
      ["lib/req/archive_error.ex", 0.9512982607615971],
      ["lib/req/http_error.ex", 0.9652939336935658],
      ["Req.DecompressError.html", 0.9682236133489371],
      ["lib/req/transport_error.ex", 0.9696966898486066],
      ["Req.ChecksumMismatchError.html", 0.9715665051662066]
    ]
  ],
  [
    :mxbai_embed_large,
    [
      ["Req.ArchiveError.html", 0.8825895652492148],
      ["Req.Steps.html", 0.8861386254170186],
      ["CHANGELOG.md", 0.8877450658370508],
      ["lib/req/archive_error.ex", 0.8882296960744931],
      ["CHANGELOG.md", 0.8982783964506256],
      ["lib/req/steps.ex", 0.9088519660172513],
      ["Req.DecompressError.html", 0.909644379977974],
      ["api-reference.html", 0.9114014126397327]
    ]
  ]
]

Most models get a hint of the correct files. Given the broadness of the input, this is not too bad. How are the distances when there is no obvious match?

query = """
What is the capitol of france?
"""

semantic_matches = Exmeralda.Topics.Rag.semantic_search("req", query, [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large], 8)

-> [
  [
    :jina_v2_code,
    [
      ["search.html", 1.2013792773813514],
      ["changelog.html", 1.2675223081081528],
      ["lib/req/test/ownership.ex", 1.27652497253914],
      ["Req.Steps.html", 1.2884749489577967],
      ["Req.Test.OwnershipError.html", 1.2887939635806138],
      ["lib/req/transport_error.ex", 1.2910635011599565],
      ["lib/req/checksum_mismatch_error.ex", 1.2951447768272124],
      [".formatter.exs", 1.3046684492170826]
    ]
  ],
  (...)

Here you can find the results from :mxbai_embed_large, :all_minilm_l6_v2 and :nomic_embed_text

[
  [
    :jina_v2_code,
    [
      ["search.html", 1.2013792773813514],
      ["changelog.html", 1.2675223081081528],
      ["lib/req/test/ownership.ex", 1.27652497253914],
      ["Req.Steps.html", 1.2884749489577967],
      ["Req.Test.OwnershipError.html", 1.2887939635806138],
      ["lib/req/transport_error.ex", 1.2910635011599565],
      ["lib/req/checksum_mismatch_error.ex", 1.2951447768272124],
      [".formatter.exs", 1.3046684492170826]
    ]
  ],
  [
    :all_minilm_l6_v2,
    [
      ["Req.Request.html", 1.3125475012040022],
      ["changelog.html", 1.326029751652449],
      ["Req.Request.html", 1.347188674972259],
      ["Req.html", 1.34829745311331],
      ["Req.html", 1.350020261895043],
      ["lib/req/utils.ex", 1.351296238774175],
      ["lib/req/test/ownership.ex", 1.3532148635308432],
      ["Req.Steps.html", 1.3565381583398395]
    ]
  ],
  [
    :nomic_embed_text,
    [
      ["Req.html", 0.9798611585883052],
      ["Req.html", 0.9798611585883052],
      ["search.html", 0.9825806991787288],
      ["CHANGELOG.md", 1.0566829877389166],
      ["Req.ChecksumMismatchError.html", 1.0655728951074148],
      ["Req.Test.html", 1.066903190664944],
      ["Req.Test.html", 1.066903190664944],
      ["Req.Steps.html", 1.0781216828668398]
    ]
  ],
  [
    :mxbai_embed_large,
    [
      ["search.html", 1.1419219912593435],
      ["Req.Request.html", 1.177317278184068],
      ["Req.html", 1.1789670853942418],
      ["lib/req.ex", 1.1825928941494814],
      ["Req.html", 1.184483360627748],
      ["Req.Response.html", 1.1891909155259033],
      ["README.md", 1.1894402965235962],
      ["lib/req/steps.ex", 1.1905512509070648]
    ]
  ]
]

A fuzzy Query with `ts_rank` & `trgm` ...

Before we move on, let's try something hard:

query = """
What are the most recent changes in this library?
"""

iex> Exmeralda.Topics.Rag.get_fulltext_results_2f5cbfa_3("req", query, 7)

-> []

iex> Exmeralda.Topics.Rag.get_trgm_results("req", query, 8)

-> [
  [".formatter.exs", 0.11363636702299118],
  ["lib/req/transport_error.ex", 0.10784313827753067],
  ["lib/req/test/ownership_error.ex", 0.10132158547639847]
]

... and, in Contrast, with the Embedding Models.

iex> query = "What are the most recent changes in this library?"
iex> Exmeralda.Topics.Rag.semantic_search("req", query, [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large ], 8)

jina_v2_code:
[
  ["changelog.html", 0.6787963813404295],
  ["changelog.html", 0.7811186107777701],
  (...)
]
all_minilm_l6_v2:,
[
  ["lib/req/response_async.ex", 0.8177686940327334],
  ["Req.Response.Async.html", 0.8348016598126115],
  ["CHANGELOG.md", 0.8403415712638496],
  (...)
]
nomic_embed_text:,
[
  ["readme.html", 1.034428927580288],
  ["lib/req.ex", 1.044201525943257],
  ["CHANGELOG.md", 1.0647588168422053],
  (...)
]
mxbai_embed_large:,
[
  ["CHANGELOG.md", 0.6185546412466434],
  ["CHANGELOG.md", 0.6508479987416208],
  (...)
]

Here, the results of the embedding models differ a lot: mxbai_embed_large and jina_v2_code get CHANGELOG.md on the first rank. all_minilm_l6_v2 and nomic_embed_text only list CHANGELOG.md on the third place.

Again you can find the complete results from :mxbai_embed_large, :all_minilm_l6_v2, :nomic_embed_text and :mxbai_embed_large here

query = """
What are the most recent changes in this library?
"""

semantic_matches = Exmeralda.Topics.Rag.semantic_search("req", query, [:jina_v2_code, :nomic_embed_text, :all_minilm_l6_v2, :mxbai_embed_large ], 8)
[
  [
    :jina_v2_code,
    [
      ["changelog.html", 0.6787963813404295],
      ["changelog.html", 0.7811186107777701],
      ["readme.html", 0.7866344675718526],
      ["lib/req/finch.ex", 0.8164027820860295],
      ["CHANGELOG.md", 0.8586129017187336],
      ["api-reference.html", 0.9161379148926494],
      ["Req.html", 0.9174212832035994],
      ["lib/req.ex", 0.9254078172251721]
    ]
  ],
  [
    :all_minilm_l6_v2,
    [
      ["lib/req/response_async.ex", 0.8177686940327334],
      ["Req.Response.Async.html", 0.8348016598126115],
      ["CHANGELOG.md", 0.8403415712638496],
      ["lib/req.ex", 0.869153003390171],
      ["lib/req/steps.ex", 0.8702174684426948],
      ["lib/req/finch.ex", 0.8776942055053603],
      ["lib/req.ex", 0.8799764263723598],
      ["lib/req/steps.ex", 0.8882293605494803]
    ]
  ],
  [
    :nomic_embed_text,
    [
      ["readme.html", 1.034428927580288],
      ["lib/req.ex", 1.044201525943257],
      ["CHANGELOG.md", 1.0647588168422053],
      ["lib/req/response.ex", 1.070890618559592],
      ["mix.exs", 1.0846005999352963],
      ["Req.ChecksumMismatchError.html", 1.0880410657102961],
      ["lib/req/steps.ex", 1.0933889610591538],
      ["Req.html", 1.094495954904274]
    ]
  ],
  [
    :mxbai_embed_large,
    [
      ["CHANGELOG.md", 0.6185546412466434],
      ["CHANGELOG.md", 0.6508479987416208],
      ["Req.Response.Async.html", 0.6764724812517728],
      ["Req.html", 0.6895154541197983],
      ["lib/req/steps.ex", 0.6902661595261197],
      ["lib/req/response_async.ex", 0.6943797380713216],
      ["lib/req/request.ex", 0.6989134223104326],
      ["Req.Steps.html", 0.6994476796039114]
    ]
  ]
]

Of course, this kind of query is where the embedding models shine, and the text-based search strategies fail.

A Systematic Evaluation of Document Retrieval

Now that we have our testing field, let's automate testing:

For each html file in the docs from hexdocs, we let a commercial LLM devise questions that would be answered by the file.
For each match with either the HTML file or the corresponding .ex file, we attribute the candidate a point
We expect to get some additional matches for related or unrelated files. We will use the ratio between them to evaluate accuracy, precision, recall and f1 score.

The Prompt:

You are given a piece of technical documentation.

Perform two tasks:

Extract the key assertions
Invent realistic Stack-Overflow-style questions

(...)

"""
    You are given a piece of technical documentation.

    Perform two tasks:

    1. **Extract the key assertions**
    • Read the text carefully.
    • List every important assertion the docs make (what the feature *is*, how it *works*, guarantees, limits, options, notes, etc.).
    • Phrase each important assertion as a single, self-contained sentence.
    • Outline the assertions that only this document makes and that will most likely be unique to this document in a list.

    2. **Invent realistic Stack-Overflow-style questions**
    • Think of developers encountering issues that this doc resolves.
    • For **each** imagined user, write:
     – **problem** : a first-person sentence that includes a tiny code snippet or concrete detail (e.g. `Req.get!("…", into: :self)`).
     – **code**    : that mini snippet as a plain string.
     – **question**: the direct question they would post (“Why does …?”, “How can I …?”).
    • Produce 3–10 of these question objects.
    • Every question must be answerable solely with the assertions from task 1.

    The documentation is:
    ======= begin documentation =======
    #{markdown_content}
    ======= end documentation =======

    **OUTPUT — return only this JSON structure, with no markdown, no comments:**

    {
    "questions": [
    {
      "problem": "<first-person problem statement>",
      "code": "<inline code snippet>",
      "question": "<direct question>"
    }
    /* 3-10 such objects total */
    ]
    }

    RULES:
    • Output must be valid, minified JSON.
    • Do **not** include answers or any extra keys.
    • Do **not** mention these instructions or add other text.

"""

This yields questions like the following:

{
 "filename":"Req.ArchiveError.html",
 "question":"I'm calling Req.Steps.decode_body(response) on a potentially corrupted
  archive response and it suddenly raises Req.ArchiveError.
  Req.Steps.decode_body(response)
  Why am I getting a Req.ArchiveError while trying to unpack the archive?"},

{"filename":"Req.ArchiveError.html",
 "question":"When I run Req.Steps.decode_body(resp) with a tar.gz file, the function
 throws a Req.ArchiveError after failing to unpack it.\nReq.Steps.decode_body(resp)\n
 What causes Req.ArchiveError to be returned from decode_body in this scenario?"},

 {"filename":"Req.ArchiveError.html",
 "question":"I pass an HTTP response body to
 Req.Steps.decode_body, hoping it will uncompress automatically, but it raises
 Req.ArchiveError instead.\nReq.Steps.decode_body(my_response)\n
 How do I handle Req.ArchiveError triggered by decode_body when unpacking an archive?"}
 (...)

The questions are realistic enough in that they should have the structure of a user's question, but ideally only yield one matching file.

The Results

the results look like the following row:

query	target file	ts_rank	trgm	jina_v2_code	nomic_embed_text	all_minilm_l6_v2	mxbai_embed_large
`I tried calling 'HTTP.get(\"/docs/missing_page\")' but I'm getting a 'Page not found' error message.\nHTTP.get(\"/docs/missing_page\")\nWhy am I getting 'Page not found', and what should I do to locate the correct documentation page?`	404.html	lib/req/test/ownership.ex Req.Test.html lib/req/test.ex changelog.html Req.html lib/req/test/ownership.ex CHANGELOG.md Req.html	lib/req/transport_error.ex lib/req/http_error.ex Req.TransportError.html Req.HTTPError.html lib/req/test/ownership_error.ex lib/req/response.ex lib/req/decompress_error.ex lib/req/too_many_redirects_error.ex	CHANGELOG.md lib/req/too_many_redirects_error.ex lib/req/http_error.ex api-reference.html README.md Req.html lib/req/steps.ex readme.html	lib/req/steps.ex lib/req.ex Req.HTTPError.html readme.html lib/req/http_error.ex lib/req/test.ex api-reference.html Req.ArchiveError.html	Req.html Req.Request.html lib/req.ex README.md CHANGELOG.md lib/req/http_error.ex lib/req/steps.ex Req.HTTPError.html	CHANGELOG.md Req.html lib/req.ex CHANGELOG.md lib/req/http_error.ex lib/req/request.ex Req.html readme.html

They all rank matches in the order of importance. Some of them are exact matches, others are just background noise. When now need to establish a way to tell them apart. This sounds like the perfect cliff hanger, and this post is already quiet long.

77 more examples of the file rankings that the different models produce

query	target file	ts_rank	trgm	jina_v2_code	nomic_embed_text	all_minilm_l6_v2	mxbai_embed_large
`I tried calling 'HTTP.get(\"/docs/missing_page\")' but I'm getting a 'Page not found' error message.\nHTTP.get(\"/docs/missing_page\")\nWhy am I getting 'Page not found', and what should I do to locate the correct documentation page?`	404.html	lib/req/test/ownership.ex Req.Test.html lib/req/test.ex changelog.html Req.html lib/req/test/ownership.ex CHANGELOG.md Req.html	lib/req/transport_error.ex lib/req/http_error.ex Req.TransportError.html Req.HTTPError.html lib/req/test/ownership_error.ex lib/req/response.ex lib/req/decompress_error.ex lib/req/too_many_redirects_error.ex	CHANGELOG.md lib/req/too_many_redirects_error.ex lib/req/http_error.ex api-reference.html README.md Req.html lib/req/steps.ex readme.html	lib/req/steps.ex lib/req.ex Req.HTTPError.html readme.html lib/req/http_error.ex lib/req/test.ex api-reference.html Req.ArchiveError.html	Req.html Req.Request.html lib/req.ex README.md CHANGELOG.md lib/req/http_error.ex lib/req/steps.ex Req.HTTPError.html	CHANGELOG.md Req.html lib/req.ex CHANGELOG.md lib/req/http_error.ex lib/req/request.ex Req.html readme.html
`When I visit 'mysite.com/api/v1/unknown', I see a message that the page does not exist.\nmysite.com/api/v1/unknown\nHow can I find the correct URL or documentation if the page I'm looking for isn't there?`	404.html	lib/req.ex Req.html lib/req.ex Req.html lib/req/request.ex lib/req.ex lib/req/request.ex Req.Request.html	lib/req/test/ownership_error.ex lib/req/request.ex lib/req/archive_error.ex lib/req.ex lib/req/request.ex lib/req/transport_error.ex lib/req/response.ex lib/req/test/ownership.ex	Req.html lib/req.ex CHANGELOG.md lib/req/too_many_redirects_error.ex README.md lib/req/steps.ex readme.html api-reference.html	readme.html lib/req/steps.ex lib/req.ex lib/req/test.ex Req.ArchiveError.html lib/req.ex Req.HTTPError.html lib/req.ex	README.md Req.html lib/req.ex Req.Request.html Req.HTTPError.html Req.Request.html CHANGELOG.md Req.Request.html	Req.html lib/req/request.ex lib/req/utils.ex Req.html CHANGELOG.md CHANGELOG.md lib/req/steps.ex lib/req.ex
`I tried searching for a method doc by name but ended up on an error page saying it doesn't exist.\nsearch_method('fooBarMethod')\nWhere do I go after seeing the page not found message to locate the doc for the method I'm looking for?`	404.html	lib/req/fields.ex lib/req/request.ex lib/req/finch.ex lib/req/utils.ex lib/req.ex lib/req/finch.ex lib/req/test.ex lib/req/request.ex	lib/req/response.ex lib/req/test/ownership_error.ex lib/req/archive_error.ex lib/req/transport_error.ex lib/req/too_many_redirects_error.ex Req.Request.html lib/req/decompress_error.ex lib/req.ex	lib/req/checksum_mismatch_error.ex lib/req/decompress_error.ex lib/req/too_many_redirects_error.ex lib/req/archive_error.ex api-reference.html lib/req/transport_error.ex Req.html search.html	lib/req/archive_error.ex Req.Request.html Req.ArchiveError.html Req.TooManyRedirectsError.html lib/req/decompress_error.ex Req.Request.html Req.DecompressError.html lib/req/steps.ex	Req.Steps.html Req.html Req.Request.html Req.Request.html lib/req.ex Req.Request.html lib/req/request.ex Req.html	lib/req/request.ex lib/req/request.ex Req.Request.html lib/req/http_error.ex lib/req.ex lib/req/request.ex Req.Request.html lib/req/archive_error.ex
`I'm not sure how to proceed after seeing a 404 error while looking up 'someRandomEndpoint' in the docs.\nHTTP.get(\"/docs/someRandomEndpoint\")\nShould I use the search sidebar or the API reference to find what I'm looking for, and how does it help?`	404.html	lib/req/request.ex lib/req/steps.ex lib/req/request.ex lib/req/request.ex lib/req/steps.ex lib/req/test.ex CHANGELOG.md changelog.html	lib/req/transport_error.ex Req.HTTPError.html Req.TransportError.html lib/req/http_error.ex lib/req/response.ex lib/req/test.ex lib/req/test/ownership_error.ex lib/req/test/ownership.ex	CHANGELOG.md lib/req/http_error.ex README.md lib/req/too_many_redirects_error.ex lib/req/steps.ex readme.html changelog.html api-reference.html	lib/req/steps.ex lib/req.ex Req.HTTPError.html Req.ArchiveError.html lib/req.ex lib/req/http_error.ex lib/req/steps.ex lib/req.ex	Req.html Req.Request.html Req.html README.md CHANGELOG.md Req.HTTPError.html Req.Request.html Req.Steps.html	Req.html CHANGELOG.md Req.TooManyRedirectsError.html Req.html lib/req/http_error.ex lib/req.ex lib/req/request.ex lib/req.ex
`I'm calling Req.Steps.decode_body(response) on a potentially corrupted archive response and it suddenly raises Req.ArchiveError.\nReq.Steps.decode_body(response)\nWhy am I getting a Req.ArchiveError while trying to unpack the archive?`	Req.ArchiveError.html	lib/req/steps.ex lib/req/steps.ex Req.Steps.html Req.html Req.Response.html readme.html Req.Steps.html lib/req/response.ex	lib/req/archive_error.ex Req.ArchiveError.html api-reference.html Req.Request.html CHANGELOG.md lib/req.ex lib/req/decompress_error.ex lib/req/steps.ex	Req.ArchiveError.html lib/req/archive_error.ex Req.DecompressError.html changelog.html changelog.html lib/req/finch.ex lib/req/decompress_error.ex lib/req/steps.ex	Req.ArchiveError.html Req.DecompressError.html lib/req/archive_error.ex Req.TooManyRedirectsError.html Req.HTTPError.html Req.ChecksumMismatchError.html lib/req/decompress_error.ex Req.TransportError.html	Req.ArchiveError.html lib/req/archive_error.ex api-reference.html Req.DecompressError.html CHANGELOG.md lib/req/decompress_error.ex lib/req/steps.ex Req.Steps.html	Req.ArchiveError.html Req.DecompressError.html api-reference.html lib/req/archive_error.ex lib/req/steps.ex lib/req/steps.ex CHANGELOG.md lib/req/request.ex
`When I run Req.Steps.decode_body(resp) with a tar.gz file, the function throws a Req.ArchiveError after failing to unpack it.\nReq.Steps.decode_body(resp)\nWhat causes Req.ArchiveError to be returned from decode_body in this scenario?`	Req.ArchiveError.html	lib/req/steps.ex CHANGELOG.md Req.html lib/req.ex Req.Request.html Req.html Req.html Req.Steps.html	lib/req/archive_error.ex Req.ArchiveError.html api-reference.html Req.DecompressError.html lib/req/decompress_error.ex Req.Request.html lib/req/request.ex lib/req/steps.ex	Req.ArchiveError.html lib/req/archive_error.ex Req.DecompressError.html changelog.html lib/req/decompress_error.ex changelog.html lib/req/steps.ex lib/req/finch.ex	Req.ArchiveError.html Req.DecompressError.html lib/req/archive_error.ex Req.TooManyRedirectsError.html lib/req/decompress_error.ex Req.TransportError.html Req.HTTPError.html Req.Test.OwnershipError.html	Req.ArchiveError.html lib/req/archive_error.ex api-reference.html Req.DecompressError.html Req.Steps.html CHANGELOG.md lib/req/decompress_error.ex lib/req/steps.ex	Req.ArchiveError.html Req.DecompressError.html lib/req/archive_error.ex api-reference.html lib/req/steps.ex lib/req/steps.ex CHANGELOG.md Req.Steps.html
`I pass an HTTP response body to Req.Steps.decode_body, hoping it will uncompress automatically, but it raises Req.ArchiveError instead.\nReq.Steps.decode_body(my_response)\nHow do I handle Req.ArchiveError triggered by decode_body when unpacking an archive?`	Req.ArchiveError.html	lib/req/steps.ex Req.Steps.html Req.Steps.html lib/req.ex Req.Steps.html Req.html README.md lib/req/steps.ex	lib/req/archive_error.ex Req.ArchiveError.html api-reference.html lib/req/decompress_error.ex CHANGELOG.md Req.DecompressError.html Req.Request.html lib/req/steps.ex	Req.ArchiveError.html Req.DecompressError.html lib/req/archive_error.ex changelog.html changelog.html lib/req/decompress_error.ex lib/req/steps.ex lib/req/finch.ex	Req.ArchiveError.html Req.DecompressError.html lib/req/archive_error.ex Req.Response.Async.html Req.HTTPError.html Req.TransportError.html lib/req/request.ex lib/req/response.ex	Req.ArchiveError.html lib/req/archive_error.ex lib/req/steps.ex api-reference.html Req.Steps.html Req.Steps.html CHANGELOG.md changelog.html	Req.ArchiveError.html Req.DecompressError.html lib/req/steps.ex lib/req/steps.ex Req.Steps.html api-reference.html lib/req/archive_error.ex CHANGELOG.md
`I'm testing a custom archive through Req.Steps.decode_body(response), but the function aborts with Req.ArchiveError whenever I attempt to decode it.\nReq.Steps.decode_body(response)\nWhy does decode_body return Req.ArchiveError for my custom archive format?`	Req.ArchiveError.html	lib/req/steps.ex Req.Steps.html lib/req/steps.ex Req.Steps.html lib/req.ex Req.html Req.Request.html lib/req/request.ex	lib/req/archive_error.ex Req.ArchiveError.html api-reference.html Req.Request.html lib/req/decompress_error.ex lib/req/request.ex Req.Request.html Req.DecompressError.html	Req.ArchiveError.html lib/req/archive_error.ex changelog.html changelog.html Req.DecompressError.html lib/req/decompress_error.ex lib/req/steps.ex CHANGELOG.md	Req.ArchiveError.html Req.DecompressError.html lib/req/archive_error.ex lib/req/decompress_error.ex lib/req/steps.ex Req.Request.html Req.TooManyRedirectsError.html Req.HTTPError.html	Req.ArchiveError.html lib/req/archive_error.ex api-reference.html Req.DecompressError.html lib/req/decompress_error.ex CHANGELOG.md lib/req/steps.ex Req.Steps.html	Req.ArchiveError.html Req.DecompressError.html lib/req/steps.ex api-reference.html lib/req/steps.ex lib/req/archive_error.ex lib/req/request.ex Req.Steps.html
`I'm trying to confirm data integrity with Req.Steps.checksum/1, but when the computed value doesn't match, I see an exception.\nReq.Steps.checksum!(source_data)\nWhat does Req.ChecksumMismatchError represent in this context?`	Req.ChecksumMismatchError.html	Req.ChecksumMismatchError.html lib/req/finch.ex lib/req/steps.ex Req.HTTPError.html api-reference.html lib/req/utils.ex lib/req/response_async.ex Req.TransportError.html	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex Req.TooManyRedirectsError.html Req.DecompressError.html lib/req/decompress_error.ex Req.ArchiveError.html Req.TransportError.html lib/req/archive_error.ex	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex lib/req/steps.ex Req.Test.OwnershipError.html lib/req/decompress_error.ex lib/req/test/ownership_error.ex lib/req/archive_error.ex Req.DecompressError.html	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex Req.DecompressError.html Req.ArchiveError.html Req.Test.OwnershipError.html lib/req/steps.ex lib/req/steps.ex Req.TooManyRedirectsError.html	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex api-reference.html lib/req/steps.ex Req.Request.html Req.Steps.html lib/req/request.ex lib/req/steps.ex	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex api-reference.html Req.DecompressError.html Req.ArchiveError.html lib/req/steps.ex Req.Request.html lib/req/request.ex
`I used Req.Steps.checksum/1 to validate a downloaded file's SHA256, but the app crashed with Req.ChecksumMismatchError.\nReq.Steps.checksum(file_path, expected: "abc123...")\nWhy is this error triggered during the checksum step?`	Req.ChecksumMismatchError.html	lib/req/request.ex Req.Request.html Req.Steps.html Req.Request.html Req.Request.html README.md lib/req/request.ex lib/req.ex	lib/req/checksum_mismatch_error.ex Req.ChecksumMismatchError.html api-reference.html lib/req/steps.ex lib/req/transport_error.ex Req.TransportError.html lib/req/steps.ex lib/req/steps.ex	Req.ChecksumMismatchError.html lib/req/steps.ex lib/req/checksum_mismatch_error.ex lib/req/decompress_error.ex lib/req/archive_error.ex Req.Test.OwnershipError.html lib/req/too_many_redirects_error.ex changelog.html	Req.ChecksumMismatchError.html Req.ArchiveError.html lib/req/checksum_mismatch_error.ex Req.DecompressError.html Req.TooManyRedirectsError.html Req.HTTPError.html Req.Test.OwnershipError.html Req.TransportError.html	Req.ChecksumMismatchError.html api-reference.html lib/req/checksum_mismatch_error.ex lib/req/steps.ex Req.DecompressError.html Req.ArchiveError.html lib/req/steps.ex lib/req/archive_error.ex	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex api-reference.html lib/req/steps.ex Req.ArchiveError.html lib/req/steps.ex Req.DecompressError.html lib/req/request.ex
`When I run Req.Steps.checksum/1 on my data, I'm sometimes getting Req.ChecksumMismatchError in production.\nReq.Steps.checksum(data_bytes, expected: "def456...")\nWhat causes a checksum mismatch with this function?`	Req.ChecksumMismatchError.html	lib/req/checksum_mismatch_error.ex lib/req/steps.ex Req.Steps.html Req.html api-reference.html Req.Test.html Req.ChecksumMismatchError.html Req.Request.html	lib/req/checksum_mismatch_error.ex Req.ChecksumMismatchError.html lib/req/decompress_error.ex api-reference.html lib/req/archive_error.ex lib/req/http_error.ex lib/req/test.ex lib/req/transport_error.ex	lib/req/steps.ex Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex lib/req/decompress_error.ex lib/req/steps.ex lib/req/archive_error.ex changelog.html CHANGELOG.md	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex Req.DecompressError.html Req.TooManyRedirectsError.html Req.ArchiveError.html lib/req/steps.ex Req.HTTPError.html Req.Test.OwnershipError.html	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex api-reference.html lib/req/steps.ex lib/req/steps.ex Req.Steps.html Req.Request.html Req.Steps.html	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex api-reference.html lib/req/steps.ex Req.ArchiveError.html Req.DecompressError.html lib/req/steps.ex lib/req/request.ex
`I'm using Req.Steps.checksum/1 to verify a binary stream, but I keep encountering Req.ChecksumMismatchError.\nReq.Steps.checksum(stream_data, expected: "xyz789...")\nHow does this exception indicate the checksums differ?`	Req.ChecksumMismatchError.html	lib/req/steps.ex lib/req/checksum_mismatch_error.ex Req.Steps.html Req.Test.html CHANGELOG.md Req.Test.html CHANGELOG.md changelog.html	lib/req/checksum_mismatch_error.ex Req.ChecksumMismatchError.html lib/req/archive_error.ex lib/req/steps.ex lib/req/test/ownership_error.ex lib/req/steps.ex lib/req/decompress_error.ex lib/req/steps.ex	Req.ChecksumMismatchError.html lib/req/steps.ex lib/req/checksum_mismatch_error.ex lib/req/decompress_error.ex lib/req/archive_error.ex lib/req/finch.ex Req.ArchiveError.html Req.Test.OwnershipError.html	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex Req.DecompressError.html Req.ArchiveError.html Req.HTTPError.html Req.TooManyRedirectsError.html Req.Test.OwnershipError.html Req.TransportError.html	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex api-reference.html lib/req/steps.ex lib/req/steps.ex Req.Request.html CHANGELOG.md lib/req/request.ex	Req.ChecksumMismatchError.html lib/req/checksum_mismatch_error.ex lib/req/steps.ex api-reference.html Req.ArchiveError.html CHANGELOG.md lib/req/steps.ex lib/req/steps.ex
`When I call Req.Steps.decompress_body/1 on invalid data, I get a Req.DecompressError.\nReq.Steps.decompress_body/1\nWhy does Req.DecompressError occur when attempting to decompress a response?`	Req.DecompressError.html	lib/req/steps.ex Req.Steps.html Req.Steps.html Req.Steps.html Req.Steps.html Req.Response.html Req.html lib/req/request.ex	lib/req/decompress_error.ex Req.DecompressError.html lib/req/archive_error.ex lib/req/too_many_redirects_error.ex Req.ArchiveError.html Req.TooManyRedirectsError.html CHANGELOG.md api-reference.html	Req.DecompressError.html changelog.html lib/req/decompress_error.ex lib/req/steps.ex Req.ArchiveError.html changelog.html lib/req/archive_error.ex changelog.html	Req.DecompressError.html lib/req/decompress_error.ex Req.ArchiveError.html lib/req/archive_error.ex Req.ChecksumMismatchError.html Req.TooManyRedirectsError.html Req.Request.html lib/req/response.ex	Req.DecompressError.html lib/req/decompress_error.ex api-reference.html lib/req/steps.ex Req.ArchiveError.html Req.Steps.html lib/req/request.ex CHANGELOG.md	Req.DecompressError.html Req.ArchiveError.html lib/req/decompress_error.ex lib/req/steps.ex lib/req/steps.ex api-reference.html Req.Steps.html changelog.html
`I'm using Req.Steps.decompress_body/1 to handle a compressed response, but I'm seeing a Req.DecompressError in my logs.\nReq.Steps.decompress_body/1\nWhat causes a Req.DecompressError with a compressed response in Req?`	Req.DecompressError.html	Req.Steps.html Req.Steps.html Req.Steps.html README.md Req.html lib/req.ex changelog.html lib/req/request.ex	Req.DecompressError.html lib/req/decompress_error.ex Req.ArchiveError.html lib/req/archive_error.ex api-reference.html lib/req.ex CHANGELOG.md lib/req/transport_error.ex	Req.DecompressError.html lib/req/steps.ex changelog.html lib/req/decompress_error.ex Req.ArchiveError.html changelog.html CHANGELOG.md lib/req/archive_error.ex	Req.DecompressError.html lib/req/decompress_error.ex Req.ArchiveError.html lib/req/response.ex Req.Request.html Req.TooManyRedirectsError.html Req.ChecksumMismatchError.html Req.Response.Async.html	Req.DecompressError.html Req.Steps.html lib/req/decompress_error.ex Req.ArchiveError.html api-reference.html lib/req/steps.ex CHANGELOG.md lib/req/archive_error.ex	Req.DecompressError.html Req.Steps.html Req.ArchiveError.html lib/req/steps.ex lib/req/steps.ex CHANGELOG.md lib/req/decompress_error.ex api-reference.html
`After calling Req.Steps.decompress_body/1 with a broken gzip file, the function raises Req.DecompressError.\nReq.Steps.decompress_body/1\nHow can I resolve a Req.DecompressError that occurs during decompression?`	Req.DecompressError.html	lib/req/steps.ex Req.Steps.html api-reference.html Req.DecompressError.html lib/req.ex Req.Steps.html lib/req/request.ex lib/req/steps.ex	Req.DecompressError.html lib/req/decompress_error.ex Req.ArchiveError.html lib/req/archive_error.ex lib/req/transport_error.ex lib/req/too_many_redirects_error.ex api-reference.html Req.TransportError.html	Req.DecompressError.html Req.ArchiveError.html lib/req/decompress_error.ex lib/req/steps.ex changelog.html lib/req/archive_error.ex changelog.html lib/req/finch.ex	Req.DecompressError.html lib/req/decompress_error.ex Req.ArchiveError.html lib/req/archive_error.ex Req.TooManyRedirectsError.html Req.Request.html Req.ChecksumMismatchError.html Req.HTTPError.html	Req.DecompressError.html lib/req/decompress_error.ex Req.ArchiveError.html Req.Steps.html api-reference.html lib/req/archive_error.ex CHANGELOG.md CHANGELOG.md	Req.DecompressError.html lib/req/decompress_error.ex Req.ArchiveError.html lib/req/steps.ex lib/req/steps.ex api-reference.html Req.Steps.html changelog.html
`I'm calling Req.get!('http://bitcrowd.net') in my application and suddenly receive a Req.HTTPError whenever the server closes the connection unexpectedly.\nReq.get!('http://bitcrowd.net') Why does Req.HTTPError appear when the server closes the connection unexpectedly?`	Req.HTTPError.html	lib/req/steps.ex Req.Steps.html lib/req.ex lib/req/steps.ex lib/req/steps.ex lib/req/steps.ex README.md lib/req.ex	lib/req/transport_error.ex Req.ArchiveError.html Req.Request.html Req.TransportError.html lib/req.ex lib/req/archive_error.ex Req.TooManyRedirectsError.html lib/req.ex	Req.HTTPError.html lib/req/http_error.ex lib/req/response.ex lib/req/steps.ex CHANGELOG.md lib/req/too_many_redirects_error.ex api-reference.html Req.TransportError.html	Req.HTTPError.html lib/req/request.ex Req.TransportError.html lib/req/http_error.ex lib/req/response.ex Req.Response.Async.html Req.Steps.html Req.DecompressError.html	Req.html Req.HTTPError.html lib/req.ex lib/req/http_error.ex lib/req/steps.ex lib/req/steps.ex CHANGELOG.md Req.Steps.html	Req.HTTPError.html CHANGELOG.md lib/req.ex lib/req/http_error.ex lib/req/steps.ex Req.html Req.html changelog.html
`I created a custom Req adapter for an internal API and I'm raising Req.HTTPError whenever an HTTP-related issue occurs.\ndef custom_adapter(request), do: raise Req.HTTPError, message: "Protocol error"\nIs Req.HTTPError the correct exception to raise for HTTP protocol errors in a custom adapter?`	Req.HTTPError.html	lib/req/request.ex lib/req.ex lib/req/finch.ex lib/req/request.ex lib/req/steps.ex lib/req/request.ex Req.html Req.Request.html	lib/req/http_error.ex lib/req/transport_error.ex Req.HTTPError.html lib/req/too_many_redirects_error.ex lib/req/test/ownership_error.ex Req.TransportError.html lib/req/decompress_error.ex lib/req/archive_error.ex	Req.HTTPError.html lib/req/http_error.ex lib/req/transport_error.ex Req.TransportError.html lib/req/too_many_redirects_error.ex lib/req/decompress_error.ex Req.Steps.html lib/req/archive_error.ex	Req.HTTPError.html lib/req/http_error.ex Req.TransportError.html lib/req/request.ex lib/req/transport_error.ex lib/req/steps.ex Req.Steps.html Req.Request.html	lib/req/http_error.ex Req.HTTPError.html lib/req/transport_error.ex lib/req/steps.ex lib/req/steps.ex Req.html Req.TransportError.html lib/req.ex	Req.HTTPError.html lib/req/http_error.ex CHANGELOG.md lib/req/steps.ex lib/req/transport_error.ex Req.Request.html lib/req/request.ex CHANGELOG.md
`When I see Req.HTTPError in my logs, I notice it's referencing Mint.HTTPError under the hood.\nMint.HTTPError(:invalid_status_line)\nHow does Req.HTTPError relate to Mint.HTTPError?`	Req.HTTPError.html	lib/req/finch.ex readme.html lib/req/steps.ex lib/req/finch.ex README.md Req.html CHANGELOG.md Req.Steps.html	lib/req/response.ex lib/req/test/ownership_error.ex lib/req/transport_error.ex lib/req/http_error.ex Req.HTTPError.html Req.Steps.html Req.Request.html lib/req/finch.ex	Req.HTTPError.html lib/req/http_error.ex lib/req/transport_error.ex lib/req/too_many_redirects_error.ex lib/req/steps.ex api-reference.html Req.TooManyRedirectsError.html Req.TransportError.html	Req.HTTPError.html Req.TransportError.html lib/req/http_error.ex lib/req/transport_error.ex Req.Test.OwnershipError.html Req.ArchiveError.html Req.ChecksumMismatchError.html Req.DecompressError.html	Req.HTTPError.html lib/req/http_error.ex lib/req/transport_error.ex Req.TransportError.html Req.Steps.html CHANGELOG.md api-reference.html lib/req/steps.ex	Req.HTTPError.html lib/req/http_error.ex Req.TransportError.html CHANGELOG.md lib/req/transport_error.ex CHANGELOG.md Req.Steps.html api-reference.html
`After upgrading Req, I'm seeing a standardized error message: Req.HTTPError thrown in my program.\nReq.get!("https://myapi.com/data\")\nWhat does it mean that Req.HTTPError is a standardized exception for HTTP protocol errors?`	Req.HTTPError.html	lib/req/finch.ex CHANGELOG.md README.md Req.html lib/req/http_error.ex changelog.html lib/req/finch.ex Req.Steps.html	lib/req/http_error.ex Req.HTTPError.html lib/req/transport_error.ex Req.TransportError.html lib/req/test/ownership_error.ex lib/req/decompress_error.ex lib/req/archive_error.ex Req.Test.OwnershipError.html	Req.HTTPError.html lib/req/http_error.ex lib/req/transport_error.ex Req.TransportError.html api-reference.html lib/req/too_many_redirects_error.ex lib/req/steps.ex CHANGELOG.md	Req.HTTPError.html lib/req/http_error.ex Req.TransportError.html Req.ArchiveError.html lib/req/request.ex Req.DecompressError.html lib/req/transport_error.ex Req.ChecksumMismatchError.html	Req.HTTPError.html lib/req/http_error.ex lib/req/transport_error.ex Req.TransportError.html Req.html lib/req/steps.ex lib/req.ex api-reference.html	Req.HTTPError.html lib/req/http_error.ex Req.TransportError.html CHANGELOG.md api-reference.html CHANGELOG.md lib/req/transport_error.ex lib/req/steps.ex
`I'm catching Req.HTTPError in a try/rescue like try do Req.get!('bitcrowd.net') rescue e in Req.HTTPError -> IO.inspect(e) end and I want to confirm I'm handling the correct type of exception.\ntry do Req.get!('bitcrowd.net') rescue e in Req.HTTPError -> IO.inspect(e) end\nShould Req.HTTPError always be used as the rescue target for HTTP-protocol-related failures in Req?`	Req.HTTPError.html	lib/req/finch.ex README.md lib/req/finch.ex lib/req/test.ex lib/req/request.ex changelog.html lib/req/http_error.ex lib/req/steps.ex	lib/req/http_error.ex Req.HTTPError.html lib/req/transport_error.ex Req.TransportError.html lib/req/test/ownership_error.ex lib/req.ex lib/req.ex lib/req/archive_error.ex	Req.HTTPError.html lib/req/http_error.ex Req.TransportError.html lib/req/transport_error.ex lib/req/steps.ex CHANGELOG.md lib/req/too_many_redirects_error.ex lib/req/archive_error.ex	Req.HTTPError.html lib/req/http_error.ex lib/req/request.ex Req.TransportError.html Req.Request.html CHANGELOG.md Req.DecompressError.html Req.ArchiveError.html	lib/req/steps.ex lib/req/http_error.ex lib/req/request.ex Req.html lib/req.ex CHANGELOG.md Req.Request.html Req.Request.html	lib/req/steps.ex Req.HTTPError.html lib/req/http_error.ex api-reference.html Req.html lib/req.ex lib/req/request.ex Req.Request.html

In the follow-up post of this series, we will see how a common ground of truth can be established, how we can evaluate accuracy, precision, recall and f1 score of the different models. We will then use this toolset to introduce techniques to boost the search results over all mechanisms.

Why bitcrowd?

Elixir is an excellent choice for applications due to its scalability, fault tolerance, and concurrency model. Its lightweight processes and message-passing architecture make it ideal for orchestrating complex AI workflows efficiently. bitcrowd's first Elixir ML project dates back to 2020, and we have since then enabled various clients to build and scale their AI projects.

bitcrowd is an excellent choice if you need a scalable RAG system or a fully integrated AI pipeline. We help you build, optimize, and maintain it with a focus on reliability and performance.

Drop us a line via email if you want to build your next AI project with Elixir. Or book a call with us to discuss your project.

How Text Search is Broken​

A simple query​

A more complex query​

TS_Query language option​

Oh postgres, why can't you be true?​

Wishful thinking​

Intro: Ts_rank​

Is getting better?​

But wait - what about partial matches?​

No partial matches​

Enter Trigrams: pg_trgm​

The setup​

The first trgm Query​

Semantic Search with Embedding Models​

Installing pg_vector​

A fuzzy Query with ts_rank & trgm ...​

... and, in Contrast, with the Embedding Models.​

A Systematic Evaluation of Document Retrieval​

The Prompt:​

The Results​

Christoph Beck

We’re hiring

How Text Search is Broken

A simple query

A more complex query

TS_Query language option

Oh postgres, why can't you be true?

Wishful thinking

Intro: Ts_rank

Is getting better?

But wait - what about partial matches?

No partial matches

Enter Trigrams: pg_trgm

The setup

The first trgm Query

Semantic Search with Embedding Models

Installing pg_vector

A fuzzy Query with `ts_rank` & `trgm` ...

... and, in Contrast, with the Embedding Models.

A Systematic Evaluation of Document Retrieval

The Prompt:

The Results