A RAG for Elixir | bitcrowd blog

June 12, 2024 · 11 min read

Christoph Beck

Head of Intergalactic Mischief

Max Mulatz

Whitespace wrestler on a functional fieldtrip

Joshua Bauer

10-finger penguin-masseur

Abstract

This is the second part of a series of blog posts on using a RAG (Retrieval Augmented Generation) information system for your codebase. Together we explore how this can empower your development team. Check out the first post for an introduction into the topic if you haven't already.

In this episode we explore how we can adapt our RAG system for Ruby codebases from the first episode to read and understand Elixir code. We will take a look at LangChain and text "splitting" and "chunking".

Let's dive right into it.

Background

Our RAG system was built with the idea to discover Ruby codebases. In order to have conversations about Elixir codebases as well, we need to make sure our LLM "understands" Elixir code. This is where LangChain comes into play.

LangChain is a toolkit around all things LLMs, including RAG. We use it to parse our documents or codebase and generate a vector database from it. In our simple RAG system, we specify which file endings (.rb) and which programming language (Ruby) our documents have.

The ingestion of programming source code into an LLM with LangChain was initially only supported for Python, C and a few others languages. Then this issue proposed the usage of a parser library like Tree-sitter to facilitate adding support for many more languages. The discussion is worth reading.

Finally, this pull request introduced support for a lot more languages, including Ruby, based on this proposal. It was a school project:

I am submitting this for a school project as part of a team of 5. Other team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR also has contributions from community members @Harrolee and @Mario928.

Our plan is to use this as a starting point to enable basic parsing of Elixir source code with LangChain. With that, we should be able to have conversations with our RAG system about Elixir codebases as well.

Splitting / Chunking text

To read our Elixir codebase, the parser needs some rules on where to split the provided source code files at. Generally for RAG, when ingesting (reading in) a text file, PDF, etc. it will try to split it into chunks, ideally along it's semantic meaning. In text documents, the meanings are often grouped along:

chapters
paragraphs
sentences
words

If your embedding model has enough context capacity, you would try to split along chapters or paragraphs, because human readable text often groups meanings that way. If those are too big, you would try to break between sentences, and, as a last resort, words. One would generally try to avoid splitting inside words. Take for instance "sense" and "nonsense", which carry quite a different meaning.

Splitting / Chunking code

Embedding code is a bit underdeveloped, but the strategy is to break the code into pieces by inserting new lines so that it looks a bit more like natural text, and then let the embedding model deal with the task of making sense (inferring meaning) from it. Interestingly, the models trained on that task do that surprisingly well.

As said, LangChain has dedicated document loaders for source code and a guide on how to add new ones based on Tree-sitter. So we went ahead and implemented a document loader and parser for Elixir source code in LangChain. It only covers the core basics of the language, but it was already enough for our proof-of-concept RAG application. With LangChain now supporting Elixir out of the box, people can use the parser in a variety of different scenarios and will come up with ways to improve it to fit more use cases. Our implementation is only the ground work. You can have a look at the PR if you're interested in what's necessary to add parsing support for another programming language in LangChain. Spoiler: not much if you can utilize Tree-sitter.

The core of LangChain's programming language parsers based on Tree-sitter is their CHUNK_QUERY. For our Elixir parser it looks like this:

CHUNK_QUERY = """
    [
        (call target: ((identifier) @_identifier
            (#any-of? @_identifier "defmodule" "defprotocol" "defimpl"))) @module
        (call target: ((identifier) @_identifier
            (#any-of? @_identifier "def" "defmacro" "defmacrop" "defp"))) @function
        (unary_operator operator: "@" operand: (call target: ((identifier) @_identifier
              (#any-of? @_identifier "moduledoc" "typedoc""doc")))) @comment
    ]
""".strip()

We are using Tree-sitter's own tree query language here. Without diving into the details, our query makes sure to distinguish top level modules, functions and comments. The document loader will then take care of loading each chunk into a separate document and split the lines accordingly. The approach is the same for all programming languages.

Test drive

Let's take this for a spin in our RAG system scripts from episode one of this series.

the scripts used are in the bitcrowd rag_time repo on GitHub

Just as a refresher: the idea is to have a RAG system for your team's codebase using LLMs locally without exchanging any data with third parties like OpenAI and the like. It includes a conversational AI built with Chainlit, so that members of the team can "chat" with the LLM about the codebase, for instance to get information about the domain or where to find things for the ticket they are working on.

For testing purposes we will use our RAG system on a popular open source Elixir package, the Phoenix Framework.

Get the RAG ready

First we need to get our local RAG system ready for operating on an Elixir codebase. It needs to know:

Where is the code?
Which programming language is it?
Which suffixes have the source code files?

We provide this information via environment variables in a .env file:

OLLAMA_MODEL="llama3:8b"
CODEBASE_PATH="./phoenix"
CODEBASE_LANGUAGE="elixir"
CODE_SUFFIXES=".ex, .exs"

We just cloned the current state of the Phoenix Git repository right next to our RAG code. We also keep using Meta's Llama3 model, and instruct the document loader to look at Elixir files.

info

At the time of testing our PR on LangChain was not released yet. So we were pointing to our fork's local code for the langchain, langchain-community and langchain-text-splitter Python packages from the requirements.txt file.

For the sake of simplicity we assume the project documentation to be in Markdown and hard-coded this information into our code ingestion logic.

With that, we can set up our vector database for the Phoenix codebase:

python ingest-code.py

Then we start the chat bot:

chainlit run main.py

Chatbot: Hi, Welcome to Granny RAG. Ask me anything about your code!

Now we are ready to have a conversation about the codebase.

Ask questions

Let’s take an original PR from Phoenix to evaluate the quality of our solution, similar to what we did with Ruby in the previous post.

The pull request fixes a small issue in the recycle/1 function of the Phoenix.ConnTest module at phoenix/lib/phoenix/test/conn_test.ex:

Screenshot of a pull request

Currently, if you set remote_ip during a test when building a conn, calling a dispatch call e.g get will run the conn through recycle which will lose this information. This PR changes that so it is maintained similar to host.

We reset our clone of the Phoenix repository (and our vector database) to the state right before the PR was merged and then ask the RAG system for help with the issue:

Understand the problem

We: Please help me understand the following briefing:
Currently, if you set `remote_ip` during a test when building a `conn`,
calling > a dispatch call, e.g `get`, will run the `conn` through `recycle`
which will > lose this information. Instead, we want `the remote_ip`
maintained, similar to `host`.
Can you see why remote_ip is not preserved in test cases where recycle is used?

Screnshot of briefing question

Chatbot: The briefing explains that when running a test with Phoenix's Conn and Endpoint frameworks, the remote_ip information is lost during the test. This happens because of the way Phoenix's Conn.recycle/1 function works. Here's what's happening:

In your tests, you create a new conn with a specific IP address (e.g., remote_ip = "192.168.0.100").

You then dispatch an HTTP request to this connection using something like get("/").

When you do this, Phoenix's Conn framework calls the Conn.recycle/1 function behind the scenes.

This recycle function is designed to clean up and reset the connection for subsequent requests.

However, as part of this cleanup process, it loses the previously set remote_ip information.

The reason why host is preserved but not remote_ip is that Phoenix's Conn framework has a special handling for the host attribute. When you set the host attribute, it gets stored in the connection's internal state. This means that when you recycle the connection, the host information is still retained.

On the other hand, remote_ip is not preserved because it's not explicitly stored in the connection's internal state. Instead, it's just a temporary value set during the test execution. To maintain the remote_ip information across tests, you can create a custom Conn module that overrides the default behavior of Conn.recycle/1. This would involve storing the original remote_ip value and restoring it after recycling the connection.

In summary, remote_ip is lost in test cases because Phoenix's Conn.recycle/1 function doesn't explicitly preserve this information. To keep remote_ip intact, you'll need to create a custom Conn module that overrides the default behavior of recycle.

Sources: /Users/joshuabauer/code/rag_time/phoenix/test/phoenix/test/conn_test.exs, […]

Screnshot of briefing answer

Implement a solution

This would give us a hint on how to solve the issue in our app if we were "using" the Phoenix framework. But let's say we want to fix the issue "for good" in the framework itself. We now know the Conn.recycle/1 function is what we're interested in. So we look at the listed sources, search for the recycle/1 function in the code and find ConnTest/recycle/1 as the test-conn equivalent to Conn.recycle/1. This is pretty much already be the "custom Conn module" the RAG was referring to and it's just used for tests. And after all, we only want to change the behavior for tests. So let's ask the RAG what to do with it:

We: Given the following code from phoenix/lib/phoenix/test/conn_test.ex, what would I need to change to persist conn.remote_ip in the same way as conn.host?
def recycle(conn, headers \\ ~w(accept accept-language authorization)) do
  build_conn()
  |> Map.put(:host, conn.host)
  |> Plug.Test.recycle_cookies(conn)
  |> Plug.Test.put_peer_data(Plug.Conn.get_peer_data(conn))
  |> copy_headers(conn.req_headers, headers)
end

Chatbot: A Phoenix ConnTest question!

To persist conn.remote_ip in the same way as conn.host, you would need to add a line to the recycle/2 function to copy the remote IP address.

Here's how:
def recycle(conn, headers \\ ~w(accept accept-language authorization)) do
  build_conn()
  |> Map.put(:host, conn.host)
  |> Map.put(:remote_ip, conn.remote_ip)  # Add this line
  |> Plug.Test.recycle_cookies(conn)
  |> Plug.Test.put_peer_data(Plug.Conn.get_peer_data(conn))
  |> copy_headers(conn.req_headers, headers)
end
Now, when you recycle a connection, both conn.host and conn.remote_ip will be preserved.

Sources: […]

Screnshot of dialogue about implementation

Compare solution

Looking at the PR's file changes, this is exactly what the person came up with:

screenshot of the PR file changes on GitHub

So the conversation with our Elixir RAG was quite helpful for guiding us through the code and finding an adequate solution to the problem.

Try it yourself!

It is really easy! Just clone our repo, follow the README and tell the script where to find your codebase:

CODEBASE_PATH="./path-to-my-elixir-codebase"
CODEBASE_LANGUAGE="elixir"
CODE_SUFFIXES=".ex, .exs"

We kept the scripts basic, so that they are easy to understand and extend. Depending on your codebase, the results might not always be perfect, but often surprisingly good.

Outlook

In this post we saw how we can extend a simple off-the-shelf system to better fit the needs of our dev team. We enabled our RAG system to read and understand Elixir code! Text splitting and chunking is just one possible example of where to start when it comes to adjusting a RAG system for your specific needs. What we got is already quite useful, but it's definitely still lacking precision.

We will explore possibilities for further improvements and fine tuning in the next episodes of this blog post series.

Or, if you canʼt wait, give the team at bitcrowd a shout via granny-rag@bitcrowd.net or book a consulting call here.

Abstract​

Background​

Splitting / Chunking text​

Splitting / Chunking code​

Test drive​

Get the RAG ready​

Ask questions​

Understand the problem​

Implement a solution​

Compare solution​

Try it yourself!​

Outlook​

Christoph Beck

Max Mulatz

Joshua Bauer

We’re hiring

Abstract

Background

Splitting / Chunking text

Splitting / Chunking code

Test drive

Get the RAG ready

Ask questions

Understand the problem

Implement a solution

Compare solution

Try it yourself!

Outlook