Ask a question to IRS.gov

What is RAG?

RAG stands for Retrieval-Augmented Generation, a system designed to improve large language models (LLMs). LLMs generate text based on an input. Most models are trained on a general dataset, meaning an LLM can never really know anything except what it has been trained on. Their are two main solutions to this issue, train on top of the model with new data (fine-tuning) or to feed the knowledge as input to the LLM (RAG). The LLM (generator) is augmented by retrieving data from a knowledge base.

RAG can be summarized to 4 steps:

Vector embed the content
Vector embed the user query
Get the closest vectors to the user query
Feed the LLM the related content with the user’s query

We will be implementing RAG in Drupal in order to “talk” to the content inside. This can be very helpful for LLM site assistants that need lots of context in order to be useful. It can also be useful for general semantic search as that is basically the first half of RAG.

Limitations

Since we have opted for a fully self hosted implementation this means we are limited in options for LLM models. The GPU in our current server can support at max an 8B model (8 billion parameter).

To properly timebox ourselves we stuck to a more simplistic implementation of content embedding without any form of chunking (breaking the content into semantically related pieces).

Implementation

To get full RAG this project implements:

LLM Host (Ollama) I have self-hosted Ollama on a server with a dedicated GPU (Nvidia RTX 3080 10GB)

ChromaDB (ChromaDB) There is no official PHP API for accessing ChromaDB so I will be using their recommended Python API and giving Drupal access through a REST API using FastAPI (A Python library for building REST API’s).

Drupal AI Contrib Module (Drupal AI) This module is the official drupal module combining several important AI modules for other plugins to use.

Custom Drupal Module Our custom Drupal module ties the above items together to get fully implement RAG

Each pages’s text gets sent to ChromaDB to be embedded and stored.
User queries get embedded and compared to documents in ChromaDB to get related content
User Query is fed to LLM along with contextual documents

Testing

Our implementation of RAG performs quite exceptionally given the limitations of our system. To judge the quality of our implementation, I will be rating the answers of the LLM on 2 criteria (on a scale of 0 - 5):

Quality: Does the response clearly answer the question?
Hallucination: Does the response make up the answer?

Question #1

What types of insurance policies do you offer?

This is a very broad question where several pages could be relevant context.

ANSWER: Main Street America insurance provides a variety of insurance policies to meet the needs of our clients. In addition to personal auto coverage, we also offer homeowners, renters, commercial, and life insurance options. Our independent insurance agents can assist you in choosing the best combination of coverage and service for your unique situation. Some of the coverages we provide include property, liability, workers’ compensation, umbrella, disability income, long-term care, health, dental, vision, and annuities. Contact us today to learn more about our policies and find an agent near you.

RATINGS:

Quality: 4 - We are immediately given the list of possible insurance policies. It reads a bit like an advertisement instead of an answer.
Hallucination: 2 - Toward the end of the response we get some made-up coverages, I cannot find annuities in any of the context documents

Question #2

What types of surety bonds are there?

This question dives into specific content on a page, meant to see how well the LLM parses through contexts.

ANSWER:

Notary Public Bond: Required for notaries to legally notarize documents in some states.
Business Services Bond: Also known as a third-party dishonesty bond, this is necessary for businesses that perform services inside other people’s properties or premises.
ERISA Bond: Mandatory for all employee benefit plans under the Employee Retirement Income Security Act (ERISA). Protects against losses caused by fraud or dishonesty on part of plan officials or fiduciaries.
License and Permit Bonds: Required for various types of licenses or permits in some cities, counties or states.
Contractor Bonds: Usually needed for contractors working on a new job in a different city or state to ensure compliance with the local laws.
Probate Bond: May be required by a court for individuals appointed as an administrator, conservator, or guardian of minors or incapacitated adults.

RATINGS:

Quality: 5 - No faults here, direct answer describing each policy
Hallucination: 3 - The response was pretty hit & miss here with some bonds either being made up or being unrelated.

Question #3

Are drunk workers covered by my workers comp?

Another very specific question where it is directly answered in a page in which the answer is NO it is not covered.

ANSWER #1: Workers compensation insurance typically covers employees who are injured on the job, including those who become impaired by alcohol or drugs…

RATINGS (answer #1):

Quality: 2 - Was very general like it didn’t know (and was completely wrong)
Hallucination: 5 - I have left an alternate response here to show that sometimes (it seems about 50/50) It will give the correct answer.

Question #4

My house was marked as a flood zone, should I get flood insurance?

ANSWER: Yes, if your house has been designated as being in a flood zone, you should definitely consider getting flood insurance…

RATINGS (0-5):

Quality: 5 - Bravo! An immediate answer followed up by justification and information from the content.
Hallucination: 0 - Answers have been quite consistent.

Conclusion

Embedding is critical

How content is embedded plays a critical role in good RAG. In our current implementation the pages’s entire content is a single document sent to ChromaDB. This is dead simple to implement but has the shortcoming of degrading the relevance of the content. A whole page likely covers several topics in a subject which can be unrelated to a users query about a specific topic. The solution to this is breaking up the page contents into semantically related pieces (called chunking). This solution also helps with the following issue about information overload as there is less context to parse.

Context amount needs to be tailored to the LLM

In our implementation I have limited the context fed to the LLM to 5 documents. That is, I have limited it to 5 WHOLE pages. These pages have a high relevancy but still, it appears the LLM has difficulty parsing good information from all this content. When experimenting with question #2 (surety bonds) I found that when limiting the relevancy to the point where only the surety bonds page was the only context answers were consistently perfect with no hallucination. I believe the solution to this (aside from chunking) would be something like a relevance difference filter.

Quality budget RAG is very possible!

The current (very basic) implementation already works quite well with a mid-range LLM (8 Billion parameter) on a relatively inexpensive machine. With the changes above implemented I believe perfection here can be achieved!

Extra Notes

In future iterations I would like to switch from Ollama to LM Studio for its ability to use any Hugging Face models. There are several models that seem to be fine-tuned for RAG purposes, none of which are available in Ollama.