What is Retrieval Augmented Generation (RAG) for LLMs?
Retrieval-augmented generation (RAG) for large language models (LLMs) aims to improve prediction quality by using an external datastore at inference time to build a richer prompt that includes some combination of context, history, and recent/relevant knowledge (RAG LLMs). RAG LLMs can outperform LLMs without retrieval by a large margin with much fewer parameters, and they can update their knowledge by replacing their retrieval corpora, and provide citations for users to easily verify and evaluate the predictions.
RAG integrates information retrieval (or searching) into LLM text generation. It uses the user input prompt to retrieve external “context” information from a data store that is then included with the user-entered prompt to build a richer prompt containing context information that otherwise would not have been available to the LLM. Some examples of context information used by RAG include:
real-time context (the weather, your location, etc);
user-specific information (orders the user has made at this website, actions the user has taken on the website, the user’s status, etc);
relevant factual information (documents not included in the LLM’s training data - either because they are private or they were updated after the LLM was trained).
Why is there a need for RAG LLMs?
Pre-trained LLMs (foundation models) do not learn over time, often hallucinate, and may leak private data from the training corpus. To overcome these limitations, there has been growing interest in retrieval-augmented generation which incorporate a vector database and/or feature store with their LLM to provide context to prompts, also known as RAG LLMs.
What data systems are used for RAG LLMs?
Vector databases are used to retrieve relevant documents using similarity search. Vector databases can be standalone or embedded with the LLM application (e.g., Chroma embedded vector database). When structured (tabular) data is needed, an operational data store, such as a feature store, is typically used. Popular vector databases and feature stores are Weaviate and Hopsworks that both provide time-unlimited free tiers.
How does RAG compare with Fine-Tuning?
Fine-tuning takes a pre-trained LLM and further trains the model on a smaller dataset, often with data not previously used to train the LLM, to improve the LLM’s performance for a particular task.
In contrast, you should use RAG when you are able to augment your LLM prompt with data that was not known to your LLM at the time of training, such as real-time data, personal (user) data, or context information useful for the prompt.
Challenges with RAG
One of the challenges in using a vector database is the stochastic nature of similarity search - how do you know the document you retrieve is actually relevant for the prompt?
In contrast, using structured data in RAG, such as in the example above is quite straightforward. In this figure, the user has logged onto the website, so we have the user’s ID. With the user’s ID, we can retrieve the customer’s recent investments as a list of strings that we prepend to the prompt. In the above figure, looking up the relevant document in a vector database to the user’s concern about “rising interest rates affects property” is more challenging. If you query the vector database with “rising interest rates affects”, it has a good chance of returning a relevant document, but if you include “property” in your vector database query, it may return a document related to property, which is not the context we would like to provide to steer the conversation towards other investment opportunities.
Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.