The Tech Buffet #6: Why Your RAG is Not Reliable in Production
And how you should tune it properly š ļø
Hello there šš», Iām Ahmed. I write The Tech Buffet to share weekly practical tips on productionizing ML systems. Join the family (+400 š) and subscribe!
With the rise of LLMs, the Retrieval Augmented Generation (RAG) framework also gained popularity by making it possible to build question-answering systems over data.
Weāve all seen those demos of chatbots conversing with PDFs or emails.
While these systems are certainly impressive, they might not be reliable in production without tweaking and experimentation.
In this issue, we explore the problems behind the RAG framework and go over some tips to improve its performance.
These findings are based on my experience as an ML engineer whoās still learning about this tech.
RAG in a nutshell āļø
Let's get the basics right first.
Hereās how RAG works:
It first takes an input question and retrieves relevant documents to it from an external database. Then, it passes those chunks as a context in a prompt to help an LLM generate an augmented answer.
Thatās basically saying:
āHey LLM, hereās my question, and here are some pieces of text to help you understand the problem. Give me an answer.ā
You should not be fooled by the simplicity of this diagram.
In fact, RAG hides a certain complexity and involves the following components behind the scenes:
Loaders to parse external data in different formats: PDFs, websites, Doc files, etc.
Splitters to chunk the raw data into smaller pieces of text
An embedding model to convert the chunks into vectors
A vector database to store the vectors and query them
A prompt to combine the question and the retrieved documents
An LLM to generate the answer
If you like diagrams, hereās another for you.
A tad more complex but it illustrates the indexing and retrieval processes.
Phew!
Donāt worry though, you can still prototype your RAG very quickly.
Frameworks like LangChain abstracted most of the steps involved in building a RAG and it became easy to prototype those systems.
How easy is that? 5-line-of-code easy.
Of course, thereās always a problem with the apparent simplicity of such code snippets: depending on your use case, they donāt always work as-is and need very careful tuning.
Theyāre great as quickstarts but theyāre certainly not a reliable solution for an industrialized application.
The problems with RAG
If you start building RAG systems with little to no tuning, you may be surprised.
Hereās what I noticed during the first weeks of using LangChain and building RAGs.
Keep reading with a 7-day free trial
Subscribe to The Tech Buffet to keep reading this post and get 7 days of free access to the full post archives.