The Tech Buffet #22: Why You Should Consider Weaviate As Your Ultimate Vector Database

Some insights from a 9-month project building RAGs for pharma

Apr 23, 2024

∙ Paid

Hello everyone, Ahmed here. I write The Tech Buffet to deliver practical insights in machine learning and system design. This is based on my experience as an ML engineer, blogger and open source contributor.
In today’s issue, I explain my current motivations of selecting Weaviate as a production-ready vector database to power RAG systems.

A vector database is a critical component in Retrieval Augmented Generation (RAG) applications: besides storing the vectors and the metadata, it efficiently computes similarity measures and powers semantic search.

But picking the right vector database when building production-ready RAG applications is no easy task: this requires a close evaluation of the cost, the query performance, and the scalability. This is even more important when you process millions of vectors or expect real-time use.

I'm currently working at a pharma company and I've gone through this decision process. So, in this issue, I'll share with you why I think Weaviate is a good option for managing your vector data and building RAGs, especially if you're a developer with no particular experience in LLMs.

To make this tutorial easy to follow, I'll first load some data from a public API, index it in Weaviate, and experiment with some of its cool features.

Here's the agenda

Load research papers from the Papers With Code public API
Setup Weaviate locally
Index the data in batches
Overview of Weaviate search functionalities:
- vector search
- keyword search
- hybrid search
- re-ranking
- advanced filtering
Generative search: turning your database into a RAG system

1—Load research papers from the Papers With Code public API 🔌

To experiment with Weaviate features, we first need to ingest some data. Many datasets are available out there.

The one we'll experiment today with consists of research papers about Large Language Models.

We'll extract this dataset from the public API of paperswithcode.

To collect this dataset, you need to run the following function:

This dataset contains ~11K paper abstracts with the following attributes and metadata:

2—Setup Weaviate locally ⚙️

Weaviate is an open-source, AI-native vector database that helps developers create intuitive and reliable AI-powered applications.

GitHub - weaviate/weaviate: Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of

You can use Weaviate in different ways:

Via a docker-compose locally
Via Kubernetes by deploying it on cluster
Via Weaviate Cloud Services (WCS), the managed offering

At my company, we deployed Weaviate on Kubernetes but in this tutorial, we’ll use docker-compose for demo purposes.

The setup is straightforward.

Create an empty weaviate_data folder to store Weaviate data on the host
Create a docker-compose file with this configuration:
- Enabled modules: text2vec-palm,reranker-cohere,generative-palm
- AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'admin'
- AUTHENTICATION_APIKEY_USERS: 'ahmed'

Run this command:
```
docker-compose up
```

To check that Weaviate started successfully, you can hit this URL

After launching Weaviate, you must create a client to interact with it.

This client can have additional credentials to connect to other third-party cloud services like OpenAI, VertexAI, or Cohere.

Why would a vector database need these connections?

Weaviate can take care of other things than storing your vectors and computing similarities.

It can embed the data while indexing it
It can perform document reranking after retrieval
It can perform a RAG operation by answering questions based on search results
and more…

In this tutorial, I'll connect Weaviate to VertexAI to power text embedding and generation and connect it to the Cohere API for the reranking task.

3—Index the data in batches 🗂️

Before indexing the data into Weaviate, we must create a collection and define the schema.

This is done via the create method that defines:

The collection name: MLPapers

The properties (or fields)

The vectorizer that'll be used to embed the data while indexing it

The generator that'll be used to generate answers based on retrieved documents (To perform RAG directly in Weaviate)

Here’s the full code:

After the collection is created, you can index the data in batches. This is another useful functionality of Weaviate that speeds up the indexing process.

Once the indexing is complete, we can check the number of vectors using the aggregate method:

collection.aggregate.over_all(total_count=True)
# AggregateReturn(properties={}, total_count=10592)

4—Weaviate search functionalities 🔍

Weaviate has multiple search options. We’ll review some of them in this section.

1. Vector Search

The obvious search method is based on vector similarities. Here's how to do it:

Behind the scenes, Weaviate first embeds the text query using the vectorizer, then fetches the similar results from the collection.

The response contains a list of objects and each one has the following structure.

By inspecting the three retrieved documents, we notice that they match the query accurately.

2. Keyword Search

Weaviate handles the keyword search by using the BM25 algorithm.

Keyword search is a simple technique that relies on string matching only.

You can run a keyword search in Weaviate by calling the bm25 method.

This gives the following results:

After reading these documents, they appear less relevant than the previous ones. They mention the words “fine-tuning” or “LLMs” but don’t necessarily discuss the process of model fine-tuning.

Keyword search is a useufl functionality when you need to match specific terms that an embedding model doesn’t necessarily know or represent well (e.g. a drug nomenclature or an e-commerce product name)

3. Hybrid Search

One feature that I find particularly useful in Weaviate is hybrid search.

Keep reading with a 7-day free trial

Subscribe to The Tech Buffet to keep reading this post and get 7 days of free access to the full post archives.