LLM with Ollama and similarity search with Qdrant, vector database

Vector database

I have been interested in vector databases. Unlike a relational database, where data is organized into tables with rows and columns, in a vector database, data is represented as vectors in a high-dimensional space.

This approach is particularly suited for machine learning algorithms that operate on vectors, such as similarity search or deep learning.

To illustrate this, let's consider a simple example: a vector database of textual documents.

Each document is represented by a vector in a high-dimensional space, where each dimension corresponds to a feature of the document (e.g., the frequency of a specific word). With this vector representation, it is possible to search for similar documents using vector similarity search algorithms, such as cosine distance.

About vector databases

To implement a vector database, one can use a variety of different technologies. For instance, you can use a database specialized in vector storage and retrieval, such as Faiss, Milvus, Qdrant, or Pinecone in SaaS mode.

Alternatively, you can use a NoSQL database like MongoDB or even PostgreSQL through pg-vector, which offer features for efficient storage and retrieval of vectors.

Note: pg-vector is also available by default on AWS RDS

In this article, I will use Qdrant, an open-source database written in Rust.

We will see how to represent data as vectors, how to build a vector database, and how to use it to search for similar data.

Using an LLM to Generate Vectors

Vector Database

You can generate your own vectors or use a Large Language Model (LLM) to benefit from their training and their ability to generate numerous vectors (4096 in our example) more reliably based on the given context.

Being a Go user, I naturally chose to use Ollama.ai, which provides a client package to interact with the API.

This tool, based on the famous llama.cpp, allows you to run an LLM locally on your machines. Available for Linux, MacOS, and Windows.

Thus, we can install it on our machine and retrieve a model:

$ ollama pull mixtral:8x7b

Here, I am fetching the Mixtral 8x7b model, which is a Mix of Experts (MoE) model.

This means that at each layer, for each token, a network of routers selects two of these groups (the "experts") to process the token and combine their results additively.

We also launch our vector database in parallel using a container:

$ docker run --rm -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

Let's now instantiate an Ollama client in Go as well as a Qdrant client:

package main

import (
	"log"

	"github.com/jmorganca/ollama/api"
	pb "github.com/qdrant/go-client/qdrant"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
)

func main() {
  // Ollama client
	ollamaClient, err := api.ClientFromEnvironment()
	if err != nil {
		log.Fatalf("unable to create ollama client: %v\n", err)
	}

  // Qdrant client
	conn, err := grpc.Dial("localhost:6334", grpc.WithTransportCredentials(insecure.NewCredentials()))
	if err != nil {
		log.Fatalf("did not connect: %v", err)
	}
	defer conn.Close()

	pointsClient := pb.NewPointsClient(conn)
	collectionsClient := pb.NewCollectionsClient(conn)

  // To be continued...
}

Once the clients are initialized, you can initialize a new collection in the Qdrant database:

	_, err = collectionsClient.Create(ctx, &pb.CreateCollection{
		CollectionName: "test",
		VectorsConfig: &pb.VectorsConfig{
			Config: &pb.VectorsConfig_ParamsMap{
				ParamsMap: &pb.VectorParamsMap{
					Map: map[string]*pb.VectorParams{
						"": {
							Size:     4096,
							Distance: pb.Distance_Cosine,
							OnDisk:   &trueValue,
						},
					},
				},
			},
		},
	})
	if err != nil {
		log.Printf("unable to create collection: %v\n", err)
	}

Here we specify a single vector (empty: "") in this collection, but it's also possible to have multiple vectors in the same collection.

The vector size of 4096 is defined because that's the number of vectors the model we're going to use will return to us.

A small detail here: we specify a Cosine algorithm, which we'll revisit later in this article.

For example, consider the following sentences:

documents := []string{
    "Thomas loves eating apples under the shade of a tree.",
    "Thomas has a sister, her name is Julie and she is 30 years old. She doesn't like apples but prefers strawberries.",
}

To converse with an LLM and perform a search based on our data, the principle is as follows:

LLM and vector database similarity search

  • 1 - We will generate 4096 vectors for each of these sentences (which we will call documents),
  • 2 - We store these vectors in our database
  • 3 - When a prompt is entered, we retrieve again 4096 vectors generated by this prompt
  • 4 - We then perform a similarity search in our vector database
  • 5 - We retrieve the most reliable context information
  • 6 - We provide this context information to our LLM so that it has personalized context

These steps may seem costly but documents can be indexed on-the-fly.

Thus, only the step of generating vectors from the prompt and the similarity search need to be performed in the case of a user query.

Document Indexing

In terms of code, this translates to:

	for i, document := range documents {
		fmt.Printf("Indexing: %q...\n", document)

		// Generate vectors
		response, err := ollamaClient.Embeddings(
			ctx,
			&api.EmbeddingRequest{
				Model:  "mixtral:8x7b",
				Prompt: document,
			},
		)
		if err != nil {
			log.Printf("unable to embed document %q: %v\n", document, err)
		}

		// Insert vectors
		_, err = pointsClient.Upsert(
			ctx,
			&pb.UpsertPoints{
				CollectionName: "test",
				Points: []*pb.PointStruct{
					{
						Id: &pb.PointId{
							PointIdOptions: &pb.PointId_Num{Num: uint64(i)},
						},
						Payload: map[string]*pb.Value{
							"": {
								Kind: &pb.Value_StringValue{
									StringValue: document,
								},
							},
						},
						Vectors: &pb.Vectors{
							VectorsOptions: &pb.Vectors_Vector{
								Vector: &pb.Vector{
									Data: convertFloat64ToFloat32(response.Embedding),
								},
							},
						},
					},
				},
				Wait: &trueValue,
			},
		)
		if err != nil {
			log.Printf("unable to upsert points vectors for document %q: %v\n", document, err)
		}
	}

The generation of vectors is done very simply by using the Embeddings() API and specifying the model we want to use.

For vector insertion, we specify the collection in which we want to store our vectors, and the vector(s) to use.

During indexing, we also specify the document content under Payload, which will be returned to us during the similarity search.

Chat with the LLM

To perform the similarity search, we can write:

  userInput := bufio.NewScanner(os.Stdin)
  userInput.Scan()

  prompt := userInput.Text()

  // Generate vectors
  response, err := ollamaClient.Embeddings(
    ctx,
    &api.EmbeddingRequest{
      Model:  "mixtral:8x7b",
      Prompt: prompt,
    },
  )
  if err != nil {
    panic(err)
  }

  // Similarity search
  searchResult, err := pointsClient.Search(
    ctx,
    &pb.SearchPoints{
      CollectionName: "test",
      Vector:         convertFloat64ToFloat32(response.Embedding),
      Limit:          5,
      WithPayload: &pb.WithPayloadSelector{
        SelectorOptions: &pb.WithPayloadSelector_Include{
          Include: &pb.PayloadIncludeSelector{
            Fields: []string{""},
          },
        },
      },
    },
  )
  if err != nil {
    panic(err)
  }

As with the previous indexing, we generate our 4096 vectors for this given prompt.

We then perform a similarity search on our unique vector from our test collection.

We retrieve the most similar documents, ordered by similarity score, and then simply provide them to our LLM as context.

  messages := []api.Message{}

  for _, item := range searchResult.Result {
    messages = append(messages, api.Message{
      Role:    "assistant",
      Content: item.Payload[""].GetStringValue(),
    })
  }

  messages = append(
    messages, api.Message{
      Role:    "system",
      Content: "You are a technical assistant capable of answering questions based on the information provided. Respond to the question asked by the user. Do not add any additional note. Simply answer the question asked. Respond only based on the information provided. Respond in English.",
    },
  )

  var fullResponse = ""

  // LLM - Chat request
  if err := ollamaClient.Chat(
    ctx,
    &api.ChatRequest{
      Model:  "mixtral:8x7b",
      Stream: &trueValue,
      Messages: append(messages, api.Message{
          Role:    "user",
          Content: prompt,
        },
      ),
    },
    func(chatResponse api.ChatResponse) error {
      fmt.Print(chatResponse.Message.Content)
      fullResponse += chatResponse.Message.Content

      if chatResponse.Done {
        fmt.Printf("\n")

        history = append(history, api.Message{
            Role:    "user",
            Content: prompt,
          },
          api.Message{
            Role:    "assistant",
            Content: fullResponse,
          },
        )
      }
      return nil
    },
  ); err != nil {
    panic(err)
  }

Simply put, this allows us to provide context to our chat request with the LLM and display its response on the standard output.

Similarity Search Algorithms

Similarity Search Algorithms

Qdrant currently allows the use of the following algorithms:

Here, we have used a Cosine algorithm.

Here are some use cases for each of them:

  • Cosine: Content recommendation based on similarity between user profiles and content items.
  • Euclid: Anomaly detection in geospatial data by calculating the distance between data points and their center of gravity.
  • Dot: Collaborative filtering in recommendation systems by measuring similarity between user preferences and content items.
  • Manhattan: Navigation routes in cities by calculating the distance between start and end points using only horizontal and vertical directions.

The Cosine algorithm is particularly suitable for measuring similarity between text representation vectors, which is essential in LLMs.

These models represent the meaning of words and phrases as vectors in a multidimensional space. By using the angle between these vectors rather than their Euclidean distance, the Cosine algorithm better captures the semantic similarity between expressions.

Conclusion

In this article, we explored vector databases and their relevance in the context of Large Language Models (LLMs).

By combining the LLMs' abilities to generate rich information text vectors with Qdrant's similarity search functionalities, developers can create intelligent and contextual applications that effectively leverage semantic information from textual data.

This approach opens the door to many potential applications in areas such as personalized content recommendation, advanced virtual assistance, and much more.

If this content has piqued your interest, you now have a piece of Go code to start experimenting with!

Credits

Photo by Joshua Sortino on Unsplash