Similarity Search

What is similarity search for vector embeddings?

Vector embeddings (or embeddings or vectors) are compressed representations of data such as text, images, and audio. Vector similarity search (or similarity search for embeddings) finds the “top K” most similar vectors to a query vector in a vector database. In order to be able to search for items in a vector database, you need to first insert vector embeddings for items, then the items will be indexed by the vector database.

What is similarity search used for?

Similarity search is used to build:

single modality similarity search where a user supplies an Image/video/audio that is run through an embedding model to create a vector with which a similarity search is performed on the vector database to find similar images/video/audio. Here the similarity search is for the same modality (e.g., use an image to find images);
Two-modality similarity search where a user enters a search string (a query) or the last video they watched, and we augment that with the user’s history and preferences and context information like what’s trending and with all that data, we use an embedding model to create a vector. We then look up items (a different modality from the user search) in a vector database. Two-modality similarity search is enabled by training models using the two-tower model, where you have samples where a user query (and its embedding) is linked to an item the user clicked on (or negatively - the user didn’t click on it).
Prompt engineering for generative AI tools (LLMs) where you take a user’s query and retrieve similar text passages stored in a vector database, and add those text passages as “prompts” to your query. This will better instruct the LLM how to generate a more relevant, informed response. For example, if you retrieve recent information from your vector database (not available when the LLM aws trained), the LLM can give you a correct answer, even though it didn’t know that data at the time of training;
Anomaly detection where you supply a vector and search for outliers in the vector database (outliers are very different from all other vectors in the vector DB);
Deduplication and record matching where similarity search can find very similar (but maybe not 100% identical) items. This can be useful for detecting plagiarism, for example.

What indexing algorithms are used for similarity search?

Approximate k-nearest neighbors (ANN) is commonly used to return the k nearest vectors - exact matching is too computationally expensive. Approximate approaches find good enough matches without exhaustively checking all the possible matches. Your ANN algorithms should be configurable to enable you to tradeoff recall (percentage of results with true top-k nearest neighbors), latency, throughput, and vector insertion time.

Similarity Search

What is similarity search for vector embeddings?

What is similarity search used for?

What indexing algorithms are used for similarity search?

Interested for more?

Wait! Before you go…