Scheduled upgrade on April 4, 08:00 UTC

Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.

April 4, 2025

App Status

Back to Blog

Kenneth Mak

Software Engineer

Jim Dowling

CEO and Co-Founder

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

Hopsworks AI Lakehouse Now Supports NVIDIA NIM Microservices

How we secure your data with Hopsworks

Migrating from AWS to a European Cloud - How We Cut Costs by 62%

The 10 Fallacies of MLOps

Hopsworks AI Lakehouse: The Power of Integrated MLOps Components

Article updated on

GenAI comes to Hopsworks with Vector Similarity Search

Consolidate your Data for AI in a Single Platform

March 20, 2024

8 min

Read

Kenneth Mak

Software Engineer

Hopsworks

Jim Dowling

CEO and Co-Founder

Hopsworks

MLOps

TL;DR

Hopsworks has added support for approximate nearest neighbor (ANN) indexing and vector similarity search for vector embeddings stored in its feature store. Now, when you create a feature group, you can indicate which feature(s) are vector embeddings to be indexed for vector similarity search. This is a new capability for feature stores that empowers users to combine feature data (and its data validation support) with the power of vector similarity search in a single data platform. For LLMs, you can now store the data for RAG (retrieval augmented generation) using vector similarity search and instruction datasets for fine tuning in a single feature group. Similarly, for personalized recommendation systems (retrieval and ranking), can combine the vector similarity search search for candidate retrieval with the feature data used for ranking in a single feature group. Adding ANN indexes to feature groups in Hopsworks is part of our mission of making Hopsworks the world’s best data for AI platform.

Introduction

Vector similarity search has gained wide adoption in recent years, with the growth of vector databases. That retail website that shows you similar items of clothing to the one you're viewing, or the personalized recommendation system that recommends products based on your recent browsing/purchasing history are examples of vector similarity search. From a database perspective, if you have a table containing rows of items, you can take one item and perform vector similarity search on the table and it will return rows of items semantically similar to your item. To perform vector similarity search, you need an embedding model to take your item data and compress it into a vector embedding. What’s different and new about vector embeddings is that they retain semantic information about an item even after compression.

Vector similarity search applications range across various domains, from RAG in LLMs, to recommendation systems, to image similarity search and beyond. As such, vector similarity search has become an important capability in operational machine learning systems where semantically related information can be retrieved at low latency. While vector similarity search originated with vector databases, other databases have recently added support for vector similarity search as a capability, including Postgres/pgvector, Neo4J, OpenSearch, Elastic, Datastax, and more. In this article, we discuss the addition of ANN indexes to feature groups in Hopsworks, enabling vector similarity search over feature groups stored in Hopsworks.

Vector Similarity Search for Feature Stores

Building ML systems is hard and production systems have traditionally had high operational cost. For example, personalized recommender systems based on the retrieval-and-ranking architecture have traditionally included a vector database, to generate candidates, a feature store, to enrich the candidates with features before ranking the candidates, and model-serving infrastructure to host the ranking model. That is a lot of infrastructure to operate.

Now, with its new vector similarity search capability, Hopsworks can now provide all the ML infrastructure needed to support use cases such as the retrieval-and-ranking recommender architecture, but also other use cases such as RAG and fine-tuning in a single platform.

Extending Feature Groups with ANN Indexes

***Figure 1:*** *Feature Group Shared Schema*

Hopsworks provides a Feature Group API for writing DataFrames (Pandas, Polars, or Spark) transparently to either the offline, online, or both stores. The Feature Group client has internal batch and stream APIs for routing the data to the backend stores.

The code snippet below shows how to create a Feature Group with a DataFrame, where the content column contains vector embedding data that should be indexed for vector similarity search. Vector embeddings are only supported in online-enabled Feature Groups. Under the hood, DataFrame is written to Kafka and synchronized to the backend stores: RonDB (online), Opensearch (vector DB), Apache Hudi or Delta Lake (offline store). Hopsworks transparently creates a unified schema for all of these stores, manages the lifecycle (creation/deletion) of the backing tables/indexes, and optimizes their layout for query performance.

from hsfs import embedding

df = # Pandas/PySpark DataFrame


emb = embedding.EmbeddingIndex()
emb.add_embedding("content", len(df["content"][0]))

expectation_suite = # Great Expectations Data Validation Rules

news_fg = fs.create_feature_group(
    name="news_fg",
  description="News data, indexed for similarity search",
    embedding_index=emb, 
    primary_key=["content"],
    version=version,
  expectation_suite=expectation_suite,
    online_enabled=True,
)

news_fg.insert(df)

Now, when you write your vector embedding data to a Feature Group, you get additional benefits above and beyond what is found in existing vector databases:

transparent data validation on data ingestion, through declarative support for Great Expectations data validation rules,
time travel support for feature groups, crucial for exact reproduction of training datasets using only metadata,
training dataset creation using vector embeddings (this is particularly important for time-series data, where temporal joins are needed to create point-in-time correct training data),
efficient retrieval of vector embeddings along with feature data using Hopsworks Feature Query Service.

The upshot of these improvements is that you can treat your vector embeddings as any other data source for AI with Hopsworks Feature Store. You get all the benefits of a feature and vector database in a single platform.

Vector Similarity Search API

The vector similarity search API is designed for ease of use, enabling developers to seamlessly integrate similarity search into their applications. Users provide the target embedding as a search query to a feature group as well as a nearest neighbor count k, and it returns k rows of features that contain the approximately closest embedding vectors from the feature group. You can also provide a filter that is pushed down to the vector database, filtering out unwanted rows.

search_query = "News articles about Lionel Messi"

# Similarity search returning k=3 nearest neighbors 
news_fg.find_neighbors(model.encode(search_query), k=3)

# Similarity search with push-down filters
news_fg.find_neighbors(model.encode(search_query), k=3, filter=news_fg.newstype == "sports")

For creating training datasets using historical embeddings with point-in-time correctness, a feature view provides methods to create training dataset with or without time split.

# Create a training dataset with a time-series split
X_train, X_test, y_train, y_test =
  fv.train_test_split(start_time="20240101", end_time="20240131")

To retrieve feature data at a specific time in the past from the feature group, users can utilize the offline read API to perform time travel.

time_in_past = "19580206"
df = news_fg.as_of(time_in_past).read()

Personalized Recommendations with Retrieval and Ranking

We have a complete personalized recommendations example, based on the Retrieval and Ranking architecture, available in our tutorials to help get you started with using vector similarity search.

In the retrieval and ranking architecture, the second phase reranking of the top k items fetched by first phase filtering is common where extra features are required from other sources after fetching the k nearest items. In practice, it means that an extra step is needed to fetch the features from other feature groups in the online feature store. Hopsworks provides yet another simple read API for this purpose. Users can create a feature view by joining multiple feature groups and fetch all the required features by calling fv.find_neighbors. In the example below, view_cnt from another feature group is also returned in the result.

views_fg = fs.create_feature_group(
    name="news_views",
    primary_key=["id1"],
    version=version,
    online_enabled=True
)

fv = fs.create_feature_view(
    name="news_cnt", 
  version=version, 
    query=news_fg.select(["date", "heading", "newstype"]
).join(views_fg.select(["view_cnt"]))
)

fv.find_neighbors(model.encode(search_query), k=5)

Summary

In this blog, we have introduced the enhanced vector similarity search capabilities in Hopsworks. Explore the notebook example, demonstrating how to use Hopsworks for implementing a news search application. Users can search for news using natural language in the application, powered by the new Hopsworks vector database.

References

Interested for more?

🤖 Register for free on Hopsworks Serverless
🌐 Read about the open, disaggregated AI Lakehouse stack
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

HopsFS file system: 100X Times Faster than AWS S3

Many developers believe S3 is the "end of file system history". It is impossible to build a file/object storage system on AWS that can compete with S3 on cost. But what if you could build on top of S3

A tutorial that overviews of how to work with Jupyter on Hopsworks and train a state-of-the-art Machine Learning models using the fastai python library.

How to build ML models with fastai and Jupyter in Hopsworks

This tutorial gives an overview of how to work with Jupyter on the platform and train a state-of-the-art ML model using the fastai python library.

Robin Andersson

In this blog we explore what are and how to create embeddings in machine learning and their diverse applications in data-driven decision-making.

Machine Learning Embeddings as Features for Models

Delve into the profound implications of machine learning embeddings, their diverse applications, and their crucial role in reshaping the way we interact with data.

Prithivee Ramalingam