Scheduled upgrade on April 4, 08:00 UTC

Kindly note that during the maintenance window, app.hopsworks.ai will not be accessible.

April 4, 2025

App Status

Back to Blog

Dhananjay Mukhedkar

Software Engineer

Let's keep in touch!

Subscribe to our newsletter and receive the latest product updates, upcoming events, and industry news.

More Blogs

Hopsworks AI Lakehouse Now Supports NVIDIA NIM Microservices

How we secure your data with Hopsworks

Migrating from AWS to a European Cloud - How We Cut Costs by 62%

The 10 Fallacies of MLOps

Hopsworks AI Lakehouse: The Power of Integrated MLOps Components

Article updated on

How to use external data stores as an offline feature store in Hopsworks with Connector API

September 15, 2022

9 min

Read

Dhananjay Mukhedkar

Software Engineer

Hopsworks

Data Engineering

Feature Store

TL;DR

Hopsworks is a feature store that can use an existing data lake/warehouses/lakehouse as its offline store for feature groups. We call feature groups that are “mounted” from external data sources external feature groups. Hopsworks feature store supports external features from many different data sources, providing the engine to join these features together. In this blog, we introduce Hopsworks Connector API that is used to mount a table in an external data source as an external feature group in Hopsworks.

Introduction

Enterprise data in the cloud is stored in a plethora of data storage services, from data lake based services like AWS S3, ADLS, and GCS, to data warehouses like Redshift, BigQuery, Snowflake, and Delta Lake - to name just a few. In this blog post, we explore how we can easily connect Hopsworks to commonly used data stores and use them as an offline store for features in Hopsworks. In effect, Hopsworks can be a virtual feature store, where you leverage your existing data lake or warehouse as an offline store, and Hopsworks manages both the metadata and online features.

Before we dive into the Connector API, let's briefly understand what an external feature group is.

A feature group is a table of features (columns) that are updated by a feature pipeline program. Feature stores provide two different APIs for reading features:

an Offline API to read large volumes of historical feature data for training or batch inference;
an Online API to read precomputed features at low latency for real-time inference.

Given these requirements, most feature stores are implemented as dual database systems, with one column-oriented offline store implementing the Offline API and a row-oriented online store implementing the Online API. The online store is a low-latency database that stores and serves the latest values of features. The offline store serves as a data warehouse to store historical values of features and requires low cost storage of large data volumes, efficient push down filter queries, efficient point-in-time Joins, and time-travel for features to be able to read historical values for features.

External feature groups enable the benefits of centralized feature management and reuse along with the storage of features on existing data platforms. Often, Enterprises have built feature pipelines on an existing data lake or data warehouse, and would like to reuse those features in a feature store without rewriting the feature pipelines. Also, many Enterprises do not like to introduce an additional data store, and would like to keep their features on their existing data platform. For these scenarios, external feature groups are an attractive proposition.

Connector API

The Connector API enables you to mount an existing table in your external data source (Snowflake, Redshift, Delta Lake, etc) as an offline feature group in Hopsworks. This means you can use its features exactly like any other offline feature, for example, joining different features together from different data sources, to create feature views, training data, and feature vectors for online models.

In the image below, we can see that Hopsworks Feature Store (HSFS) currently supports a plethora of data sources, including any JDBC-enabled source, Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, and Kafka.

Hopsworks only stores the metadata about such features, but not the actual data itself. This opens up the possibility to use your existing data storage and utilize it as an external table without the hassle of migrating the data into Hopsworks. If you do decide to store your offline features in Hopsworks, they are stored in Hudi tables in S3/GCS/ADLS.

The Connector API requires a storage connector, which securely stores the authentication information about how to connect to an external data store, to enable the feature store client library, HSFS, to retrieve data from the external table. Once a storage connector is created for the desired data source, it can seamlessly be used to create external feature groups from that data source. Storage connectors are also used in other applications like reading data for batch inference or streaming sources and also for creating training datasets.

We now take a look at an example of how to create an external feature group programmatically using the Connector API.

Create a storage connector

Our first step is to create a storage connector from the UI. In the screenshot below, we see an example of creating a storage connector for GCS. Detailed information about how to create a storage connector is found here.

***Figure 2.*** *Hopsworks UI - Storage Connector*

We retrieve the storage connector object using the feature store connection object as follows:

import hsfs
# Connect to the Hopsworks feature store
connection = hsfs.connection()
# Retrieve the metadata handle
fs = connection.get_feature_store()
# Retrieve the storage connector
connector = fs.get_storage_connector('connector_name')

Note, the above connector object contains only metadata for the external data source. There is no active connection yet to the external data source, so the call returns quickly.

Create external feature group

The method call to create an external feature group is create_external_feature_group. It takes name as mandatory parameter, for the name of the feature group being created. Depending on the type of the external data source, we can set either the path or query argument, as shown below.

Data Lake based external feature group

For data lake based external sources (AWS S3, GCS), the path to the data object to be used as an external feature group needs to be set. It can be set while creating the storage connector in the bucket field or setting it to the path argument, which is appended to the bucket path mentioned while creating the storage connector. Secondly, the format of the data needs to be set in data_format argument, which can typically be any spark data format (csv, parquet, orc, etc.).

external_fg = fs.create_external_feature_group(name="external_fg_name",
	version=1
	description="Physical shop sales features",
	data_format="parquet",
	storage_connector=connector,
	primary_key=['ss_store_sk'],
	event_time='sale_date'
)

SQL based external feature group

For external sources based on a data warehouse (e.g., JDBC, Snowflake, Redshift, BigQuery) the user provides a SQL string for retrieving data and even computing the features. Your SQL query can include operations such as projections, aggregations, and so on, including.

query = """
	SELECT TO_NUMERIC(ss_store_sk) AS ss_store_sk
    	, AVG(ss_net_profit) AS avg_ss_net_profit
    	, SUM(ss_net_profit) AS total_ss_net_profit
    	, AVG(ss_list_price) AS avg_ss_list_price
    	, AVG(ss_coupon_amt) AS avg_ss_coupon_amt
    	, sale_date
    	, ss_store_sk
	FROM STORE_SALES
	GROUP BY ss_store_sk, sales_date
     """

external_fg = fs.create_external_feature_group(name="external_fg_name",
	version=1
	description="Physical shop sales features",
	query=query,
	storage_connector=connector,
	primary_key=['ss_store_sk'],
	event_time='sale_date'
)

We can also optionally specify which columns should be used as primary key, and event time.

The create method creates the metadata object of the feature group but does not persist it. Upon calling save, the feature group metadata gets stored in Hopsworks. At this point, Hopsworks will also execute the SQL query and compute descriptive statistics and feature distributions over the data returned, saving the results as metadata.

# saves the feature group in Hopsworks
external_fg.save()

That's it! This creates an external feature group in Hopsworks while the actual data still lives only in the external storage!

Summary

The Connector API makes it very easy to connect to an external data source and create offline feature groups in Hopsworks. Hopsworks supports a large variety of external data sources like Snowflake, Data Lake, Redshift, BigQuery, S3, ADLS, GCS, and Kafka.

These external feature groups can be used like any other offline feature group, for example, to create feature views, batch inference data, and training data for models. This brings up the benefit of leveraging existing data storage and reading features seamlessly to Hopsworks as needed, without the cost and operational overhead of data migration.

For more information on the Connector API, checkout Hopsworks Documentation.

References

Interested for more?

🤖 Register for free on Hopsworks Serverless
🌐 Read about the open, disaggregated AI Lakehouse stack
📚 Get your early copy: O'Reilly's 'Building Machine Learning Systems' book
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

More blogs

ML and AI applications are becoming increasingly demanding in terms of performance, we compare Redis to RonDB in terms of Scalability, Throughput and H.A.

AI/ML needs a Key-Value store, and Redis is not up to it

Seeing how Redis is a popular open-source feature store with features significantly similar to RonDB, we compared the innards of RonDB’s multithreading architecture to the commercial Redis products.

Mikael Ronström

Learn the best ways to enrich and evolve your models from stateless AI to Total Recall AI with the help of the python-centric Hopsworks Feature Store.

Beyond Brainless AI with a Feature Store

Evolve your models from stateless AI to Total Recall AI with the help of a Feature Store.

Jim Dowling

Hopsworks Online Feature Store: Fast Access to Feature Data for AI Applications

Read about how the Hopsworks Feature Store abstracts away the complexity of a dual database system, unifying feature access for online and batch applications.

Moritz Meister