Hopsworks 3.3 is now generally available. This version includes two new APIs to retrieve data for batch inference (built using DuckDB and ArrowFlight server) and a REST API to retrieve data from the online feature store. Hopsworks 3.3 adds support for Flink and Beam as execution engines to write feature pipelines.
Faster Feature Data with DuckDB and ArrowFlight server
Hopsworks 3.0 introduced new Python APIs to extend its support to moderate-sized data challenges, where Python and Pandas are the dominant technologies today. Up until now, the Hopsworks backend used big data technologies (Spark or Hive) to create and read training data for Python clients. In this release, we introduce a new service based on DuckDB and Arrow Flight Service, that significantly reduces the time needed to create and read moderate-sized training data and batch inference data with Python clients. Our new service aligns with the big change in Pandas 2.0 with the introduction of the Apache Arrow as the backend for Pandas data. For larger data volumes, Hopsworks will continue to use Spark as its default engine.
Online Feature Store Client Improvements
More Flexible Feature Retrieval with Spine DataFrames
Hopsworks 3.3 introduces a new way of creating training datasets and retrieving data for batch inference by providing the entity IDs, timestamps, and labels as a DataFrame. Before Hopsworks 3.3, entity IDs, timestamps, and labels needed to be materialized in a “label” feature group. With the Spine DataFrame, the entity IDs, timestamps, and labels can be provided as a DataFrame (Pandas or Spark) without the need to write them to feature groups. Hopsworks then takes care of retrieving the point-in-time correct feature values for the provided Spine DataFrame.
Online external feature groups
Hopsworks has external feature group support, enabling creating external feature groups which data reside on external Data Warehouses (e.g. Snowflake, BigQuery, Redshift) or Data Lakes (e.g. S3, GCS, ADLS).
Hopsworks 3.3 extended our external feature group functionality to allow the feature groups also to be made available online. The historical data for training and batch inference remains on the external data store, while the online feature data is synchronized to the Online Feature Store (RonDB) for fast retrieval for online inference.
Improved Flink and New Beam Support
Hopsworks 3.3 improves the support for Apache Flink and adds support for Apache Beam as frameworks users can use to write feature pipelines. Apache Flink is the de facto standard for real-time feature engineering pipelines that require low latency processing and/or more sophisticated windows/grouping operations on data streams.
Apache Beam is the interface for Google DataFlow pipelines. Users can use Google DataFlow and Apache Beam to create feature pipelines to write data to the Hopsworks feature store.
The Apache Flink and Beam integration is currently only supported by the Java version of the Hopsworks SDK. Our JavaDoc is now available online here: https://docs.hopsworks.ai/feature-store-api/3.3/javadoc/
Other Notable API Improvements
As of Hopsworks 3.3, the default insertion of Pandas DataFrames from a Python client becomes non-blocking. This in contrast with previous releases where the insert() method call was blocking, waiting for the offline materialization job to complete. This will reduce the time a client waits before Hopsworks acknowledges that the data has been inserted successfully.
Hopsworks 3.3 now supports medium and long blobs and text as valid data types for online features. This addition allows data scientists to store embeddings of size up to 4GB in the online feature store.
The `.save()` method for feature groups now accepts a list of Feature objects as an alternative to Pandas/Spark DataFrames. This new mode allows data scientists to more easily separate the feature group definition from the pipeline feeding the feature group.
The get_feature_vector() and get_feature_vectors() methods to retrieve data from the online feature store now accept a return_type parameter to specify how the methods should return the feature vectors. Valid types are list, numpy and Pandas. For compatibility with previous versions, the default value is list and the vector is returned as a Python array.
Software stack upgrade
Some of the internal components of Hopsworks have been updated in version 3.3. As of Hopsworks 3.2, Hopsworks fully supports Ubuntu 22.04 and RedHat/Centos 8.x. Support for Ubuntu 18.04 and RedHat/Centos 7 has been removed considering the End of Life of the respective OS.
Hopsworks 3.3 comes with Kafka 3.4 and Zookeeper 3.7.1. Additionally the Hopsworks Feature Store library is now fully compatible with Pandas 2 as of version 3.3. Great Expectations has been updated to version 0.14.13 and KServe has been updated to version 0.10.0.