How should you manage your Data

Machine Learning (ML) is transforming businesses, enabling new revenue streams and significant cost savings for those companies that have been able to put models in production. Those few companies that have succeeded in operationalizing ML have one thing in common: they invest in infrastructure for managing their data for AI so that Data Scientists and ML Engineers can be put to work designing and deploying models on data they can rely on.

The Hopsworks Feature Store for ML manages your data throughout the entire ML lifecycle. With Hopsworks, you can manage that data needed to productionize models that will create value and reduce costs for your business.

What is a Feature Store?

The Feature Store is a platform for managing machine learning features, including the feature engineering code and the feature data itself. The Feature Store enables the real-time computation of features and the storage  of pre-computed features, ensuring that features used during training and serving are consistent, and providing high performance access to features for training, batch predictions, and online model serving.

Similar to how the Data Warehouse centralizes data for analytics, the Feature Store is the central source of truth for data used in  ML - it is a platform for discovering, documenting, and reusing features across many different models.

Feature Store Personas

ML Engineers

Need a Feature Store to operate both analytical and operational models that need to be fed with precomputed feature vectors. They also need to apply real-time transformations and log predictions/outcomes from online serving back to the Feature Store to help generate new training datasets.

Data Engineers

Need a Feature Store to store the output of feature pipelines that make (1) historical feature data available to train machine learning models and operate analytical models (batch inference), and (2) precomputed features available to operate operational models (online inference).

Data Scientists

Need a Feature Store to discover and explore available features, create batch or real-time aggregations, define feature transformations, and select features to build training data.

Data Architects

Need to secure and govern the feature data in the Feature Store and the feature pipelines, training pipelines, and inference pipelines that use the Feature Store.

Open and Pluggable Feature Store

Hopsworks includes an integrated data science and MLOps platform for developing features, training models, and serving models. However, many organizations already have existing data science, model serving, and MLOps platforms.

For those, Hopsworks Feature Store can be used as a standalone open platform that seamlessly integrates with existing feature engineering, model training, and model serving platforms - whether in the cloud or on-premises. You can keep feature data in your existing Data Warehouse and make it available on-demand (as external tables), and you can use Python/Scala/Java APIs or SQL to integrate model-serving infrastructure with the Feature Store.

Figure 1. Hopsworks Standalone Platform

A Catalog for Enterprise Features

Enterprise users can use the Feature Catalog to discover what features are available, perform exploratory data analysis by inspecting statistics for the features and previewing the feature data, and with custom schematized tags they can even identify the constraints under which a feature can be used. You can also design what type of Feature Catalogs you want - a classical dev/staging/production setup - or a data mesh, where different business units have their own Feature Catalogs but features can still be securely shared between teams using Hopsworks’ unique project-based multi-tenancy security model.

Hopsworks has tremendous flexibility in enabling you to design your own governance policies using self-service schematized tags and access control policies. Examples of easily configurable policies include:


  1. Who or what group of users are allowed to write/read to/from these features?
  2. Can the feature be used in a particular geographic region or industry?  
  3. Is this feature available for online model inference and/or batch inference? 
  4. Does this feature contain personally identifiable information?
  5. Is there an existing production feature with the same name as my new feature or is anybody currently developing a feature with that name or description?

Feature Engineering 

Hopsworks enables you to write feature engineering logic in any Python, Spark or SQL environment. You can run your feature engineering pipelines in Hopsworks itself or on external platforms, such as Databricks, AWS EMR, Azure ML Studio, Jupyter Notebooks, Colab, or even Snowflake/Redshift/Synapse.

Hopsworks provides user-interface and API  support for specifying data validation rules for features. You can create training data with point-in-time correctness for feature values, and define online transformation functions in Python, where the same Python function is used to consistently transform feature values for training data, batch inference, and feature vectors for online models.

Hopsworks also supports the freshest features of any available Feature Store. You can compute your features in streaming platforms (Spark Streaming or Apache Flink) and write those features to the Hopsworks Feature Store with just a few seconds of end-to-end latency, from data arrival to precomputed features being available for serving to applications.

Figure 2. Feature Store and Data Infrastructure

Security

The Hopsworks platform supports Single Sign On (SSO) for Active Directory, OAuth-2, and LDAP. All data-in-transit inside Hopsworks is secured using two-way TLS and X.509 certificates. The Hopsworks REST API supports authentication with API keys and JWT, over a HTTPS connection. Hopsworks can securely integrate with a Kubernetes cluster to offload feature engineering, model training, notebooks, and model serving (KServe).

Figure 3. Hopsworks ML Environment

Performance at Scale with RonDB

With RonDB, Hopsworks has the lowest-latency, highest throughput Feature Store available on the market today. RonDB has its roots in NDBCluster, an open-source key-value store with a SQL API, that powers the world’s highest availability platforms (up to 7 nines availability) in the Telecoms industry, with over 5 billion users globally using NDBCluster daily.

RonDB provides sub-millisecond access to features and up to millions of concurrent reads/writes on modestly sized 2-node high availability clusters, but also scaling to up to 144 nodes and even PBs of storage. Data in RonDB can be stored in-memory (for the lowest latency) or on-disk (for lower cost storage). RonDB supports both a MySQL API (with widespread interoperability, authentication, and access control) and a native API (higher performance for C++ and Java clients).

Figure 4. RonDB - LATS database in the cloud

Historical Features in a Hudi Lakehouse or in your Data Warehouse

Historical feature values can be stored in Hudi tables in Hopsworks. Hudi provides ACID support for Parquet data on object storage. In Hopsworks, we can store these cached historical features in our high performance HopsFS filesystem, backed by cloud-native object storage.

With HopsFS, you only pay object storage price (data is in a S3 bucket or Azure Blob Store container), but get the benefits of a HDFS API, up to
100X higher metadata performance than cloud native object stores and up to 3.4X higher read performance.

If you prefer, you can also keep historical feature values in your existing Data Warehouse, registering its tables as on-demand feature groups that can be used just like cached features in Hopsworks. Similar to Hudi tables in Hopsworks, on-demand features also support data validation and Hopsworks can compute statistics over them for exploratory data analysis. Supported data warehouses include Snowflake, Redshift, Delta Lake, Synapse, BigQuery, and any JDBC-enabled database.

Figure 5. Positioning Lifecycle with Data Warehouse

Real-Time Features with Spark Streaming
or Apache Flink

Hopsworks enables the freshest features by supporting the computation of features using either Spark Streaming or Apache Flink.
As live events regarding your system arrive at a message bus (like Kafka or Kinesis), a streaming application (that can run either on Hopsworks or your platform of choice) computes aggregations and writes them to the Hopsworks Online Feature Store (RonDB).

Hopsworks scales to millions of concurrent writes/second from streaming platforms, and can also transparently and consistently update to historical feature store.

MLOps and the Feature Store

"MLOps is not the beer, it's the brewery"
Jim Dowling
Hopsworks comes with built-in Airflow support. You can use Airflow or an external CI/CD platform, like Jenkins, Azure Data Factory, to automate the computation of features for your feature store, as well as the creation for training data for model training. In fact, if you also use Hopsworks model training and model serving, you can use Hopsworks as your end-to-end MLOps platform.

Deploy Anywhere: Managed in the Cloud
or On-Premises

Hopsworks is available where-ever you have your data. The easiest way to get started is to use the managed platform in the cloud, available today on AWS and Azure at www.hopsworks.ai. Hopsworks is also available as an open-source platform and an Enterprise platform for on-premises data centers with extra support for SSO and Kubernetes integration.