The Feature Store is a platform for managing machine learning features, including the feature engineering code and the feature data itself. The Feature Store enables the real-time computation of features and the storage of pre-computed features, ensuring that features used during training and serving are consistent, and providing high performance access to features for training, batch predictions, and online model serving.
Similar to how the Data Warehouse centralizes data for analytics, the Feature Store is the central source of truth for data used in ML - it is a platform for discovering, documenting, and reusing features across many different models.
Need a Feature Store to operate both analytical and operational models that need to be fed with precomputed feature vectors. They also need to apply real-time transformations and log predictions/outcomes from online serving back to the Feature Store to help generate new training datasets.
Need a Feature Store to store the output of feature pipelines that make (1) historical feature data available to train machine learning models and operate analytical models (batch inference), and (2) precomputed features available to operate operational models (online inference).
Need a Feature Store to discover and explore available features, create batch or real-time aggregations, define feature transformations, and select features to build training data.
Need to secure and govern the feature data in the Feature Store and the feature pipelines, training pipelines, and inference pipelines that use the Feature Store.
Hopsworks includes an integrated data science and MLOps platform for developing features, training models, and serving models. However, many organizations already have existing data science, model serving, and MLOps platforms.
For those, Hopsworks Feature Store can be used as a standalone open platform that seamlessly integrates with existing feature engineering, model training, and model serving platforms - whether in the cloud or on-premises. You can keep feature data in your existing Data Warehouse and make it available on-demand (as external tables), and you can use Python/Scala/Java APIs or SQL to integrate model-serving infrastructure with the Feature Store.
Enterprise users can use the Feature Catalog to discover what features are available, perform exploratory data analysis by inspecting statistics for the features and previewing the feature data, and with custom schematized tags they can even identify the constraints under which a feature can be used. You can also design what type of Feature Catalogs you want - a classical dev/staging/production setup - or a data mesh, where different business units have their own Feature Catalogs but features can still be securely shared between teams using Hopsworks’ unique project-based multi-tenancy security model.
Hopsworks has tremendous flexibility in enabling you to design your own governance policies using self-service schematized tags and access control policies. Examples of easily configurable policies include:
Hopsworks enables you to write feature engineering logic in any Python, Spark or SQL environment. You can run your feature engineering pipelines in Hopsworks itself or on external platforms, such as Databricks, AWS EMR, Azure ML Studio, Jupyter Notebooks, Colab, or even Snowflake/Redshift/Synapse.
Hopsworks provides user-interface and API support for specifying data validation rules for features. You can create training data with point-in-time correctness for feature values, and define online transformation functions in Python, where the same Python function is used to consistently transform feature values for training data, batch inference, and feature vectors for online models.
Hopsworks also supports the freshest features of any available Feature Store. You can compute your features in streaming platforms (Spark Streaming or Apache Flink) and write those features to the Hopsworks Feature Store with just a few seconds of end-to-end latency, from data arrival to precomputed features being available for serving to applications.
The Hopsworks platform supports Single Sign On (SSO) for Active Directory, OAuth-2, and LDAP. All data-in-transit inside Hopsworks is secured using two-way TLS and X.509 certificates. The Hopsworks REST API supports authentication with API keys and JWT, over a HTTPS connection. Hopsworks can securely integrate with a Kubernetes cluster to offload feature engineering, model training, notebooks, and model serving (KServe).
With RonDB, Hopsworks has the lowest-latency, highest throughput Feature Store available on the market today. RonDB has its roots in NDBCluster, an open-source key-value store with a SQL API, that powers the world’s highest availability platforms (up to 7 nines availability) in the Telecoms industry, with over 5 billion users globally using NDBCluster daily.
RonDB provides sub-millisecond access to features and up to millions of concurrent reads/writes on modestly sized 2-node high availability clusters, but also scaling to up to 144 nodes and even PBs of storage. Data in RonDB can be stored in-memory (for the lowest latency) or on-disk (for lower cost storage). RonDB supports both a MySQL API (with widespread interoperability, authentication, and access control) and a native API (higher performance for C++ and Java clients).
Historical feature values can be stored in Hudi tables in Hopsworks. Hudi provides ACID support for Parquet data on object storage. In Hopsworks, we can store these cached historical features in our high performance HopsFS filesystem, backed by cloud-native object storage.
With HopsFS, you only pay object storage price (data is in a S3 bucket or Azure Blob Store container), but get the benefits of a HDFS API, up to 100X higher metadata performance than cloud native object stores and up to 3.4X higher read performance.
If you prefer, you can also keep historical feature values in your existing Data Warehouse, registering its tables as on-demand feature groups that can be used just like cached features in Hopsworks. Similar to Hudi tables in Hopsworks, on-demand features also support data validation and Hopsworks can compute statistics over them for exploratory data analysis. Supported data warehouses include Snowflake, Redshift, Delta Lake, Synapse, BigQuery, and any JDBC-enabled database.
"MLOps is not the beer, it's the brewery"
Hopsworks is available where-ever you have your data. The easiest way to get started is to use the managed platform in the cloud, available today on AWS and Azure at www.hopsworks.ai. Hopsworks is also available as an open-source platform and an Enterprise platform for on-premises data centers with extra support for SSO and Kubernetes integration.