Hopsworks Feature Store

The Feature Store is a data management system for managing machine learning features, including the feature engineering code and the feature data. The Feature Store helps ensure that features used during training and serving are consistent and that features are documented and reused within Enterprises.

1. What environment are you using ?

On-Premise

Hopsworks Feature Store can be installed on your infrastructure with the installer. The supported operating systems are: Ubuntu/Debian and Redhat/Centos. 

With Kubernetes

Hopsworks cannot currently  be deployed on a Kubernetes cluster. You can, however, connect Enterprise Hopsworks to an existing Kubernetes cluster, where you can run Python Jobs, Jupyter notebooks, and serve models (KFServing). Hopsworks Community does not support Kubernetes integration.

On Google Cloud Platform

Hopsworks can be deployed as a self-managed cluster on virtual machines (VMs) inside an existing GCP project. You can either run the Hopsworks Cloud Installer tool that uses the GCP command-line utility to create VMs and install Hopsworks or you can create the VMs yourself and then run the Hopsworks Installer tool.

On Azure

Hopsworks can be deployed as either:

(1) a managed platform inside your organisation’s existing cloud account at www.hopsworks.ai. You need to connect your Azure account by creating a service principal.

(2) a self-managed cluster on virtual machines (VMs) inside your organisation’s existing Azure subscription. You can either run the Hopsworks Cloud Installer tool that uses the AZ command-line utility to create VMs and install Hopsworks or you can create the VMs, resource group, virtual network, private DNS zone, yourself and then run the Hopsworks Installer tool.

On AWS

Hopsworks can be deployed as either:

(1) a managed platform inside your organisation’s existing cloud account at www.hopsworks.ai. You need to connect your AWS cloud account by creating a cross-account role.

(2) a self-managed cluster on virtual machines (VMs) inside your organisation’s existing cloud account. You create EC2 VMs and run the Hopsworks Installer Tool.

2. What is your architecture?

Your Data Source
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Your Feature Engineering
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Hopsworks
Feature Store
Your Data Science
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Source: HDFS

Hopsworks Feature Store supports HDFS as a data source

Source: Delta Lake

Hopsworks Feature Store supports delta lake as a data source

Source: JDBC

Hopsworks Feature Store supports JDBC as a data source.

We support the following databases:

• MySQL
• Microsoft SQL Server
• PostgreSQL
• MangoDB
• Db2
• Redis

Source: Snowflake

Hopsworks Feature Store supports Snowflake as a data source

Source: Azure Data Lake

Hopsworks Feature Store supports Azure Data Lake and Azure Block Storage as data sources. 

Source: Redshift

Hopsworks Feature Store supports Redshift as a data source. 

Docs

Source: Spark

Hopsworks Feature Store supports Spark as a data source.

We support the following databases:

• Elasticsearch
‍• Cassandra

Engineering: Hopsworks

You can do your feature engineering natively into the Hopsworks platform.

Engineering: SageMaker

Hopsworks supports feature engineering from AWS SageMaker.

Engineering: Azure ML

Hopsworks supports feature engineering from Azure ML. Contact us for more information.

Engineering: Databricks

Hopsworks can be deployed as either:

Engineering: Kubeflow

Hopsworks supports feature engineering on Kubeflow. Contact us for more information.

Engineering: Other Python

Hopsworks supports other python environment, such as Jupyter.

Engineering: Other Spark

Hopsworks supports other Spark environment, such as Cloudera.

Data Science: Hopsworks

If you don’t already have a Data Science platform, Hopsworks can be used to train and serve models, providing end-to-end security and governance, in a notebook friendly, Python-first platform.

Data Science: Databricks

Databricks users can use the Hopsworks Feature Store as a standalone service to manage their features for model training and model serving.

Data Science: SageMaker

AWS Sagemaker users can use the Hopsworks Feature Store as a standalone service to manage their features for model training and model serving.

Data Science: Azure ML

Azure ML users can use the Hopsworks Feature Store as a standalone service to manage their features for model training and model serving.

Data Science: Dataiku

Dataiku users can use the Hopsworks Feature Store as a standalone service to manage their features for model training and model serving.

Data Science: Domino

Domino users can use the Hopsworks Feature Store as a standalone service to manage their features for model training and model serving.

Data Science: Kubeflow

Kubeflow users can use the Hopsworks Feature Store as a standalone service to manage their features for model training and model serving.

Data Science: Jupyter Notebooks

Jupyter users can use the Hopsworks Feature Store as a standalone service to manager their features for model training and model serving.

Core components

Online Feature Store

The online Feature Store stores the latest values of features and supports low latency access to features that should be highly available.

Hopsworks Online Feature Store is built on NDB, the world’s highest throughput low latency key-value store that happens to be a full database with 99.999% high availability;

Highly available across availability zones within a region,

It can be replicated asynchronously between regions.

Offline Feature Store

The offline Feature Store supports exploratory data analysis and large volumes of data for model training and for use by batch (analytical) applications for model scoring.

Hopsworks Offline Feature Store is built on HopsFS to provide higher performance than S3 and Azure Block Storage (ABS) with no increase in cost (data is stored in a S3/ABS bucket in your account) while remaining open, providing a HDFS API and open file formats like Parquet, Hudi, Delta Lake, and ORC;

Hopsworks Offline Feature Store is built on HopsFS to provide higher performance than S3 and Azure Block Storage (ABS) with no increase in cost (data is stored in a S3/ABS bucket in your account) while remaining open, providing a HDFS API and open file formats like Parquet, Hudi, Delta Lake, and ORC;

It can be configured to be highly available across availability zones within a region;

It can be replicated asynchronously between regions.

Feature Store API

We provide comprehensive API documentation and examples, enabling you to create, version and control your features using Python, Scala, or Java.

Other Feature Stores provide a SQL API for you to transform your data into features and to retrieve features from the Feature Store. Hopsworks Feature Store is Python-Friendly, providing a Pandas-like API, making complex operations simple, such as joining features together to create training data;

Our API has also undergone revision, to apply what we have learnt about supporting production Feature Stores, so we support versioning both feature schemas and feature values (time-travel). Versioning is key to enable developers to update feature definitions without breaking existing feature pipelines.

Business Benefits

Decrease Time & Costs

The more features that are made available, the more features will be reused directly in many models.

Monitor model performance

Feature statistics help to quickly monitor, identify and adjust problems with models performance.

Increase Revenue

Correct feature pipelines, with no training/inference feature engineering skew affecting model performance.

Monitor model performance

Access control, custom metadata, and end-to-end security requirements assuring that all actions are audited, from ingesting to using features.

Users Benefits

The Feature Store's benefits are broad, and while there is some overlap, it may vary depending on the users; Data scientists, Data engineers or ML engineers. You may look at the direct benefits for your onw usage below;

For Data Scientists

Reduce time preparing training data
60-80% of time is spent finding the data and preparing it for use as training data.
Go beyond feature discovery
See the code used to define the feature, preview the feature values, visualize the feature distribution, understand the statistical properties of the feature, see who’s the feature owner and the conditions for use of the feature.
Implement new features
The data source of a new feature, from either existing features or external data source, and the pipeline for feature ingestion.
DevOps model as part of the development model
Make features automatically available for reuse by others, without any extra effort required to make features shareable.

For Data Engineers

Make data available for other teams
Enable data scientists to quickly access Enterprise data, stored in data lakes, data warehouses, event buses to train models and online and batch applications for making predictions, setting up a CI/CD framework for feature ingestion. 
Provide support to other teams
Building the pipelines that ingest data from external sources, helping define software tests for the feature transformation logic, and the quality of data ingested to the feature store - identifying drift in feature ingestion pipelines before it pollutes.
Curate the feature store
Ensure that  features move smoothly between development and production, and that only features that are in use are managed by the platform.
Upgrade features for new releases
Upgrade the schemas of features, revert a specific ingestion of feature values, and keep existing feature pipelines.

For ML Engineers

Retrieve features with the online feature store
Deploy online models to production and supply the features to the models with low latency to retrieve features and build the feature vector that is sent to the model for prediction.
Scalable and highly available real-time feature engineering
Reuse pipelines, supplied by users in websites or external computer generated sources (such as Internet-of-Things devices), to create training data for models and prevent skew between training and inference pipelines.
Retrieve training data statistics
Use API calls on the feature store for monitoring models to production and identifying data drift by comparing statistics on live traffic with training data statistics.
Debug operational models
Governance and metadata support enables debugging end-to-end machine learning pipelines, when problems with operational models are identified.

Try now free

Hopsworks Feature Store is available both on AWS and Azure as a  managed platform.
You can register for free, without having to enter payment details.