Feature Store On-Premises

Hopsworks allows you to manage all your data for machine learning on a Feature Store that integrates with Data platforms such as Oracle, Cloudera, Teradata, DB2, SQL Server, Kafka, MongoDB, Elastic, Kubernetes clusters, and more. Hopsworks also provides compute (Jupyter, Python, Spark, Flink), storage (commodity disks or an existing S3-compatible object store), and model registry/serving capabilities.

Why Hopsworks On-Premises?

Feature Computation and Storage

Hopsworks provides its own compute and storage, providing all of the building blocks for operational ML systems. Hopsworks can run your Spark/Flink/Python jobs for feature/training/inference pipelines and store your offline feature data replicated either on commodity servers or on a S3-compatible object store with optional KServe model serving/ registry support available.

Highest Performance

Hopsworks Online Feature Store is the highest throughput, lowest-latency available today. It is built on RonDB, our cloud-native version of MySQL Cluster that powers most of the world’s Network Operator Databases with 7 nines of availability. It is the only online feature store that can handle personalized search/recommendation use cases with the lowest of latency SLAs.

Python Native

Hopsworks also provides support for feature engineering in Spark and Flink. You can use Hopsworks’ unique Python APIs that provide high performance reading/writing of features from any Python environment (including Hopsworks itself, Kubernetes (Kubeflow Pipelines), Dataiku, Domino Data, or any Python-enabled data science platform).

Integration with an Data Warehouses, Hadoop/Cloudera, S3-compatible Object Store, and Kubernetes

Hopsworks' stores its offline feature groups as Hudi tables on either local disks of commodity servers or on an S3 compatible object store. Hopsworks provides a HDFS-compatible API to access the data through our own file system HopsFS. When data is stored in a S3 object store, HopsFS becomes a fast cache for accessing data on the object store.

Hopsworks supports external feature groups in its offline store with connectors for Oracle, S3 object storage, and any JDBC-enabled database. This means, you can keep your existing feature pipelines that create tables in Oracle/DB2/etc or Parquet tables on your S3 object store, and just mount them as external feature groups in Hopsworks.

You can also write feature pipelines in Python/Spark/Flink that read from almost any data source and write to Hopsworks feature groups in Hudi. This is a lower-cost and higher-performance alternative to storage tables of features in your data warehouse. When you read the offline feature data, from Hudi tables, you read via the high performance file system, called HopsFS. In Python, we provide lightning fast access to this data via our FlyingDuck service that transfers data from HopsFS/S3 to Pandas clients in Arrow format (without serialization/deserialization) and server-side uses DuckDB to do push-down filtering and point-in-time correct joins across feature groups.

Installs on Virtual Machines or Bare-metal, Air-Gapped if needed

Hopsworks can be installed on Linux virtual machines (Ubuntu, RHEL/Centos) on VMware, OpenStack, and other virtual machine platforms. You can also install on bare-metal servers for the highest performance.

If your servers are in an air-gapped environment, we can still give your developers a best-in-class Python development environment. Hopsworks can be integrated with a local Python PyPi and/or Conda server, so that your developers can easily install the latest approved libraries they need for their ML pipelines. Hopsworks also integrates with GitLab, Bitbucket, and Github to manage source code for your ML pipelines.

Hopsworks Cluster Management

Hopsworks engineers will manage the configuration, installation, and upgrade of your Hopsworks cluster securely inside your data center. You can configure your Hopsworks cluster(s) to support single sign-on (SSO), such as ActiveDirectory, LDAP, or O-Auth2. This way, members of your organization can be just given a URL to the cluster and they will authenticate via your existing SSO mechanisms.

Relevant Resources