Compare Hopsworks with Databricks

What is Hopsworks?

Hopsworks is a machine learning platform that offers a state-of-the-art feature store solution, making it one of the most feature-rich and versatile feature stores on the market. It provides the highest level of integrability with any other ecosystem, making it easy to use with a wide range of data sources. Additionally, Hopsworks offers Python APIs that are easy to use, providing developers with great flexibility. With its multitude of sources, Hopsworks allows for a seamless feature engineering workflow, making it easy for data scientists to generate training datasets from raw data. Hopsworks is ideal for businesses that require low-latency data processing and support for multiple data sources.

What is Databricks?

Databricks is a unified data analytics platform that allows businesses to build data pipelines and create collaborative workflows. While Databricks provides a range of capabilities, its feature store is lighter in terms of technical capacities compared to most of the other feature store solutions. The feature store can only ingest pre-computed data and does not support defining feature pipelines. While this can be limiting, Databricks is still highly versatile, making it a great option for businesses that require a more comprehensive data analytics platform.

How to Choose?

While Hopsworks provides a state-of-the-art feature store with a multitude of sources, Databricks provides a comprehensive data analytics platform with a lighter feature store. Businesses looking for a solution centered around a feature store with the highest level of integrability and support for multiple data sources should consider Hopsworks. In contrast, businesses looking for a more comprehensive data analytics platform that does includes a feature store but is not their main requirement should consider Databricks.

Feature Store Capabilities

Hopsworks

Databricks

Engineering

Feature Computation Engines

What frameworks/languages are supported to create features?

Any environment to run Python, Spark, Flink.

Spark on Databricks

Feature pipelines computed from multiple Data Sources

Some feature stores ingest only pre-computed data, while others support defining feature pipelines.

Yes, using any data sources supported by Python/Spark/Flink. A single data source for SQL for external feature groups.

Yes, using any data sources supported by Spark

Creating Training Data and Batch Inference Data

How is feature data returned in batches for training or batch inference?

Python/Spark job that returns Training Data or Batch Inference Data as either a DataFrame or Files (Parquet, TFRecord, CSV)

Spark Job returns Spark DataFrame

On-Demand Features

Is there support for computing features on data only available from clients at request-time?

Python UDFs

Python UDFs in MLFlow

Data types

What (Python) language-level data types are supported.

Most Spark and Pandas datatypes (including timestamps and arrays)

Most PySpark Data Types

Datatype for entity/primary keys

What (Python) language-level data types are supported by the feature store for defining primary keys for entities?

String, Int, Long, Date

String, Int, Long, Date

Versioning

Does the platform provide support for versioning of features or Feature Tables/Groups.

Feature Groups, Feature Views, Training Data

N/A - Semantic versioning using names

Data Validation

Is there support for validating data in feature pipelines before the features are written to the feature store?

Great Expectations for Python or Spark Feature Pipelines

N/A

Feature Testing and CI/CD

Best practices for testing and CI/CD for feature development in machine learning.

Supports industry standard DevOps processes, with Git, PyTest, and CI/CD services (Jenkins, Github Actions, etc)

The same testing practices as you use for PySpark on Databricks

Retrieving Feature Vectors from Online Store

What APIs are supported for reading a row of feature values from the online feature store?

Python or Java SDK

Python or REST API

Operations

Pipeline Orchestration

How are the feature/training/inference pipelines that use the feature store scheduled to run? What orchestration engines are supported?

Any Python or Spark Orchestration tool (Airflow, Dagster, AWS Lambda, etc)

Databricks Worfklow Orchestrator

Offline Feature Store

What data warehouse / lakehouse / object store is supported for storing offline feature data?

Hudi on HopsFS/S3 or External Tables (Snowflake, S3, GCS, JDBC, etc)

Delta Lake

Platform Support

What platforms is the feature store available on

AWS, Azure, GCP, On-Prem

AWS, Azure, GCP

Online Feature Store

What operational database is supported for storing online features?

RonDB

DynamoDB or MySQL (Aurora or RDS)

Batch Ingestion

How are features written to the offline feature store.

Python or Spark DataFrame API
‍(Online, Offline or Both)

Spark DataFrame API

Streaming Ingestion

Does the platform support computing features in a streaming application.

Spark Streaming and Flink API

N/A

Join Engine

A join engine can help achieve point-in-time correctness for training data.

Spark and DuckDB

Spark

Reuse Features

Does the platform support feature encoding (model-dependent transformations) after the data has been stored in a Feature Table/Group?

Transformations attached to Feature Views or SkLearn/Keras/PyTorch Transformation Pipelines

N/A

Feature Monitoring

Is there support for identifying (and alerting) when there are anomalous changes in a feature as it is updated over time?

Feature ingestion monitoring with Great Expectations and alerting (email or slack)

N/A

Backfill Features

Is there any additional support for specifying a job to fill up a feature table/group with feature values from data source(s) that contains historical data?

Repeated Parameterized Python or Spark Job

Batch Ingestion Spark Job

Ranking and Retrieval Architecture Support

If you are using the feature store to build a personalized recommendation or search system, what support is there for vector DB integration?

Out-of-the-box, with OpenSearch K-NN included. External Vector Databases can be integrated.

External Vector Databases can be integrated

Model Registry & Model Serving Support

Is there support for storing the models in a registry and for running the online inference pipelines in a model serving platform?

Yes, with KServe for Model Serving

Yes wiith MLFlow

Security & Governance

Access Control

What support is there in the platform for authenticating users and then definining policies.

Platform level access control and Project Membership RBAC Inside Projects

RBAC for Feature Tables

Custom metadata and search

What type of tags can be created - string-based or schematized tags? How is search performed?

Names, descriptions, keywords, schematized tags - with free-text search

Name, description and Tags

Provenance

What support is there for tracking the lineage of features - what raw data are they computed on, what training data or models are they used in?

Feature Groups, Feature Views, Training Data

N/A

If you would like a more detailed comparison and complete review of the above products feel free to contact us.

Hopsworks compared to

Capabilities

Hopsworks