app.hopsworks.ai is experiencing some issues - we are investigating
2
arrow back
Back to Blog
Dhananjay Mukhedkar
link to linkedin
Software Engineer
Javier de la Rúa Martínez
link to linkedin
Software Engineer
Mikael Ronström
link to linkedin
Head of Data
Jim Dowling
link to linkedin
CEO and Co-Founder
Davit Bzhalava
link to linkedin
Head of Data Science
Article updated on

The Journey from Star Schema to Snowflake Schema in the Feature Store

Add more features to your ML model with a Snowflake Schema
July 8, 2024
14 min
Read
Dhananjay Mukhedkar
Dhananjay Mukhedkarlink to linkedin
Software Engineer
Hopsworks
Javier de la Rúa Martínez
Javier de la Rúa Martínezlink to linkedin
Software Engineer
Hopsworks
Mikael Ronström
Mikael Ronströmlink to linkedin
Head of Data
Hopsworks
Jim Dowling
Jim Dowlinglink to linkedin
CEO and Co-Founder
Hopsworks
Davit Bzhalava
Davit Bzhalavalink to linkedin
Head of Data Science
Hopsworks

TL;DR

Features are the input data that determines the quality of the predictions produced by AI systems. In real-time AI systems, features can be (1) computed on-demand from user-input and (2) retrieved as precomputed features from a feature store using some entity ID(s) provided by the application or user-input. Precomputed features are low-latency and information-rich, as they provide history and context information not available in user-input. In existing feature stores, the more entity IDs your real-time AI system can provide as input, the more precomputed features you can retrieve for different entities. In general, as you add features your AI system makes marginal gains in predictive power, hopefully, translating into competitive advantage. But where do the entity IDs come from? In interactive applications, you often have to ask the user for their booking ID. But what happens if, for usability reasons, you don’t want to ask the user to enter their bank account number and social security number? For one use case, you could build and maintain a denormalized (flat) table in your operational database. But that requires building and operating ETL pipelines, leads to no feature reuse, and doesn’t scale in operational databases. Until now, existing feature stores only supported Star Schema data models, where clients have to provide all the entity ID(s). In this article, we also introduce the snowflake schema data model for feature stores, and show how it helps you include more features to make better predictions - while keeping normalized tables (and reusing features). We also describe the database capabilities we built to support low latency queries (a pushdown LEFT JOIN in RonDB). There are bonus performance tips about how to optimize your feature store using projection pushdown and predicate pushdown.

This article is part 3 in a 7 part series describing in lay terms concepts and results from a SIGMOD 2024 research paper on the Hopsworks Feature Store.

Other Parts: 1 (Modular/Composable AI Systems) , 2 (The Taxonomy for Data Transformations), 4 (Lakehouse for AI), 5 (From Lakehouse to AI Lakehouse), 6 (Real-Time AI Database), 7 (Reproducible Data).

More Features makes for Better Predictions

Mary is building a real-time credit card fraud prediction model that has a strict 50ms latency SLO (service level objective). Every credit card payment arrives at a server with minimal information - the credit card number, transaction amount, longitude and latitude of where the translation took place, and whether it was online in the real-world at a merchant point of payment. That information is not sufficient to train a high performance machine learning (ML) model to predict whether the incoming transaction is fraudulent or not. She learns that feature stores can provide her with low latency access to precomputed features that could enable her to have a rich enough model to perform good enough for fraud detection. The credit card transactions land in a Kafka cluster and a streaming feature pipeline can be used to compute aggregated usage statistics over windows of time (last 15 mins, hour, day, week). Anomalous patterns of credit card usage can be detected by the model - great. She could also use the location of the current credit card transaction and the location of the previous transaction to compute a feature that is predictive of a geographic attack - for example, if the card is used in Paris, then 1 hour later in New York. 

Star schema
Figure 1: Mary can’t use the cc_issuer_details or account_details features as the credit card transaction as her feature store only supports a star schema data model. The credit card transaction does not include primary keys for the cc_issuer_details or account_details, and it would introduce too much latency to resolve them from the existing operational database, so she doesn’t use those features in her fraud model.

Mary does some offline experiments that show that there are features related to the account holder and the credit card issuer (recently, there has been an increase in single use virtual credit cards) that would further help improve the accuracy of her model. However, she encounters a problem. Existing feature stores only support a star schema data model. That means, when the credit card transaction arrives at the online model, she needs to provide the primary keys for all features that are retrieved from the feature store. That was easy for credit card features (and merchant features, when the merchant identifier is included with the transaction) as she could use the credit card number to retrieve the precomputed features. The transaction did not include, however, an identifier for the card issuer or the account holder. She considered making calls to retrieve these 2 IDs using primary key lookups on two different tables in a Postgres database. However, operations were not happy about opening access to their production database and the ~15ms p95 latency would have blown her SLO budget. So, she just went to production without the card issuer and account holder features. Not her fault.

So, how could Mary have included the card issuer and account holder features? If her feature store supported a Snowflake Schema (nested tables), not just a star schema, she could have retrieved all.

Snowflake schema
Figure 2: Mary can now use the cc_issuer_details and account_details features as her feature store supports the Snowflake Schema data model. This enables her to read these features in nested tables in a single pushdown LEFT JOIN query.

Snowflake Schema in Hopsworks

In Hopsworks, we call tables of features feature groups. Feature pipelines are programs (in Pandas, Polars, SQL, Spark, Flink, Feldera, Bytewax, Quix, etc) that periodically (or continually for streaming feature pipelines) update the feature data stored in feature groups. Each feature group stores its historical feature data in a table format (Apache Hudi or Delta Lake) on object storage and its low latency feature data for online models in a table in RonDB. So, how do we define a data model in Hopsworks?

Each feature group in Hopsworks has one or more columns as its primary key (the entity identifier). The foreign keys (shown in figures 1 and 2) are just columns in a feature group (there are no foreign key constraints). When you want to create a model and select features from more than one feature group, you have to use our feature selection API. For the Star Schema data model from figure 1, the code looks like this:

selected_features = credit_card_transactions.select_all()
    .join(aggregated_cc_transactions.select_all())
    .join(account_details.select_all())
    .join(merchant_details.select_all())
    .join(cc_issuer_details.select_all())

Here, for every child of the credit_card_transactions table in the Star Schema, we call nested join methods. All tables are children of the parent credit_card_transactions table. We pass the selected features to a feature view in Hopsworks, that is later used to create training data and retrieve feature data for inference, ensuring no skew between training and inference.

In online inference, when you want to retrieve features in your online model, you have to provide all 4 foreign key values from the credit_card_transactions feature group to retrieve your precomputed feature values using the feature view.  

feature vector = feature_view.get_feature_vector({
  ‘cc_num’: “1234 5555 3333 8888”, 
  ‘issuer_id’: 20440455, 
  ‘merchant_id’: 44208484,
  ‘account_id’: 84403331
})

For the Snowflake Schema data model from figure 1, the feature selection code looks as different, as we have to build up the subtrees:

nested_selection = aggregated_cc_transactions.select_all()
.join(account_details.select_all())
.join(cc_issuer_details.select_all())

selected_features = credit_card_transactions.select_all()
		.join(nested_selection)
.join(merchant_details.select_all()

First, we build the subtrees with nested tables (subtrees) using joins, and then we join the subtrees to their parents (the root credit_card_transactions node). 

Now, we have the benefit, that in online inference, we only need to pass two foreign key values to retrieve the precomputed features:

feature vector = feature_view.get_feature_vector({
  ‘cc_num’: “1234 5555 3333 8888”, 
  ‘merchant_id’: 44208484,
})

Internally, Hopsworks will perform two parallel queries - a primary key lookup on the merchant table in RonDB, and a pushdown LEFT JOIN to retrieve the other features from the aggregated_cc_transactions, issuer_details, and account_details tables.

Pushdown Left Join Performance Analysis

RonDB implements a pushdown LEFT JOIN to reduce the latency in retrieving nested features from a Snowflake schema data model. Figure 3 shows how we implement pushdown LEFT JOINS in our online feature store database, RonDB.

How we implemented pushdown LEFT JOINS in our online feature store database, RonDB.
Figure 3: In our SIGMOD paper, we show how we implement the pushdown LEFT JOIN, where the nested foreign keys are iteratively resolved in the database nodes, instead of in the API client (which would lead to higher latency). Star schema features are resolved with parallel primary key lookups to the tables, while nested features are resolved with a pushdown LEFT JOIN.

The Hopsworks feature store client issues parallel queries for all features from different tables (feature groups) in RonDB. For child tables (star schema), a primary key lookup is used to retrieve features from RonDB, while nested features are retrieved by issuing a pushdown LEFT JOIN to RonDB. For wide tables, with many features, and where only a small number of features are required, RonDB supports projection pushdown. For example, if you have a feature group (table) with hundreds of features, but only need a few of them for the current model, RonDB will only read and return the data for those features. Many databases or key-value stores used as feature stores, most notably Redis, do not support projection pushdown. Without projection pushdown, the entire row is returned and the filtering is performed in the client. For rows of 10s or 100s of KB, this could mean 100s of times more data is transferred than needed, negatively impacting latency and throughput (and potentially also cost).

In Figure 4 below, we also see the performance improvement of pushing down the LEFT JOIN to the database.

Pushdown LEFT JOINs reduce latency and improve throughput for feature stores
Figure 4: Pushdown LEFT JOINs reduce latency and improve throughput for feature stores.

We didn’t include any performance numbers in the paper for projection pushdown or predicate pushdown (both supported by RonDB), as these are not novel concepts in the database world - research papers favor novelty over known concepts. However, projection pushdown and predicate pushdown are must-have functionality for your feature store if you care about performance at scale.

Conclusions

This article explores the limitations of existing feature stores that only support Star Schema data models and introduces the Snowflake Schema data model, now supported by the Hopsworks Feature Store. The Snowflake Schema enhances online models by incorporating additional features through nested tables. In RonDB, we showed how the pushdown LEFT JOIN in RonDB reduces the latency and improves throughput for nested queries in Snowflake Schema data models.

References