Feature Store: The Definitive Guide

What is a feature store?

A feature store is a data platform that supports the development and operation of machine learning systems by managing the storage and efficient querying of feature data. Machine learning systems can be real-time, batch or stream processing systems, and the feature store is a general purpose data platform that supports a multitude of write and read workloads, including batch and streaming writes, to batch and point read queries, and even approximate nearest neighbour search. Feature stores also provide compute support to ML pipelines that create and use features, including ensuring the consistent computation of features in different (offline and online) ML pipelines.

What is a feature and why do I need a specialized store for them?

A feature is a measure property of some entity that has predictive power for a machine learning model. Feature data is used to train ML models, and make predictions in batch ML systems and online ML systems. Features can be computed either when they are needed or in advance and used later for training and inference. Some of the advantages of storing features is that they can be easily discovered and reused in different models, reducing the cost and time required to build new machine learning systems. For real-time ML systems, the feature store provides history and context to (stateless) online models. Online models tend to have no local state, but the feature store can enrich the set of features available to the model by providing, for example, historical feature data about users (retrieved with the user’s ID) as well as contextual data, such as what’s trending. The feature store also reduces the time required to make online predictions, as these features do not need to be computed on-demand, - they are precomputed.

How does the feature store relate to MLOps and ML systems?

In a MLOps platform, the feature store is the glue that ties together different ML pipelines to make a complete ML system:

feature pipelines compute features and then write those features (and labels/targets) to it;
training pipelines read features (and labels/targets) from it;
Inference pipelines can read precomputed features from it.

The main goals of MLOps are to decrease model iteration time, improve model performance, ensure governance of ML assets (feature, models), and improve collaboration. By decomposing your ML system into separate feature, training, and inference (FTI) pipelines, your system will be more modular with 3 pipelines that can be independently developed, tested, and operated. This architecture will scale from one developer to teams that take responsibility for the different ML pipelines: data engineers and data scientists typically build and operate feature pipelines; data scientists build and operate training pipelines, while ML engineers build and operate inference pipelines. The feature store enables the FTI pipeline architecture, enabling improved communication within and between data, ML, and operations teams.

What problems does a feature store solve?

The feature store solves many of the challenges that you typically face when you (1) deploy models to production, and (2) scale the number of models you deploy to production, and (3) scale the size of your ML teams, including:

Support for collaborative development of ML systems based on centralized, governed access to feature data, along with a new unified architecture for ML systems as feature, training and inference pipelines;
Manage incremental datasets of feature data. You should be able to easily add new, update existing, and delete feature data using DataFrames. Feature data should be transparently and consistently replicated between the offline and online stores;
Backfill feature data from data sources using a feature pipeline and backfill training data using a training pipeline;
Provides history and context to stateless interactive (online) ML applications;
Feature reuse is made easy by enabling developers to select existing features and reuse them for training and inference in a ML model;
Support for diverse feature computation frameworks - including batch, streaming, and request-time computation. This enables ML systems to be built based on their feature freshness requirements;
Validate feature data written and monitor new feature data for drift;
A taxonomy for data transformations for machine learning based on the type of feature computed (a) reusable features are computed by model-independent transformations, (b) features specific to one model are computed by model-dependent transformations, and (c) features computed with request-time data are on-demand transformations. The feature store provide abstractions to prevent skew between data transformations performed in more than one ML pipeline.
A point-in-time consistent query engine to create training data from historical time-series feature data, potentially spread over many tables, without future data leakage;
A query engine to retrieve and join precomputed features at low latency for online inference using an entity key;
A query engine to find similar feature values using embedding vectors.

The table below shows you how the feature store can help you with common ML deployment scenarios.

For just putting ML in production, the feature store helps with managing incremental datasets, feature validation and monitoring, where to perform data transformations, and how to create point-in-time consistent training data. Real-Time ML extends the production ML scenario with the need for history and context information for stateless online models, low latency retrieval of precomputed features, online similarity search, and the need for either stream processing or on-demand feature computation. For the ML at large scale, there is also the challenge of enabling collaboration between teams of data engineers, data scientists, and ML engineers, as well as the reuse of features in many models.

Collaborative Development

Feature stores are the key data layer in a MLOps platform. The main goals of MLOps are to decrease model iteration time, improve model performance, ensure governance of ML assets (feature, models), and improve collaboration. The feature store enables different teams to take responsibility for the different ML pipelines: data engineers and data scientists typically build and operate feature pipelines; data scientists build and operate training pipelines, while ML engineers build and operate inference pipelines.

They enable the sharing of ML assets and improved communication within and between teams. Whether teams are building batch machine learning systems or real-time machine learning systems, they can use shared language around feature, training, and inference pipelines to describe their responsibilities and interfaces.

A more detailed Feature Store Architecture is shown in the figure below.

Its historical feature data is stored in an offline store (typically a columnar data store), its most recent feature data that is used by online models in an online store (typically a row-oriented database or key-value store), and if indexed embeddings are supported, they are stored in a vector database. Some feature stores provide the storage layer as part of the platform, some have partial or full pluggable storage layers.

The machine learning pipelines (feature pipelines, training pipelines, and inference pipelines) read and write features/labels from/to the feature store, and prediction logs are typically also stored there to support feature/model monitoring and debugging. Different data transformations (model-independent, model-dependent, and on-demand) are performed in the different ML pipelines, see the Taxonomy of Data Transformations for more details.

Incremental Datasets

Feature pipelines keep producing feature data as long as your ML system is running. Without a feature store, it is non-trivial to manage the mutable datasets updated by feature pipelines - as the datasets are stored in the different offline/online/vector-db stores. Each store has its own drivers, authentication and authorization support, and the synchronization of updates across all stores is challenging.

Feature stores make the management of mutable datasets of features, called feature groups, easy by providing CRUD (create/read/update/delete) APIs. The following code snippet shows how to append, update & delete feature data in a feature group using a Pandas DataFrame in Hopsworks. The updates are transparently synchronized across all of the underlying stores - the offline/online/vector-db stores.

df = # read from data source, then perform feature engineering
fg = fs.get_or_create_feature_group(name="query_terms_yearly",
                              version=1,
                              description="Count of search term by year",
                              primary_key=['year', 'search_term'],
                              partition_key=['year'],
                              online_enabled=True
                              )
fg.insert(df) # insert or update
fg.commit_delete_record(df) # delete

We can also update the same feature group using a stream processing client (streaming feature pipeline). The following code snippet uses PySpark streaming to update a feature group in Hopsworks. It computes the average amount of money spent on a credit card, for all transactions on the credit card, every 10 minutes. It reads its input data as events from a Kafka cluster.

df_read = spark.readStream.format("kafka")...option("subscribe", 
KAFKA_TOPIC_NAME).load()
 
# Deserialize data from Kafka and create streaming query
df_deser = df_read.selectExpr(....).select(...)
 
# 10 minute window
windowed10mSignalDF = df_deser \
    .selectExpr(...)\
    .withWatermark(...) \
    .groupBy(window("datetime", "10 minutes"), "cc_num").agg(avg("amount")) \
    .select(...)
 
card_transactions_10m_agg =fs.get_feature_group("card_transactions_10m_agg", version=1)
 
query_10m = card_transactions_10m_agg.insert_stream(windowed10mSignalDF)

Some feature stores also support defining columns as embeddings that are indexed for similarity search. The following code snippet writes a DataFrame to a feature group in Hopsworks, and indexes the “embedding_body” column in the vector database. You need to create the vector embedding using a model, add it as a column to the DataFrame, and then write the DataFrame to Hopsworks.

from hsfs import embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

df = # read from data source, then perform feature engineering


embeddings_body = model.encode(df["Article"])
df["embedding_body"] = pd.Series(embeddings_body.tolist())

emb = embedding.EmbeddingIndex()
emb.add_embedding("embedding_body", len(df["embedding_body"][0]))

news_fg = fs.get_or_create_feature_group(
    name="news_fg",
    embedding_index=emb,
    primary_key=["id"],
    version=1,
    online_enabled=True
)
news_fg.insert(df)

Backfill feature data and Training Data

Backfilling is the process of recomputing datasets from raw, historical data. When you backfill feature data, backfilling involves running a feature pipeline with historical data to populate the feature store. This requires users to provide a start_time and an end_time for the range of data that is to be backfilled, and the data source needs to support timestamps, e.g., Type 2 slowly changing dimensions in a data warehouse table.

The same feature pipeline used to backfill features should also process “live” data. You just point the feature pipeline at the data source and the range of data to backfill (e.g., backfill the daily partitions with all users for the last 180 days). Both batch and streaming feature pipelines should be able to backfill features. Backfilling features is important because you may have existing historical data that can be leveraged to create training data for a model. If you couldn’t backfill features, you could start logging features in your production system and wait until sufficient data has been collected before you start training your model.

Point-in-Time Correct Training Data

If you want to create training data from time-series feature data without any future data leakage, you will need to perform a temporal join, sometimes called a point-in-time correct join.

For example, in the figure below, we can see that for the (red) label value, the correct feature values for Feature A and Feature B are 4 and 6, respectively. Data leakage would occur if we included feature values that are either the pink (future data leakage) or orange values (stale feature data). If you do not create point-in-time correct training data, your model may perform poorly and it will be very difficult to discover the root cause of the poor performance.

If your offline store supports AsOf Joins, feature retrieval involves joining Feature A and Feature B from their respective tables AsOf the timestamp value for each row in the Label table. The SQL query to create training data is an “AS OF LEFT JOIN”, as this query enforces the invariant that for every row in your Label table, there should be a row in your training dataset, and if there are missing feature values for a join, we should include NULL values (we can later impute missing values in model-dependent transformations). If your offline store does not support AsOf Joins, you can write alternative windowing code using state tables.

As both AsOf Left joins and window tables result in complex SQL queries, many feature stores provide domain-specific language (DSL) support for executing the temporal query. The following code snippet, in Hopsworks, creates point-in-time-consistent training data by first creating a feature view. The code starts by (1) selecting the columns to use as features and label(s) to use for the model, then (2) creates a feature view with the selected columns, defining the label column(s), and (3) uses the feature view object to create a point-in-time correct snapshot of training data.

fg_loans = fs.get_feature_group(name="loans", version=1)
fg_applicants = fs.get_feature_group(name="applicants", version=1)
select= fg_loans.select_except(["issue_d", "id"]).join(\
            fg_applicants.select_except(["earliest_cr_line", "id"]))
 
fv = fs.create_feature_view(name="loans_approvals", 
            version=1,
            description="Loan applicant data",
            labels=["loan_status"],
            query=select
            )
X_train, X_test, y_train, y_test = fv.train_test_split(test_size=0.2)
#....
model.fit(X_train, y_train)

The following code snippet, in Hopsworks, uses the feature view we just defined to create point-in-time consistent batch inference data. The model makes predictions using the DataFrame df containing the batch inference data.

fv = fs.get_feature_view(name="loans_approvals", version=fv_version) 
df = fv.get_batch_data(start_time=”2023-12-23 00:00”, end_time=NOW)

predictions_df = model.predict(df)

History and Context for Online Models

Online models are often hosted in model-serving infrastructure or stateless (AI-enabled) applications. In many user-facing applications, the actions taken by users are “information poor”, but we would still like to use a trained model to make an intelligent decision. For example, in Tiktok, a user click contains a limited amount of information - you could not build the world’s best real-time recommendation system using just a single user click as an input feature.

The solution is to use the user’s ID to retrieve precomputed features from the online store containing the user's personal history as well as context features (such as what videos or searches are trending). The precomputed features returned enrich any features that can be computed from the user input to build a rich feature vector that can be used to train complex ML models. For example, in Tiktok, you can retrieve precomputed features about the 10 most recent videos you looked at - their category, how long you engaged for, what’s trending, what your friends are looking at, and so on.
In many examples of online models, the entity is a simple user or product or booking. However, often you will need more complex data models, and it is beneficial if your online store supports multi-part primary keys (see Uber talk).

Feature Reuse

A common problem faced by organizations when they build their first ML models is that there is a lot of bespoke tooling, extracting data from existing backend systems so that it can be used to train a ML model. Then, when it comes to productionizing the ML model, more data pipelines are needed to continually extract new data and compute features so that the model can make continual predictions on the new feature data.

However, after the first set of pipelines have been written for the first model, organizations soon notice that one or more features used in an earlier model are needed in a new model. Meta reported that in their feature store “most features are used by many models”, and that the most popular 100 features are reused in over 100 different models. However, for expediency, developers typically rewrite the data pipelines for the new model. Now you have different models re-computing the same feature(s) with different pipelines. This leads to waste, and a less maintainable (non-DRY) code base.

The benefits of feature reuse with a feature store include higher quality features through increased usage and scrutiny, reduced storage costs - and less feature pipelines. In fact, the feature store decouples the number of models you run in production from the number of feature pipelines you have to maintain. Without a feature store, you typically write at least one feature pipeline per model. With a (large enough) feature store, you may not need to write any feature pipeline for your model if the features you need are already available there.

Multiple Feature Computation Models

The feature pipeline typically does not need GPUs, may be a batch program or streaming program, and may process small amounts of data with Pandas or Polars or large amounts of data with a framework such as Spark or DBT/SQL. Streaming feature pipelines can be implemented in Python (Bytewax) or more commonly in distributed frameworks such as PySpark, with its micro-batch computation model, or Flink/Beam with their lower latency per-event computation model.

The training pipeline is typically a Python program, as most ML frameworks are written in Python. It reads features and labels as input, trains a model and outputs the trained model (typically to a model registry).

An inference pipeline then downloads a trained model and reads features as input (some may be computed from the user’s request, but most will be read as precomputed features from the feature store). Finally, it uses the features as input to the model to make predictions that are either returned to the client who requested them or stored in some data store (often called an inference store) for later retrieval.

Validate Feature Data and Monitor for Drift

Garbage-in, garbage out is a well known adage in the data world. Feature stores can provide support for validating feature data in feature pipelines. The following code snippet uses the Great Expectations library to define a data validation rule that is applied when feature data is written to a feature group in Hopsworks.

df = # read from data source, then perform feature engineering


# define data validation rules in Great Expectations
ge_suite = ge.core.ExpectationSuite(
    expectation_suite_name="expectation_suite_101"
    )

ge_suite.add_expectation(
    ExpectationConfiguration(
        expectation_type="expect_column_values_to_not_be_null",
        kwargs={"column":"'search_term'"}
    )
)

fg = fs.get_or_create_feature_group(name="query_terms_yearly",
                              version=1,
                              description="Count of search term by year",
                              primary_key=['year', 'search_term'],
                              partition_key=['year'],
                              online_enabled=True,
                              expectation_suite=ge_suite
                              )
fg.insert(df) # data validation rules executed in client before insertion

The data validation results can then be viewed in the feature store, as shown below. In Hopsworks, you can trigger alerts if data validation fails, and you can decide whether to allow the insertion or fail the insertion of data, if data validation fails.

Feature monitoring is another useful capability provided by many feature stores. Whether you build a batch ML system or an online ML system, you should be able to monitor inference data for the system’s model to see if it is statistically significantly different from the model’s training data (data drift). If it is, you should alert users and ideally kick-off the retraining of the model using more recent training data.

Here is an example code snippet from Hopsworks for defining a feature monitoring rule for the feature “amount” in the model’s prediction log (available for both batch and online ML systems). A job is run once per day to compare inference data for the last week for the amount feature, and if its mean value deviates more than 50% from the mean observed in the model’s training data, data drift is flagged and alerts are triggered.

# Compute statistics on a prediction log as a detection window
fg_mon = pred_log.create_feature_monitoring("name", 
    feature_name = "amount", job_frequency = "DAILY")
    .with_detection_window(row_percentage=0.8, time_offset ="1w")

# Compare feature statistics with a reference window - e.g., training data
fg_mon.with_reference_training_dataset(version=1).compare_on(
    metric = "mean", threshold=50)

Taxonomy of Data Transformations

When data scientists and data engineers talk about data transformations, they are not talking about the same thing. This can cause problems in communication, but also in the bigger problem of feature reuse in feature stores. There are 3 different types of data transformations, and they belong in different ML pipelines.

Data transformations, as understood by data engineers, is a catch-all term that covers data cleansing, aggregations, and any changes to your data to make it consumable by BI or ML. These data transformations are called model-independent transformations as they produce features that are reusable by many models.

In data science, data transformations are a more specific term that refers to encoding a variable (categorical or numerical) into a numerical format, scaling a numerical variable, or imputing a value for a variable, with the goal of improving the performance of your ML model training. These data transformations are called model-dependent transformations and they are specific to one model.

Finally, there are data transformations that can only be performed at runtime for online models as they require parameters only available in the prediction request. These data transformations are called on-demand transformations, but they may also be needed in feature pipelines if you want to backfill feature data from historical data.

The feature store architecture diagram from earlier shows that model-independent transformations are only performed in feature pipelines (whether batch or streaming pipelines). However, model-dependent transformations are performed in both training and inference pipelines, and on-demand transformations can be applied in both feature and online inference pipelines. You need to ensure that equivalent transformations are performed in both pipelines - if there is skew between the transformations, you will have model performance bugs that will be very hard to identify and debug. Feature stores help prevent this problem of online-offline skew. For example, model-dependent transformations can be performed in scikit-learn pipelines or in feature views in Hopsworks, ensuring consistent transformations in both training and inference pipelines. Similarly, on-demand transformations are version-controlled Python or Pandas user-defined functions (UDFs) in Hopsworks that are applied in both feature and online inference pipelines.

Query Engine for Point-in-Time Consistent Feature Data for Training

Feature stores can use existing columnar data stores and data processing engines, such as Spark, to create point-in-time correct training data. However, as of December 2023, Spark, BigQuery, Snowflake, and Redshift do not support the ASOF LEFT JOIN query that is used to create training data from feature groups. Instead, they have to implement stateful windowed approaches.

The other main performance bottleneck with many current data warehouses is that they provide query interfaces to Python with either a JDBC or ODBC API. These are row-oriented protocols, and data from the offline store needs to be pivoted from columnar format to row-oriented, and then back to column-oriented in Pandas. Arrow is now the backing data format for Pandas 2.+.

In open-source, reproducible benchmarks by KTH, Karolinska, and Hopsworks, they showed the throughput improvements over a specialist DuckDB/ArrowFlight feature query engine that returns Pandas DataFrames to Python clients in training and batch inference pipelines. We can see from the table below that throughput improvements of 10-45X JDBC/ODBC-based query engines can be achieved.

Query Engine for Low Latency Feature Data for Online Inference

The online feature store is typically built on existing low latency row-oriented data stores. These could be key-value stores such as Redis or Dynamo or a key-value store with a SQL API, such as RonDB for Hopsworks.

The process of building the feature vector for an online model also involves more than just retrieving precomputed features from the online feature store using an entity ID. Some features may be passed as request parameters directly and some features may be computed on-demand - using either request parameters or data from some 3rd party API, only available at runtime. These on-demand transformations may even need historical feature values, inference helper columns, to be computed.

In the code snippet below, we can see how an online inference pipeline takes request parameters in the predict method, computes an on-demand feature, retrieves precomputed features using the request supplied id, and builds the final feature vector used to make the prediction with the model.

def loc_diff(event_ts, cur_loc) :
    return grid_loc(event_ts, cur_loc)

def predict(id, event_ts, cur_loc, amount) :
    f1 = loc_diff(event_ts, cur_loc)
    df = feature_view.get_feature_vector(
        entry = {"id":id}, 
        passed_features ={"f1" : f1, "amount" : amount}
    )
    return model.predict(df)

In the figure below, we can see important system properties for online feature stores. If you are building your online AI application on top of an online feature store, it should have LATS properties (low Latency, high Availability, high Throughput, and scalable Storage), and it should also support fresh features (through streaming feature pipelines).

Some other important technical and performance considerations here for the online store are:

Projection pushdown can massively reduce network traffic and latency. When you have popular features in feature groups with lots of columns, your model may only require a few features. Projection pushdown only returns the features you need. Without projection pushdown (e.g., most key-value stores), the entire row is returned and the filtering is performed in the client. For rows of 10s of KB, this could mean 100s of times more data is transferred than needed, negatively impacting latency and throughput (and potentially also cost).
Your feature store should support a normalized data model, not just a star schema. For example, if your user provides a booking reference number that is used as the entity ID, can your online store also return features for the user and products referenced in the booking, or does either the user or application have to provide the user ID and product ID? For high performance, your online store should support pushdown LEFT JOINs to reduce the number of database round trips for building features from multiple feature groups.

Query Engine to find similar Feature Data using Embeddings

Real-time ML systems often use similarity search as a core functionality. For example, personalized recommendation engines typically use similarity search to generate candidates for recommendation, and then use a feature store to retrieve features for the candidates, before a ranking model personalizes the candidates for the user.

The example code snippet below is from Hopsworks, and shows how you can search for similar rows in a feature group with the text “Happy news for today” in the embedding_body column.

news_desc = "Happy news for today"
df = news_fg.find_neighbors(model.encode(news_desc), k=3)
# df now contains rows with 'news_desc' values that are most similar to 'news_desc'

Do I need a feature store?

Feature stores have historically been part of big data ML platforms, such as Uber’s Michelangelo, that manage the entire ML workflow, from specifying feature logic, to creating and operating feature pipelines, training pipelines, and inference pipelines.

More recent open-source feature stores provide open APIs enabling easy integration with existing ML pipelines written in Python, Spark, Flink, or SQL. Serverless feature stores further reduce the barriers of adoption for smaller teams. The key features needed by most teams include APIs for consistent reading/writing of point-in-time correct feature data, monitoring of features, feature discovery and reuse, and the versioning and tracking of feature data over time. Basically, feature stores are needed for MLOps and governance. Do you need Github to manage your source code? No, but it helps. Similarly, do you need a feature store to manage your features for ML? No, but it helps.

What is the difference between a feature store and a vector database?

Both feature stores and vector databases are data platforms used by machine learning systems. The feature store stores feature data and provides query APIs for efficient reading of large volumes feature data (for model training and batch inference) and low latency retrieval of feature vectors (for online inference). In contrast, a vector database provides a query API to find similar vectors using approximate nearest neighbour (ANN) search.

The indexing and data models used by feature stores and vector databases are very different. The feature store has two data stores - an offline store, typically a data warehouse/lakehouse, that is a columnar database with indexes to help improve query performance such as (file) partitioning based on a partition column, skip indexes (skip files when reading data using file statistics), and bloom filters (which files to skip when looking for a row). The online store is row-oriented database with indexes to help improve query performance such as a hash index to lookup a row, a tree index (such as a b-tree) for efficient range queries and row lookups, and a log-structured merge-tree (for improved write performance). In contrast, the vector database stores its data in a vector index that supports ANN search, such as FAISS (Facebook AI Similarity Search) or ScaNN by Google.

Is there an integrated feature store and vector database?

Hopsworks is a feature store with an integrated vector database. You store tables of feature data in feature groups, and you can index a column that contains embeddings in a built-in vector database. This means you can search for rows of similar features using embeddings and ANN search. Hopsworks also supports filtering, so you can search for similar rows, but provide conditions on what type of data to return (e.g., only users whose age>18).

Resources on feature stores

Our research paper, "The Hopsworks Feature Store for Machine Learning", is the first feature store to appear at the top-tier database or systems conference SIGMOD 2024. This article series is describing in lay terms concepts and results from this study.