Back to the Index

Backfill features

What does backfilling features mean?

Backfilling is the process of recomputing datasets from raw, historical data. For reusable features in a feature store, backfilling involves running a feature pipeline with historical data to populate the feature store.

Implementation Notes

The same feature pipeline used to backfill features should also process “live” data. You just point the feature pipeline at the data source and the range of data to backfill (e.g., backfill the daily partitions with all users for the last 180 days). Both batch and streaming feature pipelines should be able to backfill features.

Why is backfilling features useful?

Backfilling features is important because you may have existing historical data that can be leveraged to create training data for a model. If you couldn’t backfill features, you could start logging features in your production system and wait until sufficient data has been collected before you start training your model.

‍Example of backfilling features

The historical data for backfilling could be user clicks on a website, purchases on an ecommerce site, or any data that is systematically collected, curated, and typically stored in a lakehouse or data warehouse. The data you are backfilling should include in its data model a timestamp for each event/row, so that you can specify a range of time to backfill data with.

Interested for more?

🤖 Register for free on Hopsworks Serverless
🐍 Learn all about the Python-Centric Feature Store
🛠️ Explore all Hopsworks Integrations
🧩 Get started with codes and examples
⚖️ Compare other Feature Stores with Hopsworks

Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.

B

Auto-regressive Models

B

Backfill training data

Backpressure for feature stores

Batch Inference Pipeline

B

CI/CD for MLOps

Compound AI Systems

Context Window for LLMs

B

DAG Processing Model

Data Compatibility

Data Partitioning

Data Transformation

Data Type (for features)

Data Validation (for features)

Data-Centric ML

Dimensional Modeling and Feature Stores

B

Encoding (for Features)

B

Gradient Accumulation

Grouped Query Attention

B

Hallucinations in LLMs

Hyperparameter Tuning

B

Idempotent Machine Learning Pipelines

In Context Learning (ICL)

Inference Pipeline

Instruction Datasets for Fine-Tuning LLMs

B

LLM Code Interpreter

LLM Temperature

LLMs - Large Language Models

Lagged features

B

Natural Language Processing (NLP)

B

On-Demand Features

On-Demand Transformation

Online Inference Pipeline

Online-Offline Feature Skew

Online-Offline Feature Store Consistency

B

Parameter-Efficient Fine-Tuning (PEFT) of LLMs

Point-in-Time Correct Joins

Precomputed Features

Prompt Engineering

B

RLHF - Reinforcement Learning from Human Feedback

Real-Time Machine Learning

Recommender System

Representation Learning

Retrieval Augmented Generation (RAG) for LLMs

B

SQL UDF in Python

Similarity Search

Splitting Training Data

Streaming Feature Pipeline

Streaming Inference Pipeline

B

Theory-of-Mind Tasks

Time travel (for features)

Train (Training) Set

Training Pipeline

Training-Inference Skew

Two-Tower Embedding Model

Types of Machine Learning

B

B

Vector Database

Versioning (of ML Artifacts)