Back to the Index

Data Partitioning

What is data partitioning and how does it relate to feature stores?

When you create a feature group, you can select one or more features (columns) as the partition key, storing data with the same partition key values in the same directory. Partitioning can enable faster queries using a feature store’s Offline API, by enabling you to pass partition key values, and only results with those partition key values will be read in the query. For example, in Hopsworks, the offline store uses Hive-style partitioning to store feature group partitions in directories. If you only want to create training data for users in the location “USA”, you can pass a filter to your feature view, and only data in the feature group’s subdirectory “USA” will be read, skipping the rest of the data in the feature group:

# If ‘location’ is a partition_key in its feature group, then the query 
# will be pushed down and only read data for users in “USA”
training_data = feature_view.training_data().filter("location", "USA")

Data modeling for feature stores involves organizing your entities and features into feature groups. In data warehousing, dimensional modeling is a data modeling technique that identifies entities and then decomposes your data into “facts” and “dimensions” related to those entities

Does this content look outdated? If you are interested in helping us maintain this, feel free to contact us.

© Hopsworks 2024. All rights reserved. Various trademarks held by their respective owners.

Privacy Policy
Cookie Policy
Terms and Conditions