No items found.
arrow back
Back to Blog
Ralfs Zangis
link to linkedin
Software Engineer
Steffen Grohsschmiedt
link to linkedin
Head of Cloud
Article updated on

Federated Data with the Hopsworks Feature Query Service

Python clients can read from Snowflake, BigQuery, Hopsworks, and more
February 14, 2024
14 min
Read
Ralfs Zangis
Ralfs Zangislink to linkedin
Software Engineer
Hopsworks
Steffen Grohsschmiedt
Steffen Grohsschmiedtlink to linkedin
Head of Cloud

TL;DR

The Hopsworks Feature Query Service offers an efficient and performant way to join features across multiple platforms on multiple cloud providers such as Snowflake, BigQuery, and Hopsworks without data duplication. This tutorial goes through the steps for joining features across Snowflake, BigQuery and the Hopsworks Feature Store and shows how to present these features in a unified view.

Introduction

It is common for organizations to have their features spread across multiple platforms and cloud providers due to the fact that different teams with different needs develop solutions that suit them best. Having your features spread across multiple platforms usually means that you have to ingest and duplicate all these features into a feature store in order to be able to share and reuse them effectively. In Hopsworks, we solve this problem by introducing the Hopsworks Feature Query Service, powered by ArrowFlight and DuckDB, which efficiently queries and joins features from multiple platforms such as Snowflake, BigQuery and Hopsworks without any data duplication.

Federated Queries with Hopsworks Feature Query Service
Figure 1. Federated Queries with Hopsworks Feature Query Service

If you are interested in how the Hopsworks Feature Query Service works, you can find the details in our earlier blog post: Faster reading from the Lakehouse to Python with DuckDB/ArrowFlight.

Prerequisites

To follow this tutorial you can sign up for the Hopsworks Free Tier or use  your own Hopsworks installation. You also need access to Snowflake and BigQuery, which offer free trials: Snowflake Free TrialGoogle Cloud Free Tier.

If you choose to use your own Hopsworks, you should have an instance of Hopsworks version 3.5 or above and be the Data Owner/Author of a project. Furthermore, to use the  Hopsworks Feature Query Service, the user has to configure the Hopsworks cluster to enable it. This can only be done during cluster creation.

The tutorial 

The aim of this tutorial is to create a unified view of features regarding the 100 most popular GitHub projects joining public datasets on Snowflake (GitHub Archive). BigQuery (deps.dev) and Hopsworks. We will create feature groups for each of these sources and then combine them in a unified view exposing all features together regardless of their source. We then use the view to create training data for a model predicting the code coverage of Github projects.

Setting up the Snowflake connection

Hopsworks manages the connection to Snowflake through storage connectors. Therefore, the following steps show you how to retrieve the connection information from Snowflake and provide it to your Hopsworks project.

Gain access to the GitHub Archive dataset on Snowflake

Start by logging in to your Snowflake account. If you don’t have a Snowflake account yet then you can register for a free trial on their homepage.

Go to the GitHub Archive dataset and add it to your account:

Get connection information for Snowflake 

In the Snowflake web user interface, go to Admin -> Accounts and hover your mouse pointer to the right of the account id until the Current URL appears. This is your connection URL:

In the Snowflake web user interface, go to Admin -> Warehouses and take note of the name of the warehouse you want to use:

Define the connection to Snowflake using a Hopsworks storage connector

Next, in Hopsworks select a project where you want to connect to Snowflake and create a new storage connector for Snowflake. Enter the following details and save the connector:

  • Name: Snowflake
  • Protocol: Snowflake
  • Connection URL: The connection URL you looked up earlier
  • User: Your Snowflake user name
  • Password: Your Snowflake user password
  • Warehouse: The name of the warehouse you looked up earlier
  • Database : GITHUB_ARCHIVE
  • Schema: CYBERSYN

After entering the connection information the connector should look something like this:

Setting up the BigQuery connection

Hopsworks manages the connection to BigQuery through storage connectors. Therefore, the following steps show you how to retrieve the connection information from Google Cloud and provide it to your Hopsworks project.

Gain access to the deps.dev dataset on BigQuery

Start by logging in to your Google Cloud account. If you don’t have a Google Cloud  account yet then you can sign up for the Google Cloud Free Tier.

We will be using the public deps.dev dataset which is available to all BigQuery users:

Get connection information for BigQuery

In the Google Cloud web interface, open the project selector in the top bar and take note of the ID of the project you want to use:

In the Google Cloud web interface, go to BigQuery -> BigQuery Studio and select Create dataset using the three dots on the project you want to use. Fill in the dataset ID materialization_us and select location type Multi-region, US. Note that you have to use the multi-region US as the dataset we are using is located there.

In the Google Cloud web interface, go to IAM -> Service accounts, select the service account you want to use and use the three dots to open Managed keys. Create and download a JSON key file for your service account:

Note that the service account you use needs permissions to execute BigQuery jobs as well as permissions to write to the materialization_us dataset created earlier. This can be achieved by assigning the roles BigQuery Data Owner, BigQuery Job User, BigQuery Read Session User and Service Usage Consumer to your service account. Alternatively, you may define more restrictive permissions manually.

Define the connection to BigQuery using a Hopsworks storage connector

Next, in Hopsworks select a project where you want to connect to BigQuery and create a new storage connector for BigQuery. Enter the following details and save the connector:

  • Name: BigQuery
  • Protocol: Google BigQuery
  • Parent Project: The project ID you looked up earlier
  • Authentication JSON Keyfile: Upload the JSON key file you created earlier
  • BigQuery Project: The project ID you looked up earlier
  • BigQuery Dataset: bigquery-public-data
  • BigQuery Table: deps_dev_v1
  • Materialization Dataset: The ID of the materialization dataset you created earlier, e.g. materialization_us

After entering the connection information the connector should look something like this:

Create a Feature View joining a Feature Group with External Feature Groups in Snowflake and BigQuery

We will now use a Python notebook on JupyterLab to create and query a federated feature view that extracts the 100 most popular Github repositories from the Snowflake dataset and amends them with features from BigQuery and Hopsworks.

Prepare Jupyter

To be able to run the Jupyter notebook, we need to install Jupyter and the Hopsworks dependencies before starting Jupyter:

pip install jupyterlab hopsworks==3.7.0
jupyter lab

Connect to Hopsworks

In Jupyter, create a new Python 3 notebook and copy the following code into the first cell and execute it to connect to Hopsworks and retrieve the storage connectors:

import hopsworks
from hsfs.feature import Feature


project = hopsworks.login() # See Login API for using your own Hopsworks
feature_store = project.get_feature_store()

snowflake = feature_store.get_storage_connector("Snowflake")
bigquery = feature_store.get_storage_connector("BigQuery") 

Create an External Feature Group on Snowflake

We now create an external feature group querying the GitHub Archive dataset on Snowflake to return the 100 repositories that got the most stars during the 365 days before Nov 11, 2023. 

query_str = """
WITH latest_repo_name AS (
    SELECT repo_name,
           repo_id
    FROM cybersyn.github_repos
    QUALIFY ROW_NUMBER() OVER (PARTITION BY repo_id ORDER BY first_seen DESC) = 1
)
SELECT LOWER(repo.repo_name) as repo_name,
       SUM(stars.count) AS sum_stars
FROM cybersyn.github_stars AS stars
JOIN latest_repo_name AS repo
    ON (repo.repo_id = stars.repo_id)
WHERE stars.date >= DATEADD('day', -365, DATE('2023-11-13'))
GROUP BY repo.repo_name, repo.repo_id
ORDER BY sum_stars DESC NULLS LAST
LIMIT 100;"""

features = [
    Feature(name="repo_name",type="string"),
    Feature(name="sum_stars",type="int")
]

github_most_starts_fg = feature_store.create_external_feature_group(
    name="github_most_starts",
    version=1,
    description="The Github repositories that got the most stars last year",
    primary_key=['repo_name'],
    query=query_str,
    storage_connector=snowflake,
    features=features
)

github_most_starts_fg.save()

After creating the external feature group on Snowflake, we are now able to query it in our notebook utilizing the Hopsworks Feature Query Service:

github_most_starts_df = github_most_starts_fg.read()
github_most_starts_df.head()

The result will be returned in a matter of seconds:

Create an External Feature Group on BigQuery

We now create an external feature group on BigQuery containing information about the licenses, number of forks and open issues from the deps.dev dataset. To limit the cost, we limit the content to the 100 repositories from the github_most_starts feature group:

repos_quoted = github_most_starts_df['repo_name'].map(lambda r: f"'{r}'").tolist()

query_str = f"""
SELECT
  Name as repo_name, Licenses as licenses, ForksCount as forks_count, OpenIssuesCount as open_issues_count
FROM
  `bigquery-public-data.deps_dev_v1.Projects`
WHERE
  TIMESTAMP_TRUNC(SnapshotAt, DAY) = TIMESTAMP("2023-11-13")
  AND
  Type = 'GITHUB'
  AND Name IN ({','.join(repos_quoted)})
 """

features = [
    Feature(name="repo_name",type="string"),
    Feature(name="licenses",type="string"),
    Feature(name="forks_count",type="int"),
    Feature(name="open_issues_count",type="int")
]

github_info_fg = feature_store.create_external_feature_group(
    name="github_info",
    version=1,
    description="Information about Github project licenses, forks count and open issues count",
    primary_key=['repo_name'],
    query=query_str,
    storage_connector=bigquery,
    features=features
)

github_info_fg.save()

After creating the external feature group on BigQuery, we can now query it in our notebook utilizing the Hopsworks Feature Query Service:

github_info_df = github_info_fg.read()
github_info_df.head()

After a few seconds, you will get the result:

Create a Feature Group on Hopsworks

To show that the data from the datasets on Snowflake and BigQuery can be queried together with data on Hopsworks, we now make up a dataset for the code coverage of repositories on GitHub and put it into a feature group on Hopsworks:

import random
import pandas as pd

repos = github_most_starts_df['repo_name'].tolist()

numbers = [random.uniform(0, 1) for _ in range(len(repos))]
coverage_df = pd.DataFrame(list(zip(repos, numbers)),
               columns =['repo_name', 'code_coverage'])

coverage_fg = feature_store.create_feature_group(name="github_coverage",
    version=1,
    primary_key=['repo_name'],
)

coverage_fg.insert(coverage_df, write_options={"wait_for_job": True})

After creating the feature group, we can look at it:

coverage_fg.select_all().show(5)

Create a Feature View joining all Feature Groups together

We now join the two external feature groups on Snowflake and BigQuery with the feature group in Hopsworks into a single feature view and mark the feature code_coverage as our label to be able to create training data in the next step:

query = github_most_starts_fg.select_all().join(github_info_fg.select_all(), join_type='left').join(coverage_fg.select_all(), join_type='left')

feature_view = feature_store.create_feature_view(
    name='github_all_info',
    version=1,
    query=query,
    labels=['code_coverage']
)

We can query the feature view in the same way we query any other feature view, regardless of the data being spread across Snowflake, BigQuery and Hopsworks. The data will be queried directly from its source and joined using the Hopsworks Feature Query Service before being returned to Python:

data = feature_view.get_batch_data()
data.head()

The result is returned in a matter of seconds, even though we queried three different platforms:

Create the training data from the Feature View

Finally, we can use the feature view to create training data that could be used to train a model predicting the code coverage of the GitHub repositories:

X_train, X_test, Y_train, Y_test = feature_view.train_test_split(test_size=0.2)

Summary

In this blog post, we have shown how to use the Hopsworks Feature Query Service by joining features across Snowflake, BigQuery and Hopsworks with low latency and no data duplication. This enables users to have a unified interface for their features even though they are originally created across multiple platforms that suit the producing team best.

The full source code for this blog post is available on GitHub. You will need a Hopsworks 3.7+ cluster, access to Snowflake, and BigQuery to complete this tutorial.

References

© Hopsworks 2024. All rights reserved. Various trademarks held by their respective owners.

Privacy Policy
Cookie Policy
Terms and Conditions