What are pandas UDFs?
Pandas UDFs (User-Defined Functions) are functions that allow users to perform feature engineering (or any custom transformations) on a Pandas DataFrame using PySpark. Pandas UDFs enable high performance feature engineering using Python functions on data stored in a PySpark DataFrame.
PySpark provides built-in functions for common feature engineering operations, such as filtering, grouping, and aggregating data. However, if you need to write your own custom feature engineering code, you can instead write a Pandas UDF. PySpark will convert your Spark DataFrame into a Pandas DataFrame, apply the user-defined function on the Pandas DataFrame, and then convert the output back into a PySpark DataFrame - without any serialization/deserialization costs as PyArrow is used transfer data between PySpark and Pandas.
How do I write a pandas UDF?
You write your UDF and add the pandas_udf decorator provided by PySpark. The decorator allows you to specify the input and output types of the UDF and provides other configuration options. Once the UDF is defined, it can be applied to a PySpark DataFrame using the pandas_udf function.
The case for pandas UDFs for feature engineering:
- Performance: Pandas UDFs enable vectorized computation, which leads to improved performance compared to Python UDFs. Vectorized operations in Pandas are implemented using low-level C or Cython code, which significantly reduces the overhead associated with Python's interpreted nature.
- Flexibility: Pandas UDFs can be used in various environments, including Spark feature pipelines, Spark batch inference, and Python-based online inference pipelines (as supported by on-demand features in the Hopsworks feature store). This versatility makes it easier to integrate Pandas UDFs across different stages of a machine learning pipeline.
- Familiarity: For data scientists and engineers who are already familiar with the Pandas library, working with Pandas UDFs is a natural extension of their existing skill set. This reduces the learning curve and makes it easier to implement custom feature transformations.
- Expressiveness: Pandas provides a rich set of data manipulation functions, allowing users to implement complex feature transformations with just a few lines of code. This expressiveness can simplify the feature engineering process and reduce the time spent on implementing custom functions.