We are proud to introduce the AI Lakehouse, the first unified tool specifically designed for building AI systems.
Pandas UDFs (User-Defined Functions) are functions that allow users to perform feature engineering (or any custom transformations) on a Pandas DataFrame using PySpark. Pandas UDFs enable high performance feature engineering using Python functions on data stored in a PySpark DataFrame.
PySpark provides built-in functions for common feature engineering operations, such as filtering, grouping, and aggregating data. However, if you need to write your own custom feature engineering code, you can instead write a Pandas UDF. PySpark will convert your Spark DataFrame into a Pandas DataFrame, apply the user-defined function on the Pandas DataFrame, and then convert the output back into a PySpark DataFrame - without any serialization/deserialization costs as PyArrow is used transfer data between PySpark and Pandas.
You write your UDF and add the pandas_udf decorator provided by PySpark. The decorator allows you to specify the input and output types of the UDF and provides other configuration options. Once the UDF is defined, it can be applied to a PySpark DataFrame using the pandas_udf function.