DataFrames are widely used for data transformations and analysis. DataFrames have historically been created from files in CSV, JSON, or Parquet file formats. As DataFrames have become more widely adopted for data processing pipelines, they have acquired new capabilities for both reading and writing from/to databases and data lakes.
In 2023, with the introduction of Pandas2, Apache Arrow became the dominant standard for both in-memory representation and over-the-wire transfer format for data in DataFrames. Now, data can moved at little cost between DataFrame frameworks such as Pandas2, Polars, and even distributed DataFrames, such as PySpark. Network protocols such as Arrow Flight and ADBC enable data to flow from columnar datastores, like data warehouses, directly to Pandas2 clients without serialization/deserialization or transformations between row-oriented and column-oriented formats (as required by JDBC/ODBC APIs).
In this talk, we will explore the importance of Apache Arrow in the evolution of data for DataFrames from CSV and Parquet file formats, to tabular formats, such as Apache Hudi, Iceberg, and Delta Lake. We will examine the huge performance benefits of using Apache Arrow end-to-end in integration with Python clients with data lakes and warehouses. We will also look to the future at work we have been doing in the open-source Hopsworks platform, on network-hosted (serverless), incremental tables for DataFrames - backed by Arrow End-to-End.