A Data Lakehouse is a modern data architecture that combines the benefits of both data lakes and data warehouses. A Data Lakehouse consists of a file format, such as Apache Hudi, Delta Lake, or Apache Iceberg, and a set of programs for performing transactional updates and database management operations on the files stored in the Lakehouse file format.
Lakehouses are often used to store the raw data used to compute features. They provide a centralized platform for storing and processing data that is both flexible and cost-effective. Lakehouses also support both batch and real-time processing, making it easier to work with both historical and streaming data.
Apache Hudi is supported by AWS, Hopsworks, and Onehouse. Delta Lake is supported by Databricks. Apache Iceberg is supported by Snowflake and AWS. Spark is popular as a data parallel processing engine to perform ETL operations on Data Lakehouses.