The Data Lakehouse with Apache Iceberg

A Data Lakehouse is a new, open data management architecture that combines the cost-efficiency and flexibility of a Data Lake (S3/GCS) with the performance and ACID transactions of a Data Warehouse. Apache Iceberg is the leading open table format for this architecture.

🏗️ Why do we need Table Formats?

Data Lakes (S3) are just folders with files (Parquet/JSON).

Problem: No way to update a single row without rewriting the whole file.
Problem: Multiple users reading/writing at the same time can cause corruption.
Problem: No “Schema Evolution” (adding a column safely).

🚀 Apache Iceberg Features

1. ACID Transactions

Multiple applications can read and write to the same table simultaneously without conflict. Iceberg uses an “Optimistic Concurrency Control” model.

2. Time Travel

Every write creates a new “Snapshot.” You can query the table exactly as it looked at a specific point in history.

SELECT * FROM orders FOR SYSTEM_TIME AS OF '2023-10-01 12:00:00';

🛠️ Implementation: Spark SQL (Upsert)

Iceberg brings full SQL support to S3. Here is how you create a table and perform a high-performance upsert (MERGE).

-- 1. Create a partitioned Iceberg table
CREATE TABLE prod.db.orders (
    id bigint,
    status string,
    updated_at timestamp
) 
USING iceberg
PARTITIONED BY (days(updated_at));

-- 2. Perform an Upsert (Merge)
MERGE INTO prod.db.orders AS target
USING (SELECT * FROM staging_orders) AS source
ON target.id = source.id
WHEN MATCHED THEN 
    UPDATE SET target.status = source.status, target.updated_at = source.updated_at
WHEN NOT MATCHED THEN 
    INSERT *;

3. Partition Evolution

In old systems (Hive), if you changed your partition from “Year” to “Day,” you had to rewrite the whole table. Iceberg can change partitioning on-the-fly without a rewrite.

4. Hidden Partitioning

Iceberg handles the relationship between a column and its partition automatically. You don’t have to manually specify WHERE date = ... and WHERE year = ....

🛠️ Architecture: The Catalog

Iceberg tracks “Metadata Files” instead of “File Paths.”

Catalog: (e.g., AWS Glue, Nessie) Points to the latest “Metadata File.”
Metadata File: Contains the schema, partition info, and points to “Manifest Lists.”
Manifest List: Points to “Manifest Files.”
Manifest File: Points to the actual data files (Parquet/Avro).

💡 Key Use Cases

Compliance (GDPR): Efficiently deleting a single user’s data from a 100TB S3 bucket.
Streaming Ingestion: Writing real-time data into Parquet files without creating thousands of small, unoptimized files (automatic compaction).
Cross-Engine Compatibility: Read the same data from Spark, Trino, Flink, and Dremio without moving it.

📊 Iceberg vs. Delta Lake vs. Hudi

Feature	Apache Iceberg	Delta Lake	Apache Hudi
Backing	Netflix / Apple	Databricks	Uber
Focus	Correctness & Scale	Ease of Use (Spark)	Streaming Upserts
Openness	Very High (Community)	High	High

💡 Engineering Takeaway

If you are building a modern data platform on S3/Azure Blob/GCS, Apache Iceberg is the foundation for making your lake behave like a professional database.