Open Table Formats
Introduction to Open Table Formats
To understand the monumental importance of Open Table Formats, you must understand the two major flaws of the previous data generation.
- The Data Warehouse Lock-in: If you loaded data into a proprietary warehouse (like Oracle or Teradata), they converted your data into a secret, proprietary file format. If you wanted to leave Oracle, or use a different tool to analyze that data, you couldn’t. Your data was held hostage by the vendor.
- The Data Lake Chaos: To escape lock-in, companies dumped their data as open-source Parquet files into Amazon S3 (the Data Lake). But S3 is just a hard drive. It has no transactional guarantees. If two people tried to write a file at the same time, the data corrupted. You couldn’t run a SQL
UPDATEorDELETEcommand.
The industry needed a solution that provided the transactional perfection of a Data Warehouse, but kept the data stored in open, vendor-neutral files on cheap cloud storage.
This solution is the Open Table Format. It is the architectural foundation of the modern Data Lakehouse.
What is an Open Table Format?
An Open Table Format (like Apache Iceberg, Delta Lake, or Apache Hudi) is not a query engine, and it is not a storage platform. It is a Metadata Abstraction Layer.
When you write an Iceberg table, two things are saved to your Amazon S3 bucket:
- The Data Files: Standard, open-source Apache Parquet files.
- The Metadata Files: A collection of JSON and Avro files that explicitly map out exactly which Parquet files belong to the table.
How it Solves the Problem
When an engine (like Dremio, Spark, or Snowflake) wants to query the table, it doesn’t blindly scan S3. It reads the Iceberg Metadata.
- ACID Transactions: Because Iceberg controls the metadata, it can guarantee transactions. If an engine writes 100 new Parquet files, they are invisible. Only when the write is 100% complete does Iceberg instantly update the metadata pointer to include the new files.
- Row-Level Updates/Deletes: If an analyst runs a
DELETEcommand, Iceberg handles the complex mechanics of rewriting the specific Parquet file or logging a delete file, allowing the Data Lake to behave exactly like a relational database. - Time Travel: Because the metadata keeps a historical log of every single change, an analyst can query the table exactly as it existed 30 days ago.
The Big Three: Iceberg, Delta, and Hudi
The Open Table Format wars began in the late 2010s, dominated by three major open-source projects.
- Apache Iceberg: Originally developed at Netflix. It is widely considered the most truly “open” standard, boasting the largest ecosystem of native integrations across AWS, Google Cloud, Snowflake, Dremio, and open-source engines. It was designed from the ground up for massive, petabyte-scale table metadata management.
- Delta Lake: Developed by Databricks. It is heavily tied to the Apache Spark ecosystem and optimized for the Databricks platform. It is exceptionally popular due to its seamless developer experience within Databricks.
- Apache Hudi: Originally developed at Uber. Hudi (Hadoop Upserts Deletes and Incrementals) was explicitly designed for massive streaming workloads and heavy real-time Upserts.
The Promise of Interoperability
The defining characteristic of these formats is the word Open.
Because the specifications for Iceberg are public and open-source, any vendor can build an engine to read and write it. An organization can use Apache Flink to ingest streaming data into an Iceberg table, use Apache Spark to run a massive batch transformation on that table, use Snowflake to run a highly secure executive dashboard on it, and use Dremio to federate it—all without moving or copying a single byte of data.
Conclusion
Open Table Formats represent the final decoupling of compute from storage. By providing a standardized, vendor-neutral layer of metadata on top of cloud object storage, they stripped proprietary Data Warehouses of their core advantage. They transformed the chaotic Data Lake into the highly reliable Data Lakehouse, ensuring that enterprises permanently retain ownership, flexibility, and control over their most valuable data assets.
Deepen Your Knowledge
Ready to take the next step in mastering the Data Lakehouse? Dive deeper with my authoritative guides and practical resources.
Explore Alex's Books