Row-Level Deletes
Introduction to Row-Level Deletes
In the traditional Data Lake era (built on Apache Hive and raw Hadoop), data was treated as append-only. Because data was stored in massive text or Parquet files on rigid distributed file systems, the concept of deleting a specific row was practically impossible.
If a company stored 10 years of sales data in a massive folder partitioned by year, and they discovered a single fraudulent transaction in the 2018 folder, fixing it was a nightmare. The data engineer had to write a MapReduce job to read the entire 2018 dataset (terabytes of data), manually filter out the single bad row, completely delete the old 2018 folder, and write the new terabytes of data back to disk.
The inability to easily execute Row-Level Deletes (and the related UPDATE and MERGE commands) was the primary reason Data Lakes were considered inferior to traditional Data Warehouses.
The Compliance Catalyst: GDPR and CCPA
The inability to delete specific rows shifted from an engineering annoyance to an existential legal threat with the passing of strict privacy frameworks like the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA).
These laws introduced the “Right to be Forgotten.” If a customer emails a company and demands that all their personal data be deleted, the company has a strict legal window (usually 30 days) to scrub that specific user’s data from every system, database, and data lake they own.
If a company has petabytes of raw JSON logs sitting in an Amazon S3 Data Lake, and they have no native SQL mechanism to execute DELETE FROM raw_logs WHERE user_email = 'alex@example.com', they face massive regulatory fines. Organizations were forced to build incredibly expensive, brittle pipelines just to execute compliance deletes.
The Lakehouse Solution
The invention of Open Table Formats (Apache Iceberg, Delta Lake, and Apache Hudi) was heavily driven by the need to support native Row-Level Deletes directly on top of immutable cloud object storage.
By introducing an intelligent metadata layer above the raw Parquet files, these table formats finally allowed data lakes to behave like relational databases.
When a privacy officer executes DELETE FROM iceberg_table WHERE user_id = 99, the compute engine utilizes one of two underlying mechanisms:
- Copy-on-Write (CoW): The engine finds the specific Parquet file containing
user_id = 99, reads it into memory, drops that single row, and rewrites a brand new Parquet file back to S3, updating the Iceberg metadata to point to the new, compliant file. - Merge-on-Read (MoR): For faster execution, the engine simply writes a tiny “Delete File” to S3 that says “Ignore the row containing
user_id = 99.” All future queries automatically filter the user out dynamically.
The Upsert (MERGE INTO)
The ability to execute Row-Level Deletes also unlocked the most critical operation in modern data engineering: the Upsert (Update + Insert), executed via the SQL MERGE INTO command.
When streaming data from a live operational database (CDC), the pipeline frequently receives updates for existing rows.
Without row-level mutation capabilities, engineers had to write complex code to deduplicate data. With Iceberg, engineers simply run a MERGE INTO command. The engine automatically checks if the record already exists; if it does, it updates the existing row (using a delete + append). If it doesn’t exist, it inserts it as a new row.
Conclusion
Row-Level Deletes are the defining feature that elevated the Data Lake into the Data Lakehouse. By providing standard SQL DELETE, UPDATE, and MERGE capabilities over massive, immutable cloud storage files, Open Table Formats solved the most critical regulatory and architectural challenge of the big data era. They allow organizations to maintain strict GDPR compliance and execute real-time CDC streams without abandoning the infinite scalability and cost-efficiency of the data lake.
Deepen Your Knowledge
Ready to take the next step in mastering the Data Lakehouse? Dive deeper with my authoritative guides and practical resources.
Explore Alex's Books