Medallion Architecture
Introduction to the Medallion Architecture
As organizations scale their data ingestion and processing, managing the quality and structure of data becomes increasingly complex. Without a rigid framework, data lakes often turn into “data swamps”—repositories of unorganized, untrusted, and difficult-to-query data. To combat this, the Medallion Architecture has emerged as the industry-standard data design pattern for logical data organization within a modern Lakehouse.
The Medallion Architecture (sometimes referred to as the multi-hop architecture) logically organizes data in a lakehouse into three distinct layers, or “medallions”: Bronze, Silver, and Gold. Each layer represents a progressively higher standard of data quality, structure, and business readiness.
By structuring data pipelines to flow through these sequential layers, data engineering teams can guarantee data lineage, enable robust reprocessing, and serve different personas—from data scientists wanting raw logs to business analysts requiring highly curated aggregate tables.
The Three Layers: Bronze, Silver, and Gold
The Bronze Layer (Raw)
The Bronze layer is the landing zone. Its primary purpose is to capture and store data exactly as it was generated by the source systems, with zero transformations or data loss.
- Characteristics: Immutable, append-only, high volume, schema-on-read.
- Data Formats: Often stored in the format it was received (JSON, CSV, Avro, or raw Parquet). In a modern Iceberg lakehouse, tools like Kafka Connect or Flink will land this data directly into Iceberg tables.
- Purpose: The Bronze layer acts as a historical archive. If a downstream transformation logic error is discovered, engineers can always replay the pipeline starting from the Bronze layer because the raw, unadulterated truth is preserved forever.
- Metadata: It is common practice to append system metadata columns during ingestion, such as
ingest_timestamp,source_system_id, orkafka_offset. This aids in debugging and auditing.
The Silver Layer (Cleansed and Conformed)
The Silver layer is the enterprise repository. Data flowing from Bronze to Silver undergoes validation, cleansing, and standardization. It represents a “single source of truth” that is trusted, structured, and ready for exploration.
- Characteristics: Structured, strongly typed, deduplicated, and conformed.
- Operations:
- Data Type Casting: Converting strings to proper
TIMESTAMPorDECIMALtypes. - Deduplication: Using Iceberg’s
MERGE INTO(upserts) to handle late-arriving or duplicate records originating from source systems. - Schema Enforcement: Rejecting or quarantining rows that do not conform to the expected schema.
- Standardization: Standardizing formats (e.g., converting all timestamps to UTC, standardizing currency codes).
- Data Type Casting: Converting strings to proper
- Purpose: This layer is typically consumed by data scientists for feature engineering, machine learning model training, and advanced ad-hoc analytics where raw granularity is required but messy data is not.
The Gold Layer (Business-Ready)
The Gold layer is the presentation layer. It is highly refined, aggregated, and modeled specifically to answer business questions.
- Characteristics: Highly aggregated, denormalized, read-optimized.
- Operations:
- Complex joins across multiple Silver tables.
- Aggregations (e.g., daily sales rollups, monthly active users).
- Business logic application (e.g., calculating net revenue, churning metrics).
- Purpose: Gold tables are built for consumption by Business Intelligence (BI) tools (Tableau, Power BI, Apache Superset) and business analysts. They are optimized for low-latency queries and strictly enforce business definitions.
Implementing Medallion Architecture in an Iceberg Lakehouse
Apache Iceberg is uniquely suited to power the Medallion Architecture due to its transactional guarantees (ACID compliance) and table format features.
Handling Streaming Upserts
Moving data from Bronze to Silver often requires handling Change Data Capture (CDC) streams. If an UPDATE event arrives from a source PostgreSQL database, the Silver layer must reflect this update. Iceberg’s support for row-level deletes (Merge-on-Read) allows data engineers to execute efficient MERGE INTO SQL commands to synchronize the Silver layer incrementally, without rewriting entire datasets.
Data as Code and Branching
When altering the transformation logic from Silver to Gold, organizations using an open catalog like Project Nessie or Apache Polaris can leverage Data Branching. An engineer can create a branch of the Gold table, test the new transformation logic in isolation using Trino or Spark, validate the metrics, and then atomically merge the branch back into the main production branch.
Time Travel for Pipeline Reprocessing
If a bug is introduced in the Silver layer, Iceberg’s Time Travel allows engineers to rollback the Silver table to a snapshot from 24 hours ago. They can then effortlessly replay the pipeline from the Bronze layer using the corrected transformation code.
Conclusion
The Medallion Architecture provides the necessary discipline to tame the inherent chaos of big data. By enforcing strict boundaries between raw ingestion, enterprise standardization, and business aggregation, teams can ensure their data is auditable, reproducible, and highly performant. When deployed on top of a modern Apache Iceberg lakehouse, the multi-hop architecture transitions from a conceptual best practice into a highly robust, automated, and scalable engineering reality.
Deepen Your Knowledge
Ready to take the next step in mastering the Data Lakehouse? Dive deeper with my authoritative guides and practical resources.
Explore Alex's Books