Last-mile DE-ASSOC review: Spark SQL/DataFrames essentials, Delta Lake features (schema, time travel, merge), ETL patterns, and exam-style pickers. Includes code snippets, tables, and diagrams.
Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to validate speed/accuracy.
| Category | Examples | Notes |
|---|---|---|
| Transformations | select, filter, withColumn, groupBy, join | Lazy: build a plan |
| Actions | count, collect, show, write, foreach | Trigger execution |
Rule: If it returns a DataFrame, it’s usually a transformation. If it returns a value or writes, it’s usually an action.
| Type | What it implies | Examples |
|---|---|---|
| Narrow | No shuffle | select, filter, map |
| Wide | Shuffle | groupBy, join, distinct, orderBy |
| Join | Keeps rows from | Common use |
|---|---|---|
| INNER | both sides match | dimensional enrichment |
| LEFT | left side always | “keep all facts” |
| FULL | both sides always | reconciliation |
| SEMI/ANTI | left rows with/without match | de-dup and “missing” detection |
1SELECT
2 user_id,
3 ts,
4 amount,
5 ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) AS rn
6FROM transactions;
Delta Lake adds ACID + schema enforcement + time travel to data lake storage.
1CREATE TABLE IF NOT EXISTS silver_orders
2USING DELTA
3AS SELECT * FROM bronze_orders;
MERGE (upsert / CDC pattern)1MERGE INTO silver_orders t
2USING staging_orders s
3ON t.order_id = s.order_id
4WHEN MATCHED THEN UPDATE SET *
5WHEN NOT MATCHED THEN INSERT *;
| Concept | What it means | What you see |
|---|---|---|
| Schema enforcement | rejects incompatible writes | write fails |
| Schema evolution | updates table schema intentionally | new columns appear |
1DESCRIBE HISTORY silver_orders;
2
3SELECT * FROM silver_orders VERSION AS OF 42;
Exam cue: time travel is for debugging/audit/rollback; it doesn’t replace good pipeline discipline.
| You want… | Do this | Why |
|---|---|---|
| Faster pruning on common filters | Partition on low-to-medium cardinality columns | Partition pruning |
| Avoid too many tiny files | Don’t over-partition; use compaction/OPTIMIZE where appropriate | Small-file problem |
| Faster reads on frequent predicates | Use file compaction + data skipping (conceptually) | Fewer files + better locality |
flowchart LR
B["Bronze (raw)"] --> S["Silver (cleaned)"]
S --> G["Gold (business-ready)"]
Rules of thumb
MERGE join condition wrong or not unique.