Last-mile DE-PRO review: incremental pipelines, Structured Streaming (watermarks/checkpoints), Delta Live Tables concepts, performance tuning pickers (shuffle/skew/file layout), and production troubleshooting heuristics.
Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to harden production instincts.
If two answers “work,” choose the one that is:
MERGE (CDC)1MERGE INTO silver t
2USING cdc s
3ON t.id = s.id
4WHEN MATCHED AND s.op = 'D' THEN DELETE
5WHEN MATCHED THEN UPDATE SET *
6WHEN NOT MATCHED THEN INSERT *;
Rules of thumb
ON condition is unique on the source side.| Concept | Why it matters | Failure mode |
|---|---|---|
| Checkpoint | enables exactly-once style recovery for sinks | deleting/moving checkpoint breaks correctness |
| Watermark | bounds state and handles late data | missing watermark → unbounded state |
1(df
2 .withWatermark("event_time", "10 minutes")
3 .writeStream
4 .format("delta")
5 .option("checkpointLocation", "/chk/orders")
6 .outputMode("append")
7 .start("/delta/silver/orders"))
Exam cues
DLT is about declarative pipelines with built-in operational structure:
flowchart LR
BR["Bronze (ingest)"] --> SI["Silver (clean + dedupe)"]
SI --> GO["Gold (metrics)"]
Operator mindset: treat expectations as guardrails; don’t silently pass bad data downstream.
| Symptom | Likely cause | Safe next step |
|---|---|---|
| Slow joins/aggregations | heavy shuffle | reduce data early; pick join strategy; tune partitions |
| One task runs forever | data skew | handle hot keys; split/skew hints (concept-level) |
| Lots of tiny files | write pattern | compaction/OPTIMIZE (concept-level) |