DE-PRO Cheatsheet — DLT, Streaming, Performance & Reliability on Databricks

Last-mile DE-PRO review: incremental pipelines, Structured Streaming (watermarks/checkpoints), Delta Live Tables concepts, performance tuning pickers (shuffle/skew/file layout), and production troubleshooting heuristics.

On this page

Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to harden production instincts.

1) Production pipeline lens (what DE-PRO is really testing)

If two answers “work,” choose the one that is:

Recoverable: checkpoints/state, idempotent writes, safe retries
Observable: clear metrics/logs, quality gates, lineage
Low blast radius: staged changes, small reversible steps

2) Incremental batch + CDC (the safest default patterns)

Upsert with `MERGE` (CDC)

1MERGE INTO silver t
2USING cdc s
3ON t.id = s.id
4WHEN MATCHED AND s.op = 'D' THEN DELETE
5WHEN MATCHED THEN UPDATE SET *
6WHEN NOT MATCHED THEN INSERT *;

Rules of thumb

Ensure the ON condition is unique on the source side.
Make pipelines idempotent: re-running the same input should not double-apply changes.

3) Structured Streaming essentials (checkpointing, watermarks, late data)

The two most-tested concepts

Concept	Why it matters	Failure mode
Checkpoint	enables exactly-once style recovery for sinks	deleting/moving checkpoint breaks correctness
Watermark	bounds state and handles late data	missing watermark → unbounded state

Streaming write (conceptual template)

1(df
2  .withWatermark("event_time", "10 minutes")
3  .writeStream
4  .format("delta")
5  .option("checkpointLocation", "/chk/orders")
6  .outputMode("append")
7  .start("/delta/silver/orders"))

Exam cues

Late data policy changes outcomes (drop vs update aggregates).
Triggers control latency/cost; stateful ops need careful tuning.

4) Delta Live Tables (DLT) — what to remember

DLT is about declarative pipelines with built-in operational structure:

pipeline graph (table dependencies)
quality expectations
managed execution/monitoring

    flowchart LR
	  BR["Bronze (ingest)"] --> SI["Silver (clean + dedupe)"]
	  SI --> GO["Gold (metrics)"]

Operator mindset: treat expectations as guardrails; don’t silently pass bad data downstream.

5) Performance pickers (shuffle, skew, file layout)

Shuffle vs skew diagnosis

Symptom	Likely cause	Safe next step
Slow joins/aggregations	heavy shuffle	reduce data early; pick join strategy; tune partitions
One task runs forever	data skew	handle hot keys; split/skew hints (concept-level)
Lots of tiny files	write pattern	compaction/`OPTIMIZE` (concept-level)

File layout rules of thumb

Don’t over-partition (small file explosion).
Compact when needed; keep partition columns low/medium cardinality.
Use Z-order/data skipping where supported (concept-level).

6) Reliability + troubleshooting quick pickers

Streaming duplicates: checkpoint misuse, non-idempotent sink, or incorrect output mode for stateful ops.
State grows forever: missing/incorrect watermark; unbounded aggregation.
MERGE “explodes” rows: source not unique on merge keys.
Reprocessing/backfill: prefer explicit versioning and safe re-runs over ad-hoc manual deletes.

Syllabus

Practice

Browse Exams — Mock Exams & Practice Tests