DE-PRO Cheatsheet — DLT, Streaming, Performance & Reliability on Databricks

Last-mile DE-PRO review: incremental pipelines, Structured Streaming (watermarks/checkpoints), Delta Live Tables concepts, performance tuning pickers (shuffle/skew/file layout), and production troubleshooting heuristics.

Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to harden production instincts.


1) Production pipeline lens (what DE-PRO is really testing)

If two answers “work,” choose the one that is:

  • Recoverable: checkpoints/state, idempotent writes, safe retries
  • Observable: clear metrics/logs, quality gates, lineage
  • Low blast radius: staged changes, small reversible steps

2) Incremental batch + CDC (the safest default patterns)

Upsert with MERGE (CDC)

1MERGE INTO silver t
2USING cdc s
3ON t.id = s.id
4WHEN MATCHED AND s.op = 'D' THEN DELETE
5WHEN MATCHED THEN UPDATE SET *
6WHEN NOT MATCHED THEN INSERT *;

Rules of thumb

  • Ensure the ON condition is unique on the source side.
  • Make pipelines idempotent: re-running the same input should not double-apply changes.

3) Structured Streaming essentials (checkpointing, watermarks, late data)

The two most-tested concepts

ConceptWhy it mattersFailure mode
Checkpointenables exactly-once style recovery for sinksdeleting/moving checkpoint breaks correctness
Watermarkbounds state and handles late datamissing watermark → unbounded state

Streaming write (conceptual template)

1(df
2  .withWatermark("event_time", "10 minutes")
3  .writeStream
4  .format("delta")
5  .option("checkpointLocation", "/chk/orders")
6  .outputMode("append")
7  .start("/delta/silver/orders"))

Exam cues

  • Late data policy changes outcomes (drop vs update aggregates).
  • Triggers control latency/cost; stateful ops need careful tuning.

4) Delta Live Tables (DLT) — what to remember

DLT is about declarative pipelines with built-in operational structure:

  • pipeline graph (table dependencies)
  • quality expectations
  • managed execution/monitoring
    flowchart LR
	  BR["Bronze (ingest)"] --> SI["Silver (clean + dedupe)"]
	  SI --> GO["Gold (metrics)"]

Operator mindset: treat expectations as guardrails; don’t silently pass bad data downstream.


5) Performance pickers (shuffle, skew, file layout)

Shuffle vs skew diagnosis

SymptomLikely causeSafe next step
Slow joins/aggregationsheavy shufflereduce data early; pick join strategy; tune partitions
One task runs foreverdata skewhandle hot keys; split/skew hints (concept-level)
Lots of tiny fileswrite patterncompaction/OPTIMIZE (concept-level)

File layout rules of thumb

  • Don’t over-partition (small file explosion).
  • Compact when needed; keep partition columns low/medium cardinality.
  • Use Z-order/data skipping where supported (concept-level).

6) Reliability + troubleshooting quick pickers

  • Streaming duplicates: checkpoint misuse, non-idempotent sink, or incorrect output mode for stateful ops.
  • State grows forever: missing/incorrect watermark; unbounded aggregation.
  • MERGE “explodes” rows: source not unique on merge keys.
  • Reprocessing/backfill: prefer explicit versioning and safe re-runs over ad-hoc manual deletes.