Use this syllabus as your source of truth for DE‑PRO. Work topic-by-topic, and drill questions after each section.
What’s covered
Topic 1: Incremental Batch Pipelines & CDC
Practice this topic →
1.1 Idempotent batch design and backfills
- Define idempotency for pipelines and identify patterns that make re-runs safe.
- Design backfill strategies that minimize blast radius and avoid corrupting curated tables.
- Choose append vs overwrite vs partition overwrite based on data lifecycle and recovery requirements.
- Explain how watermarking/processing windows support incremental batch loads.
- Recognize when full refresh is appropriate vs unnecessary and expensive.
- Given a scenario, choose the safest remediation path after a partial failure (re-run, backfill, or rollback).
- Describe how to validate correctness after backfill (row counts, reconciliation, constraints).
1.2 CDC upserts with MERGE (correctness and pitfalls)
- Select MERGE as the correct operation for CDC/upsert workloads and describe its purpose at a high level.
- Identify why merge keys must be unique on the source side to prevent row multiplication.
- Design delete/update/insert logic for CDC patterns (tombstones, soft deletes) conceptually.
- Explain how late-arriving changes affect CDC pipelines and how to handle reprocessing.
- Recognize performance implications of frequent small merges and when compaction helps (concept-level).
- Given a scenario, choose an upsert strategy that balances correctness, performance, and recoverability.
- Describe how to test CDC pipelines with representative edge cases (out-of-order updates, duplicates).
1.3 Multi-hop architecture and data quality gates
- Explain Bronze/Silver/Gold as a reliability pattern and map quality checks to the correct layer.
- Define data quality expectations and identify where to enforce them to prevent downstream corruption.
- Design a quarantine pattern for bad records that preserves observability and enables remediation.
- Explain why stable business definitions belong in curated layers and should be versioned/managed.
- Given a scenario, choose the right place to deduplicate and enforce constraints.
- Describe how lineage and documentation reduce operational confusion in shared pipelines.
- Recognize the trade-off between strictness (fail fast) and availability (tolerate and quarantine).
Topic 2: Structured Streaming on Databricks
Practice this topic →
2.1 Core streaming mechanics (state, triggers, checkpointing)
- Explain the purpose of checkpointing and how it enables reliable streaming recovery.
- Differentiate trigger types conceptually (micro-batch vs continuous triggers where applicable) and their latency/cost implications.
- Describe stateful operations (aggregations, joins) and why state size must be bounded.
- Explain why deleting or moving checkpoints can break correctness and lead to duplicates or data loss.
- Recognize the role of output modes (append/update/complete) and which are valid for stateful operations.
- Given a scenario, choose the correct checkpointing strategy and storage location for recovery.
- Describe safe restart behavior and what to verify after restart (throughput, lag, correctness).
2.2 Late data and watermarks
- Explain watermarks conceptually and how they bound state for event-time processing.
- Differentiate event time vs processing time and why it matters for streaming correctness.
- Describe how late data affects aggregations and what trade-offs exist (drop late vs update aggregates).
- Recognize scenarios where missing watermarks can create unbounded state and operational instability.
- Given a scenario, choose an appropriate watermark duration based on data delay characteristics and business requirements.
- Explain how to validate late-data handling using test cases and monitoring.
- Describe why watermark changes can change results and should be treated as a controlled change.
2.3 Streaming sinks and idempotent writes
- Describe common sinks conceptually (Delta, message queues, external stores) and their correctness trade-offs.
- Explain why idempotent writes matter for streaming and how retries can create duplicates without safeguards.
- Recognize the relationship between checkpointing and exactly-once style behavior for certain sinks (concept-level).
- Identify when a foreachBatch pattern is appropriate for integrating with external systems (concept-level).
- Given a scenario, choose a write strategy that is recoverable and observable.
- Explain what to monitor for streaming pipelines (throughput, processing latency, state size, errors).
- Describe how to handle poison messages without stalling an entire stream (quarantine patterns).
Topic 3: Delta Live Tables (DLT) Pipelines
Practice this topic →
3.1 Declarative pipeline structure and dependencies
- Explain DLT’s value proposition at a high level (declarative pipelines with built-in ops).
- Describe how tables/views in a pipeline form a dependency graph and why ordering matters.
- Differentiate between batch and streaming DLT pipelines conceptually and when each is used.
- Recognize how pipeline changes should be staged to reduce risk (dev → staging → prod).
- Given a scenario, choose a DLT structure that separates raw ingest from curated outputs.
- Explain why pipeline graphs improve observability and troubleshooting.
- Describe basic pipeline parameterization and environment separation patterns (concept-level).
3.2 Expectations (data quality) and outcomes
- Define DLT expectations conceptually and explain how they enforce data quality rules.
- Differentiate fail-fast vs drop/quarantine behavior and how each affects downstream correctness.
- Design expectations for common quality constraints (not null, ranges, uniqueness) conceptually.
- Recognize why quality gates should be placed early in curated layers to prevent bad data propagation.
- Given a scenario, choose the right expectation behavior based on business tolerance and reliability goals.
- Explain why quality metrics should be monitored as first-class signals.
- Describe how changes to expectations should be managed to avoid surprising downstream consumers.
3.3 DLT operations and troubleshooting
- Identify common DLT failure categories (upstream schema drift, permissions, data quality failures).
- Describe safe recovery strategies (rerun, backfill, rollback) for failed pipeline updates.
- Explain why lineage and pipeline graphs help isolate the root cause quickly.
- Recognize operational best practices: small changes, validation runs, and clear ownership.
- Given a scenario, choose the least risky corrective action that restores correctness.
- Explain how to monitor pipeline health and data freshness for SLA-style requirements.
- Describe why cost and performance should be tracked over time to prevent regressions.
Practice this topic →
4.1 Shuffle, skew, and join strategy
- Identify operations that cause shuffles and explain why shuffles are expensive (network + disk + coordination).
- Diagnose data skew symptoms (one task straggles) and choose mitigation approaches conceptually.
- Explain broadcast joins conceptually and identify when broadcasting a small dimension is appropriate.
- Describe why filtering early and selecting only needed columns reduces shuffle size and cost.
- Given a scenario, choose whether to tune code, data layout, or cluster sizing for best ROI.
- Explain adaptive execution at a high level and why it can improve query plans.
- Describe how to validate performance changes without breaking correctness (A/B timing + row checks).
4.2 File layout, compaction, and table maintenance
- Recognize the small-file problem and why compaction can improve performance.
- Choose partition columns that enable pruning while avoiding high-cardinality partition explosions.
- Explain Z-order/data clustering at a conceptual level and why it helps common filter patterns.
- Describe maintenance operations at a high level (optimize/compaction, vacuum) and the risks of aggressive cleanup.
- Given a scenario, choose whether to adjust write patterns or run maintenance to address performance issues.
- Explain why retention and cleanup settings interact with auditability and rollback ability.
- Describe why table maintenance should be scheduled and monitored like a production job.
4.3 Cluster sizing and workload isolation (concept-level)
- Differentiate scaling up vs scaling out and identify when each helps Spark workloads.
- Explain why concurrent workloads may need isolation to prevent noisy-neighbor performance degradation.
- Recognize when caching is beneficial vs when it wastes memory in batch pipelines.
- Given a scenario, choose a safe scaling approach that reduces cost while meeting SLAs.
- Describe why autoscaling can help bursty workloads but needs observability and limits.
- Explain why driver/executor resource imbalance can cause job instability (concept-level).
- Identify safe operational practices: staged scaling changes and monitoring after changes.
Topic 5: Reliability, Observability & Governance
Practice this topic →
5.1 Operational monitoring and incident response
- Identify the minimal monitoring set for pipelines (freshness, row counts, error rate, latency).
- Explain why structured logs and metrics reduce MTTR during incident response.
- Describe a safe incident triage sequence: scope impact → validate inputs → isolate failing stage → remediate.
- Recognize common failure modes: schema drift, data quality violations, permission changes, and upstream outages.
- Given a scenario, choose the least risky corrective action that restores correctness.
- Explain why runbooks and ownership reduce repeated operational mistakes.
- Describe how to validate recovery (correctness checks, lag/backlog drain, SLA restoration).
5.2 Security and governance awareness (production hygiene)
- Explain why least privilege and clear ownership are essential for shared production tables.
- Differentiate read vs write permissions and identify why write access should be tightly controlled.
- Recognize that shared curated tables require schema change governance to prevent breaking consumers.
- Describe the purpose of environment separation (dev/test/prod) and why it reduces accidental production impact.
- Given a scenario, choose a safer sharing strategy (views, controlled access) instead of copying data.
- Explain why auditability matters for regulated datasets (lineage, history, approvals).
- Identify the operational risks of destructive operations and why guardrails are needed.
5.3 Change management for production pipelines
- Describe why small, reversible changes reduce blast radius in production.
- Explain a staged release approach for pipelines (test in lower envs, promote with verification).
- Recognize when to pause changes due to unstable platform signals (ongoing incidents, high backlog).
- Given a scenario, choose a rollback strategy (time travel, redeploy previous version, re-run with known inputs).
- Explain why performance changes must be validated to avoid hidden regressions.
- Describe how to document pipeline contracts (schemas, SLAs, ownership) for stable operations.
- Identify anti-patterns: manual hotfixes without tracking, silent data drops, and unbounded retries.
Tip: After finishing a topic, take a 15–25 question drill focused on that area, then revisit weak objectives before moving on.