DE-PRO Syllabus — Learning Objectives by Topic

Blueprint-aligned learning objectives for Databricks Data Engineer Professional (DE-PRO), organized by topic with quick links to targeted practice.

Use this syllabus as your source of truth for DE‑PRO. Work topic-by-topic, and drill questions after each section.

What’s covered

Topic 1: Incremental Batch Pipelines & CDC

Practice this topic →

1.1 Idempotent batch design and backfills

  • Define idempotency for pipelines and identify patterns that make re-runs safe.
  • Design backfill strategies that minimize blast radius and avoid corrupting curated tables.
  • Choose append vs overwrite vs partition overwrite based on data lifecycle and recovery requirements.
  • Explain how watermarking/processing windows support incremental batch loads.
  • Recognize when full refresh is appropriate vs unnecessary and expensive.
  • Given a scenario, choose the safest remediation path after a partial failure (re-run, backfill, or rollback).
  • Describe how to validate correctness after backfill (row counts, reconciliation, constraints).

1.2 CDC upserts with MERGE (correctness and pitfalls)

  • Select MERGE as the correct operation for CDC/upsert workloads and describe its purpose at a high level.
  • Identify why merge keys must be unique on the source side to prevent row multiplication.
  • Design delete/update/insert logic for CDC patterns (tombstones, soft deletes) conceptually.
  • Explain how late-arriving changes affect CDC pipelines and how to handle reprocessing.
  • Recognize performance implications of frequent small merges and when compaction helps (concept-level).
  • Given a scenario, choose an upsert strategy that balances correctness, performance, and recoverability.
  • Describe how to test CDC pipelines with representative edge cases (out-of-order updates, duplicates).

1.3 Multi-hop architecture and data quality gates

  • Explain Bronze/Silver/Gold as a reliability pattern and map quality checks to the correct layer.
  • Define data quality expectations and identify where to enforce them to prevent downstream corruption.
  • Design a quarantine pattern for bad records that preserves observability and enables remediation.
  • Explain why stable business definitions belong in curated layers and should be versioned/managed.
  • Given a scenario, choose the right place to deduplicate and enforce constraints.
  • Describe how lineage and documentation reduce operational confusion in shared pipelines.
  • Recognize the trade-off between strictness (fail fast) and availability (tolerate and quarantine).

Topic 2: Structured Streaming on Databricks

Practice this topic →

2.1 Core streaming mechanics (state, triggers, checkpointing)

  • Explain the purpose of checkpointing and how it enables reliable streaming recovery.
  • Differentiate trigger types conceptually (micro-batch vs continuous triggers where applicable) and their latency/cost implications.
  • Describe stateful operations (aggregations, joins) and why state size must be bounded.
  • Explain why deleting or moving checkpoints can break correctness and lead to duplicates or data loss.
  • Recognize the role of output modes (append/update/complete) and which are valid for stateful operations.
  • Given a scenario, choose the correct checkpointing strategy and storage location for recovery.
  • Describe safe restart behavior and what to verify after restart (throughput, lag, correctness).

2.2 Late data and watermarks

  • Explain watermarks conceptually and how they bound state for event-time processing.
  • Differentiate event time vs processing time and why it matters for streaming correctness.
  • Describe how late data affects aggregations and what trade-offs exist (drop late vs update aggregates).
  • Recognize scenarios where missing watermarks can create unbounded state and operational instability.
  • Given a scenario, choose an appropriate watermark duration based on data delay characteristics and business requirements.
  • Explain how to validate late-data handling using test cases and monitoring.
  • Describe why watermark changes can change results and should be treated as a controlled change.

2.3 Streaming sinks and idempotent writes

  • Describe common sinks conceptually (Delta, message queues, external stores) and their correctness trade-offs.
  • Explain why idempotent writes matter for streaming and how retries can create duplicates without safeguards.
  • Recognize the relationship between checkpointing and exactly-once style behavior for certain sinks (concept-level).
  • Identify when a foreachBatch pattern is appropriate for integrating with external systems (concept-level).
  • Given a scenario, choose a write strategy that is recoverable and observable.
  • Explain what to monitor for streaming pipelines (throughput, processing latency, state size, errors).
  • Describe how to handle poison messages without stalling an entire stream (quarantine patterns).

Topic 3: Delta Live Tables (DLT) Pipelines

Practice this topic →

3.1 Declarative pipeline structure and dependencies

  • Explain DLT’s value proposition at a high level (declarative pipelines with built-in ops).
  • Describe how tables/views in a pipeline form a dependency graph and why ordering matters.
  • Differentiate between batch and streaming DLT pipelines conceptually and when each is used.
  • Recognize how pipeline changes should be staged to reduce risk (dev → staging → prod).
  • Given a scenario, choose a DLT structure that separates raw ingest from curated outputs.
  • Explain why pipeline graphs improve observability and troubleshooting.
  • Describe basic pipeline parameterization and environment separation patterns (concept-level).

3.2 Expectations (data quality) and outcomes

  • Define DLT expectations conceptually and explain how they enforce data quality rules.
  • Differentiate fail-fast vs drop/quarantine behavior and how each affects downstream correctness.
  • Design expectations for common quality constraints (not null, ranges, uniqueness) conceptually.
  • Recognize why quality gates should be placed early in curated layers to prevent bad data propagation.
  • Given a scenario, choose the right expectation behavior based on business tolerance and reliability goals.
  • Explain why quality metrics should be monitored as first-class signals.
  • Describe how changes to expectations should be managed to avoid surprising downstream consumers.

3.3 DLT operations and troubleshooting

  • Identify common DLT failure categories (upstream schema drift, permissions, data quality failures).
  • Describe safe recovery strategies (rerun, backfill, rollback) for failed pipeline updates.
  • Explain why lineage and pipeline graphs help isolate the root cause quickly.
  • Recognize operational best practices: small changes, validation runs, and clear ownership.
  • Given a scenario, choose the least risky corrective action that restores correctness.
  • Explain how to monitor pipeline health and data freshness for SLA-style requirements.
  • Describe why cost and performance should be tracked over time to prevent regressions.

Topic 4: Performance Tuning & Optimization

Practice this topic →

4.1 Shuffle, skew, and join strategy

  • Identify operations that cause shuffles and explain why shuffles are expensive (network + disk + coordination).
  • Diagnose data skew symptoms (one task straggles) and choose mitigation approaches conceptually.
  • Explain broadcast joins conceptually and identify when broadcasting a small dimension is appropriate.
  • Describe why filtering early and selecting only needed columns reduces shuffle size and cost.
  • Given a scenario, choose whether to tune code, data layout, or cluster sizing for best ROI.
  • Explain adaptive execution at a high level and why it can improve query plans.
  • Describe how to validate performance changes without breaking correctness (A/B timing + row checks).

4.2 File layout, compaction, and table maintenance

  • Recognize the small-file problem and why compaction can improve performance.
  • Choose partition columns that enable pruning while avoiding high-cardinality partition explosions.
  • Explain Z-order/data clustering at a conceptual level and why it helps common filter patterns.
  • Describe maintenance operations at a high level (optimize/compaction, vacuum) and the risks of aggressive cleanup.
  • Given a scenario, choose whether to adjust write patterns or run maintenance to address performance issues.
  • Explain why retention and cleanup settings interact with auditability and rollback ability.
  • Describe why table maintenance should be scheduled and monitored like a production job.

4.3 Cluster sizing and workload isolation (concept-level)

  • Differentiate scaling up vs scaling out and identify when each helps Spark workloads.
  • Explain why concurrent workloads may need isolation to prevent noisy-neighbor performance degradation.
  • Recognize when caching is beneficial vs when it wastes memory in batch pipelines.
  • Given a scenario, choose a safe scaling approach that reduces cost while meeting SLAs.
  • Describe why autoscaling can help bursty workloads but needs observability and limits.
  • Explain why driver/executor resource imbalance can cause job instability (concept-level).
  • Identify safe operational practices: staged scaling changes and monitoring after changes.

Topic 5: Reliability, Observability & Governance

Practice this topic →

5.1 Operational monitoring and incident response

  • Identify the minimal monitoring set for pipelines (freshness, row counts, error rate, latency).
  • Explain why structured logs and metrics reduce MTTR during incident response.
  • Describe a safe incident triage sequence: scope impact → validate inputs → isolate failing stage → remediate.
  • Recognize common failure modes: schema drift, data quality violations, permission changes, and upstream outages.
  • Given a scenario, choose the least risky corrective action that restores correctness.
  • Explain why runbooks and ownership reduce repeated operational mistakes.
  • Describe how to validate recovery (correctness checks, lag/backlog drain, SLA restoration).

5.2 Security and governance awareness (production hygiene)

  • Explain why least privilege and clear ownership are essential for shared production tables.
  • Differentiate read vs write permissions and identify why write access should be tightly controlled.
  • Recognize that shared curated tables require schema change governance to prevent breaking consumers.
  • Describe the purpose of environment separation (dev/test/prod) and why it reduces accidental production impact.
  • Given a scenario, choose a safer sharing strategy (views, controlled access) instead of copying data.
  • Explain why auditability matters for regulated datasets (lineage, history, approvals).
  • Identify the operational risks of destructive operations and why guardrails are needed.

5.3 Change management for production pipelines

  • Describe why small, reversible changes reduce blast radius in production.
  • Explain a staged release approach for pipelines (test in lower envs, promote with verification).
  • Recognize when to pause changes due to unstable platform signals (ongoing incidents, high backlog).
  • Given a scenario, choose a rollback strategy (time travel, redeploy previous version, re-run with known inputs).
  • Explain why performance changes must be validated to avoid hidden regressions.
  • Describe how to document pipeline contracts (schemas, SLAs, ownership) for stable operations.
  • Identify anti-patterns: manual hotfixes without tracking, silent data drops, and unbounded retries.

Tip: After finishing a topic, take a 15–25 question drill focused on that area, then revisit weak objectives before moving on.