DE-PRO Syllabus — Learning Objectives by Topic

Blueprint-aligned learning objectives for Databricks Data Engineer Professional (DE-PRO), organized by topic with quick links to targeted practice.

Use this syllabus as your source of truth for DE‑PRO. Work topic-by-topic, and drill questions after each section.

What’s covered

Topic 1: Incremental Batch Pipelines & CDC
Topic 2: Structured Streaming on Databricks
Topic 3: Delta Live Tables (DLT) Pipelines
Topic 4: Performance Tuning & Optimization
Topic 5: Reliability, Observability & Governance

Topic 1: Incremental Batch Pipelines & CDC

Practice this topic →

1.1 Idempotent batch design and backfills

Define idempotency for pipelines and identify patterns that make re-runs safe.
Design backfill strategies that minimize blast radius and avoid corrupting curated tables.
Choose append vs overwrite vs partition overwrite based on data lifecycle and recovery requirements.
Explain how watermarking/processing windows support incremental batch loads.
Recognize when full refresh is appropriate vs unnecessary and expensive.
Given a scenario, choose the safest remediation path after a partial failure (re-run, backfill, or rollback).
Describe how to validate correctness after backfill (row counts, reconciliation, constraints).

1.2 CDC upserts with MERGE (correctness and pitfalls)

Select MERGE as the correct operation for CDC/upsert workloads and describe its purpose at a high level.
Identify why merge keys must be unique on the source side to prevent row multiplication.
Design delete/update/insert logic for CDC patterns (tombstones, soft deletes) conceptually.
Explain how late-arriving changes affect CDC pipelines and how to handle reprocessing.
Recognize performance implications of frequent small merges and when compaction helps (concept-level).
Given a scenario, choose an upsert strategy that balances correctness, performance, and recoverability.
Describe how to test CDC pipelines with representative edge cases (out-of-order updates, duplicates).

1.3 Multi-hop architecture and data quality gates

Explain Bronze/Silver/Gold as a reliability pattern and map quality checks to the correct layer.
Define data quality expectations and identify where to enforce them to prevent downstream corruption.
Design a quarantine pattern for bad records that preserves observability and enables remediation.
Explain why stable business definitions belong in curated layers and should be versioned/managed.
Given a scenario, choose the right place to deduplicate and enforce constraints.
Describe how lineage and documentation reduce operational confusion in shared pipelines.
Recognize the trade-off between strictness (fail fast) and availability (tolerate and quarantine).

Topic 2: Structured Streaming on Databricks

Practice this topic →

2.1 Core streaming mechanics (state, triggers, checkpointing)

Explain the purpose of checkpointing and how it enables reliable streaming recovery.
Differentiate trigger types conceptually (micro-batch vs continuous triggers where applicable) and their latency/cost implications.
Describe stateful operations (aggregations, joins) and why state size must be bounded.
Explain why deleting or moving checkpoints can break correctness and lead to duplicates or data loss.
Recognize the role of output modes (append/update/complete) and which are valid for stateful operations.
Given a scenario, choose the correct checkpointing strategy and storage location for recovery.
Describe safe restart behavior and what to verify after restart (throughput, lag, correctness).

2.2 Late data and watermarks

Explain watermarks conceptually and how they bound state for event-time processing.
Differentiate event time vs processing time and why it matters for streaming correctness.
Describe how late data affects aggregations and what trade-offs exist (drop late vs update aggregates).
Recognize scenarios where missing watermarks can create unbounded state and operational instability.
Given a scenario, choose an appropriate watermark duration based on data delay characteristics and business requirements.
Explain how to validate late-data handling using test cases and monitoring.
Describe why watermark changes can change results and should be treated as a controlled change.

2.3 Streaming sinks and idempotent writes

Describe common sinks conceptually (Delta, message queues, external stores) and their correctness trade-offs.
Explain why idempotent writes matter for streaming and how retries can create duplicates without safeguards.
Recognize the relationship between checkpointing and exactly-once style behavior for certain sinks (concept-level).
Identify when a foreachBatch pattern is appropriate for integrating with external systems (concept-level).
Given a scenario, choose a write strategy that is recoverable and observable.
Explain what to monitor for streaming pipelines (throughput, processing latency, state size, errors).
Describe how to handle poison messages without stalling an entire stream (quarantine patterns).

Topic 3: Delta Live Tables (DLT) Pipelines

Practice this topic →

3.1 Declarative pipeline structure and dependencies

Explain DLT’s value proposition at a high level (declarative pipelines with built-in ops).
Describe how tables/views in a pipeline form a dependency graph and why ordering matters.
Differentiate between batch and streaming DLT pipelines conceptually and when each is used.
Recognize how pipeline changes should be staged to reduce risk (dev → staging → prod).
Given a scenario, choose a DLT structure that separates raw ingest from curated outputs.
Explain why pipeline graphs improve observability and troubleshooting.
Describe basic pipeline parameterization and environment separation patterns (concept-level).

3.2 Expectations (data quality) and outcomes

Define DLT expectations conceptually and explain how they enforce data quality rules.
Differentiate fail-fast vs drop/quarantine behavior and how each affects downstream correctness.
Design expectations for common quality constraints (not null, ranges, uniqueness) conceptually.
Recognize why quality gates should be placed early in curated layers to prevent bad data propagation.
Given a scenario, choose the right expectation behavior based on business tolerance and reliability goals.
Explain why quality metrics should be monitored as first-class signals.
Describe how changes to expectations should be managed to avoid surprising downstream consumers.

3.3 DLT operations and troubleshooting

Identify common DLT failure categories (upstream schema drift, permissions, data quality failures).
Describe safe recovery strategies (rerun, backfill, rollback) for failed pipeline updates.
Explain why lineage and pipeline graphs help isolate the root cause quickly.
Recognize operational best practices: small changes, validation runs, and clear ownership.
Given a scenario, choose the least risky corrective action that restores correctness.
Explain how to monitor pipeline health and data freshness for SLA-style requirements.
Describe why cost and performance should be tracked over time to prevent regressions.

Topic 4: Performance Tuning & Optimization

Practice this topic →

4.1 Shuffle, skew, and join strategy

Identify operations that cause shuffles and explain why shuffles are expensive (network + disk + coordination).
Diagnose data skew symptoms (one task straggles) and choose mitigation approaches conceptually.
Explain broadcast joins conceptually and identify when broadcasting a small dimension is appropriate.
Describe why filtering early and selecting only needed columns reduces shuffle size and cost.
Given a scenario, choose whether to tune code, data layout, or cluster sizing for best ROI.
Explain adaptive execution at a high level and why it can improve query plans.
Describe how to validate performance changes without breaking correctness (A/B timing + row checks).

4.2 File layout, compaction, and table maintenance

Recognize the small-file problem and why compaction can improve performance.
Choose partition columns that enable pruning while avoiding high-cardinality partition explosions.
Explain Z-order/data clustering at a conceptual level and why it helps common filter patterns.
Describe maintenance operations at a high level (optimize/compaction, vacuum) and the risks of aggressive cleanup.
Given a scenario, choose whether to adjust write patterns or run maintenance to address performance issues.
Explain why retention and cleanup settings interact with auditability and rollback ability.
Describe why table maintenance should be scheduled and monitored like a production job.

4.3 Cluster sizing and workload isolation (concept-level)

Differentiate scaling up vs scaling out and identify when each helps Spark workloads.
Explain why concurrent workloads may need isolation to prevent noisy-neighbor performance degradation.
Recognize when caching is beneficial vs when it wastes memory in batch pipelines.
Given a scenario, choose a safe scaling approach that reduces cost while meeting SLAs.
Describe why autoscaling can help bursty workloads but needs observability and limits.
Explain why driver/executor resource imbalance can cause job instability (concept-level).
Identify safe operational practices: staged scaling changes and monitoring after changes.

Topic 5: Reliability, Observability & Governance

Practice this topic →

5.1 Operational monitoring and incident response

Identify the minimal monitoring set for pipelines (freshness, row counts, error rate, latency).
Explain why structured logs and metrics reduce MTTR during incident response.
Describe a safe incident triage sequence: scope impact → validate inputs → isolate failing stage → remediate.
Recognize common failure modes: schema drift, data quality violations, permission changes, and upstream outages.
Given a scenario, choose the least risky corrective action that restores correctness.
Explain why runbooks and ownership reduce repeated operational mistakes.
Describe how to validate recovery (correctness checks, lag/backlog drain, SLA restoration).

5.2 Security and governance awareness (production hygiene)

Explain why least privilege and clear ownership are essential for shared production tables.
Differentiate read vs write permissions and identify why write access should be tightly controlled.
Recognize that shared curated tables require schema change governance to prevent breaking consumers.
Describe the purpose of environment separation (dev/test/prod) and why it reduces accidental production impact.
Given a scenario, choose a safer sharing strategy (views, controlled access) instead of copying data.
Explain why auditability matters for regulated datasets (lineage, history, approvals).
Identify the operational risks of destructive operations and why guardrails are needed.

5.3 Change management for production pipelines

Describe why small, reversible changes reduce blast radius in production.
Explain a staged release approach for pipelines (test in lower envs, promote with verification).
Recognize when to pause changes due to unstable platform signals (ongoing incidents, high backlog).
Given a scenario, choose a rollback strategy (time travel, redeploy previous version, re-run with known inputs).
Explain why performance changes must be validated to avoid hidden regressions.
Describe how to document pipeline contracts (schemas, SLAs, ownership) for stable operations.
Identify anti-patterns: manual hotfixes without tracking, silent data drops, and unbounded retries.

Tip: After finishing a topic, take a 15–25 question drill focused on that area, then revisit weak objectives before moving on.

Study Plan

Cheat Sheet

Browse Exams — Mock Exams & Practice Tests

DE-PRO Syllabus — Learning Objectives by Topic

What’s covered

Topic 1: Incremental Batch Pipelines & CDC

1.1 Idempotent batch design and backfills

1.2 CDC upserts with MERGE (correctness and pitfalls)

1.3 Multi-hop architecture and data quality gates

Topic 2: Structured Streaming on Databricks

2.1 Core streaming mechanics (state, triggers, checkpointing)

2.2 Late data and watermarks

2.3 Streaming sinks and idempotent writes

Topic 3: Delta Live Tables (DLT) Pipelines

3.1 Declarative pipeline structure and dependencies

3.2 Expectations (data quality) and outcomes

3.3 DLT operations and troubleshooting

Topic 4: Performance Tuning & Optimization

4.1 Shuffle, skew, and join strategy

4.2 File layout, compaction, and table maintenance

4.3 Cluster sizing and workload isolation (concept-level)

Topic 5: Reliability, Observability & Governance

5.1 Operational monitoring and incident response

5.2 Security and governance awareness (production hygiene)

5.3 Change management for production pipelines