DE-ASSOC Cheatsheet — Spark + Delta Lake ETL (Tables, SQL, and Quick Rules)

Last-mile DE-ASSOC review: Spark SQL/DataFrames essentials, Delta Lake features (schema, time travel, merge), ETL patterns, and exam-style pickers. Includes code snippets, tables, and diagrams.

On this page

Use this for last‑mile review. Pair it with the Syllabus for coverage and Practice to validate speed/accuracy.

1) Spark fundamentals (what the exam is testing)

Transformations vs actions (execution trigger)

Category	Examples	Notes
Transformations	`select`, `filter`, `withColumn`, `groupBy`, `join`	Lazy: build a plan
Actions	`count`, `collect`, `show`, `write`, `foreach`	Trigger execution

Rule: If it returns a DataFrame, it’s usually a transformation. If it returns a value or writes, it’s usually an action.

Narrow vs wide transformations (shuffle intuition)

Type	What it implies	Examples
Narrow	No shuffle	`select`, `filter`, `map`
Wide	Shuffle	`groupBy`, `join`, `distinct`, `orderBy`

2) Spark SQL high-yield patterns

Joins: choose the right type

Join	Keeps rows from	Common use
INNER	both sides match	dimensional enrichment
LEFT	left side always	“keep all facts”
FULL	both sides always	reconciliation
SEMI/ANTI	left rows with/without match	de-dup and “missing” detection

Window functions (classic exam topic)

1SELECT
2  user_id,
3  ts,
4  amount,
5  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) AS rn
6FROM transactions;

3) Delta Lake essentials (must-know features)

Why Delta exists (one sentence)

Delta Lake adds ACID + schema enforcement + time travel to data lake storage.

Read/write basics

1CREATE TABLE IF NOT EXISTS silver_orders
2USING DELTA
3AS SELECT * FROM bronze_orders;

`MERGE` (upsert / CDC pattern)

1MERGE INTO silver_orders t
2USING staging_orders s
3ON t.order_id = s.order_id
4WHEN MATCHED THEN UPDATE SET *
5WHEN NOT MATCHED THEN INSERT *;

Schema enforcement vs schema evolution

Concept	What it means	What you see
Schema enforcement	rejects incompatible writes	write fails
Schema evolution	updates table schema intentionally	new columns appear

4) Time travel + history (audit and rollback)

1DESCRIBE HISTORY silver_orders;
2
3SELECT * FROM silver_orders VERSION AS OF 42;

Exam cue: time travel is for debugging/audit/rollback; it doesn’t replace good pipeline discipline.

5) Partitioning and file layout (concept-level pickers)

You want…	Do this	Why
Faster pruning on common filters	Partition on low-to-medium cardinality columns	Partition pruning
Avoid too many tiny files	Don’t over-partition; use compaction/`OPTIMIZE` where appropriate	Small-file problem
Faster reads on frequent predicates	Use file compaction + data skipping (conceptually)	Fewer files + better locality

6) ETL pipeline mental model (Bronze → Silver → Gold)

    flowchart LR
	  B["Bronze (raw)"] --> S["Silver (cleaned)"]
	  S --> G["Gold (business-ready)"]

Rules of thumb

Bronze is append-heavy and close to source.
Silver enforces schema/quality and supports joins.
Gold is curated for BI and stable consumption.

7) Quick troubleshooting pickers

Unexpected duplicates after upsert: MERGE join condition wrong or not unique.
Slow groupBy/join: shuffle-heavy; consider partition strategy and input skew (concept-level).
Write fails on schema mismatch: schema enforcement; decide whether evolution is intended.
Lots of small files: adjust write patterns and compact (concept-level).

Syllabus

Practice

Browse Exams — Mock Exams & Practice Tests