Use this syllabus as your source of truth for CCAAK. Work topic-by-topic, and drill questions after each section.
What’s covered
Topic 1: Kafka Architecture & Core Concepts (Admin View)
Practice this topic →
- Describe the role of a broker and how brokers host partition replicas.
- Explain the controller’s responsibility for partition leadership and cluster metadata at a high level.
- Differentiate between data plane traffic (produce/consume) and control plane operations (metadata, elections).
- Explain how clients use bootstrap servers and why multiple bootstrap brokers are recommended.
- Describe how controller instability can surface as client errors, leader changes, and operational alerts.
- Recognize Zookeeper-mode vs KRaft-mode clusters conceptually (metadata quorum vs external ZK).
- Given a scenario, identify which component is most likely responsible for a leadership/metadata symptom.
1.2 Topics, partitions, offsets, and ordering
- Explain how a topic is partitioned and why partitions are the unit of parallelism and ordering.
- Define an offset and describe how offsets represent position within a partition log.
- Describe ordering guarantees (per partition) and why ordering is not preserved across partitions.
- Explain how record keys affect partition selection and why key choice impacts load and ordering.
- Describe consumer groups and the one-consumer-per-partition rule within a group.
- Identify how adding partitions affects key→partition mapping and consumer parallelism.
- Given a scenario, choose a partitioning approach that balances parallelism with ordering requirements.
1.3 Replication, ISR, and durability trade-offs
- Define replication factor and describe leader/follower replication at a high level.
- Explain the in-sync replica (ISR) concept and why it matters for durability and acknowledgements.
- Describe under-replicated partitions (URP) and what it implies about cluster health.
- Explain how producer acknowledgements and topic min ISR interact conceptually (durability vs availability).
- Describe what unclean leader election means and why it can lead to data loss.
- Identify the operational symptoms of ISR shrink (URP, increased risk, durability constraints).
- Given a scenario, choose safer durability defaults for production topics.
Topic 2: Broker Configuration & Cluster Setup
Practice this topic →
2.1 Listener configuration and client connectivity
- Explain the purpose of listeners and advertised listeners in broker networking.
- Identify common connectivity failures caused by incorrect advertised hostnames or ports.
- Describe listener security protocol mapping at a high level (PLAINTEXT, SSL, SASL_SSL).
- Explain the role of the inter-broker listener and why it must be consistent across brokers.
- Describe how DNS, load balancers, and NAT can affect advertised listener design.
- Given a scenario, diagnose why clients can connect to one broker but not others.
- Choose a listener design that supports internal vs external client access safely.
2.2 Storage, log directories, and retention basics
- Explain what log directories store and how partitions map to files on disk.
- Identify the operational risks of disk pressure and why free space must be protected.
- Describe retention and compaction at a high level and how they affect disk usage.
- Explain segment roll settings conceptually and why segment size/time impacts compaction and retention behavior.
- Describe the difference between broker defaults and topic-level overrides for retention/compaction.
- Given a scenario, choose the fastest safe mitigation for a disk pressure incident.
- Explain why extremely large records can create broker instability and how max message size settings help.
- Explain how partitions affect throughput and why too many partitions can add overhead.
- Describe how replication factor affects write amplification and storage needs.
- Identify key bottlenecks for Kafka clusters (disk I/O, network, CPU) and the signals they produce.
- Explain why controller and metadata operations can become bottlenecks in very large clusters.
- Describe the purpose of quotas conceptually and when they are used to protect shared clusters.
- Given a scenario, choose a scale strategy: add brokers vs increase partitions vs tune producer/consumer behavior.
- Recognize common anti-patterns: oversized partitions, uncontrolled topic sprawl, and unbounded retention.
Topic 3: Topic Lifecycle & Data Management
Practice this topic →
3.1 Creating topics: partitions, replication factor, and placement
- Choose an appropriate partition count based on target consumer parallelism and expected throughput.
- Choose an appropriate replication factor based on fault tolerance requirements and broker count.
- Explain why replication factor cannot exceed available brokers and what happens when brokers are unavailable.
- Describe rack awareness conceptually and when it matters for failure domain separation.
- Explain the operational trade-offs of increasing partitions after a topic is in use.
- Given a scenario, choose a topic configuration that balances ordering, throughput, and resilience.
- Explain why many small topics can be operationally expensive and how consolidation decisions are made.
3.2 Topic configs: retention, compaction, and durability controls
- Differentiate cleanup policies (delete vs compact) and map them to event stream intent.
- Configure and reason about time/size retention settings and their impact on storage.
- Explain how compaction interacts with keys and why keyless records are not compacted as expected.
- Describe min ISR as a durability control and the availability consequences of setting it too high.
- Explain the trade-offs of unclean leader election and when it should be avoided.
- Given a scenario, decide whether a topic should be a changelog (compact) or an event log (delete).
- Describe how topic-level overrides interact with broker defaults.
3.3 Partition reassignment and leader balancing (operational intent)
- Explain why partition reassignment is performed (add brokers, balance load, decommission brokers).
- Describe the risk of moving large partitions and how throttling/stepwise moves reduce blast radius.
- Explain preferred leader election at a high level and why leader imbalance can occur.
- Identify the difference between balancing replicas vs balancing leaders and why both matter.
- Recognize when reassignment can temporarily increase network and disk load.
- Given a scenario, pick a safe sequence for broker decommissioning and partition movement.
- Describe the post-change verification steps: replication health, ISR, and client impact checks.
Topic 4: Security (TLS, SASL, ACLs) & Access Control
Practice this topic →
4.1 TLS and encryption in transit (admin level)
- Explain why TLS is used between clients and brokers and between brokers.
- Differentiate server authentication from mutual TLS (mTLS) at a conceptual level.
- Identify common TLS misconfigurations (truststore issues, hostname mismatch, wrong listener protocol).
- Describe certificate rotation risk and the importance of staged rollout and validation.
- Explain how TLS settings affect client connection troubleshooting (handshake failures).
- Given a scenario, choose a safe rollout approach for enabling TLS on a running cluster.
- Recognize which endpoints typically require TLS (brokers, Connect, Schema Registry) in secure deployments.
4.2 Authentication with SASL (high level) and common patterns
- Differentiate authentication (who) from authorization (what) in Kafka access control.
- Explain SASL’s role at a high level and recognize that mechanisms vary by environment.
- Identify how SASL configuration interacts with listeners and security protocols.
- Recognize common auth failures from the client perspective (invalid credentials, mechanism mismatch).
- Explain why separating admin, producer, and consumer identities reduces blast radius.
- Given a scenario, determine whether a failure is due to TLS, SASL authentication, or ACL authorization.
- Describe safe secret handling for credentials (no hardcoding, rotate, use secret managers).
4.3 Authorization with ACLs and least privilege
- Describe what ACLs control conceptually (topic read/write, group access, cluster actions).
- Identify the minimum permissions for a producer-only application vs consumer-only application.
- Explain why consumers often need both topic READ and group permissions to operate correctly.
- Recognize how wildcard ACLs increase risk and how to scope ACLs safely.
- Describe super user concepts at a high level and why they should be tightly controlled.
- Given a scenario, select the ACL changes required to fix an authorization error without over-granting.
- Explain why auditing and logging access changes matters for compliance.
Topic 5: Monitoring, Metrics & Observability
Practice this topic →
5.1 Health indicators: URP, offline partitions, and controller behavior
- Differentiate under-replicated partitions from offline partitions and explain severity differences.
- Identify the common causes of URP (broker down, disk/network issues, load).
- Explain why offline partitions indicate missing leadership and require immediate attention.
- Recognize symptoms of controller churn and why it destabilizes leadership and metadata operations.
- Describe the relationship between ISR shrink and durability constraints (min ISR vs availability).
- Given a scenario, choose the first diagnostic step for URP/offline partitions (logs, broker status, disk).
- Describe safe remediation sequencing: stabilize brokers, restore replication, then optimize.
5.2 Throughput and latency monitoring (broker and client signals)
- Identify key broker-side signals for throughput/latency (request rates, request times, network, disk).
- Describe how disk I/O bottlenecks surface in lag, replication delays, and increased request latency.
- Explain why network saturation can cause replication lag and producer timeouts.
- Describe how client configs can amplify load (too many requests, tiny batches, aggressive fetches).
- Given a scenario, determine whether a bottleneck is broker CPU, disk, or network based on symptoms.
- Identify safe tuning levers: adjust quotas, reduce retention pressure, scale brokers, or shift workload.
- Explain why monitoring should include both broker metrics and client behavior (producer/consumer metrics).
5.3 Consumer group monitoring and lag diagnosis
- Define consumer lag and explain why it is a symptom, not a root cause.
- Identify common lag causes: insufficient partitions, slow processing, downstream dependencies, rebalances.
- Explain how frequent rebalances can look like lag spikes and unstable processing.
- Describe how max poll interval and session timeouts relate to consumer stability (high level).
- Given a scenario, choose a fix for lag: scale consumers, increase partitions, or optimize processing.
- Identify why consumer group metrics and offsets are essential for incident triage.
- Explain why backlogs can be ‘normal’ during planned maintenance and how to plan for catch-up.
Topic 6: Troubleshooting & Incident Response
Practice this topic →
6.1 Client connectivity and authentication failures
- Diagnose common connectivity failures caused by wrong advertised listeners or DNS/route issues.
- Differentiate TLS handshake failures from SASL authentication failures based on symptoms.
- Identify authorization failures (ACLs) vs authentication failures and choose the right remediation.
- Explain how misaligned security protocol configuration across listeners can break inter-broker replication.
- Given a scenario, choose the fastest safe way to validate connectivity (test client, logs, listener checks).
- Describe safe incident practice: don’t disable security controls broadly to ‘make it work’.
- Explain why configuration changes should be staged and validated with a controlled client.
6.2 Replication issues, ISR shrink, and leader election problems
- Identify the most common causes of ISR shrink and replication lag (disk, network, broker instability).
- Explain why increasing timeouts can mask symptoms but not fix root causes.
- Describe safe steps to recover replication health after a broker failure (bring broker back, reassign, verify).
- Explain why unclean leader election is risky and how it trades durability for availability.
- Given a scenario, choose the safest remediation path for URP and leadership churn.
- Describe verification steps after recovery: ISR size, leader distribution, client error rates, lag trends.
- Explain why incident response should prioritize stabilizing the cluster before tuning performance.
6.3 Disk pressure, retention incidents, and topic sprawl
- Identify early warning signals for disk pressure and why it quickly cascades into replication issues.
- Explain safe immediate mitigations (free space, reduce ingestion, temporarily adjust retention) and risks.
- Describe how retention and compaction affect disk usage differently and how to pick the right lever.
- Explain why deleting log segments is not a safe manual fix and what policies should be used instead.
- Given a scenario, choose a remediation plan that restores safety first, then optimizes long-term settings.
- Describe how to prevent recurrence: capacity planning, guardrails, quotas, topic lifecycle governance.
- Explain why topic sprawl increases operational load and how naming/ownership policies help.
Topic 7: Maintenance, Upgrades & Ecosystem Operations
Practice this topic →
7.1 Safe maintenance: rolling restarts and configuration changes
- Explain why rolling restarts reduce downtime and how they preserve availability.
- Describe a safe rolling restart sequence and what to verify between broker restarts.
- Recognize when maintenance should be postponed due to cluster health (URP, disk pressure, controller churn).
- Explain why config changes should be small, reversible, and paired with validation steps.
- Given a scenario, choose the safest next action during maintenance when errors appear (pause, verify, rollback).
- Describe how to communicate and plan for lag/backlog during maintenance windows.
- Explain why automation should include guardrails and health checks before proceeding.
7.2 Upgrades and compatibility (high level)
- Explain why upgrades are typically performed in a rolling fashion to reduce impact.
- Describe the importance of version compatibility between brokers and clients at a high level.
- Identify why testing in lower environments is necessary before production upgrades.
- Explain how to evaluate upgrade risk: feature flags, protocol compatibility, and operational changes.
- Given a scenario, choose a safer upgrade approach (staged rollout, verification, backout plan).
- Describe post-upgrade verification: controller stability, replication health, client error rates, throughput.
- Recognize when ecosystem components (Connect, Schema Registry) require coordinated upgrades.
7.3 Confluent ecosystem services (admin awareness)
- Describe the purpose of Schema Registry and why schema governance reduces breaking changes.
- Explain Kafka Connect’s role and operational considerations (distributed mode, tasks, error handling) at a high level.
- Describe how Control Center (or monitoring tools) support visibility into cluster health and consumer lag.
- Recognize common operational failure modes for Connect and Schema Registry (connectivity, auth, schema incompatibility).
- Given a scenario, decide whether an issue is likely broker-side or ecosystem-service-side.
- Describe why multi-component security must be consistent (TLS/SASL) across Kafka and supporting services.
- Explain how incident response should include verifying dependencies, not just broker health.
Tip: After finishing a topic, take a 15–25 question drill focused on that area, then revisit weak objectives before moving on.