Use this syllabus as your source of truth for CCAC. Work topic-by-topic, and drill questions after each section.
What’s covered
Topic 1: Confluent Cloud Fundamentals & Resource Model
Practice this topic →
1.1 Organizations, environments, and clusters
- Describe Confluent Cloud’s resource hierarchy (organization → environments → clusters and services).
- Explain why environments are used to separate blast radius (dev/test/prod) and access boundaries.
- Identify the services that can exist within an environment (Kafka clusters, Schema Registry, governance tools, connectors).
- Explain the difference between environment-level and cluster-level configuration/permissions at a high level.
- Given a scenario, choose an environment strategy that supports isolation, compliance, and team ownership.
- Describe common naming/tagging conventions that improve multi-team operations (env, owner, cost center).
- Recognize how multi-region and multi-cloud designs typically map to multiple clusters and/or environments.
1.2 Service accounts, API keys, and identity basics
- Define a service account and explain why it is preferred over shared human credentials for automation.
- Explain how API keys map to service accounts and how keys are used for client authentication.
- Describe safe credential handling practices (secret managers, rotation, least privilege).
- Differentiate authentication errors from authorization errors in client symptoms at a high level.
- Given a scenario, choose the correct identity approach for an application, connector, or CI/CD pipeline.
- Describe key rotation patterns that avoid downtime (dual keys during cutover).
- Explain why per-application identities reduce blast radius and improve auditability.
1.3 Confluent CLI and operational workflow (awareness)
- Recognize the purpose of Confluent CLI for managing resources from automation and scripts.
- Describe a safe workflow for changes: plan → apply → verify → rollback if needed.
- Identify operations that should be tracked with change control (networking changes, RBAC changes, linking).
- Explain why “verify first” is a core operating principle (health checks before and after changes).
- Given a scenario, choose when to use UI vs CLI vs IaC tooling for repeatable operations.
- Describe audit-friendly practices: unique identities, least privilege, and explicit approvals for high-risk actions.
- Identify common operational documentation artifacts (runbooks, ownership, escalation paths).
Topic 2: Cluster Provisioning, Topics, and Client Connectivity
Practice this topic →
2.1 Cluster types, sizing, and placement (conceptual)
- Describe the purpose of choosing cluster region and cloud provider based on latency and residency requirements.
- Explain why capacity planning includes throughput, partition count, replication, and retention considerations at a high level.
- Identify how cluster choice impacts cost and operational constraints (limits/quotas awareness).
- Given a scenario, choose a cluster placement strategy to minimize latency for producers and consumers.
- Explain how multi-region availability requirements commonly translate into multi-cluster designs.
- Describe why partitioning strategy and consumer parallelism impact perceived cluster capacity.
- Recognize that cost control often starts with right-sizing, retention, and connector usage patterns.
2.2 Topic fundamentals and operational constraints
- Explain how partitions determine consumer parallelism and the unit of ordering (per partition).
- Describe how replication and durability interact with producer acknowledgement settings at a high level.
- Differentiate retention-based topics from compacted topics and map each to common use cases (event log vs changelog).
- Identify why large messages and unbounded retention can create operational risk.
- Given a scenario, choose topic settings that balance durability, cost, and consumer performance.
- Explain why increasing partitions after a topic is in use can change key distribution behavior and ordering expectations.
- Recognize basic topic configuration levers operators should understand (retention, compaction, max message size).
2.3 Client connectivity and endpoint patterns
- Explain how clients connect to clusters and why DNS and routing matter in private connectivity scenarios.
- Identify common causes of connectivity failures (wrong endpoint, allowlist issues, private DNS misconfiguration).
- Differentiate connection failures from TLS/SASL/auth failures based on symptoms at a high level.
- Describe the operational meaning of “public endpoint” vs “private endpoint” access (exposure and routing).
- Given a scenario, choose the safest connectivity approach that meets compliance requirements.
- Explain why clients should be configured for retries/timeouts and how that affects perceived availability.
- Recognize that private networking often requires coordination with cloud networking teams (VPC/VNet/DNS ownership).
Topic 3: Networking & Private Connectivity
Practice this topic →
3.1 Private connectivity options (high level)
- Differentiate public internet access from private connectivity options conceptually.
- Explain PrivateLink/private service access at a high level and why it reduces public exposure.
- Describe VPC/VNet peering at a high level and identify common pitfalls (IP overlap, route propagation).
- Recognize that private connectivity often implies private DNS and split-horizon resolution.
- Given a scenario, choose the private connectivity option that best fits enterprise routing constraints.
- Explain why private connectivity affects tooling and troubleshooting (different endpoints, DNS, firewall rules).
- Identify the operational steps that typically require coordination across teams (cloud networking, security).
3.2 IP allowlists and controlled exposure
- Explain the purpose of IP allowlists as a control for public endpoints.
- Describe the operational risk of over-broad allowlists and how to reduce blast radius.
- Identify how NAT and egress IPs affect allowlist design for applications running in cloud environments.
- Given a scenario, determine why an application cannot connect due to allowlist constraints and select the fix.
- Explain why allowlists do not replace encryption and authentication controls (defense in depth).
- Describe change-management best practices for allowlist updates (staging, verification, rollback).
- Recognize how allowlists interact with managed connectors and third-party integrations.
3.3 DNS, routing, and troubleshooting connectivity
- Describe how private DNS is required when resolving private endpoints for Kafka brokers.
- Identify the most common DNS failure pattern: private endpoint created but clients still resolve public addresses.
- Explain how routing/firewall rules can block private traffic even when DNS resolves correctly.
- Given a scenario, choose the correct troubleshooting sequence (DNS → routing → TLS/auth → RBAC).
- Describe how to validate connectivity using a controlled test client inside the target network boundary.
- Explain why changing DNS can have broad impact and should be treated as a high-risk change.
- Recognize that multi-region private networking introduces additional complexity (cross-region routing and DNS).
Topic 4: Security, RBAC, and Governance
Practice this topic →
4.1 RBAC roles, scopes, and least privilege
- Explain the purpose of RBAC and why scope matters (org vs environment vs cluster).
- Differentiate common operator needs: read-only, operator, and admin capabilities (high level).
- Describe least-privilege principles for producers, consumers, connectors, and administrators.
- Identify common authorization failure symptoms and which role binding is likely missing (conceptual).
- Given a scenario, choose the minimum permissions required to accomplish an operational task safely.
- Explain why shared admin accounts increase blast radius and how to separate duties with role bindings.
- Describe governance guardrails: environment separation, role reviews, and key rotation policies.
4.2 Schema discipline and Stream Governance mindset
- Explain why schema discipline reduces breaking changes in shared topics.
- Differentiate schema compatibility modes at a conceptual level (backward/forward/full).
- Identify which schema changes are typically safe (additive fields) vs risky (removals/type changes).
- Describe why governance is easier when applied consistently per environment (shared rules).
- Given a scenario, choose a governance approach to reduce consumer breakage (compatibility settings, approvals).
- Explain how catalog/lineage awareness supports impact analysis and operational debugging.
- Recognize that governance includes naming standards, ownership metadata, and lifecycle decisions.
4.3 Credential hygiene and incident response
- Describe safe storage of API keys and secrets (secret managers, least access, rotation).
- Explain how to rotate keys safely with minimal downtime (overlapping validity).
- Identify incident steps when a key is suspected compromised (revoke/rotate, audit usage, narrow permissions).
- Differentiate a connectivity incident from an authorization incident and select immediate safe mitigations.
- Given a scenario, choose a safe containment action that reduces blast radius without breaking all traffic.
- Describe audit-friendly operations: unique identities, short-lived access where possible, and change logs.
- Recognize when to involve security/network teams (private connectivity, allowlists, policy changes).
Topic 5: Managed Connectors & Integrations
Practice this topic →
5.1 Connector fundamentals (sources, sinks, and configuration)
- Differentiate source connectors from sink connectors and match each to common integration needs.
- Describe connector configuration requirements at a high level (credentials, topics, converters/serialization).
- Explain why managed connectors reduce ops burden but do not eliminate data, auth, or networking constraints.
- Identify common connector risks: throughput caps, destination throttling, and schema incompatibility.
- Given a scenario, choose a connector-based solution vs custom ingestion code.
- Describe safe connector deployment practices: least privilege credentials, staged rollout, and monitoring.
- Recognize how connector retries and error tolerance settings affect downstream correctness.
5.2 Troubleshooting connector failures (auth, network, data)
- Differentiate connector failures caused by authentication/authorization from those caused by networking.
- Identify data/serialization failures (schema mismatch, converter errors) and choose remediation steps.
- Explain why DNS and private networking issues commonly break connectors connecting to private destinations.
- Given a scenario, choose the first diagnostic step based on the observed error (task logs, status, metrics).
- Describe strategies to handle poison messages (dead-letter topics) conceptually.
- Recognize how destination backpressure can cause lag and retries without a full connector crash.
- Explain why secrets rotation and credential expiry can surface as sudden connector failures.
5.3 Governance and operational guardrails for connectors
- Describe how to standardize connector ownership and lifecycle (who owns failures and costs).
- Explain why connectors should run under dedicated service accounts with scoped permissions.
- Identify cost control levers for connectors (throughput, polling intervals, topic retention).
- Given a scenario, choose a connector strategy that reduces risk to shared clusters (quotas, limits, isolation).
- Explain why schema governance matters for sink connectors (downstream tables/contracts).
- Recognize how connector changes should follow change management and staged rollout patterns.
- Describe why monitoring should include both connector health and target system health.
Topic 6: Cluster Linking & Multi-Cluster Architectures
Practice this topic →
6.1 Cluster Linking concepts and use cases
- Describe what Cluster Linking provides conceptually (replication between clusters).
- Differentiate common use cases: disaster recovery, multi-region reads, and multi-cloud topologies.
- Explain why Cluster Linking is often preferred to custom replication pipelines for platform-managed replication.
- Identify operational considerations: monitoring link health, lag, and topic replication status.
- Given a scenario, decide whether Cluster Linking or application-level dual writes is the better approach.
- Describe how identity, networking, and permissions impact link setup and operations.
- Recognize that linking is not a full application failover plan; clients still need a switching strategy.
6.2 DR patterns and failover planning
- Describe active-passive vs active-active patterns at a conceptual level for streaming platforms.
- Explain the difference between replicating data and replicating applications/state.
- Identify which requirements drive DR design: RTO/RPO, compliance, latency, and cost.
- Given a scenario, choose a failover strategy that matches business requirements and operational realities.
- Describe how to test failover without causing data duplication or consumer confusion (planned drills).
- Recognize how schema governance and topic naming impact multi-cluster consistency.
- Explain why monitoring and runbooks are essential for DR readiness.
- Describe how to standardize environments and governance across clusters to reduce drift.
- Explain why access control and key management must be consistent across clusters and regions.
- Identify common failure modes: networking changes, permission changes, quota throttling, and link instability.
- Given a scenario, choose the smallest safe change to restore replication health (avoid broad changes).
- Describe how to manage costs in multi-cluster designs (retain only necessary topics, right-size).
- Recognize that private networking and DNS complexity increases with multi-region designs.
- Explain how to coordinate changes and incident response across teams and regions.
Topic 7: Monitoring, Troubleshooting, and Cost Controls
Practice this topic →
7.1 Operational signals and health checks
- Identify core operational signals for clusters (throughput, latency, error rates) conceptually.
- Explain why consumer lag is a symptom and list common root causes (processing, partitions, backpressure).
- Describe common Confluent Cloud incident categories: auth failures, networking failures, connector failures, quota/limit issues.
- Given a scenario, choose a triage order that reduces time to resolution (scope, isolate, validate).
- Describe verification steps after changes: confirm connectivity, permissions, and stable throughput.
- Recognize when to escalate to networking/security teams (private connectivity, allowlists, policy).
- Explain why runbooks and ownership metadata reduce incident duration.
7.2 Quotas, limits, and safe throttling behavior
- Explain why quotas/limits exist and how they protect shared platform stability.
- Identify symptoms of throttling or quota enforcement (increased errors, retries, lag).
- Describe safe approaches to reducing load: backoff, batching, scaling consumers, and right-sizing.
- Given a scenario, choose a remediation plan that reduces load without introducing data loss or excessive duplicates.
- Explain why aggressive retries can amplify incidents and how exponential backoff reduces retry storms.
- Recognize how connector throughput and destination throttling can mimic platform throttling symptoms.
- Describe the importance of communicating expected backlogs during mitigation and recovery.
7.3 Cost management and operational hygiene
- Identify major cost drivers: throughput, retention/storage, connectors, and multi-cluster replication.
- Explain how retention policy choices directly affect storage costs and operational risk.
- Describe how to prevent cost surprises: environment separation, ownership, budgets, and regular review.
- Given a scenario, choose the most cost-effective change that maintains required reliability and security.
- Explain why deleting data or reducing retention is high-risk and should be done with governance and approvals.
- Recognize that operational hygiene includes key rotation, permission review, and periodic connector audits.
- Describe how standardization (naming, ownership, schemas) reduces both cost and incident frequency.
Tip: After finishing a topic, take a 15–25 question drill focused on that area, then revisit weak objectives before moving on.