Use this syllabus as your source of truth for DEA-C01. Work through each domain in order and drill targeted sets after every task.
What’s covered
Domain 1: Data Ingestion and Transformation (34%)
Practice this topic →
- Evaluate throughput and latency requirements and select an ingestion approach that meets service limits and operational needs.
- Differentiate streaming and batch ingestion patterns and choose based on frequency, historical backfill needs, and timeliness requirements.
- Design replayable ingestion pipelines (for backfills and reprocessing) and apply idempotency principles to avoid duplicate processing.
- Differentiate stateful and stateless ingestion transactions and understand implications for ordering, deduplication, and checkpoints.
- Ingest streaming data from sources such as Amazon Kinesis, Amazon MSK, DynamoDB Streams, AWS DMS, AWS Glue, and Amazon Redshift based on the source type and constraints.
- Ingest batch data from sources such as Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, and Amazon AppFlow based on the integration pattern.
- Configure batch ingestion options such as scheduled runs, incremental loads, partitioning, and checkpoints to support repeatability and performance.
- Consume data APIs safely by handling pagination, throttling, retries, and authentication while maintaining traceability.
- Set up schedulers using Amazon EventBridge, Apache Airflow (Amazon MWAA), or time-based schedules for ingestion jobs and AWS Glue crawlers.
- Use event triggers such as Amazon S3 Event Notifications and EventBridge rules to start ingestion and downstream processing.
- Invoke AWS Lambda from Amazon Kinesis and design fan-in/fan-out distribution patterns for streaming data pipelines.
- Implement secure connectivity (for example, IP allowlists) and manage throttling and rate limits for services such as DynamoDB, Amazon RDS, and Amazon Kinesis.
- Design ETL pipeline steps based on business requirements, target schemas, and downstream analytics or operational needs.
- Select transformation strategies based on data volume, velocity, and variety (structured, semi-structured, unstructured) and operational constraints.
- Apply distributed computing concepts and use Apache Spark to process large datasets efficiently.
- Select intermediate data staging locations (for example, Amazon S3 or temporary tables) to support multi-step processing and recoverability.
- Optimize container usage for performance needs using Amazon EKS or Amazon ECS and select compute sizing and scaling strategies.
- Connect to data sources using JDBC/ODBC connectors and manage network access and credentials securely.
- Integrate data from multiple sources and handle joins, deduplication, normalization, and schema alignment across systems.
- Optimize costs while processing data by choosing the right compute and execution model (serverless vs provisioned, scaling, and purchasing options).
- Select and use transformation services such as Amazon EMR, AWS Glue, AWS Lambda, and Amazon Redshift based on workload requirements.
- Transform data between formats (for example, CSV to Parquet) and apply partitioning and compression for analytics performance.
- Troubleshoot and debug common transformation failures and performance issues such as data skew, memory pressure, and small-file problems.
- Create data APIs to make data available to other systems using AWS services while managing schema/version changes.
Task 1.3 - Orchestrate data pipelines
- Integrate AWS services to build end-to-end data pipelines with explicit dependencies, retries, and repeatable outcomes.
- Apply event-driven architecture to trigger pipeline steps using services such as Amazon EventBridge and Amazon S3 notifications.
- Configure pipelines to run on schedules or dependencies, including managing reruns and backfills.
- Choose serverless workflow patterns and determine when to use AWS Step Functions, AWS Glue workflows, or AWS Lambda orchestration.
- Build orchestration workflows using services such as AWS Lambda, Amazon MWAA, AWS Step Functions, AWS Glue workflows, and Amazon EventBridge.
- Implement notifications and alerting for pipeline events using Amazon SNS and Amazon SQS and integrate with failure handling.
- Design pipelines for performance, availability, scalability, resiliency, and fault tolerance using retries, idempotency, and isolation.
- Implement and maintain serverless workflows with correct timeouts, error handling, and state management.
- Define and monitor pipeline SLAs such as data freshness and latency, and configure alarms for breaches.
- Control parallelism and fan-out patterns safely using workflow constructs (for example, Step Functions Map/Parallel or Airflow task concurrency).
- Implement checkpointing and understand at-least-once vs exactly-once trade-offs across streaming and batch steps.
- Secure orchestration components with least-privilege roles and controlled network access for data services and endpoints.
Task 1.4 - Apply programming concepts
- Implement CI/CD for data pipelines, including automated testing, packaging, and safe deployment practices.
- Write SQL queries for data extraction and transformation, including joins and multi-step transformations that support pipeline requirements.
- Optimize SQL queries using techniques such as predicate pushdown, partition pruning, and avoiding expensive joins when possible.
- Use infrastructure as code (AWS CloudFormation or AWS CDK) to deploy repeatable data pipeline infrastructure.
- Apply distributed computing concepts to optimize data pipeline code and avoid bottlenecks in parallel processing.
- Use appropriate data structures and algorithms (for example, graph and tree structures) when modeling or traversing complex data relationships.
- Optimize code to reduce runtime and resource usage for ingestion and transformation, including efficient I/O and batching.
- Configure AWS Lambda functions for concurrency and performance (memory, timeout, concurrency limits) in data pipeline workloads.
- Use Amazon Redshift stored procedures or SQL UDFs to encapsulate transformations and improve repeatability.
- Use Git commands and workflows (clone, branch, merge, tag) to manage pipeline code changes safely.
- Package and deploy serverless data pipelines using AWS SAM, including Lambda functions, Step Functions state machines, and DynamoDB tables.
- Mount and use storage volumes from within Lambda functions (for example, Amazon EFS) when required and understand constraints and best practices.
Domain 2: Data Store Management (26%)
Practice this topic →
Task 2.1 - Choose a data store
- Compare storage platforms (object, file, relational, NoSQL, streaming) and explain how their characteristics affect data engineering designs.
- Select AWS data stores and configurations that meet performance demands such as throughput, concurrency, and latency.
- Choose data storage formats (for example, CSV, TXT, Parquet) and apply compression and partitioning aligned to access patterns.
- Align data storage choices with migration requirements, including cutover approaches, replication needs, and data movement constraints.
- Determine appropriate storage solutions for access patterns such as point lookups, scans, time-series, OLTP, OLAP, and streaming.
- Manage locks and concurrency controls (for example, in Amazon Redshift and Amazon RDS) and prevent contention in multi-user environments.
- Implement appropriate storage services for cost and performance requirements (for example, Amazon Redshift, Amazon EMR, AWS Lake Formation, Amazon RDS, DynamoDB, Amazon Kinesis Data Streams, Amazon MSK).
- Configure storage services for access pattern requirements (for example, Redshift distribution/sort design, DynamoDB partition key design, and S3 partition layout).
- Apply Amazon S3 to appropriate use cases such as data lakes, staging layers, and durable storage of curated datasets.
- Integrate migration tools such as AWS Transfer Family into data movement and ingestion workflows.
- Implement data migration or remote access methods such as Amazon Redshift federated queries, materialized views, and Redshift Spectrum.
- Design hybrid and multi-store architectures with clear responsibilities (for example, lake for raw/curated data, warehouse for BI, stream for real-time ingestion).
Task 2.2 - Understand data cataloging systems
- Explain the purpose of a data catalog and how it enables discovery, governance, and consistent schema usage across analytics services.
- Create and maintain a data catalog, including databases and tables, and keep metadata consistent as data changes.
- Classify data based on requirements and use metadata to support access controls and governance workflows.
- Identify key components of metadata and catalogs (schemas, partitions, tags, lineage) and how they are used by query engines.
- Build and reference catalogs using AWS Glue Data Catalog or Apache Hive metastore for consistent schema discovery.
- Discover schemas and populate data catalogs using AWS Glue crawlers and apply controls to manage schema drift.
- Synchronize partitions with a data catalog and ensure new partitions are available for query engines promptly.
- Create and manage source and target connections for cataloging using AWS Glue connections and appropriate network settings.
- Enable catalog-driven consumption for services such as Athena, Amazon EMR, and Redshift Spectrum while maintaining schema consistency.
- Troubleshoot catalog issues such as stale partitions, crawler misclassification, and insufficient permissions.
- Apply governance controls through catalog metadata using Lake Formation permissions and LF-tags.
- Version and document schemas and catalog changes to support reproducibility, collaboration, and audits.
Task 2.3 - Manage the lifecycle of data
- Choose storage solutions that meet hot and cold data requirements and design tiered storage based on access frequency.
- Optimize storage cost across the data lifecycle using mechanisms such as storage class transitions, compression, and partitioning.
- Define retention policies and archiving strategies aligned to business requirements and legal obligations.
- Delete and expire data to meet business and legal requirements, including implementing automated expiry processes.
- Protect data with appropriate resiliency and availability using strategies such as versioning, replication, and backups.
- Perform load and unload operations to move data between Amazon S3 and Amazon Redshift using efficient patterns.
- Manage S3 Lifecycle policies to transition objects across storage tiers and enforce retention policies.
- Expire data when it reaches a specific age using S3 Lifecycle policies and validate outcomes against retention requirements.
- Manage S3 versioning and understand implications for restore scenarios, rollback, and ongoing storage cost.
- Use DynamoDB TTL to expire data and design applications that handle eventual deletions safely.
- Manage lifecycle policies for intermediate and derived datasets to avoid accumulating stale data and unnecessary costs.
- Apply governance controls to lifecycle operations, including change control, approvals, and audit logs for retention policy changes.
Task 2.4 - Design data models and schema evolution
- Apply data modeling concepts (normalized/denormalized, star/snowflake) and select a design based on query patterns and constraints.
- Model structured, semi-structured, and unstructured data and decide when to use schema-on-write versus schema-on-read.
- Design schemas for Amazon Redshift, DynamoDB, and lake-based tables (governed by Lake Formation) aligned to access patterns.
- Use indexing, partitioning, sort/distribution strategies, and compression to optimize performance and cost.
- Plan schema evolution techniques (additive changes, versioning, backfills) and avoid breaking downstream consumers.
- Address changes to the characteristics of data (volume, cardinality, skew) and update data models to preserve performance and correctness.
- Establish data lineage to ensure accuracy and trustworthiness of data and support auditability.
- Perform schema conversion when migrating databases using AWS Schema Conversion Tool (AWS SCT) and AWS DMS schema conversion.
- Manage schema drift with controlled discovery (for example, AWS Glue crawlers) and compatibility policies for consumers.
- Use lineage tools to track transformations and provenance (for example, Amazon SageMaker ML Lineage Tracking) and document pipeline metadata.
- Design backward and forward compatibility approaches for schema changes and validate with consumer contract testing.
- Document data models and schema evolution policies to support collaboration, governance, and predictable upgrades.
Domain 3: Data Operations and Support (22%)
Practice this topic →
Task 3.1 - Automate data processing by using AWS services
- Maintain and troubleshoot automated data processing workflows to ensure repeatable business outcomes.
- Use API calls and SDKs to automate data processing operations and integrate programmatic control into pipelines.
- Identify which AWS services accept scripting (for example, Amazon EMR, Amazon Redshift, AWS Glue) and integrate scripts into workflows safely.
- Orchestrate data pipelines using services such as Amazon MWAA and AWS Step Functions to coordinate processing, retries, and dependencies.
- Troubleshoot Amazon managed workflows and identify common failure modes in orchestration and scheduling.
- Use AWS service features to process data (for example, Amazon EMR, Amazon Redshift, AWS Glue) and select the best service for the workload.
- Consume and maintain data APIs, including versioning and access controls that enable stable downstream integrations.
- Prepare data transformations using tools such as AWS Glue DataBrew and integrate outputs into downstream analytics.
- Query data using Amazon Athena and create repeatable datasets using views or CTAS patterns.
- Use AWS Lambda to automate data processing and connect event-driven triggers to processing steps.
- Manage events and schedulers using Amazon EventBridge and integrate triggers for batch and stream processing.
- Implement automation guardrails (idempotency, retries, backoff, and rate-limit handling) to prevent duplicate or runaway processing.
Task 3.2 - Analyze data by using AWS services
- Choose between provisioned and serverless analytics services based on concurrency needs, workload variability, cost, and operational overhead.
- Write SQL queries with joins, filters, and window functions to satisfy analysis requirements while controlling cost and runtime.
- Apply cleansing techniques appropriately and document assumptions to ensure repeatable and auditable analysis.
- Perform aggregations, rolling averages, grouping, and pivoting to create analysis-ready datasets.
- Use Amazon Athena to query data and create reusable assets such as views and CTAS outputs.
- Use Athena notebooks with Apache Spark to explore data interactively and validate data assumptions.
- Visualize data using AWS services such as Amazon QuickSight and design dashboards aligned to business questions.
- Use AWS Glue DataBrew for profiling, visualizing, and preparing datasets during analysis.
- Verify and clean data using services and tools such as AWS Lambda, Athena, QuickSight, Jupyter notebooks, and Amazon SageMaker Data Wrangler.
- Optimize query performance and cost using file formats, partition pruning, statistics, and CTAS patterns where appropriate.
- Select the appropriate analysis engine (Athena vs Redshift vs EMR) based on latency needs, scale, governance controls, and integration requirements.
- Share analysis outputs safely by controlling access to derived datasets and applying least-privilege patterns for consumers.
Task 3.3 - Maintain and monitor data pipelines
- Implement application logging for data pipelines and define what operational data to capture (job metadata, row counts, failures).
- Log access to AWS services used in pipelines using AWS CloudTrail and integrate events into audit and incident workflows.
- Monitor pipeline health using Amazon CloudWatch metrics, logs, and alarms and define actionable alert thresholds.
- Apply best practices for performance tuning and validate improvements using observable metrics and controlled changes.
- Deploy logging and monitoring solutions that enable traceability across ingestion, transformation, and consumption steps.
- Use notification services (Amazon SNS and Amazon SQS) to send alerts and integrate with operational runbooks.
- Troubleshoot performance issues using tools such as CloudWatch Logs Insights, Amazon OpenSearch Service, Athena, and EMR log analysis.
- Troubleshoot and maintain pipelines (for example, AWS Glue jobs and Amazon EMR steps), including dependency and configuration issues.
- Extract logs for audits and implement log retention and archival policies that meet compliance requirements.
- Use Amazon Macie findings to detect sensitive data exposure risks and integrate detection into governance workflows.
- Create dashboards for key pipeline SLIs/SLOs such as freshness, latency, error rates, and backlog.
- Automate operational responses for common failures (reruns, retries, backfills) with guardrails and approvals when required.
Task 3.4 - Ensure data quality
- Define data validation dimensions (completeness, consistency, accuracy, integrity) and select metrics for measuring them.
- Use data profiling techniques to discover anomalies and guide rule creation.
- Apply data sampling techniques to validate large datasets efficiently and detect drift in data quality.
- Detect and mitigate data skew mechanisms that can impact processing performance and downstream correctness.
- Run data quality checks during processing (for example, null checks and range checks) and decide when to fail fast versus quarantine records.
- Define data quality rules using services such as AWS Glue DataBrew and manage rules as versioned assets.
- Investigate data consistency issues using AWS Glue DataBrew profiling and targeted queries to locate root causes.
- Route invalid or suspicious records to a quarantine dataset and design remediation workflows for later correction.
- Configure thresholds and alerting for data quality failures and integrate alerts into orchestration failure handling.
- Maintain a data quality scorecard over time and track regressions after pipeline changes.
- Detect schema drift and validate schema conformance early to prevent downstream breakages.
- Implement continuous improvement loops for recurring quality issues using root cause analysis, remediation, and prevention controls.
Domain 4: Data Security and Governance (18%)
Practice this topic →
Task 4.1 - Apply authentication mechanisms
- Apply VPC security networking concepts (subnets, routing, security groups) to secure data engineering workloads.
- Differentiate managed and unmanaged services and understand how that affects authentication responsibilities and control points.
- Compare authentication methods (password-based, certificate-based, and role-based) and choose mechanisms appropriate to the service and workload.
- Differentiate AWS managed policies and customer managed policies and determine when a custom policy is required.
- Update VPC security groups to allow only required traffic for data sources, processing jobs, and analytics endpoints.
- Create and update IAM groups, roles, endpoints, and services used by data pipelines and analytics workloads.
- Create and rotate credentials using AWS Secrets Manager and avoid embedding secrets in code or configuration.
- Set up IAM roles for access from services such as Lambda, API Gateway, the AWS CLI, and CloudFormation, including correct trust policies.
- Apply IAM policies to roles, endpoints, and services such as S3 Access Points and AWS PrivateLink endpoints.
- Configure private connectivity using VPC endpoints or PrivateLink to access AWS data services without traversing the public internet.
- Troubleshoot authentication issues (for example, invalid credentials or trust policy errors) and implement least-privilege fixes.
- Implement identity hygiene for pipelines, including short-lived credentials, federation, and role assumption patterns.
Task 4.2 - Apply authorization mechanisms
- Differentiate authorization methods (role-based, policy-based, tag-based, attribute-based) and map them to AWS services and use cases.
- Apply the principle of least privilege by scoping actions and resources to the minimum required for each pipeline component.
- Implement role-based access control (RBAC) patterns that match expected access patterns for producers, consumers, and administrators.
- Protect data from unauthorized access across services using IAM conditions, resource policies, tag-based controls, and network isolation.
- Create custom IAM policies when managed policies do not meet requirements, including using conditions for least privilege.
- Store application and database credentials securely using AWS Secrets Manager or AWS Systems Manager Parameter Store.
- Provide database users, groups, and roles appropriate access in databases (for example, Amazon Redshift) aligned to separation of duties.
- Manage permissions through AWS Lake Formation for Amazon S3 data and integrated engines such as Redshift, EMR, and Athena.
- Use Lake Formation governance features such as LF-tags to scale permission management across many datasets.
- Implement cross-account access patterns safely using role assumption and resource sharing while maintaining governance boundaries.
- Troubleshoot authorization failures by distinguishing IAM policy issues from resource policy and Lake Formation permission issues.
- Audit and periodically review permissions, including identifying overly permissive policies and applying remediation.
Task 4.3 - Ensure data encryption and masking
- Select data encryption options available in AWS analytics services such as Amazon Redshift, Amazon EMR, and AWS Glue and apply encryption by default.
- Differentiate client-side encryption and server-side encryption and choose based on trust boundaries and operational requirements.
- Protect sensitive data by selecting appropriate anonymization, masking, tokenization, and salting approaches.
- Apply data masking and anonymization according to compliance laws and company policies and verify that controls are effective.
- Use AWS KMS keys to encrypt and decrypt data and manage key rotation and access grants safely.
- Enable encryption in transit (TLS) for data movement and service connections across ingestion, processing, and analytics layers.
- Configure encryption across AWS account boundaries (for example, cross-account KMS use) while maintaining least privilege.
- Configure Amazon S3 encryption (SSE-KMS), bucket policies, and default encryption for data lake storage.
- Enable encryption at rest and in transit for Amazon Redshift and understand how encryption interacts with data loading and sharing.
- Apply encryption and security controls to streaming and migration integrations (for example, Kinesis, MSK, and DMS) at a high level.
- Implement column-level or field-level masking patterns to limit exposure of sensitive attributes in analytics outputs.
- Design audit-ready evidence for encryption and masking controls, including key usage logs and periodic validation checks.
Task 4.4 - Prepare logs for audit
- Define what application events to log for data pipelines (parameters, row counts, errors) while minimizing sensitive data exposure.
- Log access to AWS services used by data platforms and understand how CloudTrail events support investigations and audits.
- Design centralized logging for AWS logs, including storing logs in durable locations with retention and access controls.
- Use AWS CloudTrail to track API calls that affect data systems, including changes to permissions, configurations, and data access.
- Store application logs in Amazon CloudWatch Logs and configure retention, encryption, and controlled access.
- Use AWS CloudTrail Lake for centralized logging queries and generate audit evidence from event queries.
- Analyze logs using services such as Athena, CloudWatch Logs Insights, and Amazon OpenSearch Service for audit and incident response needs.
- Integrate AWS services to process large volumes of log data (for example, using Amazon EMR when needed) and maintain scalability.
- Implement log integrity and immutability patterns (write-once storage, access separation) for audit readiness.
- Create dashboards and alarms for suspicious activity and policy violations using monitored audit signals.
- Implement separation of duties by controlling who can write, change, and read audit logs and audit configurations.
- Document audit procedures and evidence mapping to compliance requirements and keep logs accessible for required retention periods.
Task 4.5 - Understand data privacy and governance
- Identify and protect personally identifiable information (PII) and other sensitive data types throughout ingestion, storage, processing, and consumption.
- Explain data sovereignty and how regional restrictions affect storage, processing, backups, and replication strategies.
- Grant permissions for data sharing (for example, Amazon Redshift data sharing) while controlling scope, consumers, and auditability.
- Implement PII identification using Amazon Macie and integrate findings with Lake Formation governance controls.
- Implement data privacy strategies that prevent backups or replications of data to disallowed AWS Regions.
- Manage configuration changes using AWS Config to detect, track, and remediate governance drift in data platforms.
- Apply governance frameworks using Lake Formation, tags, and catalog metadata to manage dataset ownership and access.
- Implement data classification and labeling strategies to drive access controls, retention policies, and monitoring priorities.
- Define purpose limitation and consent-aware data usage policies (where applicable) and document intended uses for datasets.
- Maintain auditability of governance decisions by logging policy changes, approvals, and access reviews.
- Apply multi-account governance patterns at a high level (for example, Organizations guardrails and separation of duties) for data platforms.
- Monitor and remediate governance drift (permissions creep and misconfigurations) using automated checks and operational processes.
Tip: After each task, write 5–10 “one-liner rules” from your misses (service selection + trade-offs).