DEA-C01 Syllabus — Objectives by Domain

Blueprint-aligned learning objectives for AWS Certified Data Engineer — Associate (DEA-C01), organized by domain with quick links to targeted practice.

Use this syllabus as your source of truth for DEA-C01. Work through each domain in order and drill targeted sets after every task.

What’s covered

Domain 1: Data Ingestion and Transformation (34%)
Domain 2: Data Store Management (26%)
Domain 3: Data Operations and Support (22%)
Domain 4: Data Security and Governance (18%)

Domain 1: Data Ingestion and Transformation (34%)

Practice this topic →

Task 1.1 - Perform data ingestion

Evaluate throughput and latency requirements and select an ingestion approach that meets service limits and operational needs.
Differentiate streaming and batch ingestion patterns and choose based on frequency, historical backfill needs, and timeliness requirements.
Design replayable ingestion pipelines (for backfills and reprocessing) and apply idempotency principles to avoid duplicate processing.
Differentiate stateful and stateless ingestion transactions and understand implications for ordering, deduplication, and checkpoints.
Ingest streaming data from sources such as Amazon Kinesis, Amazon MSK, DynamoDB Streams, AWS DMS, AWS Glue, and Amazon Redshift based on the source type and constraints.
Ingest batch data from sources such as Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, and Amazon AppFlow based on the integration pattern.
Configure batch ingestion options such as scheduled runs, incremental loads, partitioning, and checkpoints to support repeatability and performance.
Consume data APIs safely by handling pagination, throttling, retries, and authentication while maintaining traceability.
Set up schedulers using Amazon EventBridge, Apache Airflow (Amazon MWAA), or time-based schedules for ingestion jobs and AWS Glue crawlers.
Use event triggers such as Amazon S3 Event Notifications and EventBridge rules to start ingestion and downstream processing.
Invoke AWS Lambda from Amazon Kinesis and design fan-in/fan-out distribution patterns for streaming data pipelines.
Implement secure connectivity (for example, IP allowlists) and manage throttling and rate limits for services such as DynamoDB, Amazon RDS, and Amazon Kinesis.

Task 1.2 - Transform and process data

Design ETL pipeline steps based on business requirements, target schemas, and downstream analytics or operational needs.
Select transformation strategies based on data volume, velocity, and variety (structured, semi-structured, unstructured) and operational constraints.
Apply distributed computing concepts and use Apache Spark to process large datasets efficiently.
Select intermediate data staging locations (for example, Amazon S3 or temporary tables) to support multi-step processing and recoverability.
Optimize container usage for performance needs using Amazon EKS or Amazon ECS and select compute sizing and scaling strategies.
Connect to data sources using JDBC/ODBC connectors and manage network access and credentials securely.
Integrate data from multiple sources and handle joins, deduplication, normalization, and schema alignment across systems.
Optimize costs while processing data by choosing the right compute and execution model (serverless vs provisioned, scaling, and purchasing options).
Select and use transformation services such as Amazon EMR, AWS Glue, AWS Lambda, and Amazon Redshift based on workload requirements.
Transform data between formats (for example, CSV to Parquet) and apply partitioning and compression for analytics performance.
Troubleshoot and debug common transformation failures and performance issues such as data skew, memory pressure, and small-file problems.
Create data APIs to make data available to other systems using AWS services while managing schema/version changes.

Task 1.3 - Orchestrate data pipelines

Integrate AWS services to build end-to-end data pipelines with explicit dependencies, retries, and repeatable outcomes.
Apply event-driven architecture to trigger pipeline steps using services such as Amazon EventBridge and Amazon S3 notifications.
Configure pipelines to run on schedules or dependencies, including managing reruns and backfills.
Choose serverless workflow patterns and determine when to use AWS Step Functions, AWS Glue workflows, or AWS Lambda orchestration.
Build orchestration workflows using services such as AWS Lambda, Amazon MWAA, AWS Step Functions, AWS Glue workflows, and Amazon EventBridge.
Implement notifications and alerting for pipeline events using Amazon SNS and Amazon SQS and integrate with failure handling.
Design pipelines for performance, availability, scalability, resiliency, and fault tolerance using retries, idempotency, and isolation.
Implement and maintain serverless workflows with correct timeouts, error handling, and state management.
Define and monitor pipeline SLAs such as data freshness and latency, and configure alarms for breaches.
Control parallelism and fan-out patterns safely using workflow constructs (for example, Step Functions Map/Parallel or Airflow task concurrency).
Implement checkpointing and understand at-least-once vs exactly-once trade-offs across streaming and batch steps.
Secure orchestration components with least-privilege roles and controlled network access for data services and endpoints.

Task 1.4 - Apply programming concepts

Implement CI/CD for data pipelines, including automated testing, packaging, and safe deployment practices.
Write SQL queries for data extraction and transformation, including joins and multi-step transformations that support pipeline requirements.
Optimize SQL queries using techniques such as predicate pushdown, partition pruning, and avoiding expensive joins when possible.
Use infrastructure as code (AWS CloudFormation or AWS CDK) to deploy repeatable data pipeline infrastructure.
Apply distributed computing concepts to optimize data pipeline code and avoid bottlenecks in parallel processing.
Use appropriate data structures and algorithms (for example, graph and tree structures) when modeling or traversing complex data relationships.
Optimize code to reduce runtime and resource usage for ingestion and transformation, including efficient I/O and batching.
Configure AWS Lambda functions for concurrency and performance (memory, timeout, concurrency limits) in data pipeline workloads.
Use Amazon Redshift stored procedures or SQL UDFs to encapsulate transformations and improve repeatability.
Use Git commands and workflows (clone, branch, merge, tag) to manage pipeline code changes safely.
Package and deploy serverless data pipelines using AWS SAM, including Lambda functions, Step Functions state machines, and DynamoDB tables.
Mount and use storage volumes from within Lambda functions (for example, Amazon EFS) when required and understand constraints and best practices.

Domain 2: Data Store Management (26%)

Practice this topic →

Task 2.1 - Choose a data store

Compare storage platforms (object, file, relational, NoSQL, streaming) and explain how their characteristics affect data engineering designs.
Select AWS data stores and configurations that meet performance demands such as throughput, concurrency, and latency.
Choose data storage formats (for example, CSV, TXT, Parquet) and apply compression and partitioning aligned to access patterns.
Align data storage choices with migration requirements, including cutover approaches, replication needs, and data movement constraints.
Determine appropriate storage solutions for access patterns such as point lookups, scans, time-series, OLTP, OLAP, and streaming.
Manage locks and concurrency controls (for example, in Amazon Redshift and Amazon RDS) and prevent contention in multi-user environments.
Implement appropriate storage services for cost and performance requirements (for example, Amazon Redshift, Amazon EMR, AWS Lake Formation, Amazon RDS, DynamoDB, Amazon Kinesis Data Streams, Amazon MSK).
Configure storage services for access pattern requirements (for example, Redshift distribution/sort design, DynamoDB partition key design, and S3 partition layout).
Apply Amazon S3 to appropriate use cases such as data lakes, staging layers, and durable storage of curated datasets.
Integrate migration tools such as AWS Transfer Family into data movement and ingestion workflows.
Implement data migration or remote access methods such as Amazon Redshift federated queries, materialized views, and Redshift Spectrum.
Design hybrid and multi-store architectures with clear responsibilities (for example, lake for raw/curated data, warehouse for BI, stream for real-time ingestion).

Task 2.2 - Understand data cataloging systems

Explain the purpose of a data catalog and how it enables discovery, governance, and consistent schema usage across analytics services.
Create and maintain a data catalog, including databases and tables, and keep metadata consistent as data changes.
Classify data based on requirements and use metadata to support access controls and governance workflows.
Identify key components of metadata and catalogs (schemas, partitions, tags, lineage) and how they are used by query engines.
Build and reference catalogs using AWS Glue Data Catalog or Apache Hive metastore for consistent schema discovery.
Discover schemas and populate data catalogs using AWS Glue crawlers and apply controls to manage schema drift.
Synchronize partitions with a data catalog and ensure new partitions are available for query engines promptly.
Create and manage source and target connections for cataloging using AWS Glue connections and appropriate network settings.
Enable catalog-driven consumption for services such as Athena, Amazon EMR, and Redshift Spectrum while maintaining schema consistency.
Troubleshoot catalog issues such as stale partitions, crawler misclassification, and insufficient permissions.
Apply governance controls through catalog metadata using Lake Formation permissions and LF-tags.
Version and document schemas and catalog changes to support reproducibility, collaboration, and audits.

Task 2.3 - Manage the lifecycle of data

Choose storage solutions that meet hot and cold data requirements and design tiered storage based on access frequency.
Optimize storage cost across the data lifecycle using mechanisms such as storage class transitions, compression, and partitioning.
Define retention policies and archiving strategies aligned to business requirements and legal obligations.
Delete and expire data to meet business and legal requirements, including implementing automated expiry processes.
Protect data with appropriate resiliency and availability using strategies such as versioning, replication, and backups.
Perform load and unload operations to move data between Amazon S3 and Amazon Redshift using efficient patterns.
Manage S3 Lifecycle policies to transition objects across storage tiers and enforce retention policies.
Expire data when it reaches a specific age using S3 Lifecycle policies and validate outcomes against retention requirements.
Manage S3 versioning and understand implications for restore scenarios, rollback, and ongoing storage cost.
Use DynamoDB TTL to expire data and design applications that handle eventual deletions safely.
Manage lifecycle policies for intermediate and derived datasets to avoid accumulating stale data and unnecessary costs.
Apply governance controls to lifecycle operations, including change control, approvals, and audit logs for retention policy changes.

Task 2.4 - Design data models and schema evolution

Apply data modeling concepts (normalized/denormalized, star/snowflake) and select a design based on query patterns and constraints.
Model structured, semi-structured, and unstructured data and decide when to use schema-on-write versus schema-on-read.
Design schemas for Amazon Redshift, DynamoDB, and lake-based tables (governed by Lake Formation) aligned to access patterns.
Use indexing, partitioning, sort/distribution strategies, and compression to optimize performance and cost.
Plan schema evolution techniques (additive changes, versioning, backfills) and avoid breaking downstream consumers.
Address changes to the characteristics of data (volume, cardinality, skew) and update data models to preserve performance and correctness.
Establish data lineage to ensure accuracy and trustworthiness of data and support auditability.
Perform schema conversion when migrating databases using AWS Schema Conversion Tool (AWS SCT) and AWS DMS schema conversion.
Manage schema drift with controlled discovery (for example, AWS Glue crawlers) and compatibility policies for consumers.
Use lineage tools to track transformations and provenance (for example, Amazon SageMaker ML Lineage Tracking) and document pipeline metadata.
Design backward and forward compatibility approaches for schema changes and validate with consumer contract testing.
Document data models and schema evolution policies to support collaboration, governance, and predictable upgrades.

Domain 3: Data Operations and Support (22%)

Practice this topic →

Task 3.1 - Automate data processing by using AWS services

Maintain and troubleshoot automated data processing workflows to ensure repeatable business outcomes.
Use API calls and SDKs to automate data processing operations and integrate programmatic control into pipelines.
Identify which AWS services accept scripting (for example, Amazon EMR, Amazon Redshift, AWS Glue) and integrate scripts into workflows safely.
Orchestrate data pipelines using services such as Amazon MWAA and AWS Step Functions to coordinate processing, retries, and dependencies.
Troubleshoot Amazon managed workflows and identify common failure modes in orchestration and scheduling.
Use AWS service features to process data (for example, Amazon EMR, Amazon Redshift, AWS Glue) and select the best service for the workload.
Consume and maintain data APIs, including versioning and access controls that enable stable downstream integrations.
Prepare data transformations using tools such as AWS Glue DataBrew and integrate outputs into downstream analytics.
Query data using Amazon Athena and create repeatable datasets using views or CTAS patterns.
Use AWS Lambda to automate data processing and connect event-driven triggers to processing steps.
Manage events and schedulers using Amazon EventBridge and integrate triggers for batch and stream processing.
Implement automation guardrails (idempotency, retries, backoff, and rate-limit handling) to prevent duplicate or runaway processing.

Task 3.2 - Analyze data by using AWS services

Choose between provisioned and serverless analytics services based on concurrency needs, workload variability, cost, and operational overhead.
Write SQL queries with joins, filters, and window functions to satisfy analysis requirements while controlling cost and runtime.
Apply cleansing techniques appropriately and document assumptions to ensure repeatable and auditable analysis.
Perform aggregations, rolling averages, grouping, and pivoting to create analysis-ready datasets.
Use Amazon Athena to query data and create reusable assets such as views and CTAS outputs.
Use Athena notebooks with Apache Spark to explore data interactively and validate data assumptions.
Visualize data using AWS services such as Amazon QuickSight and design dashboards aligned to business questions.
Use AWS Glue DataBrew for profiling, visualizing, and preparing datasets during analysis.
Verify and clean data using services and tools such as AWS Lambda, Athena, QuickSight, Jupyter notebooks, and Amazon SageMaker Data Wrangler.
Optimize query performance and cost using file formats, partition pruning, statistics, and CTAS patterns where appropriate.
Select the appropriate analysis engine (Athena vs Redshift vs EMR) based on latency needs, scale, governance controls, and integration requirements.
Share analysis outputs safely by controlling access to derived datasets and applying least-privilege patterns for consumers.

Task 3.3 - Maintain and monitor data pipelines

Implement application logging for data pipelines and define what operational data to capture (job metadata, row counts, failures).
Log access to AWS services used in pipelines using AWS CloudTrail and integrate events into audit and incident workflows.
Monitor pipeline health using Amazon CloudWatch metrics, logs, and alarms and define actionable alert thresholds.
Apply best practices for performance tuning and validate improvements using observable metrics and controlled changes.
Deploy logging and monitoring solutions that enable traceability across ingestion, transformation, and consumption steps.
Use notification services (Amazon SNS and Amazon SQS) to send alerts and integrate with operational runbooks.
Troubleshoot performance issues using tools such as CloudWatch Logs Insights, Amazon OpenSearch Service, Athena, and EMR log analysis.
Troubleshoot and maintain pipelines (for example, AWS Glue jobs and Amazon EMR steps), including dependency and configuration issues.
Extract logs for audits and implement log retention and archival policies that meet compliance requirements.
Use Amazon Macie findings to detect sensitive data exposure risks and integrate detection into governance workflows.
Create dashboards for key pipeline SLIs/SLOs such as freshness, latency, error rates, and backlog.
Automate operational responses for common failures (reruns, retries, backfills) with guardrails and approvals when required.

Task 3.4 - Ensure data quality

Define data validation dimensions (completeness, consistency, accuracy, integrity) and select metrics for measuring them.
Use data profiling techniques to discover anomalies and guide rule creation.
Apply data sampling techniques to validate large datasets efficiently and detect drift in data quality.
Detect and mitigate data skew mechanisms that can impact processing performance and downstream correctness.
Run data quality checks during processing (for example, null checks and range checks) and decide when to fail fast versus quarantine records.
Define data quality rules using services such as AWS Glue DataBrew and manage rules as versioned assets.
Investigate data consistency issues using AWS Glue DataBrew profiling and targeted queries to locate root causes.
Route invalid or suspicious records to a quarantine dataset and design remediation workflows for later correction.
Configure thresholds and alerting for data quality failures and integrate alerts into orchestration failure handling.
Maintain a data quality scorecard over time and track regressions after pipeline changes.
Detect schema drift and validate schema conformance early to prevent downstream breakages.
Implement continuous improvement loops for recurring quality issues using root cause analysis, remediation, and prevention controls.

Domain 4: Data Security and Governance (18%)

Practice this topic →

Task 4.1 - Apply authentication mechanisms

Apply VPC security networking concepts (subnets, routing, security groups) to secure data engineering workloads.
Differentiate managed and unmanaged services and understand how that affects authentication responsibilities and control points.
Compare authentication methods (password-based, certificate-based, and role-based) and choose mechanisms appropriate to the service and workload.
Differentiate AWS managed policies and customer managed policies and determine when a custom policy is required.
Update VPC security groups to allow only required traffic for data sources, processing jobs, and analytics endpoints.
Create and update IAM groups, roles, endpoints, and services used by data pipelines and analytics workloads.
Create and rotate credentials using AWS Secrets Manager and avoid embedding secrets in code or configuration.
Set up IAM roles for access from services such as Lambda, API Gateway, the AWS CLI, and CloudFormation, including correct trust policies.
Apply IAM policies to roles, endpoints, and services such as S3 Access Points and AWS PrivateLink endpoints.
Configure private connectivity using VPC endpoints or PrivateLink to access AWS data services without traversing the public internet.
Troubleshoot authentication issues (for example, invalid credentials or trust policy errors) and implement least-privilege fixes.
Implement identity hygiene for pipelines, including short-lived credentials, federation, and role assumption patterns.

Task 4.2 - Apply authorization mechanisms

Differentiate authorization methods (role-based, policy-based, tag-based, attribute-based) and map them to AWS services and use cases.
Apply the principle of least privilege by scoping actions and resources to the minimum required for each pipeline component.
Implement role-based access control (RBAC) patterns that match expected access patterns for producers, consumers, and administrators.
Protect data from unauthorized access across services using IAM conditions, resource policies, tag-based controls, and network isolation.
Create custom IAM policies when managed policies do not meet requirements, including using conditions for least privilege.
Store application and database credentials securely using AWS Secrets Manager or AWS Systems Manager Parameter Store.
Provide database users, groups, and roles appropriate access in databases (for example, Amazon Redshift) aligned to separation of duties.
Manage permissions through AWS Lake Formation for Amazon S3 data and integrated engines such as Redshift, EMR, and Athena.
Use Lake Formation governance features such as LF-tags to scale permission management across many datasets.
Implement cross-account access patterns safely using role assumption and resource sharing while maintaining governance boundaries.
Troubleshoot authorization failures by distinguishing IAM policy issues from resource policy and Lake Formation permission issues.
Audit and periodically review permissions, including identifying overly permissive policies and applying remediation.

Task 4.3 - Ensure data encryption and masking

Select data encryption options available in AWS analytics services such as Amazon Redshift, Amazon EMR, and AWS Glue and apply encryption by default.
Differentiate client-side encryption and server-side encryption and choose based on trust boundaries and operational requirements.
Protect sensitive data by selecting appropriate anonymization, masking, tokenization, and salting approaches.
Apply data masking and anonymization according to compliance laws and company policies and verify that controls are effective.
Use AWS KMS keys to encrypt and decrypt data and manage key rotation and access grants safely.
Enable encryption in transit (TLS) for data movement and service connections across ingestion, processing, and analytics layers.
Configure encryption across AWS account boundaries (for example, cross-account KMS use) while maintaining least privilege.
Configure Amazon S3 encryption (SSE-KMS), bucket policies, and default encryption for data lake storage.
Enable encryption at rest and in transit for Amazon Redshift and understand how encryption interacts with data loading and sharing.
Apply encryption and security controls to streaming and migration integrations (for example, Kinesis, MSK, and DMS) at a high level.
Implement column-level or field-level masking patterns to limit exposure of sensitive attributes in analytics outputs.
Design audit-ready evidence for encryption and masking controls, including key usage logs and periodic validation checks.

Task 4.4 - Prepare logs for audit

Define what application events to log for data pipelines (parameters, row counts, errors) while minimizing sensitive data exposure.
Log access to AWS services used by data platforms and understand how CloudTrail events support investigations and audits.
Design centralized logging for AWS logs, including storing logs in durable locations with retention and access controls.
Use AWS CloudTrail to track API calls that affect data systems, including changes to permissions, configurations, and data access.
Store application logs in Amazon CloudWatch Logs and configure retention, encryption, and controlled access.
Use AWS CloudTrail Lake for centralized logging queries and generate audit evidence from event queries.
Analyze logs using services such as Athena, CloudWatch Logs Insights, and Amazon OpenSearch Service for audit and incident response needs.
Integrate AWS services to process large volumes of log data (for example, using Amazon EMR when needed) and maintain scalability.
Implement log integrity and immutability patterns (write-once storage, access separation) for audit readiness.
Create dashboards and alarms for suspicious activity and policy violations using monitored audit signals.
Implement separation of duties by controlling who can write, change, and read audit logs and audit configurations.
Document audit procedures and evidence mapping to compliance requirements and keep logs accessible for required retention periods.

Task 4.5 - Understand data privacy and governance

Identify and protect personally identifiable information (PII) and other sensitive data types throughout ingestion, storage, processing, and consumption.
Explain data sovereignty and how regional restrictions affect storage, processing, backups, and replication strategies.
Grant permissions for data sharing (for example, Amazon Redshift data sharing) while controlling scope, consumers, and auditability.
Implement PII identification using Amazon Macie and integrate findings with Lake Formation governance controls.
Implement data privacy strategies that prevent backups or replications of data to disallowed AWS Regions.
Manage configuration changes using AWS Config to detect, track, and remediate governance drift in data platforms.
Apply governance frameworks using Lake Formation, tags, and catalog metadata to manage dataset ownership and access.
Implement data classification and labeling strategies to drive access controls, retention policies, and monitoring priorities.
Define purpose limitation and consent-aware data usage policies (where applicable) and document intended uses for datasets.
Maintain auditability of governance decisions by logging policy changes, approvals, and access reviews.
Apply multi-account governance patterns at a high level (for example, Organizations guardrails and separation of duties) for data platforms.
Monitor and remediate governance drift (permissions creep and misconfigurations) using automated checks and operational processes.

Tip: After each task, write 5–10 “one-liner rules” from your misses (service selection + trade-offs).

Study Plan

Cheat Sheet

Browse Exams — Mock Exams & Practice Tests

DEA-C01 Syllabus — Objectives by Domain

What’s covered

Domain 1: Data Ingestion and Transformation (34%)

Task 1.1 - Perform data ingestion

Task 1.2 - Transform and process data

Task 1.3 - Orchestrate data pipelines

Task 1.4 - Apply programming concepts

Domain 2: Data Store Management (26%)

Task 2.1 - Choose a data store

Task 2.2 - Understand data cataloging systems

Task 2.3 - Manage the lifecycle of data

Task 2.4 - Design data models and schema evolution

Domain 3: Data Operations and Support (22%)

Task 3.1 - Automate data processing by using AWS services

Task 3.2 - Analyze data by using AWS services

Task 3.3 - Maintain and monitor data pipelines

Task 3.4 - Ensure data quality

Domain 4: Data Security and Governance (18%)

Task 4.1 - Apply authentication mechanisms

Task 4.2 - Apply authorization mechanisms

Task 4.3 - Ensure data encryption and masking

Task 4.4 - Prepare logs for audit

Task 4.5 - Understand data privacy and governance