Use this syllabus as your source of truth for MLA-C01. Work through each domain in order and drill targeted sets after every task.
What’s covered
Domain 1: Data Preparation for Machine Learning (ML) (28%)
Practice this topic →
Task 1.1 - Ingest and store data
- Choose appropriate data formats for ML (for example, Apache Parquet, JSON, CSV, ORC, Avro, RecordIO) based on access patterns and downstream training requirements.
- Differentiate validated and non-validated data ingestion formats and identify when schema enforcement is required.
- Select core AWS data sources for ML workloads (for example, Amazon S3, Amazon EFS, Amazon FSx for NetApp ONTAP) and explain common trade-offs.
- Select AWS streaming data sources to ingest data (for example, Amazon Kinesis, managed Apache Flink, Apache Kafka) based on throughput, ordering, and latency constraints.
- Explain AWS storage options for ML datasets and artifacts and choose between them based on cost, performance, and data structure.
- Extract data from storage systems (for example, Amazon S3, Amazon EBS, Amazon EFS, Amazon RDS, Amazon DynamoDB) and identify AWS options that improve transfer or I/O performance (for example, S3 Transfer Acceleration, EBS Provisioned IOPS).
- Design a simple data landing strategy (raw → curated) and explain why consistent partitioning and file layout improves training performance.
- Merge data from multiple sources using appropriate approaches (for example, programming techniques, AWS Glue, Apache Spark) and validate schema compatibility.
- Ingest and explore data with Amazon SageMaker Data Wrangler and export prepared datasets for model training.
- Ingest features into Amazon SageMaker Feature Store and explain how a feature store supports feature reuse and consistency.
- Differentiate offline and online feature access patterns in SageMaker Feature Store and select the correct approach for training vs inference.
- Troubleshoot data ingestion and storage issues related to capacity and scalability (for example, throughput bottlenecks, partition hot keys, I/O limits).
- Implement dataset versioning and repeatable data extraction to support reproducibility across experiments and audits.
- Choose compression, object sizing, and partitioning strategies to balance read efficiency and storage cost for large datasets.
- Prepare datasets for the training environment by staging data to the appropriate storage resource and access method (for example, file-based access vs object-based access).
- Validate ingestion completeness and correctness by using basic checks (record counts, schema checks, and integrity checks) before handing data to transformation or training steps.
- Apply common data cleaning techniques (outlier detection/treatment, missing value imputation, combining, deduplication) and explain how they affect model performance.
- Select and apply feature scaling techniques (normalization, standardization) and identify when scaling is required by the chosen algorithm.
- Perform feature transformations such as binning, log transforms, and feature splitting and recognize when they reduce skew or improve signal.
- Choose categorical encoding techniques (one-hot, binary, label encoding) based on feature cardinality and model type.
- Apply tokenization and basic text preprocessing as part of feature engineering for NLP workloads.
- Use tools to explore, visualize, and transform data (for example, SageMaker Data Wrangler) and interpret profiling outputs to guide transformations.
- Transform data at scale using AWS tools (for example, AWS Glue jobs, Spark on Amazon EMR) and choose an approach based on volume and operational constraints.
- Use AWS Glue DataBrew for no-code profiling and transformations and identify when it is more efficient than custom ETL code.
- Transform streaming data using appropriate services (for example, AWS Lambda or Spark-based streaming) and handle late/out-of-order events at a high level.
- Create and manage features using SageMaker Feature Store to ensure consistency between training and inference.
- Design feature definitions and transformations to avoid training/serving skew, including using the same transformation logic in training and inference paths.
- Validate transformation outputs by checking schema, value ranges, and null thresholds and documenting transformation assumptions.
- Choose data annotation and labeling approaches to create high-quality labeled datasets and identify quality controls (audit tasks, consensus labeling) at a high level.
- Label and validate data using AWS services (for example, SageMaker Ground Truth and Amazon Mechanical Turk) and select the best option based on scale and sensitivity.
- Version features and transformation code to enable reproducible training and consistent reprocessing when data changes.
- Explain how feature engineering choices affect interpretability, bias, and downstream monitoring (for example, how encoded features influence fairness analysis).
Task 1.3 - Ensure data integrity and prepare data for modeling
- Validate data quality using AWS tools (for example, AWS Glue DataBrew and AWS Glue Data Quality) and interpret quality checks to identify issues before training.
- Define and interpret pre-training bias metrics for numeric, text, and image data (for example, class imbalance and difference in proportions of labels).
- Identify sources of bias in data (for example, selection bias and measurement bias) and explain how these biases can propagate into models.
- Select strategies to address class imbalance (for example, resampling and synthetic data generation) and evaluate potential trade-offs.
- Prepare datasets to reduce prediction bias using dataset splitting and shuffling strategies, including stratification when appropriate.
- Apply data augmentation techniques for numeric, text, and image data and recognize when augmentation can introduce unintended bias.
- Use Amazon SageMaker Clarify to identify and mitigate sources of bias and document findings for governance.
- Select techniques to encrypt data at rest and in transit and explain why encryption and key management are required for sensitive ML datasets.
- Classify data and apply anonymization or masking techniques to reduce exposure of personally identifiable information (PII) or protected health information (PHI).
- Evaluate compliance implications such as PII/PHI handling and data residency requirements and apply them to dataset storage and processing design.
- Prevent data leakage by designing correct train/validation/test split boundaries and ensuring that transformations do not use future information.
- Prepare data in the expected input format and packaging for training jobs (for example, sharded inputs, channel layout, and consistent schema).
- Configure data to load into model training resources (for example, Amazon EFS or Amazon FSx) and explain why file-based storage can be useful for certain training workloads.
- Establish dataset lineage and provenance (dataset versioning, transformation history) to support reproducibility and audits.
- Implement integrity checks between pipeline stages (for example, checksums, idempotent reprocessing, schema validation gates).
- Define governance requirements for data readiness (approvals, documentation) before data is used for training or fine-tuning.
Domain 2: ML Model Development (26%)
Practice this topic →
Task 2.1 - Choose a modeling approach
- Assess available data, label quality, and problem complexity to determine whether an ML solution is feasible and how it should be scoped.
- Translate business requirements into an ML problem type (classification, regression, clustering, NLP, or computer vision) and identify success metrics.
- Compare common ML algorithm families and match them to use cases based on data type, interpretability, and operational constraints.
- Consider interpretability and transparency requirements during model selection and choose approaches that meet stakeholder needs.
- Choose between building a custom model and using AWS AI services based on customization needs, time-to-value, and operational overhead.
- Map common business problems to AWS AI services (for example, Amazon Translate, Amazon Transcribe, Amazon Rekognition) when a managed service is the best fit.
- Identify when a foundation model approach (for example, Amazon Bedrock) is appropriate for generative AI tasks and recognize common limitations.
- Choose built-in algorithms, foundation models, and solution templates using options such as SageMaker JumpStart and Amazon Bedrock.
- Identify SageMaker built-in algorithms and recognize scenarios where built-in algorithms reduce development time compared to custom training.
- Evaluate trade-offs between model performance, training time, inference latency, and cost when selecting an approach.
- Select a modeling approach that meets cost constraints by considering dataset size, training compute needs, and inference request volume.
- Define a simple baseline solution and explain how baselines help determine whether a more complex model is justified.
- Evaluate privacy and compliance constraints that affect model or service selection, including data residency and sensitive data handling.
- Choose feature representations (for example, embeddings vs engineered features) as part of the modeling approach based on task type and constraints.
- Identify when human review or human-in-the-loop workflows are appropriate due to model risk or decision impact.
- Incorporate deployment requirements (real-time vs batch, model size, edge constraints) into model selection and solution design decisions.
Task 2.2 - Train and refine models
- Explain core training loop concepts (epochs, steps, batch size) and relate them to convergence, training time, and model quality.
- Choose methods to reduce training time (for example, early stopping and distributed training) while maintaining model quality.
- Identify factors that influence model size and understand how model size impacts inference latency, cost, and deployability.
- Select methods to improve model performance, including feature selection, data improvements, and algorithm/hyperparameter changes.
- Apply regularization techniques (for example, dropout, weight decay, L1 and L2) and explain how they reduce overfitting.
- Compare hyperparameter tuning techniques (random search and Bayesian optimization) and select an approach appropriate to the search space and budget.
- Identify hyperparameters and their effects on model performance (for example, number of trees and depth in tree-based models, number of layers in neural networks).
- Use SageMaker built-in algorithms and common ML libraries to develop models efficiently.
- Use SageMaker script mode with supported frameworks (for example, TensorFlow and PyTorch) to run custom training code.
- Fine-tune pre-trained models using custom datasets (for example, with Amazon Bedrock or SageMaker JumpStart) when customization is required.
- Perform hyperparameter tuning using SageMaker automatic model tuning (AMT) and interpret tuning results to select candidate models.
- Integrate models that were built outside SageMaker into SageMaker training and hosting workflows.
- Prevent overfitting, underfitting, and catastrophic forgetting using appropriate techniques (regularization, feature selection, training strategies).
- Combine multiple models to improve performance using ensemble methods (ensembling, stacking, boosting) and identify when ensembles increase operational complexity.
- Reduce model size for deployment using techniques such as pruning, compression, feature selection changes, and data type changes, and evaluate accuracy trade-offs.
- Manage model versions for repeatability and audits using tools such as the SageMaker Model Registry.
- Select evaluation techniques and metrics appropriate to the task (for example, confusion matrix, F1, accuracy, precision, recall, RMSE, ROC, AUC) and explain why the metric choice matters.
- Interpret confusion matrices, ROC curves, and related visualizations to assess classification performance and threshold trade-offs.
- Create performance baselines and compare candidate models against baselines to evaluate improvement and regressions.
- Identify model overfitting and underfitting using training/validation results and apply corrective actions.
- Diagnose convergence issues during training and recognize symptoms such as unstable loss, divergence, and slow convergence.
- Use SageMaker Model Debugger to debug model convergence issues and analyze training telemetry.
- Detect model bias and evaluate fairness concerns by selecting and interpreting relevant metrics and slice-based analysis.
- Use SageMaker Clarify metrics to gain insights into ML training data and models, including bias detection and explainability outputs.
- Assess trade-offs between model performance, training time, and cost when selecting a model for production use.
- Design reproducible experiments by tracking datasets, code, parameters, and outputs and using AWS services to support repeatability.
- Compare the performance of a shadow variant to a production variant and decide when to promote, rollback, or iterate.
- Use SageMaker Clarify outputs to interpret model predictions and communicate findings to stakeholders.
- Run A/B testing for model variants in production and interpret results with appropriate statistical caution at a high level.
- Define acceptance criteria for model promotion, including accuracy thresholds, bias checks, and operational constraints.
- Perform error analysis by inspecting mispredictions and identifying systematic failure patterns across data segments.
- Document evaluation results, limitations, and known risks to support governance, audits, and future improvements.
Domain 3: Deployment and Orchestration of ML Workflows (22%)
Practice this topic →
Task 3.1 - Select deployment infrastructure based on existing architecture and requirements
- Select a model serving strategy (real time, serverless, asynchronous, batch inference) based on latency, throughput, and user experience requirements.
- Differentiate model and endpoint requirements for deployment endpoints (serverless endpoints, real-time endpoints, asynchronous endpoints, batch inference) and choose the best fit for a scenario.
- Apply deployment best practices such as versioning, staged rollouts, and rollback strategies to reduce risk when deploying models.
- Choose AWS deployment services (for example, Amazon SageMaker) and identify when container platforms (Kubernetes, Amazon ECS, Amazon EKS) or AWS Lambda are more appropriate.
- Provision compute resources for training and inference in production and test environments (CPU, GPU) and select environments to meet performance requirements.
- Evaluate performance, cost, and latency trade-offs across model sizes, instance types, and endpoint configurations.
- Select the appropriate compute environment for training and inference based on requirements (GPU vs CPU, processor family, networking bandwidth).
- Choose appropriate containers for hosting (provided vs customized/BYOC) and identify when custom containers are required.
- Select multi-model or multi-container deployments to optimize cost and manage multiple models or pre/post-processing components.
- Choose a deployment target that fits existing architecture and requirements (for example, SageMaker endpoints, Kubernetes, Amazon ECS, Amazon EKS, AWS Lambda).
- Select the correct deployment orchestrator (for example, Apache Airflow or SageMaker Pipelines) based on workflow complexity and team operational preferences.
- Choose model deployment strategies (real-time vs batch) and understand implications for monitoring, scaling, and cost.
- Design integration patterns for model inference APIs, including synchronous request/response vs asynchronous patterns and streaming responses where applicable.
- Optimize models on edge devices using approaches such as SageMaker Neo at a high level and explain constraints that drive edge optimization.
- Package and version model artifacts and dependencies for deployment and ensure repeatable builds across environments.
- Plan network placement and data access for endpoints, including VPC integration requirements and secure access to downstream data sources.
Task 3.2 - Create and script infrastructure based on existing architecture and requirements
- Differentiate on-demand and provisioned resources and choose an approach based on workload predictability and performance requirements.
- Compare scaling policies (for example, target tracking, step scaling, scheduled scaling) and determine which policy best meets expected traffic patterns.
- Configure SageMaker endpoint auto scaling policies to meet scalability requirements based on demand or time-based schedules.
- Choose specific metrics for auto scaling (for example, model latency, CPU utilization, invocations per instance) and explain why metric selection affects stability and cost.
- Explain containerization concepts and identify AWS container services used for ML deployments (Amazon ECR, Amazon ECS, Amazon EKS).
- Build and maintain containers for ML workloads (including BYOC with SageMaker) and apply reproducible build and dependency management practices.
- Select infrastructure as code (IaC) options (AWS CloudFormation vs AWS CDK) and explain trade-offs for maintainability and reuse.
- Automate provisioning of compute resources and manage communication between stacks using CloudFormation or AWS CDK outputs and parameters.
- Apply best practices to build scalable and cost-effective ML solutions (for example, endpoint auto scaling and cost-aware compute choices such as Spot where appropriate).
- Deploy and host models using the SageMaker SDK and automate endpoint creation, update, and deletion workflows.
- Configure SageMaker endpoints within a VPC network and explain how subnets, security groups, and routing affect connectivity.
- Design for maintainability by separating environments (dev/test/prod) and parameterizing infrastructure for repeatable deployments.
- Implement cost optimization patterns for inference infrastructure, including right-sizing, scaling to match demand, and choosing appropriate endpoint types.
- Secure infrastructure provisioning with least-privilege roles, encryption defaults, and safe handling of secrets and artifacts.
- Troubleshoot scaling and performance issues for deployed models, including misconfigured scaling policies, throttling, and service quota constraints.
- Evaluate trade-offs between infrastructure simplicity and flexibility when choosing between managed endpoints and container orchestration platforms.
- Describe CI/CD principles and explain how ML workflows add additional versioned assets (data, features, models) compared to traditional application CI/CD.
- Identify capabilities and quotas for AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy and design pipelines that operate within service limits.
- Configure CodePipeline stages and artifacts for ML workflows (build, test, train, evaluate, register, deploy) and explain the purpose of each stage.
- Configure and troubleshoot CodeBuild, CodeDeploy, and CodePipeline, including stage transitions and common failure modes.
- Use version control systems (for example, Git) and select a branching strategy (Gitflow or GitHub Flow) aligned to deployment cadence and risk.
- Explain how code repositories and pipelines work together to trigger builds, tests, and deployments.
- Automate and integrate data ingestion with orchestration services as part of an end-to-end ML workflow.
- Use deployment strategies and rollback actions (blue/green, canary, linear) to reduce risk when deploying new model versions.
- Use AWS services to automate orchestration for model building and deployment, including coordinating processing, training, and evaluation steps.
- Configure training and inference jobs using event-driven triggers (for example, Amazon EventBridge rules) and pipeline tools such as SageMaker Pipelines and CodePipeline.
- Create automated tests in CI/CD pipelines (unit, integration, end-to-end) and explain how tests improve reliability of ML deployments.
- Implement validation gates in pipelines (metric thresholds, bias checks, explainability requirements) before promoting a model to production.
- Build and integrate mechanisms to retrain models in response to schedules, new data, or detected drift.
- Manage versioning for datasets, code, and models to support reproducibility and safe rollback in production.
- Design approval workflows and separation of duties in ML delivery pipelines to satisfy governance and compliance expectations.
- Troubleshoot pipeline issues related to IAM permissions, artifact locations, environment configuration, and dependency management.
Domain 4: ML Solution Monitoring, Maintenance, and Security (24%)
Practice this topic →
Task 4.1 - Monitor model inference
- Differentiate data drift and concept drift and explain why drift detection is required for production ML systems.
- Select techniques to monitor data quality and model performance for inference workloads and identify which signals indicate degradation.
- Set up model monitoring in production using SageMaker Model Monitor and establish baselines for comparison.
- Monitor workflows to detect anomalies or errors in data processing or model inference and design alerting strategies.
- Detect changes in the distribution of data that can affect model performance and identify how tooling such as SageMaker Clarify can assist.
- Monitor model performance in production using A/B testing and interpret results to decide on promotion or rollback.
- Compare shadow deployments and A/B testing and choose the appropriate strategy for safe evaluation in production.
- Define monitoring thresholds, SLIs/SLOs, and triggers for automated actions such as rollback, alerting, or retraining.
- Design dashboards and reporting for ML inference quality, latency, and error rates and communicate monitoring results to stakeholders.
- Incorporate feedback loops to capture ground truth when available and use feedback for evaluation and retraining planning.
- Plan operational processes for monitoring incidents, including triage, root cause analysis, and post-incident remediation.
- Apply design principles from ML well-architected guidance that relate to monitoring, including automation, observability, and controlled change.
- Balance observability needs with privacy constraints by limiting exposure of raw sensitive inputs and controlling access to logs.
- Design retraining triggers based on monitoring signals, including drift, performance regression, and changes in business conditions.
- Detect and respond to anomalous or abusive inference traffic patterns that may indicate misuse or security issues.
- Maintain documentation of monitoring configurations, baselines, and changes to support audits and continuous improvement.
Task 4.2 - Monitor and optimize infrastructure and costs
- Identify key performance metrics for ML infrastructure (utilization, throughput, availability, scalability, fault tolerance) and map them to operational goals.
- Use Amazon CloudWatch metrics, logs, and alarms to troubleshoot latency and performance issues for ML endpoints and pipelines.
- Use CloudWatch Logs Insights to analyze logs and identify bottlenecks and error patterns in ML applications.
- Use observability tools (for example, AWS X-Ray and CloudWatch Lambda Insights) to trace requests and diagnose end-to-end latency.
- Create AWS CloudTrail trails to log, monitor, and audit activities that affect ML systems, including retraining and deployment actions.
- Build dashboards to monitor performance and cost metrics using tools such as CloudWatch dashboards and Amazon QuickSight.
- Monitor infrastructure events using Amazon EventBridge events and integrate event-driven notifications or remediation workflows.
- Differentiate instance families (general purpose, compute optimized, memory optimized, inference optimized) and select the right family for training and inference workloads.
- Right-size inference infrastructure using tools such as SageMaker Inference Recommender and AWS Compute Optimizer.
- Troubleshoot and resolve latency and scaling issues, including misconfigured auto scaling, throttling, and insufficient capacity.
- Manage service quotas and capacity constraints and identify when to request quota increases for production workloads.
- Implement cost tracking and allocation techniques (for example, resource tagging) to enable chargeback and visibility.
- Use cost analysis tools (AWS Cost Explorer and AWS Billing and Cost Management) to analyze spend and identify optimization opportunities.
- Set budgets and cost quotas using tools such as AWS Budgets and use alerts to prevent unexpected spend.
- Use AWS Trusted Advisor to identify cost and performance recommendations and prioritize fixes by impact.
- Optimize infrastructure costs using purchasing options (Spot Instances, On-Demand Instances, Reserved Instances, SageMaker Savings Plans) based on workload characteristics.
Task 4.3 - Secure AWS resources
- Design IAM roles, policies, and groups to control access to AWS services used in ML systems and apply the principle of least privilege.
- Use resource policies (for example, Amazon S3 bucket policies) to restrict access to datasets, model artifacts, and other ML assets.
- Configure IAM policies and roles for users and applications that interact with ML systems, including separate roles for training, processing, and inference.
- Secure SageMaker workloads using appropriate SageMaker security and compliance features and document controls for audits.
- Configure least-privilege access to ML artifacts (datasets, feature store, model registry, endpoints) and periodically review permissions.
- Control network access to ML resources using VPCs, subnets, security groups, and routing, and explain how isolation reduces blast radius.
- Build VPC network designs that securely isolate ML systems and support private connectivity to dependent AWS services.
- Secure data at rest and in transit for ML systems using encryption and key management practices (for example, using AWS KMS).
- Apply security best practices for CI/CD pipelines, including least-privilege build roles, secure artifact handling, and safe secret management.
- Monitor, audit, and log ML systems to ensure continued security and compliance, including capturing access and change events.
- Design logging retention and access controls so that audit logs support investigations without exposing sensitive data.
- Troubleshoot and debug security issues in ML systems, including AccessDenied errors, KMS permission issues, and network connectivity problems.
- Implement separation of duties and environment separation (dev/test/prod) to reduce risk and support compliance requirements.
- Secure and rotate secrets used by ML applications (for example, credentials and API keys) using appropriate AWS secret management services.
- Establish incident response practices for ML systems, including containment actions, credential rotation, and post-incident review.
- Apply governance practices such as tagging, policy enforcement, and periodic access reviews to maintain secure and compliant ML environments.
Tip: MLA-C01 is heavy on “best-fit” trade-offs. After each task, write 5–10 one-liner rules from your misses.