Keep this page open while drilling questions. MLA‑C01 rewards “production ML realism”: data quality gates, repeatability, safe deployments, drift monitoring, cost controls, and least-privilege security.
Quick facts (MLA-C01)
| Item | Value |
|---|
| Questions | 65 (multiple-choice + multiple-response) |
| Time | 130 minutes |
| Passing score | 720 (scaled 100–1000) |
| Cost | 150 USD |
| Domains | D1 28% • D2 26% • D3 22% • D4 24% |
Fast strategy (what the exam expects)
- If the question says best-fit managed ML, the answer is often SageMaker (Feature Store, Pipelines, Model Registry, managed endpoints).
- If the scenario is “data is messy,” think data quality checks, profiling, transformations, and feature consistency (train/serve).
- If the scenario is “accuracy dropped in prod,” think drift, monitoring baselines, A/B or shadow, and retraining triggers.
- If the scenario is “cost is spiking,” think right-sizing, endpoint type selection, auto scaling, Spot / Savings Plans, and budgets/tags.
- If there’s “security/compliance,” include least privilege IAM, encryption, VPC isolation, and audit logging.
- Read the last sentence first to capture constraints: latency, cost, ops effort, compliance, auditability.
Domain weights (how to allocate your time)
| Domain | Weight | Prep focus |
|---|
| Domain 1: Data Preparation for ML | 28% | Ingest/ETL, feature engineering, data quality and bias basics |
| Domain 2: ML Model Development | 26% | Model choice, training/tuning, evaluation, Clarify/Debugger/Registry |
| Domain 3: Deployment + Orchestration | 22% | Endpoint types, scaling, IaC, CI/CD for ML pipelines |
| Domain 4: Monitoring + Security | 24% | Drift/model monitor, infra monitoring + costs, security controls |
0) SageMaker service map (high yield)
| Capability | What it’s for | MLA‑C01 “why it matters” |
|---|
| SageMaker Data Wrangler | Data prep + feature engineering | Fast, repeatable transforms; reduces time-to-first-model |
| SageMaker Feature Store | Central feature storage | Avoid train/serve skew; feature reuse and governance |
| SageMaker Training | Managed training jobs | Repeatable, scalable training on AWS compute |
| SageMaker AMT | Hyperparameter tuning | Systematic search for better model configs |
| SageMaker Clarify | Bias + explainability | Responsible ML evidence + model understanding |
| SageMaker Model Debugger | Training diagnostics | Debug convergence and training instability |
| SageMaker Model Registry | Versioning + approvals | Auditability, rollback, safe promotion to prod |
| SageMaker Endpoints | Managed model serving | Real-time/serverless/async inference patterns |
| SageMaker Model Monitor | Monitoring workflows | Detect drift and quality issues in production |
| SageMaker Pipelines | ML workflow orchestration | Build-test-train-evaluate-register-deploy automation |
1) End-to-end ML on AWS (mental model)
flowchart LR
S["Sources"] --> I["Ingest"]
I --> T["Transform + Quality Checks"]
T --> F["Feature Engineering + Feature Store"]
F --> TR["Train + Tune"]
TR --> E["Evaluate + Bias/Explainability"]
E --> R["Register + Approve"]
R --> D["Deploy Endpoint or Batch"]
D --> M["Monitor Drift/Quality/Cost/Security"]
M -->|Triggers| RT["Retrain"]
RT --> TR
High-yield framing: MLA‑C01 is about the pipeline, not just the model.
2) Domain 1 — Data preparation (28%)
| You need… | Typical best-fit | Why |
|---|
| Visual data prep + fast iteration | SageMaker Data Wrangler | Interactive + repeatable workflows |
| No/low-code transforms and profiling | AWS Glue DataBrew | Good for business-friendly prep |
| Scalable ETL jobs | AWS Glue / Spark | Production batch ETL at scale |
| Big Spark workloads (custom) | Amazon EMR | More control over Spark |
| Simple streaming transforms | AWS Lambda | Event-driven, lightweight |
| Streaming analytics | Managed Apache Flink | Stateful streaming at scale |
| Format | Why it shows up | Typical trade-off |
|---|
| Parquet / ORC | Columnar analytics + efficient reads | Best for large tabular datasets |
| CSV / JSON | Interop + simplicity | Bigger + slower at scale |
| Avro | Schema evolution + streaming | Good for pipelines |
| RecordIO | ML-specific record formats | Useful with some training stacks |
Rule: choose formats based on access patterns (scan vs selective reads), schema evolution, and scale.
Data ingestion and storage (high yield)
- Amazon S3: default data lake for ML (durable, cheap, scalable).
- Amazon EFS / FSx: file-based access patterns; useful when training expects POSIX-like file semantics.
- Streaming ingestion: use Kinesis/managed streaming where low-latency data arrival matters.
Common best answers:
- Use AWS Glue / Spark on EMR for big ETL jobs.
- Use SageMaker Data Wrangler for fast interactive prep and repeatable transformations.
- Use SageMaker Feature Store to keep training/inference features consistent.
Feature Store: why it matters
- Avoid train/serve skew: the feature used in training is the same feature served to inference.
- Support feature reuse across teams and models.
- Enable governance: feature definitions and versions.
Data integrity + bias basics (often tested)
| Problem | What to do | Tooling you might name |
|---|
| Missing/invalid data | Add data quality checks + fail fast | Glue DataBrew / Glue Data Quality |
| Class imbalance | Resampling or synthetic data | (Conceptual) + Clarify for analysis |
| Bias sources | Identify selection/measurement bias | SageMaker Clarify (bias analysis) |
| Sensitive data | Classify + mask/anonymize + encrypt | KMS + access controls |
| Compliance constraints | Data residency + least privilege + audit logs | IAM + CloudTrail + region choices |
High-yield rule: don’t “fix” model issues before you verify data quality and leakage.
3) Domain 2 — Model development (26%)
Choosing an approach
| If you need… | Typical best-fit |
|---|
| A standard AI capability with minimal ML ops | AWS AI services (Translate/Transcribe/Rekognition, etc.) |
| A custom model with managed training + deployment | Amazon SageMaker |
| A foundation model / generative capability | Amazon Bedrock (when applicable) |
Rule: don’t overbuild. If an AWS managed AI service solves it, it usually wins on time-to-value and ops.
Training and tuning (high yield)
- Training loop terms: epoch, step, batch size.
- Speedups: early stopping, distributed training.
- Generalization controls: regularization (L1/L2, dropout, weight decay) + better data/features.
- Hyperparameter tuning: random search vs Bayesian optimization; in SageMaker, use Automatic Model Tuning (AMT).
Metrics picker (what to choose)
| Task | Common metrics | What the exam tries to trick you on |
|---|
| Classification | Accuracy, precision, recall, F1, ROC-AUC | Class imbalance makes accuracy misleading |
| Regression | MAE/RMSE | Outliers and error cost (what matters more?) |
| Model selection | Metric + cost/latency | “Best” isn’t only accuracy |
Overfitting vs underfitting (signals)
| Symptom | Likely issue | Typical fix |
|---|
| Train ↑, validation ↓ | Overfitting | Regularization, simpler model, more data, better features |
| Both low | Underfitting | More expressive model, better features, tune hyperparameters |
Clarify vs Debugger vs Model Monitor (common confusion)
| Tool | What it helps with | When to name it |
|---|
| SageMaker Clarify | Bias + explainability | Fairness questions, “why did it predict X?” |
| SageMaker Model Debugger | Training diagnostics + convergence | Training instability, loss not decreasing, debugging training |
| SageMaker Model Monitor | Production monitoring workflows | Drift, data quality degradation, monitoring baselines |
Model Registry (repeatability + governance)
- Track: model artifacts, metrics, lineage, approvals.
- Enables safe promotion/rollback and audit-ready workflows.
4) Domain 3 — Deployment and orchestration (22%)
Endpoint types (must-know picker)
| Endpoint type | Best for | Typical constraint |
|---|
| Real-time | Steady, low-latency inference | Cost for always-on capacity |
| Serverless | Spiky traffic, scale-to-zero | Cold starts + limits |
| Asynchronous | Long inference time, bursty workloads | Event-style patterns + polling/callback |
| Batch inference | Scheduled/offline scoring | Not interactive |
Scaling metrics (what to pick)
| Metric | Good when… | Watch out |
|---|
| Invocations per instance | Request volume drives load | Spiky traffic can cause oscillation |
| Latency | You have a latency SLO | Noisy metrics require smoothing |
| CPU/GPU utilization | Compute bound models | Not always correlated to request rate |
Multi-model / multi-container (why they exist)
- Multi-model: multiple models behind one endpoint to reduce cost.
- Multi-container: pre/post-processing plus model serving, or multiple frameworks.
IaC + containers (exam patterns)
- IaC: CloudFormation or CDK for reproducible environments.
- Containers: build/publish to ECR, deploy via SageMaker, ECS, or EKS.
CI/CD for ML (what’s different)
You version and validate more than code:
- Code + data + features + model artifacts + evaluation reports
- Promotion gates: accuracy thresholds, bias checks, smoke tests, canary/shadow validation
Typical services: CodePipeline/CodeBuild/CodeDeploy, SageMaker Pipelines, EventBridge triggers.
flowchart LR
G["Git push"] --> CP["CodePipeline"]
CP --> CB["CodeBuild: tests + build"]
CB --> P["SageMaker Pipeline: process/train/eval"]
P --> Gate{"Meets<br/>thresholds?"}
Gate -->|yes| MR["Model Registry approve"]
Gate -->|no| Stop["Stop + report"]
MR --> Dep["Deploy (canary/shadow)"]
Dep --> Mon["Monitor + rollback triggers"]
5) Domain 4 — Monitoring, cost, and security (24%)
Monitoring and drift (high yield)
- Data drift: input distribution changed.
- Concept drift: relationship between input and label changed.
- Use baselines + ongoing checks; monitor latency/errors too.
Common services/patterns:
- SageMaker Model Monitor for monitoring workflows.
- A/B testing or shadow deployments for safe comparison.
Monitoring checklist (what to instrument)
- Inference quality: when ground truth is available later, compare predicted vs actual.
- Data quality: nulls, ranges, schema changes, category explosion.
- Distribution shift: feature histograms/summary stats vs baseline.
- Ops signals: p50/p95 latency, error rate, throttles, timeouts.
- Safety/security: anomalous traffic spikes, abuse patterns, permission failures.
Infra + cost optimization (high yield)
| Theme | What to do |
|---|
| Observability | CloudWatch metrics/logs/alarms; Logs Insights; X-Ray for traces |
| Rightsizing | Pick instance family/size based on perf; use Inference Recommender + Compute Optimizer |
| Spend control | Tags + Cost Explorer + Budgets + Trusted Advisor |
| Purchasing options | Spot / Reserved / Savings Plans where the workload fits |
Cost levers (common “best answer” patterns)
- Choose the right inference mode first: batch (cheapest) → async → serverless → real-time (most always-on).
- Right-size and auto scale; don’t leave endpoints overprovisioned.
- Use Spot for fault-tolerant training/batch where interruptions are acceptable.
- Use Budgets + tags early (before the bills surprise you).
Security defaults (high yield)
- Least privilege IAM for training jobs, pipelines, and endpoints.
- Encrypt at rest + in transit (KMS + TLS).
- VPC isolation (subnets + security groups) for ML resources when required.
- Audit trails (CloudTrail) + controlled access to logs and artifacts.
Common IAM/security “gotchas”
- Training role can read S3 but can’t decrypt KMS key (KMS key policy vs IAM policy mismatch).
- Endpoint role has broad S3 access (“*”) instead of a tight prefix.
- Secrets leak into logs/artifacts (build logs, notebooks, environment variables).
- No audit trail for model registry approvals or endpoint updates.
Next steps