Use this syllabus as your source of truth for ML‑PRO. Work topic-by-topic, and drill questions after each section.
What’s covered
Topic 1: Feature Pipelines & Training/Serving Consistency
Practice this topic →
1.1 Feature definitions, reuse, and lifecycle
- Explain why reusable features reduce duplicated logic and improve consistency across models.
- Differentiate offline feature computation from online/serving feature access conceptually.
- Identify feature ownership and documentation practices that prevent misinterpretation.
- Recognize when to materialize features vs compute on demand (cost/latency trade-off awareness).
- Given a scenario, choose a feature strategy that supports multi-team reuse and controlled change.
- Describe why feature freshness and backfills must be managed like production pipelines.
- Explain why lineage from features to sources is required for audit and incident response.
1.2 Leakage prevention and data validation
- Identify common forms of leakage (future information, label leakage) and their symptoms.
- Choose split strategies that align with production prediction timing (especially for time series).
- Explain why transformations should be fit on training data only and applied consistently.
- Describe why schema validation and data checks prevent silent feature drift.
- Given a scenario, select a leakage mitigation plan that preserves model usefulness and correctness.
- Explain why monitoring input distributions can detect feature drift early.
- Recognize that “too good” evaluation often signals leakage or contamination.
1.3 Training/serving skew and mitigation
- Define training/serving skew and explain why it causes production performance collapse.
- Identify skew sources: inconsistent preprocessing, missing features, schema mismatch, and drift.
- Describe how shared preprocessing pipelines reduce skew risk (concept-level).
- Given a scenario, choose whether to fix upstream data, adjust features, or rollback the model.
- Explain why strict input schema contracts improve reliability for serving.
- Recognize that monitoring must include both model metrics and feature pipeline health.
- Describe how to validate skew hypotheses using logged inputs and offline replay.
Topic 2: Reproducible Training & Experiment Management at Scale
Practice this topic →
2.1 Reproducibility: data, code, environment
- Explain why reproducing a model requires tracking data versions, code versions, and environment/dependencies.
- Identify why uncontrolled randomness can break reproducibility and how to mitigate it (seeds, deterministic ops).
- Describe why storing training artifacts (preprocessors, encoders) is required for serving consistency.
- Given a scenario, choose what metadata must be captured to satisfy audit requirements.
- Explain why reproducibility reduces incident duration during regressions.
- Recognize the risk of embedding secrets in notebooks/jobs and prefer secure secret management.
- Describe how to validate reproducibility by rerunning a training job and matching key metrics.
2.2 MLflow tracking at scale (patterns)
- Explain how MLflow runs, parameters, metrics, and artifacts support large-scale experimentation.
- Organize experiments and tags to support comparisons across many teams and models.
- Describe how to compare runs fairly and avoid p-hacking/over-tuning pitfalls (concept-level).
- Given a scenario, choose an experiment structure that supports A/B tests and ablations.
- Explain why logging evaluation reports and data checks as artifacts supports governance.
- Recognize when artifact size and retention policies must be managed for cost control.
- Describe why sensitive artifacts must be handled with access control and redaction.
2.3 Tuning and validation under compute constraints
- Choose an appropriate tuning strategy that balances compute cost with expected gains (concept-level).
- Explain why early stopping and efficient search reduce wasted compute (concept-level).
- Recognize why validation procedures must remain stable to compare experiments over time.
- Given a scenario, decide whether to invest in feature improvements vs tuning hyperparameters.
- Explain why a final holdout test set (or robust evaluation set) is needed for honest assessment.
- Describe how to prevent data leakage in tuning workflows.
- Identify when distributed training considerations change pipeline design (awareness).
Topic 3: Model Registry, Governance & Release Management
Practice this topic →
3.1 Registry versions, lineage, and auditability
- Differentiate experimental runs from registry versions and explain why the registry is the release system.
- Explain why every registry version should link back to the training run, data, and code (lineage).
- Describe why model input/output schema contracts should be validated before promotion.
- Given a scenario, choose metadata that must be captured for audit and rollback.
- Recognize why governance requires access control on who can register and promote models.
- Explain how stage transitions support controlled release and rollback.
- Describe why documentation of model intent and limitations prevents misuse.
3.2 Release workflows: approvals, staging, rollback
- Explain why approvals and review gates reduce production risk for high-impact models.
- Design a staged rollout plan (staging validation, canary, production promotion) at a conceptual level.
- Recognize when rollback is the safest action (sudden performance drop, upstream schema change).
- Given a scenario, choose a rollback vs retrain vs fix-upstream decision based on observed signals.
- Describe how to maintain reproducibility during release (pin model version and feature definitions).
- Explain why automated tests for model contract and basic sanity checks reduce regressions.
- Identify anti-patterns: promoting experimental runs directly without governance.
3.3 Security and compliance (model lifecycle)
- Describe why access to models and features must be scoped by role and environment.
- Explain why sensitive training data should not leak into logs/artifacts and how to avoid it.
- Recognize the need for audit logs for promotions and changes in regulated settings (concept-level).
- Given a scenario, choose the safest approach for sharing models across teams (registry + permissions).
- Describe how to handle secrets for inference services safely (secret scopes/management).
- Explain why data retention policies affect auditability and rollback options.
- Recognize that governance is part of reliability: fewer unknown changes during incidents.
Topic 4: Deployment Patterns & Testing
Practice this topic →
4.1 Batch vs online serving decisions
- Differentiate batch inference from online inference and map each to latency/throughput requirements.
- Explain why batch inference is often cheaper at scale when low latency is not required.
- Recognize the trade-off between throughput and per-request latency for online serving.
- Given a scenario, choose the simplest deployment approach that meets requirements.
- Describe how schema contracts and preprocessing requirements shape deployment choices.
- Explain why idempotency and retries matter for batch scoring pipelines.
- Identify when streaming inference patterns are needed (awareness).
4.2 Testing and rollout safety
- Describe a basic model test pyramid: unit tests for transforms, contract tests, and end-to-end smoke tests.
- Explain why canary/shadow deployments reduce risk for online inference (concept-level).
- Recognize why monitoring should be in place before enabling full production traffic.
- Given a scenario, choose a rollout plan that includes rollback criteria and verification steps.
- Describe how to validate feature availability and freshness before deployment.
- Explain why drift tests and regression test sets prevent silent quality degradation.
- Identify anti-patterns: deploying without monitoring, deploying without version pinning.
4.3 Operational integration (CI/CD awareness)
- Explain why CI/CD for models requires versioned artifacts and repeatable training pipelines.
- Recognize that promotion should be a controlled action with approvals when required.
- Describe how to separate dev/test/prod environments for ML workflows to reduce accidental impact.
- Given a scenario, choose an automation approach that preserves auditability (logged promotions, artifacts).
- Explain why secrets should not be embedded in pipelines and must be injected securely.
- Describe how to coordinate feature pipeline changes with model releases to avoid serving skew.
- Identify how to handle rollback in automated workflows safely (pin previous model version).
Topic 5: Monitoring, Drift & Maintenance
Practice this topic →
5.1 Drift types and detection
- Differentiate data drift, concept drift, and label drift and identify what each implies.
- Explain why monitoring input feature distributions can detect upstream changes early.
- Recognize that a sudden metric drop often indicates pipeline/schema breakage rather than gradual drift.
- Given a scenario, choose the correct first diagnostic step for a performance regression (data checks, feature skew, model change).
- Describe how to set and interpret alert thresholds responsibly to avoid noise.
- Explain why segmentation matters (monitor key cohorts rather than only global metrics).
- Recognize that monitoring requires ground truth labels and that label delay affects alert design.
5.2 Retraining strategies and operational playbooks
- Differentiate scheduled retraining from drift-triggered retraining and choose when each makes sense.
- Explain why retraining pipelines must be reproducible and governed like production code.
- Recognize when rollback is safer than retraining (upstream outage, schema change, missing features).
- Given a scenario, choose a maintenance plan: retrain, fix features, recalibrate, or rollback.
- Describe how to validate a retrained model before promotion (evaluation set, regression checks).
- Explain how to manage feature backfills and avoid training on corrupted historical data.
- Identify why cost controls matter for frequent retraining and large experiments.
5.3 Observability and incident response for ML systems
- Identify key operational signals: request latency, error rate, throughput, and feature pipeline health.
- Explain why logging inputs/outputs (safely) helps debug production incidents.
- Describe a safe incident triage sequence for ML failures (scope impact, verify pipelines, rollback if needed).
- Given a scenario, select remediation that minimizes blast radius while restoring service.
- Recognize the importance of audit logs for production changes in regulated environments.
- Explain why separating responsibilities (feature owners vs model owners) improves response times.
- Identify anti-patterns: silent failures, no rollback plan, and unbounded retries.