GENAI-ASSOC Syllabus — Learning Objectives by Topic

Blueprint-aligned learning objectives for Databricks Generative AI Engineer Associate (GENAI-ASSOC), organized by topic with quick links to targeted practice.

What’s covered

Topic 1: GenAI Foundations (LLMs, Safety, and System Thinking)

Practice this topic →

1.1 LLM basics for engineers (practical)

  • Differentiate base models from instruction-tuned models at a high level and map each to common use cases.
  • Explain context windows conceptually and why prompt + retrieved context must fit within limits.
  • Describe hallucinations at a high level and why grounding with retrieved context reduces risk.
  • Recognize that determinism and repeatability depend on inference settings (temperature/top-p) conceptually.
  • Given a scenario, choose whether to use retrieval grounding or a pure prompt-only approach.
  • Explain why latency and cost are first-class constraints in production LLM systems.

1.2 Prompting patterns and failure modes

  • Differentiate system instructions, user prompts, and context and explain why separation matters.
  • Use few-shot examples conceptually and explain when examples improve reliability.
  • Recognize prompt injection risks and why untrusted text must be treated as data, not instructions.
  • Given a scenario, choose a prompt strategy that reduces ambiguity and improves grounded answers.
  • Explain why output formatting constraints (JSON, citations) require explicit instructions and validation.
  • Describe why prompt changes should be tested with regression sets to avoid silent quality regressions.

1.3 Safety, governance, and privacy awareness

  • Identify the risk of leaking sensitive data through prompts and outputs and describe mitigation strategies conceptually.
  • Explain why access control and tenant isolation are required for multi-tenant GenAI applications.
  • Recognize why logging and telemetry must redact or avoid storing PII and secrets.
  • Given a scenario, choose a safe approach to restricting retrieval to authorized documents only.
  • Explain why model outputs should be monitored for policy violations (concept-level).
  • Describe why governance is easier when metadata and ownership are standardized.

Topic 2: Data Preparation for RAG (Ingestion, Chunking, Metadata)

Practice this topic →

2.1 Preparing documents for retrieval

  • Explain why clean text extraction (removing boilerplate, normalizing whitespace) improves retrieval quality.
  • Describe the purpose of chunking and why chunks should be semantically coherent.
  • Recognize the trade-off between chunk size, recall, precision, and cost.
  • Given a scenario, choose a chunking strategy suitable for the document type (policies, manuals, FAQs).
  • Explain why deduplication and versioning of documents prevents conflicting answers.
  • Describe why storing source identifiers and timestamps supports audit and refresh workflows.

2.2 Metadata and access control for retrieval

  • Explain why metadata filters improve relevance and enforce security boundaries.
  • Identify common metadata fields: tenant, product, region, effective date, and document type.
  • Recognize that missing metadata can cause cross-tenant leakage in retrieval.
  • Given a scenario, choose a metadata schema that supports both relevance filtering and authorization checks.
  • Explain why document lifecycle (active vs deprecated) should be represented in metadata.
  • Describe how to validate that retrieval respects access policies (test cases, audit logs).

2.3 Refresh, backfill, and data quality for knowledge bases

  • Describe why knowledge bases need refresh workflows and how stale content degrades trust.
  • Explain a safe refresh strategy: ingest new versions, re-embed, re-index, then cut over.
  • Recognize the cost of re-embedding and why incremental updates are preferred when possible.
  • Given a scenario, choose a backfill plan that avoids duplicate chunks and conflicting versions.
  • Explain why quality checks should validate chunk counts, metadata completeness, and source coverage.
  • Describe why observability for ingestion and indexing jobs reduces silent failures.

Practice this topic →

3.1 Embeddings basics (practical choices)

  • Explain what embeddings represent at a high level and why they enable semantic search.
  • Describe why embedding models must be chosen consistently for a knowledge base (avoid mixing).
  • Recognize the trade-off between embedding quality, dimensionality, cost, and latency conceptually.
  • Given a scenario, choose when to embed documents vs embed structured fields separately.
  • Explain why storing embedding version metadata supports refresh and reproducibility.
  • Describe why normalization and text preprocessing must be consistent between docs and queries.

3.2 Indexing and retrieval strategy

  • Explain the purpose of a vector index and why indexing affects retrieval latency.
  • Describe hybrid retrieval at a conceptual level (keyword + vector) and when it helps.
  • Recognize the importance of metadata filters for relevance and security boundaries.
  • Given a scenario, choose a top-k and filtering strategy that balances recall, precision, and cost.
  • Explain why index refresh and consistency must be monitored (stale index risk).
  • Describe why evaluation should include retrieval quality, not only answer quality.

3.3 Common retrieval failure modes and fixes

  • Identify symptoms of poor chunking (retrieval returns irrelevant or incomplete context).
  • Explain why missing filters can retrieve wrong-tenant or wrong-version documents.
  • Recognize query mismatch and the need for query rewriting or expansion (concept-level).
  • Given a scenario, choose the most likely fix: re-chunk, add metadata filters, or adjust top-k.
  • Explain why embeddings quality can degrade if preprocessing differs between doc and query paths.
  • Describe why debugging should inspect retrieved chunks and citations, not only final answers.

Topic 4: RAG Orchestration & Evaluation

Practice this topic →

4.1 Prompt + context assembly and grounding

  • Explain why prompts should separate instructions from retrieved context to reduce injection risk.
  • Describe context window budgeting and why you must reserve space for system instructions and output format.
  • Recognize when to include citations or sources and why they improve trust.
  • Given a scenario, choose a context assembly strategy (top-k, reranking, filtering) that improves grounding.
  • Explain why you should avoid prompting the model to follow instructions inside retrieved content.
  • Describe why output validation is required for structured outputs (JSON schema checks).

4.2 Evaluation strategy (quality, safety, regression)

  • Differentiate retrieval evaluation from answer evaluation and explain why both are required.
  • Define a regression test set for prompts and RAG pipelines and explain why it prevents silent regressions.
  • Recognize the importance of human review for edge cases and high-impact outputs.
  • Given a scenario, choose evaluation criteria that match the use case (correctness, groundedness, format).
  • Explain why safety evaluation must include prompt injection and sensitive-data leakage tests.
  • Describe why evaluation should be automated where possible and tracked over time.

4.3 Iteration loop and production readiness

  • Describe a practical iteration loop: improve data/chunking → improve retrieval → refine prompts → evaluate.
  • Explain why changing only one major variable at a time makes debugging faster (isolate causes).
  • Recognize when to improve retrieval rather than prompts (most relevance issues are retrieval/data issues).
  • Given a scenario, choose the highest-ROI improvement action based on observed failures.
  • Explain why prompt and model versioning are required for reproducibility.
  • Describe why monitoring and rollback criteria must be defined before production launch.

Topic 5: Deployment, Monitoring, and Cost Controls

Practice this topic →

5.1 Serving patterns and latency trade-offs

  • Differentiate batch vs online inference for GenAI applications and map to latency requirements.
  • Explain why caching and reuse (embeddings, retrieval results) can reduce cost and latency.
  • Recognize that vector search latency and LLM latency are separate bottlenecks and must be monitored independently.
  • Given a scenario, choose a serving architecture that balances latency, cost, and security.
  • Explain why rate limiting and quotas prevent cost blow-ups and abuse in public-facing apps.
  • Describe why graceful degradation is important (fallback responses, partial results, safe refusals).

5.2 Monitoring and incident response for GenAI

  • Identify key telemetry: token usage, latency percentiles, retrieval hit rate, and error rates.
  • Explain why monitoring must include groundedness/citation behavior, not just uptime.
  • Recognize that prompt injection attempts should be monitored as security incidents.
  • Given a scenario, choose the safest immediate incident response action (restrict access, rollback prompt, disable retrieval).
  • Explain why logging must avoid storing sensitive user data and retrieved proprietary content.
  • Describe why a human review queue is useful for high-impact or uncertain outputs.

5.3 Governance and lifecycle management (MLflow awareness)

  • Explain why model/prompt/version tracking is required for reproducibility and auditability.
  • Describe how to track evaluation artifacts and datasets used for regression testing.
  • Recognize why access control must apply to both retrieval data and model endpoints.
  • Given a scenario, choose a governance approach that prevents unauthorized data exposure via retrieval.
  • Explain why refresh workflows (re-embed, re-index) should be treated like production pipelines with approvals.
  • Describe why cost controls require ongoing review (token budgets, index refresh costs, connector usage).