

Certified Generative AI Engineer Associate Exam Questions
Total Questions
Last Updated
1st Try Guaranteed

Experts Verified
Question 11 Single Choice
When developing an LLM application, it is essential to ensure that the data used for training adheres to licensing rules to prevent legal issues.
Which action is NOT a proper approach for avoiding legal risks?
Explanation

Click "Show Answer" to see the explanation here
The correct answer is:
D. Contact the data curators directly after you’ve already started using the trained model to inform them.
Here's why:
Proactive approach: It's important to address licensing issues before using the trained model to avoid potential legal consequences.
Respect for intellectual property: Contacting data curators before using their data demonstrates respect for their intellectual property and shows that you are taking steps to comply with licensing rules.
Building relationships: Establishing contact with data curators can help build relationships and potentially lead to future collaborations.
Here's why the other options are proper approaches:
A. Contact the data curators directly before using the trained model to inform them: This is a proactive and responsible approach that demonstrates respect for intellectual property and helps avoid legal issues.
B. Use any data you have created yourself, as it is entirely original and you can assign the appropriate license: Data that you have created yourself is generally considered your intellectual property, and you can assign the appropriate license to it.
C. Only use data that is clearly labeled with an open license and make sure you comply with the terms of that license: Using data with open licenses ensures that you have the necessary permissions to use and distribute the data.
Explanation
The correct answer is:
D. Contact the data curators directly after you’ve already started using the trained model to inform them.
Here's why:
Proactive approach: It's important to address licensing issues before using the trained model to avoid potential legal consequences.
Respect for intellectual property: Contacting data curators before using their data demonstrates respect for their intellectual property and shows that you are taking steps to comply with licensing rules.
Building relationships: Establishing contact with data curators can help build relationships and potentially lead to future collaborations.
Here's why the other options are proper approaches:
A. Contact the data curators directly before using the trained model to inform them: This is a proactive and responsible approach that demonstrates respect for intellectual property and helps avoid legal issues.
B. Use any data you have created yourself, as it is entirely original and you can assign the appropriate license: Data that you have created yourself is generally considered your intellectual property, and you can assign the appropriate license to it.
C. Only use data that is clearly labeled with an open license and make sure you comply with the terms of that license: Using data with open licenses ensures that you have the necessary permissions to use and distribute the data.
Question 12 Single Choice
A Generative AI Engineer has received business requirements for an external chatbot. The chatbot needs to understand the types of questions users ask and route them to the appropriate models for answers. For instance, one user might inquire about details for upcoming events, while another might ask about purchasing tickets for a specific event.
What is the most suitable workflow for this chatbot?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer: C. The chatbot should be designed as a multi-step LLM workflow. First, it should identify the type of question being asked, then route the query to the appropriate model. For questions about upcoming events, the query should be directed to a text-to-SQL model, while ticket purchasing inquiries should redirect the user to a payment platform.
Justification:
Implementing a multi-step Large Language Model (LLM) workflow enables the chatbot to effectively manage diverse user inquiries by:
Classifying User Intent: Initially determining the nature of the user's question (e.g., event details vs. ticket purchases).
Routing to Specialized Modules:
Event Details:Reddit
Utilize a text-to-SQL model to query a database for information on upcoming events.
Ticket Purchases:anthropic.com+11LangChain Python+11arxiv.org+11
Redirect users to the payment platform to facilitate ticket transactions.
Supporting Information:
Multi-Agent Systems: Designing the chatbot with specialized agents allows for task-specific processing, enhancing efficiency and accuracy. For example, one agent can handle event information retrieval, while another manages payment processes.
LLM Routing: Effectively directing user prompts to the appropriate models or systems ensures that each query is handled by the most suitable resource, improving response relevance and user satisfaction.
Analysis of Other Options:
A. The chatbot should focus solely on past event information.
Limitation: This approach neglects user inquiries about future events and ticket purchases, thereby limiting the chatbot's functionality and user engagement.
B. Two separate chatbots should be created to handle different types of user inquiries.
Limitation: Maintaining multiple chatbots can lead to increased complexity and resource requirements. Users might also experience confusion when determining which chatbot to engage with for their specific needs.
D. The chatbot should only handle payment processing.
Limitation: This restricts the chatbot's capabilities to a single function, failing to address informational queries about events, which are essential for comprehensive user support.
Conclusion:
Designing the chatbot as a multi-step LLM workflow ensures that it can intelligently classify and route user inquiries to the appropriate modules. This structure not only enhances the chatbot's versatility but also improves user satisfaction by providing accurate and relevant responses tailored to their specific needs.
Explanation
Correct Answer: C. The chatbot should be designed as a multi-step LLM workflow. First, it should identify the type of question being asked, then route the query to the appropriate model. For questions about upcoming events, the query should be directed to a text-to-SQL model, while ticket purchasing inquiries should redirect the user to a payment platform.
Justification:
Implementing a multi-step Large Language Model (LLM) workflow enables the chatbot to effectively manage diverse user inquiries by:
Classifying User Intent: Initially determining the nature of the user's question (e.g., event details vs. ticket purchases).
Routing to Specialized Modules:
Event Details:Reddit
Utilize a text-to-SQL model to query a database for information on upcoming events.
Ticket Purchases:anthropic.com+11LangChain Python+11arxiv.org+11
Redirect users to the payment platform to facilitate ticket transactions.
Supporting Information:
Multi-Agent Systems: Designing the chatbot with specialized agents allows for task-specific processing, enhancing efficiency and accuracy. For example, one agent can handle event information retrieval, while another manages payment processes.
LLM Routing: Effectively directing user prompts to the appropriate models or systems ensures that each query is handled by the most suitable resource, improving response relevance and user satisfaction.
Analysis of Other Options:
A. The chatbot should focus solely on past event information.
Limitation: This approach neglects user inquiries about future events and ticket purchases, thereby limiting the chatbot's functionality and user engagement.
B. Two separate chatbots should be created to handle different types of user inquiries.
Limitation: Maintaining multiple chatbots can lead to increased complexity and resource requirements. Users might also experience confusion when determining which chatbot to engage with for their specific needs.
D. The chatbot should only handle payment processing.
Limitation: This restricts the chatbot's capabilities to a single function, failing to address informational queries about events, which are essential for comprehensive user support.
Conclusion:
Designing the chatbot as a multi-step LLM workflow ensures that it can intelligently classify and route user inquiries to the appropriate modules. This structure not only enhances the chatbot's versatility but also improves user satisfaction by providing accurate and relevant responses tailored to their specific needs.
Question 13 Single Choice
A team intends to deploy a code generation model to assist their software developers, ensuring support for multiple programming languages. The primary focus is on maintaining high quality in the generated code.
Which of the Databricks Foundation Model APIs or models available in the Marketplace would be the most suitable choice?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer: D. CodeLlama-34B
Justification:
CodeLlama-34B is a specialized large language model explicitly fine-tuned for code generation tasks across multiple programming languages. Its design focuses on generating high-quality code, making it particularly suitable for assisting software developers. The model's substantial parameter size (34 billion) enhances its ability to understand and generate complex code structures, thereby maintaining a high standard in the generated code. ai.meta.com
Supporting Information:
Enhanced Coding Capabilities: CodeLlama-34B has been trained on extensive code-specific datasets, enabling it to generate code and provide natural language explanations about code. This training allows it to perform tasks such as code synthesis, code completion, and debugging assistance effectively. huggingface.co
Multi-Language Support: The model supports various programming languages, including Python, Java, C++, Bash, PHP, Typescript, and C#, making it versatile for diverse development environments. codecademy.com
Performance: Among the CodeLlama models, the 34B version provides the best results for code generation and development tasks, offering a balance between performance and computational requirements. github.com+3codecademy.com+3ai.meta.com+3
Analysis of Other Options:
A. Llama2-70B:
Limitation: While Llama2-70B is a powerful language model, it is a general-purpose model not specifically fine-tuned for code generation tasks. Therefore, it may not perform as effectively in generating high-quality code compared to models like CodeLlama-34B that are specialized for such tasks.
B. BGE-large:
Limitation: BGE-large is not primarily designed for code generation. Its architecture and training data are oriented towards different applications, making it less suitable for generating high-quality code across multiple programming languages.
C. MPT-7B:
Limitation: MPT-7B is a smaller model with 7 billion parameters and is not specifically optimized for code generation tasks. Its smaller size may limit its ability to generate complex code structures effectively, resulting in lower quality code outputs.
Conclusion:
For a team aiming to deploy a code generation model that supports multiple programming languages and maintains high-quality code output, CodeLlama-34B is the most suitable choice. Its specialized training for coding tasks, multi-language support, and superior performance in code generation make it well-suited to assist software developers effectively.
References:
Explanation
Correct Answer: D. CodeLlama-34B
Justification:
CodeLlama-34B is a specialized large language model explicitly fine-tuned for code generation tasks across multiple programming languages. Its design focuses on generating high-quality code, making it particularly suitable for assisting software developers. The model's substantial parameter size (34 billion) enhances its ability to understand and generate complex code structures, thereby maintaining a high standard in the generated code. ai.meta.com
Supporting Information:
Enhanced Coding Capabilities: CodeLlama-34B has been trained on extensive code-specific datasets, enabling it to generate code and provide natural language explanations about code. This training allows it to perform tasks such as code synthesis, code completion, and debugging assistance effectively. huggingface.co
Multi-Language Support: The model supports various programming languages, including Python, Java, C++, Bash, PHP, Typescript, and C#, making it versatile for diverse development environments. codecademy.com
Performance: Among the CodeLlama models, the 34B version provides the best results for code generation and development tasks, offering a balance between performance and computational requirements. github.com+3codecademy.com+3ai.meta.com+3
Analysis of Other Options:
A. Llama2-70B:
Limitation: While Llama2-70B is a powerful language model, it is a general-purpose model not specifically fine-tuned for code generation tasks. Therefore, it may not perform as effectively in generating high-quality code compared to models like CodeLlama-34B that are specialized for such tasks.
B. BGE-large:
Limitation: BGE-large is not primarily designed for code generation. Its architecture and training data are oriented towards different applications, making it less suitable for generating high-quality code across multiple programming languages.
C. MPT-7B:
Limitation: MPT-7B is a smaller model with 7 billion parameters and is not specifically optimized for code generation tasks. Its smaller size may limit its ability to generate complex code structures effectively, resulting in lower quality code outputs.
Conclusion:
For a team aiming to deploy a code generation model that supports multiple programming languages and maintains high-quality code output, CodeLlama-34B is the most suitable choice. Its specialized training for coding tasks, multi-language support, and superior performance in code generation make it well-suited to assist software developers effectively.
References:
Question 14 Single Choice
A Generative AI Engineer is developing a system that retrieves news articles from 1918 based on a user's query and generates summaries. While the summaries are accurate, they often include unnecessary details about how the summary was generated, which is not desired.
What change can the engineer make to resolve this issue?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer: D. Provide few-shot examples to the model or adjust the user prompt to guide the system toward the desired output format.
Justification:
The issue of summaries including unnecessary details about their generation process can be effectively addressed through prompt engineering techniques:
Prompt Refinement: By explicitly instructing the model to exclude explanations about the summary generation process, you can guide it to produce cleaner outputs. For example, crafting a prompt such as, "Provide a concise summary of the following article without mentioning the summary generation process," sets clear expectations for the desired response.
Few-Shot Examples: Demonstrating the desired output format through examples within the prompt can further align the model's responses with your expectations. This involves providing sample inputs and the corresponding ideal outputs, enabling the model to learn the preferred structure and content.
Implementing these strategies can significantly enhance the relevance and clarity of the model's summaries by eliminating superfluous information.
Analysis of Other Options:
A. Split the LLM’s output using newline characters to remove the explanation portion of the summary:
Limitation: This approach is a post-processing workaround that may not consistently or accurately separate the undesired content, leading to potential loss of valuable information or incomplete summaries.
B. Adjust the chunk size of the news articles or try different embedding models:
Irrelevance: Modifying chunk sizes or embedding models pertains to how the input data is processed and represented but does not directly influence the model's tendency to include unnecessary details in its output.
C. Review the document ingestion process to confirm the news articles are being properly ingested:
Misalignment: Ensuring proper ingestion of articles is essential for data integrity but does not address the issue of extraneous information in the generated summaries.
Therefore, option D is the most effective solution, as it directly targets the root of the problem by refining the model's instructions and expectations through prompt engineering and the provision of few-shot examples.
Explanation
Correct Answer: D. Provide few-shot examples to the model or adjust the user prompt to guide the system toward the desired output format.
Justification:
The issue of summaries including unnecessary details about their generation process can be effectively addressed through prompt engineering techniques:
Prompt Refinement: By explicitly instructing the model to exclude explanations about the summary generation process, you can guide it to produce cleaner outputs. For example, crafting a prompt such as, "Provide a concise summary of the following article without mentioning the summary generation process," sets clear expectations for the desired response.
Few-Shot Examples: Demonstrating the desired output format through examples within the prompt can further align the model's responses with your expectations. This involves providing sample inputs and the corresponding ideal outputs, enabling the model to learn the preferred structure and content.
Implementing these strategies can significantly enhance the relevance and clarity of the model's summaries by eliminating superfluous information.
Analysis of Other Options:
A. Split the LLM’s output using newline characters to remove the explanation portion of the summary:
Limitation: This approach is a post-processing workaround that may not consistently or accurately separate the undesired content, leading to potential loss of valuable information or incomplete summaries.
B. Adjust the chunk size of the news articles or try different embedding models:
Irrelevance: Modifying chunk sizes or embedding models pertains to how the input data is processed and represented but does not directly influence the model's tendency to include unnecessary details in its output.
C. Review the document ingestion process to confirm the news articles are being properly ingested:
Misalignment: Ensuring proper ingestion of articles is essential for data integrity but does not address the issue of extraneous information in the generated summaries.
Therefore, option D is the most effective solution, as it directly targets the root of the problem by refining the model's instructions and expectations through prompt engineering and the provision of few-shot examples.
Question 15 Single Choice
What is an effective way to preprocess prompts using custom code before sending them to a large language model (LLM)?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer:
Option D: Create an MLflow PyFunc model that includes a separate function for processing the prompts.
Justification for the Correct Answer (D)
Preprocessing prompts before sending them to an LLM can enhance response accuracy, reduce ambiguity, and improve performance. The most efficient and modular approach is to wrap the preprocessing logic inside an MLflow PyFunc model:
Encapsulates Preprocessing Logic
The MLflow PyFunc model allows the engineer to define a custom function that modifies prompts before sending them to the LLM.
This ensures consistency and standardization across all queries.
Keeps Preprocessing Separate from Model Execution
Instead of modifying the LLM itself, the PyFunc model acts as an intermediate layer to clean, reformat, or enrich prompts.
It enables easy updates to the preprocessing logic without retraining the LLM.
Enables Versioning and Experimentation
MLflow’s model registry allows tracking different preprocessing versions to compare performance improvements.
Engineers can experiment with different prompt formulations to improve LLM responses.
Example MLflow PyFunc Model for Preprocessing Prompts
- import mlflow.pyfunc
- class PreprocessPromptModel(mlflow.pyfunc.PythonModel):
- def predict(self, context, prompts):
- processed_prompts = [self.clean_prompt(p) for p in prompts]
- return processed_prompts
- def clean_prompt(self, prompt):
- # Example: Remove extra spaces, standardize format
- return prompt.strip().lower()
- # Save the model
- mlflow.pyfunc.save_model(path="preprocess_model", python_model=PreprocessPromptModel())
Reference:
Databricks Documentation: MLflow PyFunc Models
Why Other Options Are Incorrect?
Option A: Directly alter the internal architecture of the LLM to incorporate preprocessing steps.
Why Incorrect?
Modifying an LLM’s architecture is complex, costly, and unnecessary for simple preprocessing tasks.
Preprocessing should be handled externally to avoid retraining or fine-tuning costs.
Option B: Avoid using custom code for preprocessing prompts, as the LLM has not been trained on preprocessed examples.
Why Incorrect?
Preprocessing improves prompt clarity and structure, which enhances LLM performance.
LLMs do not require retraining to benefit from well-formatted and structured prompts.
Option C: Instead of preprocessing prompts, focus on postprocessing the LLM outputs to ensure they meet desired outcomes.
Why Incorrect?
Both preprocessing and postprocessing are important.
Poorly structured prompts lead to poor-quality responses, making postprocessing less effective.
Final Answer:
D. Create an MLflow PyFunc model that includes a separate function for processing the prompts.
This ensures modular, scalable, and reproducible preprocessing before sending prompts to the LLM.
Explanation
Correct Answer:
Option D: Create an MLflow PyFunc model that includes a separate function for processing the prompts.
Justification for the Correct Answer (D)
Preprocessing prompts before sending them to an LLM can enhance response accuracy, reduce ambiguity, and improve performance. The most efficient and modular approach is to wrap the preprocessing logic inside an MLflow PyFunc model:
Encapsulates Preprocessing Logic
The MLflow PyFunc model allows the engineer to define a custom function that modifies prompts before sending them to the LLM.
This ensures consistency and standardization across all queries.
Keeps Preprocessing Separate from Model Execution
Instead of modifying the LLM itself, the PyFunc model acts as an intermediate layer to clean, reformat, or enrich prompts.
It enables easy updates to the preprocessing logic without retraining the LLM.
Enables Versioning and Experimentation
MLflow’s model registry allows tracking different preprocessing versions to compare performance improvements.
Engineers can experiment with different prompt formulations to improve LLM responses.
Example MLflow PyFunc Model for Preprocessing Prompts
- import mlflow.pyfunc
- class PreprocessPromptModel(mlflow.pyfunc.PythonModel):
- def predict(self, context, prompts):
- processed_prompts = [self.clean_prompt(p) for p in prompts]
- return processed_prompts
- def clean_prompt(self, prompt):
- # Example: Remove extra spaces, standardize format
- return prompt.strip().lower()
- # Save the model
- mlflow.pyfunc.save_model(path="preprocess_model", python_model=PreprocessPromptModel())
Reference:
Databricks Documentation: MLflow PyFunc Models
Why Other Options Are Incorrect?
Option A: Directly alter the internal architecture of the LLM to incorporate preprocessing steps.
Why Incorrect?
Modifying an LLM’s architecture is complex, costly, and unnecessary for simple preprocessing tasks.
Preprocessing should be handled externally to avoid retraining or fine-tuning costs.
Option B: Avoid using custom code for preprocessing prompts, as the LLM has not been trained on preprocessed examples.
Why Incorrect?
Preprocessing improves prompt clarity and structure, which enhances LLM performance.
LLMs do not require retraining to benefit from well-formatted and structured prompts.
Option C: Instead of preprocessing prompts, focus on postprocessing the LLM outputs to ensure they meet desired outcomes.
Why Incorrect?
Both preprocessing and postprocessing are important.
Poorly structured prompts lead to poor-quality responses, making postprocessing less effective.
Final Answer:
D. Create an MLflow PyFunc model that includes a separate function for processing the prompts.
This ensures modular, scalable, and reproducible preprocessing before sending prompts to the LLM.
Question 16 Single Choice
A Generative AI Engineer is working with a provisioned throughput model serving endpoint within a RAG application. They want to track both incoming requests and outgoing responses for the endpoint. Currently, they are using a micro-service between the endpoint and the user interface to log the information to a remote server.
Which Databricks feature can they use to handle this logging task more efficiently?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer:
Option D: Inference Tables
Justification for the Correct Answer (D)
A provisioned throughput model serving endpoint in a Retrieval-Augmented Generation (RAG) application requires efficient tracking of both incoming requests and outgoing responses. The engineer is currently using a microservice to log data externally, but Databricks provides a built-in feature for tracking inference data efficiently: Inference Tables.
Inference Tables enable:
Automatic logging of requests and responses from Databricks Model Serving endpoints.
Storage of metadata such as latency, model input parameters, and generated outputs.
Easy retrieval and analysis of logged inference data using SQL or Databricks queries.
Better performance and lower operational overhead than manually logging through a microservice.
By using Inference Tables, the engineer can eliminate the need for a separate logging service while ensuring that all model interactions are logged efficiently within Databricks.
Reference:
Databricks Documentation: Inference Tables for Model Serving
Why Other Options Are Incorrect?
Option A: Vector Search
Why Incorrect?
Vector Search is used for retrieving semantically similar documents in RAG applications, not for logging model requests and responses.
It helps improve retrieval quality, but does not handle request-response tracking for model serving endpoints.
Option B: Lakeview
Why Incorrect?
Lakeview is not a Databricks feature. It appears to be an incorrect or non-existent term in the Databricks ecosystem.
There are no official references to a feature named "Lakeview" in Databricks documentation.
Option C: DBSQL (Databricks SQL)
Why Incorrect?
DBSQL is used for querying structured data in Databricks but does not handle real-time logging of model requests and responses.
Inference Tables are specifically designed for tracking model serving interactions, making them the more efficient choice.
Final Answer:
D. Inference Tables
This is the best choice for efficiently tracking requests and responses for a provisioned throughput model serving endpoint within a RAG application.
Supporting Documentation:
Explanation
Correct Answer:
Option D: Inference Tables
Justification for the Correct Answer (D)
A provisioned throughput model serving endpoint in a Retrieval-Augmented Generation (RAG) application requires efficient tracking of both incoming requests and outgoing responses. The engineer is currently using a microservice to log data externally, but Databricks provides a built-in feature for tracking inference data efficiently: Inference Tables.
Inference Tables enable:
Automatic logging of requests and responses from Databricks Model Serving endpoints.
Storage of metadata such as latency, model input parameters, and generated outputs.
Easy retrieval and analysis of logged inference data using SQL or Databricks queries.
Better performance and lower operational overhead than manually logging through a microservice.
By using Inference Tables, the engineer can eliminate the need for a separate logging service while ensuring that all model interactions are logged efficiently within Databricks.
Reference:
Databricks Documentation: Inference Tables for Model Serving
Why Other Options Are Incorrect?
Option A: Vector Search
Why Incorrect?
Vector Search is used for retrieving semantically similar documents in RAG applications, not for logging model requests and responses.
It helps improve retrieval quality, but does not handle request-response tracking for model serving endpoints.
Option B: Lakeview
Why Incorrect?
Lakeview is not a Databricks feature. It appears to be an incorrect or non-existent term in the Databricks ecosystem.
There are no official references to a feature named "Lakeview" in Databricks documentation.
Option C: DBSQL (Databricks SQL)
Why Incorrect?
DBSQL is used for querying structured data in Databricks but does not handle real-time logging of model requests and responses.
Inference Tables are specifically designed for tracking model serving interactions, making them the more efficient choice.
Final Answer:
D. Inference Tables
This is the best choice for efficiently tracking requests and responses for a provisioned throughput model serving endpoint within a RAG application.
Supporting Documentation:
Question 17 Single Choice
A Generative AI Engineer at an electronics company has deployed a RAG (Retrieval-Augmented Generation) application that allows customers to ask questions about the company's products. However, users have reported that the responses sometimes provide information about irrelevant products.
What should the engineer do to improve the relevance of the responses?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer:
Option A: Evaluate the quality of the context being retrieved.
Justification for the Correct Answer (A)
Since users are receiving irrelevant product information, the primary issue likely lies in how the RAG system retrieves relevant documents before passing them to the LLM. Evaluating the quality of the retrieved context is the first and most effective step to improving response relevance.
Why Evaluating Retrieved Context is Crucial
Ensures Retrieval Accuracy
If incorrect or loosely related documents are retrieved, even a strong LLM cannot generate a relevant response.
Checking retrieval results manually helps identify if the system is selecting the right product information.
Optimizes Chunking Strategy
If chunks are too large, irrelevant details may be included.
If chunks are too small, important context may be missing.
Adjusting chunk size improves retrieval precision.
Improves Embedding Search and Filtering
If retrieval is pulling irrelevant documents, refining the vector search algorithm (e.g., by using better metadata filtering) may help.
Analyzing embedding similarity scores ensures higher confidence results are selected.
Allows Systematic Debugging
Testing retrieval performance separately from LLM responses helps diagnose issues before making unnecessary changes to the LLM or other components.
Reference:
Databricks: Improving RAG Performance
Why Other Options Are Incorrect?
Option B: Implement caching for commonly asked questions.
Why Incorrect?
Caching only speeds up responses but does not fix the retrieval problem.
If responses are incorrect due to poor retrieval, caching will just serve the same incorrect responses faster.
Option C: Switch to a different LLM to enhance the response generation.
Why Incorrect?
The issue is likely with document retrieval, not the LLM itself.
Even a better LLM cannot generate a relevant response if the wrong documents are retrieved.
Option D: Use a different algorithm for semantic similarity search.
Why Incorrect?
Changing the semantic search algorithm may help in some cases, but without first evaluating retrieval quality, this is premature.
A retrieval evaluation should be conducted before modifying search algorithms.
Final Answer:
A. Evaluate the quality of the context being retrieved.
This step systematically improves the RAG pipeline by ensuring the right product information is retrieved before passing it to the LLM, leading to more relevant responses.
Official Reference:
Explanation
Correct Answer:
Option A: Evaluate the quality of the context being retrieved.
Justification for the Correct Answer (A)
Since users are receiving irrelevant product information, the primary issue likely lies in how the RAG system retrieves relevant documents before passing them to the LLM. Evaluating the quality of the retrieved context is the first and most effective step to improving response relevance.
Why Evaluating Retrieved Context is Crucial
Ensures Retrieval Accuracy
If incorrect or loosely related documents are retrieved, even a strong LLM cannot generate a relevant response.
Checking retrieval results manually helps identify if the system is selecting the right product information.
Optimizes Chunking Strategy
If chunks are too large, irrelevant details may be included.
If chunks are too small, important context may be missing.
Adjusting chunk size improves retrieval precision.
Improves Embedding Search and Filtering
If retrieval is pulling irrelevant documents, refining the vector search algorithm (e.g., by using better metadata filtering) may help.
Analyzing embedding similarity scores ensures higher confidence results are selected.
Allows Systematic Debugging
Testing retrieval performance separately from LLM responses helps diagnose issues before making unnecessary changes to the LLM or other components.
Reference:
Databricks: Improving RAG Performance
Why Other Options Are Incorrect?
Option B: Implement caching for commonly asked questions.
Why Incorrect?
Caching only speeds up responses but does not fix the retrieval problem.
If responses are incorrect due to poor retrieval, caching will just serve the same incorrect responses faster.
Option C: Switch to a different LLM to enhance the response generation.
Why Incorrect?
The issue is likely with document retrieval, not the LLM itself.
Even a better LLM cannot generate a relevant response if the wrong documents are retrieved.
Option D: Use a different algorithm for semantic similarity search.
Why Incorrect?
Changing the semantic search algorithm may help in some cases, but without first evaluating retrieval quality, this is premature.
A retrieval evaluation should be conducted before modifying search algorithms.
Final Answer:
A. Evaluate the quality of the context being retrieved.
This step systematically improves the RAG pipeline by ensuring the right product information is retrieved before passing it to the LLM, leading to more relevant responses.
Official Reference:
Question 18 Single Choice
A Generative AI Engineer is tasked with designing an LLM-based application that fulfills a business requirement: answering employee HR-related questions by referencing HR PDF documentation.
Which set of high-level tasks should the engineer's system perform?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer:
Option D: Break the HR documentation into chunks and store them in a vector database. Use the employee's question to retrieve the most relevant chunks, and use the LLM to generate a response based on the retrieved documentation.
Justification for the Correct Answer (D)
A Retrieval-Augmented Generation (RAG) system is the best approach for an HR question-answering application, as it enables efficient document retrieval while keeping inference costs low. The optimal high-level tasks include:
Chunking HR Documentation:
HR PDFs are broken into smaller, semantically meaningful chunks (e.g., paragraphs or sections).
Chunking ensures better retrieval accuracy by allowing granular search.
Storing Chunks in a Vector Database:
Each chunk is embedded using an embedding model (e.g., OpenAI, Cohere, or Databricks Foundation Models).
The embeddings are stored in a vector database for similarity-based retrieval.
Retrieving Relevant Chunks Based on the Employee Query:
When an employee asks an HR-related question, the query is converted into an embedding.
A vector search retrieves the most relevant HR document chunks.
Using an LLM to Generate a Response:
The retrieved chunks provide factual grounding for the LLM-generated answer.
This approach ensures accurate responses without hallucination.
This method is efficient, scalable, and cost-effective, making it the industry standard for enterprise search applications like HR document queries.
Reference:
Databricks Documentation: Retrieval-Augmented Generation (RAG)
Vector Search for LLM Applications
Why Other Options Are Incorrect?
Option A: Compute averaged embeddings for each HR document, compare the embeddings to the user's query to identify the best document, and then pass the document along with the query into an LLM with a large context window to generate a response.
Why Incorrect?
Averaging embeddings across entire documents loses important contextual meaning.
Chunking and storing in a vector database provide much finer-grained retrieval, improving accuracy.
Sending an entire document to an LLM with a large context window is inefficient and costly.
Option B: Use an LLM to summarize the HR documents, then provide the summaries along with the user’s query to an LLM with a large context window to generate a reply.
Why Incorrect?
Summarizing entire HR documents loses critical details that may be important for specific queries.
The LLM needs factual grounding via retrieved chunks, not just pre-generated summaries.
Summarization does not scale well with dynamic queries, as different questions may require different details.
Option C: Build an interaction matrix using historical employee questions and HR documents. Apply ALS (Alternating Least Squares) to factorize the matrix and create embeddings. For new queries, calculate embeddings and use them to retrieve the best HR documentation, then use an LLM to generate a response.
Why Incorrect?
ALS is a collaborative filtering technique used for recommendation systems, not semantic search in RAG.
HR document retrieval requires a similarity search approach, not matrix factorization.
Vector-based semantic search is more accurate and scalable for retrieval.
Final Answer:
D. Break the HR documentation into chunks and store them in a vector database. Use the employee's question to retrieve the most relevant chunks, and use the LLM to generate a response based on the retrieved documentation.
This RAG-based approach is the most efficient and scalable solution for an HR document question-answering system.
Official References:
Explanation
Correct Answer:
Option D: Break the HR documentation into chunks and store them in a vector database. Use the employee's question to retrieve the most relevant chunks, and use the LLM to generate a response based on the retrieved documentation.
Justification for the Correct Answer (D)
A Retrieval-Augmented Generation (RAG) system is the best approach for an HR question-answering application, as it enables efficient document retrieval while keeping inference costs low. The optimal high-level tasks include:
Chunking HR Documentation:
HR PDFs are broken into smaller, semantically meaningful chunks (e.g., paragraphs or sections).
Chunking ensures better retrieval accuracy by allowing granular search.
Storing Chunks in a Vector Database:
Each chunk is embedded using an embedding model (e.g., OpenAI, Cohere, or Databricks Foundation Models).
The embeddings are stored in a vector database for similarity-based retrieval.
Retrieving Relevant Chunks Based on the Employee Query:
When an employee asks an HR-related question, the query is converted into an embedding.
A vector search retrieves the most relevant HR document chunks.
Using an LLM to Generate a Response:
The retrieved chunks provide factual grounding for the LLM-generated answer.
This approach ensures accurate responses without hallucination.
This method is efficient, scalable, and cost-effective, making it the industry standard for enterprise search applications like HR document queries.
Reference:
Databricks Documentation: Retrieval-Augmented Generation (RAG)
Vector Search for LLM Applications
Why Other Options Are Incorrect?
Option A: Compute averaged embeddings for each HR document, compare the embeddings to the user's query to identify the best document, and then pass the document along with the query into an LLM with a large context window to generate a response.
Why Incorrect?
Averaging embeddings across entire documents loses important contextual meaning.
Chunking and storing in a vector database provide much finer-grained retrieval, improving accuracy.
Sending an entire document to an LLM with a large context window is inefficient and costly.
Option B: Use an LLM to summarize the HR documents, then provide the summaries along with the user’s query to an LLM with a large context window to generate a reply.
Why Incorrect?
Summarizing entire HR documents loses critical details that may be important for specific queries.
The LLM needs factual grounding via retrieved chunks, not just pre-generated summaries.
Summarization does not scale well with dynamic queries, as different questions may require different details.
Option C: Build an interaction matrix using historical employee questions and HR documents. Apply ALS (Alternating Least Squares) to factorize the matrix and create embeddings. For new queries, calculate embeddings and use them to retrieve the best HR documentation, then use an LLM to generate a response.
Why Incorrect?
ALS is a collaborative filtering technique used for recommendation systems, not semantic search in RAG.
HR document retrieval requires a similarity search approach, not matrix factorization.
Vector-based semantic search is more accurate and scalable for retrieval.
Final Answer:
D. Break the HR documentation into chunks and store them in a vector database. Use the employee's question to retrieve the most relevant chunks, and use the LLM to generate a response based on the retrieved documentation.
This RAG-based approach is the most efficient and scalable solution for an HR document question-answering system.
Official References:
Question 19 Single Choice
A Generative AI Engineer has developed a RAG (Retrieval-Augmented Generation) application that helps employees retrieve answers from an internal knowledge base, such as Confluence pages or Google Drive. After receiving positive feedback from internal testers, the engineer now wants to formally assess the system’s performance and identify areas for improvement.
What is the best approach for the engineer to evaluate the system?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer:
B. Create a dataset to separately test the retrieval and generation components of the system. Utilize MLflow’s built-in evaluation metrics for this assessment.
Justification for the Correct Answer (B)
A Retrieval-Augmented Generation (RAG) system consists of two major components:
Retrieval Component: Fetches relevant knowledge base documents.
Generation Component: Uses the retrieved documents to generate a final response.
To formally assess the system’s performance and identify improvement areas, the best approach is to evaluate both retrieval and generation separately:
Evaluating Retrieval Performance
Use metrics such as Recall@K, NDCG (Normalized Discounted Cumulative Gain), and MRR (Mean Reciprocal Rank).
Ensures the most relevant documents are retrieved for the LLM.
Evaluating Generation Performance
Use MLflow’s built-in evaluation metrics, such as BLEU, ROUGE, or BERTScore, to compare generated answers with human-annotated responses.
Checks fluency, correctness, and relevance of the LLM-generated responses.
Benefits of This Approach
Separately testing retrieval and generation helps pinpoint weak areas.
Allows iterative improvements to retrieval accuracy and response quality.
MLflow provides a structured way to track model performance across versions.
Reference:
Databricks: Evaluating RAG Pipelines
Why Other Options Are Incorrect?
Option A: Use cosine similarity scores to comprehensively assess the quality of the final generated answers.
Why Incorrect?
Cosine similarity is mainly used for retrieval evaluation, not generation quality assessment.
It does not measure fluency, coherence, or factual correctness in LLM-generated responses.
Option C: Benchmark several LLMs using the same data and select the best-performing LLM for the application.
Why Incorrect?
While choosing the best LLM is important, it does not comprehensively evaluate the RAG system.
The retrieval quality is just as crucial as the LLM choice.
Testing both retrieval and generation separately is a better way to optimize performance.
Final Answer:
B. Create a dataset to separately test the retrieval and generation components of the system. Utilize MLflow’s built-in evaluation metrics for this assessment.
This approach ensures systematic evaluation and continuous improvement of the RAG pipeline.
Explanation
Correct Answer:
B. Create a dataset to separately test the retrieval and generation components of the system. Utilize MLflow’s built-in evaluation metrics for this assessment.
Justification for the Correct Answer (B)
A Retrieval-Augmented Generation (RAG) system consists of two major components:
Retrieval Component: Fetches relevant knowledge base documents.
Generation Component: Uses the retrieved documents to generate a final response.
To formally assess the system’s performance and identify improvement areas, the best approach is to evaluate both retrieval and generation separately:
Evaluating Retrieval Performance
Use metrics such as Recall@K, NDCG (Normalized Discounted Cumulative Gain), and MRR (Mean Reciprocal Rank).
Ensures the most relevant documents are retrieved for the LLM.
Evaluating Generation Performance
Use MLflow’s built-in evaluation metrics, such as BLEU, ROUGE, or BERTScore, to compare generated answers with human-annotated responses.
Checks fluency, correctness, and relevance of the LLM-generated responses.
Benefits of This Approach
Separately testing retrieval and generation helps pinpoint weak areas.
Allows iterative improvements to retrieval accuracy and response quality.
MLflow provides a structured way to track model performance across versions.
Reference:
Databricks: Evaluating RAG Pipelines
Why Other Options Are Incorrect?
Option A: Use cosine similarity scores to comprehensively assess the quality of the final generated answers.
Why Incorrect?
Cosine similarity is mainly used for retrieval evaluation, not generation quality assessment.
It does not measure fluency, coherence, or factual correctness in LLM-generated responses.
Option C: Benchmark several LLMs using the same data and select the best-performing LLM for the application.
Why Incorrect?
While choosing the best LLM is important, it does not comprehensively evaluate the RAG system.
The retrieval quality is just as crucial as the LLM choice.
Testing both retrieval and generation separately is a better way to optimize performance.
Final Answer:
B. Create a dataset to separately test the retrieval and generation components of the system. Utilize MLflow’s built-in evaluation metrics for this assessment.
This approach ensures systematic evaluation and continuous improvement of the RAG pipeline.
Question 20 Single Choice
A Generative AI Engineer is designing a system to recommend the most suitable employee for newly defined projects. The employee is selected from a large pool of team members. The selection needs to consider the employee’s availability during the project timeline and how closely their profile aligns with the project’s requirements. Both the employee profiles and project scopes are composed of unstructured text.
What approach should the engineer take to design this system?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer:
Option D: Create a tool that finds available team members for the given project dates. Embed team member profiles in a vector store and use the project description to search and filter for the best-matched team members who are available.
Justification for the Correct Answer (D)
The problem requires a system that:
Filters employees based on availability (structured data).
Matches employees based on project scope similarity (unstructured text).
A vector search approach is best suited for matching unstructured text data such as employee profiles and project descriptions. By embedding team member profiles into a vector database, the system can perform semantic similarity searches based on the project scope.
This method allows the system to:
Efficiently compare complex text-based profiles and project scopes instead of relying on keyword matching.
Retrieve the best-matched candidates based on embeddings, ensuring semantic similarity in selection.
Ensure that only available team members are considered for selection.
Reference:
Databricks Documentation: Vector Search for Semantic Retrieval
Databricks Documentation: Implementing Semantic Search with Vector Databases
Why Other Options Are Incorrect?
Option A: Develop a tool that checks team member availability based on project dates. Store the project descriptions in a vector database and retrieve the best-matched team member profiles by comparing them to the project scope.
Why Incorrect?This approach inverts the correct structure: it should be the employee profiles that are stored in the vector database, not the project descriptions.
Searching project descriptions to find employee profiles is less efficient, as the queries should come from the project scope and retrieve matching employee embeddings.
Option B: Create a tool to track team member availability according to project timelines and build another tool that uses an LLM to pull out key terms from the project descriptions. Then, go through the available team member profiles and match keywords to find the most suitable employee.
Why Incorrect?Keyword matching is not effective for unstructured text comparisons because it misses contextual meaning and semantic relationships between words.
A vector-based approach is more suitable for finding the best match based on meaning rather than exact keywords.
Option C: Build a tool to identify available team members based on project dates, and another tool to calculate a similarity score between each team member’s profile and the project description. Go through all the team members and rank them by score to select the best match.
Why Incorrect?While similarity scoring is useful, this option suggests a manual or brute-force approach of comparing each team member’s profile against the project description, which is computationally inefficient.
Using a vector database for fast retrieval is a more scalable and optimized approach.
Final Answer:
D. Create a tool that finds available team members for the given project dates. Embed team member profiles in a vector store and use the project description to search and filter for the best-matched team members who are available.
This approach optimally combines structured filtering (availability) with unstructured text similarity (vector search), ensuring an efficient and scalable solution.
Supporting Documentation:
Explanation
Correct Answer:
Option D: Create a tool that finds available team members for the given project dates. Embed team member profiles in a vector store and use the project description to search and filter for the best-matched team members who are available.
Justification for the Correct Answer (D)
The problem requires a system that:
Filters employees based on availability (structured data).
Matches employees based on project scope similarity (unstructured text).
A vector search approach is best suited for matching unstructured text data such as employee profiles and project descriptions. By embedding team member profiles into a vector database, the system can perform semantic similarity searches based on the project scope.
This method allows the system to:
Efficiently compare complex text-based profiles and project scopes instead of relying on keyword matching.
Retrieve the best-matched candidates based on embeddings, ensuring semantic similarity in selection.
Ensure that only available team members are considered for selection.
Reference:
Databricks Documentation: Vector Search for Semantic Retrieval
Databricks Documentation: Implementing Semantic Search with Vector Databases
Why Other Options Are Incorrect?
Option A: Develop a tool that checks team member availability based on project dates. Store the project descriptions in a vector database and retrieve the best-matched team member profiles by comparing them to the project scope.
Why Incorrect?This approach inverts the correct structure: it should be the employee profiles that are stored in the vector database, not the project descriptions.
Searching project descriptions to find employee profiles is less efficient, as the queries should come from the project scope and retrieve matching employee embeddings.
Option B: Create a tool to track team member availability according to project timelines and build another tool that uses an LLM to pull out key terms from the project descriptions. Then, go through the available team member profiles and match keywords to find the most suitable employee.
Why Incorrect?Keyword matching is not effective for unstructured text comparisons because it misses contextual meaning and semantic relationships between words.
A vector-based approach is more suitable for finding the best match based on meaning rather than exact keywords.
Option C: Build a tool to identify available team members based on project dates, and another tool to calculate a similarity score between each team member’s profile and the project description. Go through all the team members and rank them by score to select the best match.
Why Incorrect?While similarity scoring is useful, this option suggests a manual or brute-force approach of comparing each team member’s profile against the project description, which is computationally inefficient.
Using a vector database for fast retrieval is a more scalable and optimized approach.
Final Answer:
D. Create a tool that finds available team members for the given project dates. Embed team member profiles in a vector store and use the project description to search and filter for the best-matched team members who are available.
This approach optimally combines structured filtering (availability) with unstructured text similarity (vector search), ensuring an efficient and scalable solution.
Supporting Documentation:



