AWS Certified Machine Learning - Specialty - (MLS-C01) Exam Questions

333

Total Questions

SEP

2025

Last Updated

1st

1st Try Guaranteed

Experts Verified

Per page:

Question 11 Single Choice

A data science team at your organization is tasked with creating a machine learning model to forecast the sale prices of houses using characteristics such as the home's square footage. However, approximately 10% of the entries in the modest-sized training dataset are missing the square footage attribute. Given the importance of model accuracy in your application, which approach should the team employ to handle missing values in the training data effectively?

Question 12 Multiple Choice

A data scientist is training a deep learning model for image classification using a convolutional neural network (CNN). The model performs exceptionally well on the training data but significantly underperforms on new, unseen images. To minimize overfitting and improve the model's generalization to new data, which TWO of the following approaches should the data scientist take? (Select TWO)

Question 13 Single Choice

As a data scientist involved in the development of a self-driving car system, your task is to implement a computer vision solution capable of categorizing every pixel in images captured by the car's cameras. The categories include identifying objects like people, buildings, roads, signs, and vehicles.

How would you implement a computer vision solution capable of classifying every pixel in images captured by the car's cameras?

Question 14 Single Choice

A financial services firm is leveraging Amazon SageMaker to develop machine learning models that predict market trends. Due to the sensitive nature of their data, the firm's policy prohibits direct internet access from their virtual private cloud (VPC) to ensure the security of their data. They require the ability to use SageMaker notebook instances for model development without exposing these instances to the internet. What approach should the firm take to securely utilize SageMaker notebooks within their VPC in compliance with their security policy?

Question 15 Single Choice

A data scientist at a retail company is analyzing customer purchase patterns to segment them into distinct groups for targeted marketing campaigns. To achieve this, the scientist is employing k-Means clustering. What is the most effective method for selecting the optimal number of clusters (k) to accurately categorize customers into meaningful segments?

Click "Show Answer" to see the explanation here

Focus on services best suited for machine learning and cluster analysis, and consider how they'd specifically support choosing k.

Correct Choice: Apply the "elbow method" by analyzing a plot of the total within-cluster sum of squares (WSS) against the number of clusters (k).

The Elbow Method is a popular technique for determining k in k-Means clustering. Amazon SageMaker supports various machine learning algorithms, including k-Means, making it suitable for implementing this method. Steps include training multiple models with different k values, preferrably in parallel, and choosing the k at the "elbow" point where the decrease in the cost function diminishes. Utilizing k-means through Amazon SageMaker offers extra advantages such as scalable training and automatic model deployment, eliminating the need for infrastructure setup and management.

Incorrect Choice: Implement the silhouette analysis using AWS Glue DataBrew for data preparation and feature engineering before clustering.

AWS Glue DataBrew is great for data preparation but doesn't directly assist in selecting the number of clusters. It's useful for cleaning and preparing your data before analysis.

Incorrect Choice: Apply Principal Component Analysis (PCA) using Amazon Redshift ML to automatically select the best k value based on data dimensionality reduction.

While PCA is useful for dimensionality reduction and can improve clustering performance, it doesn't directly help in choosing the optimal number of clusters. Amazon Redshift ML facilitates machine learning predictions in SQL but isn't tailored for determining k in clustering.

Incorrect Choice: Utilize AWS Lambda with Amazon QuickSight for dynamic scaling and computing the Gap Statistic to find the optimal number of clusters.

AWS Lambda and Amazon QuickSight are powerful for serverless computing and business intelligence, respectively. However, they're not the best tools for the specific task of calculating the optimal number of clusters. The Gap Statistic method is valid but typically requires a more direct computational approach, like scripting in Python or R within a machine learning framework. While the Gap Statistics method provides a more principled statistical approach to determining the number of clusters by comparing observed clustering to a null reference model of no clusters, its greater computational demands can make it less cost-effective, especially when cloud resources and budget constraints are considerations.

Explanation

Focus on services best suited for machine learning and cluster analysis, and consider how they'd specifically support choosing k.

Correct Choice: Apply the "elbow method" by analyzing a plot of the total within-cluster sum of squares (WSS) against the number of clusters (k).

Incorrect Choice: Implement the silhouette analysis using AWS Glue DataBrew for data preparation and feature engineering before clustering.

AWS Glue DataBrew is great for data preparation but doesn't directly assist in selecting the number of clusters. It's useful for cleaning and preparing your data before analysis.

Incorrect Choice: Apply Principal Component Analysis (PCA) using Amazon Redshift ML to automatically select the best k value based on data dimensionality reduction.

Incorrect Choice: Utilize AWS Lambda with Amazon QuickSight for dynamic scaling and computing the Gap Statistic to find the optimal number of clusters.

Question 16 Multiple Choice

As part of a major content digitization initiative, your team has been tasked with organizing a vast library of encyclopedia articles to enable efficient search and retrieval. The articles are currently stored in raw text format, without any pre-assigned topic labels or categories. To unlock the full value of this content, you need a way to automatically classify the articles into relevant topics with minimal manual effort. Which AWS services or tools would you recommend using to tackle this challenge? (SELECT TWO)

Click "Show Answer" to see the explanation here

To approach this question, the learner should review their knowledge of common NLP techniques like topic modeling, text classification, and supervised/unsupervised learning. They should evaluate the provided choices based on the specific capabilities of each AWS service and how well they align with the requirements of automatically assigning topics to a large corpus of unstructured text data.

Correct Choice: Use Amazon Comprehend's Latent Dirichlet Allocation (LDA) feature to automatically classify the encyclopedia articles.

Amazon Comprehend's LDA feature is specifically designed for unsupervised topic modeling of unstructured text data. It can automatically discover and assign topics to the encyclopedia articles with minimal human effort.

Correct Choice: Use the Amazon SageMaker Neural Topic Model (NTM) to automatically discover and assign topics to the encyclopedia articles.

The Amazon SageMaker NTM is a powerful unsupervised topic modeling algorithm that can automatically discover and assign topics to the encyclopedia articles. It can capture more complex topic relationships compared to traditional LDA.

Incorrect Choice: Employ Amazon Kendra for natural language search, allowing users to find articles by querying with natural language without pre-classifying them into topics.

While Amazon Kendra provides natural language search capabilities, it is not designed for automatically classifying the encyclopedia articles into topics. Kendra is more suitable for interactive search and retrieval, not unsupervised topic modeling.

Incorrect Choice: Use Amazon SageMaker Ground Truth to manually label a sample of the articles, then train a custom text classification model on the labeled data.

While this approach can work, it requires significant manual effort to label a sample of the articles. It may not be the most efficient solution for automatically classifying the entire encyclopedia dataset.

Incorrect Choice: Leverage Amazon Textract's document understanding capabilities to extract topics from the article text.

While Amazon Textract is a powerful service for extracting structured data from documents, it may not be the most suitable choice for automatically assigning topics to the unstructured encyclopedia articles. Textract is primarily designed for tasks like extracting text, tables, and forms from scanned documents, rather than performing advanced natural language processing required for unsupervised topic modeling. Other AWS services that are more focused on natural language understanding and unsupervised topic discovery would likely be a better fit for this specific use case.

Explanation

Correct Choice: Use Amazon Comprehend's Latent Dirichlet Allocation (LDA) feature to automatically classify the encyclopedia articles.

Correct Choice: Use the Amazon SageMaker Neural Topic Model (NTM) to automatically discover and assign topics to the encyclopedia articles.

Incorrect Choice: Employ Amazon Kendra for natural language search, allowing users to find articles by querying with natural language without pre-classifying them into topics.

Incorrect Choice: Use Amazon SageMaker Ground Truth to manually label a sample of the articles, then train a custom text classification model on the labeled data.

Incorrect Choice: Leverage Amazon Textract's document understanding capabilities to extract topics from the article text.

Question 17 Single Choice

A healthcare analytics firm is leveraging Amazon SageMaker to train machine learning models on sensitive patient data. To comply with strict data privacy regulations, the training jobs are configured to run within a Virtual Private Cloud (VPC) that lacks direct internet access. What method should be employed to ensure these training jobs can securely access training data stored in an Amazon S3 bucket?

Question 18 Single Choice

In a clinical trial dataset that encompasses a variety of features including Mean Arterial Pressure (MAP), it's observed that the features exhibit low correlation with one another. The dataset is almost complete, with less than 1% of the MAP values missing. Aside from a few outliers, the MAP data is relatively uniformly distributed, and all other features are fully accounted for. Given these characteristics, which approach should be adopted to manage the missing MAP data most effectively?

Click "Show Answer" to see the explanation here

To determine the most appropriate method for handling missing data, consider the distribution of the dataset and the influence of outliers on central tendency measures. Review machine learning topics related to data preprocessing, specifically imputation techniques and their impact on model accuracy and bias.

Correct Choice: Impute the missing MAP values using the median of the MAP data available, considering the distribution's characteristics and the presence of outliers.

Given the dataset's description, where the MAP data is mostly evenly distributed but includes some outliers, imputing missing values with the median is the most robust approach. The median is less sensitive to outliers than the mean, making it a better choice for datasets where outliers are present. Since the proportion of missing data is minimal (<1%), this approach allows for maintaining the integrity of the MAP feature without introducing significant bias.

Incorrect Choice: Impute the missing MAP values with the mean of the available MAP data, aiming for a central tendency measure.

While imputing missing values with the mean is a common practice, this method is more susceptible to distortion by outliers, especially in datasets where these are present. Given the mention of outliers in the MAP distribution, relying on the mean for imputation could skew the results, potentially leading to inaccurate representations of the MAP values.

Incorrect Choice: Exclude the MAP column entirely from the dataset due to the incompleteness of its data.

Dropping the MAP column entirely would eliminate a potentially valuable feature from the analysis, which is not advisable given that the majority of the MAP data is intact. This approach would disregard useful information that could contribute significantly to the outcomes of the clinical trial analysis.

Incorrect Choice: Fill in the missing MAP values with random noise to maintain the dataset's volume.

Introducing random noise to replace missing MAP values does not adhere to sound statistical principles and would likely degrade the quality of the dataset. This method does not contribute to an accurate or meaningful imputation of the missing data and could introduce additional variability into the analysis.

Explanation

Correct Choice: Impute the missing MAP values using the median of the MAP data available, considering the distribution's characteristics and the presence of outliers.

Incorrect Choice: Impute the missing MAP values with the mean of the available MAP data, aiming for a central tendency measure.

Incorrect Choice: Exclude the MAP column entirely from the dataset due to the incompleteness of its data.

Incorrect Choice: Fill in the missing MAP values with random noise to maintain the dataset's volume.

Question 19 Single Choice

A team is developing a "Universal Translator" application that can recognize spoken language, translate it into English, and then articulate the English translation audibly. Which sequence of AWS services should be implemented to achieve this functionality?

Question 20 Single Choice

An AI developer is fine-tuning a deep learning model for image recognition tasks. During the training process, the model's performance is measured by its accuracy on a separate validation dataset after each training epoch. The model demonstrates consistent improvement in accuracy up to the 100th epoch. However, post-100th epoch, while the training accuracy still improves, the validation accuracy starts to decline. What is the most probable remediation for this divergence in accuracy trends between the training and validation sets?

← Previous

Page: 2 / 34