Certified Machine Learning Associate Exam Questions

278

Total Questions

SEP

2025

Last Updated

1st

1st Try Guaranteed

Experts Verified

Per page:

Question 1 Single Choice

How would you obtain summary statistics of spark dataframe for comprehensive data analysis?

Question 2 Single Choice

A team is formulating guidelines on when to apply various metrics for evaluating classification models. They need to decide under what circumstances the F1 score should be favored over accuracy. The F1 score formula is given as follows:

F1 = 2 * (precision * recall) / (precision + recall)

What recommendations should the team incorporate into their guidelines?

Question 3 Single Choice

Which of the following is an example of a distributed machine learning framework?

Question 4 Single Choice

A Data Scientist is using a feature store. In one of the feature tables she wants to replace missing values with each respective feature variable's median value.

A colleague suggests that the data scientist is throwing away valuable information by doing this. Which of the following approaches can they take to include as much information as possible in the feature set?

Choose only ONE best answer.

Question 5 Single Choice

In PySpark, _________ library is provided which makes integrating Python with Apache Spark easy.

Question 6 Single Choice

Which of the following describes the relationship between the native spark Dataframe and pandas API on spark Dataframe?

Choose only ONE best answer.

Click "Show Answer" to see the explanation here

The Pandas API on Spark DataFrames allows users to interact with Spark DataFrames using a syntax and set of functions similar to the familiar pandas library. This integration enables easier adoption of Spark for users who are already proficient in pandas, while still leveraging the distributed and scalable nature of Spark.

Key Features:

Built on Spark DataFrames:
- The Pandas API on Spark is implemented as a thin wrapper around the Spark DataFrame API, meaning it shares the same underlying data and metadata.
- It enables a pandas-like interface while maintaining the distributed capabilities of Spark.
Familiarity for Pandas Users:
- Provides a seamless transition for pandas users to work with Spark, leveraging distributed computing without having to learn a completely new syntax.
Distributed Environment:
- Like Spark DataFrames, the Pandas API on Spark operates in a distributed environment, ensuring scalability and performance for large datasets.
Integration:
- Makes it easier to integrate pandas-based workflows with Spark-based pipelines, reducing the learning curve for data professionals.

Explanation of Incorrect Options:

Option A:
Incorrect. The Pandas API on Spark is designed to work in a distributed environment, just like native Spark DataFrames, and is not limited to single-node processing.
Option B:
Incorrect. The Pandas API on Spark is not a separate implementation but rather built on top of Spark DataFrames, providing a pandas-like API for distributed data processing.
Option C:
Incorrect. Both Spark DataFrames and the Pandas API on Spark are immutable by design, which is a fundamental characteristic of their underlying distributed frameworks.
Option D:
Incorrect. The performance of the Pandas API on Spark is heavily reliant on the underlying Spark operations. While there may be some overhead introduced by the additional API layer, it is generally designed to take advantage of Spark's scalability.

Conclusion:

The Pandas API on Spark DataFrames bridges the gap between pandas and Spark, offering a familiar interface for pandas users while leveraging Spark's distributed processing power. This functionality makes it easier to work with large datasets and integrate pandas-like workflows into Spark-based environments.

Explanation

Key Features:

Built on Spark DataFrames:
- The Pandas API on Spark is implemented as a thin wrapper around the Spark DataFrame API, meaning it shares the same underlying data and metadata.
- It enables a pandas-like interface while maintaining the distributed capabilities of Spark.
Familiarity for Pandas Users:
- Provides a seamless transition for pandas users to work with Spark, leveraging distributed computing without having to learn a completely new syntax.
Distributed Environment:
- Like Spark DataFrames, the Pandas API on Spark operates in a distributed environment, ensuring scalability and performance for large datasets.
Integration:
- Makes it easier to integrate pandas-based workflows with Spark-based pipelines, reducing the learning curve for data professionals.

Explanation of Incorrect Options:

Option A:
Incorrect. The Pandas API on Spark is designed to work in a distributed environment, just like native Spark DataFrames, and is not limited to single-node processing.
Option B:
Incorrect. The Pandas API on Spark is not a separate implementation but rather built on top of Spark DataFrames, providing a pandas-like API for distributed data processing.
Option C:
Incorrect. Both Spark DataFrames and the Pandas API on Spark are immutable by design, which is a fundamental characteristic of their underlying distributed frameworks.
Option D:
Incorrect. The performance of the Pandas API on Spark is heavily reliant on the underlying Spark operations. While there may be some overhead introduced by the additional API layer, it is generally designed to take advantage of Spark's scalability.

Conclusion:

Question 7 Single Choice

Which of the following tools can be used to parallelize the hyperparameters tuning process for single node machine learning models using a Spark cluster?

Choose only ONE best answer.

Question 8 Single Choice

How does Spark ML tackle a linear regression problem for an extraordinarily large dataset?

Which one of the options is correct?

Choose only ONE best answer.

Question 9 Single Choice

Binning is the process of converting numeric data into categorical data by grouping continuous data into discrete bins or intervals.

Question 10 Single Choice

A machine learning engineer attempts to scale an ML pipeline by distributing its single-node model tuning procedure. After broadcasting the entire training data onto each core, each core in the cluster is capable of training one model at once. As the tuning process is still sluggish, the engineer plans to enhance the parallelism from 4 to 8 cores to expedite the process. Unfortunately, the total memory in the cluster can't be

increased.

Under which conditions would elevating the parallelism from 4 to 8 cores accelerate the tuning process?

Choose only ONE best answer.

Page: 1 / 28