

Certified Associate Developer for Apache Spark Exam Questions
Total Questions
Last Updated
1st Try Guaranteed

Experts Verified
Question 1 Single Choice
Which of the following options describes the responsibility of the executors in Spark?
Explanation

Click "Show Answer" to see the explanation here
The executors accept tasks from the driver, execute those tasks, and return results to the driver.
Correct!
The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.
No, the job of the cluster manager is to manage computing resources in the cluster, not to distribute tasks among executors. This is the job of the driver.
The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.
No, results get returned to the driver, not the cluster manager.
The executors accept jobs from the driver, analyze those jobs, and return results to the driver.
Wrong, tasks are passed along to the executors, but not jobs. A job usually contains multiple tasks. Tasks are split among executors. Also, executors do not merely analyze the tasks they get passed from the driver, but execute those tasks.
The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.
Incorrect. Spark generates logical and physical plans on the driver, not on the executors. Results get returned to the driver. Executors accept tasks from the driver (see also previous answer).
More info: Running Spark: an overview of Spark’s runtime architecture - Manning
Explanation
The executors accept tasks from the driver, execute those tasks, and return results to the driver.
Correct!
The executors accept tasks from the cluster manager, execute those tasks, and return results to the driver.
No, the job of the cluster manager is to manage computing resources in the cluster, not to distribute tasks among executors. This is the job of the driver.
The executors accept tasks from the driver, execute those tasks, and return results to the cluster manager.
No, results get returned to the driver, not the cluster manager.
The executors accept jobs from the driver, analyze those jobs, and return results to the driver.
Wrong, tasks are passed along to the executors, but not jobs. A job usually contains multiple tasks. Tasks are split among executors. Also, executors do not merely analyze the tasks they get passed from the driver, but execute those tasks.
The executors accept jobs from the driver, plan those jobs, and return results to the cluster manager.
Incorrect. Spark generates logical and physical plans on the driver, not on the executors. Results get returned to the driver. Executors accept tasks from the driver (see also previous answer).
More info: Running Spark: an overview of Spark’s runtime architecture - Manning
Question 2 Single Choice
Which of the following describes the role of tasks in the Spark execution hierarchy?
Explanation

Click "Show Answer" to see the explanation here
Stages with narrow dependencies can be grouped into one task.
Wrong, tasks with narrow dependencies can be grouped into one stage.
Tasks with wide dependencies can be grouped into one stage.
Wrong, since a wide transformation causes a shuffle which always marks the boundary of a stage. So, you cannot bundle multiple tasks that have wide dependencies into a stage.
Tasks are the second-smallest element in the execution hierarchy.
No, they are the smallest element in the execution hierarchy.
Within one task, the slots are the unit of work done for each partition of the data.
No, tasks are the unit of work done per partition. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
More info: Spark Certification Study Guide - Part 1 (Core) | Raki Rahman
Explanation
Stages with narrow dependencies can be grouped into one task.
Wrong, tasks with narrow dependencies can be grouped into one stage.
Tasks with wide dependencies can be grouped into one stage.
Wrong, since a wide transformation causes a shuffle which always marks the boundary of a stage. So, you cannot bundle multiple tasks that have wide dependencies into a stage.
Tasks are the second-smallest element in the execution hierarchy.
No, they are the smallest element in the execution hierarchy.
Within one task, the slots are the unit of work done for each partition of the data.
No, tasks are the unit of work done per partition. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
More info: Spark Certification Study Guide - Part 1 (Core) | Raki Rahman
Question 3 Single Choice
Which of the following describes the role of the cluster manager?
Explanation

Click "Show Answer" to see the explanation here
The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.
Correct. In client mode and cluster mode, the cluster manager is located on a node other than the client machine. From there it starts and ends executor processes on the cluster nodes as required by the Spark application running on the Spark driver.
The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode.
Wrong, there is no "remote" execution mode in Spark. Available execution modes are local, client, and cluster.
The cluster manager allocates resources to the DataFrame manager
Wrong, there is no "DataFrame manager" in Spark.
The cluster manager schedules tasks on the cluster in client mode.
No, in client mode, the Spark driver schedules tasks on the cluster – not the cluster manager.
The cluster manager schedules tasks on the cluster in local mode.
Wrong: In local mode, there is no "cluster". The Spark application is running on a single machine, not on a cluster of machines.
More info: Cluster Mode Overview - Spark 3.1.1 Documentation and Spark – The Definitive Guide, Chapter 15
Explanation
The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode.
Correct. In client mode and cluster mode, the cluster manager is located on a node other than the client machine. From there it starts and ends executor processes on the cluster nodes as required by the Spark application running on the Spark driver.
The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode.
Wrong, there is no "remote" execution mode in Spark. Available execution modes are local, client, and cluster.
The cluster manager allocates resources to the DataFrame manager
Wrong, there is no "DataFrame manager" in Spark.
The cluster manager schedules tasks on the cluster in client mode.
No, in client mode, the Spark driver schedules tasks on the cluster – not the cluster manager.
The cluster manager schedules tasks on the cluster in local mode.
Wrong: In local mode, there is no "cluster". The Spark application is running on a single machine, not on a cluster of machines.
More info: Cluster Mode Overview - Spark 3.1.1 Documentation and Spark – The Definitive Guide, Chapter 15
Question 4 Single Choice
Which of the following is the idea behind dynamic partition pruning in Spark?
Explanation

Click "Show Answer" to see the explanation here
Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.
Correct - Dynamic partition pruning provides an efficient way to selectively read data from files by skipping data that is irrelevant for the query. For example, if a query asks to consider only rows which have numbers >12 in column purchases via a filter, Spark would only read the rows that match this criteria from the underlying files. This method works in an optimal way if the purchases data is in a nonpartitioned table and the data to be filtered is partitioned.
Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.
No – this is what adaptive query execution does, but not dynamic partition pruning.
Dynamic partition pruning concatenates columns of similar data types to optimize join performance.
Wrong, this answer does not make sense, especially related to dynamic partition pruning.
Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.
It is true that dynamic partition pruning works in joins using broadcast variables. This actually happens in both the logical optimization and the physical planning stage. However, data types do not play a role for the reoptimization.
Dynamic partition pruning performs wide transformations on disk instead of in memory.
This answer does not make sense. Dynamic partition pruning is meant to accelerate Spark – performing any transformation involving disk instead of memory resources would decelerate Spark and certainly achieve the opposite effect of what dynamic partition pruning is intended for.
More info: Dynamic Partition Pruning in Spark 3.0 - DZone Big Data and Learning Spark, 2nd Edition, Chapter 12
Explanation
Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.
Correct - Dynamic partition pruning provides an efficient way to selectively read data from files by skipping data that is irrelevant for the query. For example, if a query asks to consider only rows which have numbers >12 in column purchases via a filter, Spark would only read the rows that match this criteria from the underlying files. This method works in an optimal way if the purchases data is in a nonpartitioned table and the data to be filtered is partitioned.
Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.
No – this is what adaptive query execution does, but not dynamic partition pruning.
Dynamic partition pruning concatenates columns of similar data types to optimize join performance.
Wrong, this answer does not make sense, especially related to dynamic partition pruning.
Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.
It is true that dynamic partition pruning works in joins using broadcast variables. This actually happens in both the logical optimization and the physical planning stage. However, data types do not play a role for the reoptimization.
Dynamic partition pruning performs wide transformations on disk instead of in memory.
This answer does not make sense. Dynamic partition pruning is meant to accelerate Spark – performing any transformation involving disk instead of memory resources would decelerate Spark and certainly achieve the opposite effect of what dynamic partition pruning is intended for.
More info: Dynamic Partition Pruning in Spark 3.0 - DZone Big Data and Learning Spark, 2nd Edition, Chapter 12
Question 5 Single Choice
Which of the following is one of the big performance advantages that Spark has over Hadoop?
Explanation

Click "Show Answer" to see the explanation here
Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
Wrong, there is no "DAG format". DAG stands for "directed acyclic graph". The DAG is a means of representing computational steps in Spark. However, it is true that Hadoop does not use a DAG. The introduction of the DAG in Spark was a result of the limitation of Hadoop's map reduce framework in which data had to be written to and read from disk continuously. More info: Directed Acyclic Graph DAG in Apache Spark - DataFlair
Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
No. Spark can certainly store data in HDFS (as well as other formats), but this is not a key performance advantage over Hadoop. Hadoop can use multiple file formats, not only parquet.
Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
No, resiliency is not asked for in the question. The question is about performance improvements. Both Hadoop and Spark can be deployed on Kubernetes.
Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.
No. DataFrames are a concept in Spark, but not in Hadoop.
More info: Hadoop vs. Spark: A Head-To-Head Comparison | Logz.io and Learning Spark, 2nd Edition, Chapter 1
Explanation
Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
Wrong, there is no "DAG format". DAG stands for "directed acyclic graph". The DAG is a means of representing computational steps in Spark. However, it is true that Hadoop does not use a DAG. The introduction of the DAG in Spark was a result of the limitation of Hadoop's map reduce framework in which data had to be written to and read from disk continuously. More info: Directed Acyclic Graph DAG in Apache Spark - DataFlair
Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
No. Spark can certainly store data in HDFS (as well as other formats), but this is not a key performance advantage over Hadoop. Hadoop can use multiple file formats, not only parquet.
Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
No, resiliency is not asked for in the question. The question is about performance improvements. Both Hadoop and Spark can be deployed on Kubernetes.
Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.
No. DataFrames are a concept in Spark, but not in Hadoop.
More info: Hadoop vs. Spark: A Head-To-Head Comparison | Logz.io and Learning Spark, 2nd Edition, Chapter 1
Question 6 Single Choice
Which of the following is the deepest level in Spark's execution hierarchy?
Explanation

Click "Show Answer" to see the explanation here
The hierarchy is, from top to bottom: Job, Stage, Task.
Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
Explanation
The hierarchy is, from top to bottom: Job, Stage, Task.
Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.
Question 7 Single Choice
Which of the following statements about garbage collection in Spark is incorrect?
Explanation

Click "Show Answer" to see the explanation here
Manually persisting RDDs in Spark prevents them from being garbage collected.
This statement is incorrect, and thus the correct answer to the question. Spark's garbage collector will remove even persisted objects, albeit in an "LRU" fashion. LRU stands for least recently used. So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.
See the linked StackOverflow post below for more information.
Serialized caching is a strategy to increase the performance of garbage collection.
This statement is correct. The more Java objects Spark needs to collect during garbage collection, the longer it takes. Storing a collection of many Java objects, such as a DataFrame with a complex schema, through serialization as a single byte array thus increases performance. This means that garbage collection takes less time on a serialized DataFrame than an unserialized DataFrame.
Optimizing garbage collection performance in Spark may limit caching ability.
This statement is correct. A full garbage collection run slows down a Spark application. When taking about "tuning" garbage collection, we mean reducing the amount or duration of these slowdowns.
A full garbage collection run is triggered when the Old generation of the Java heap space is almost full. (If you are unfamiliar with this concept, check out the link to the Garbage Collection Tuning docs below.) Thus, one measure to avoid triggering a garbage collection run is to prevent the Old generation share of the heap space to be almost full.
To achieve this, one may decrease its size. Objects with sizes greater than the Old generation space will then be discarded instead of cached (stored) in the space and helping it to be "almost full" . This will decrease the number of full garbage collection runs, increasing overall performance.
Inevitably, however, objects will need to be recomputed when they are needed. So, this mechanism only works when a Spark application needs to reuse cached data as little as possible.
Garbage collection information can be accessed in the Spark UI's stage detail view.
This statement is correct. The task table in the Spark UI's stage detail view has a "GC Time" column, indicating the garbage collection time needed per task.
In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
This statement is correct. The G1 garbage collector, also known as garbage first garbage collector, is an alternative to the default Parallel garbage collector.
While the default Parallel garbage collector divides the heap into a few static regions, the G1 garbage collector divides the heap into many small regions that are created dynamically. The G1 garbage collector has certain advantages over the Parallel garbage collector which improve performance particularly for Spark workloads that require high throughput and low latency.
The G1 garbage collector is not enabled by default, and you need to explicitly pass an argument to Spark to enable it. For more information about the two garbage collectors, check out the Databricks article linked below.
More info:
- Would Spark unpersist the RDD itself when it realizes it won't be used anymore? - Stack Overflow
- Tuning Java Garbage Collection for Apache Spark Applications - The Databricks Blog
- Tuning - Spark 3.0.0 Documentation
- Dive into Spark memory - Blog | luminousmen
Explanation
Manually persisting RDDs in Spark prevents them from being garbage collected.
This statement is incorrect, and thus the correct answer to the question. Spark's garbage collector will remove even persisted objects, albeit in an "LRU" fashion. LRU stands for least recently used. So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.
See the linked StackOverflow post below for more information.
Serialized caching is a strategy to increase the performance of garbage collection.
This statement is correct. The more Java objects Spark needs to collect during garbage collection, the longer it takes. Storing a collection of many Java objects, such as a DataFrame with a complex schema, through serialization as a single byte array thus increases performance. This means that garbage collection takes less time on a serialized DataFrame than an unserialized DataFrame.
Optimizing garbage collection performance in Spark may limit caching ability.
This statement is correct. A full garbage collection run slows down a Spark application. When taking about "tuning" garbage collection, we mean reducing the amount or duration of these slowdowns.
A full garbage collection run is triggered when the Old generation of the Java heap space is almost full. (If you are unfamiliar with this concept, check out the link to the Garbage Collection Tuning docs below.) Thus, one measure to avoid triggering a garbage collection run is to prevent the Old generation share of the heap space to be almost full.
To achieve this, one may decrease its size. Objects with sizes greater than the Old generation space will then be discarded instead of cached (stored) in the space and helping it to be "almost full" . This will decrease the number of full garbage collection runs, increasing overall performance.
Inevitably, however, objects will need to be recomputed when they are needed. So, this mechanism only works when a Spark application needs to reuse cached data as little as possible.
Garbage collection information can be accessed in the Spark UI's stage detail view.
This statement is correct. The task table in the Spark UI's stage detail view has a "GC Time" column, indicating the garbage collection time needed per task.
In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
This statement is correct. The G1 garbage collector, also known as garbage first garbage collector, is an alternative to the default Parallel garbage collector.
While the default Parallel garbage collector divides the heap into a few static regions, the G1 garbage collector divides the heap into many small regions that are created dynamically. The G1 garbage collector has certain advantages over the Parallel garbage collector which improve performance particularly for Spark workloads that require high throughput and low latency.
The G1 garbage collector is not enabled by default, and you need to explicitly pass an argument to Spark to enable it. For more information about the two garbage collectors, check out the Databricks article linked below.
More info:
- Would Spark unpersist the RDD itself when it realizes it won't be used anymore? - Stack Overflow
- Tuning Java Garbage Collection for Apache Spark Applications - The Databricks Blog
- Tuning - Spark 3.0.0 Documentation
- Dive into Spark memory - Blog | luminousmen
Question 8 Single Choice
Which of the following describes characteristics of the Dataset API?
Explanation

Click "Show Answer" to see the explanation here
The Dataset API is available in Scala, but it is not available in Python.
Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API.
The Dataset API does not provide compile-time type safety.
No – in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.
The Dataset API does not support unstructured data.
Wrong, the Dataset API supports structured and unstructured data.
In Python, the Dataset API's schema is constructed via type hints.
No, this is not applicable since the Dataset API is not available in Python.
In Python, the Dataset API mainly resembles Pandas' DataFrame API.
The Dataset API does not exist in Python, only in Scala and Java.
More info: Learning Spark, 2nd Edition, Chapter 3, Datasets - Getting Started with Apache Spark on Databricks
Explanation
The Dataset API is available in Scala, but it is not available in Python.
Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API.
The Dataset API does not provide compile-time type safety.
No – in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.
The Dataset API does not support unstructured data.
Wrong, the Dataset API supports structured and unstructured data.
In Python, the Dataset API's schema is constructed via type hints.
No, this is not applicable since the Dataset API is not available in Python.
In Python, the Dataset API mainly resembles Pandas' DataFrame API.
The Dataset API does not exist in Python, only in Scala and Java.
More info: Learning Spark, 2nd Edition, Chapter 3, Datasets - Getting Started with Apache Spark on Databricks
Question 9 Single Choice
Which of the following describes the difference between client and cluster execution modes?
Explanation

Click "Show Answer" to see the explanation here
In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.
This is wrong, since execution modes do not specify whether workloads are run in the cloud or on-premise.
In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.
Wrong, since in both cases executors run on worker nodes.
In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.
Wrong – in cluster mode, the driver runs on a worker node. In client mode, the driver runs on the client machine.
In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.
No. In both modes, the cluster manager is typically on a separate node – not on the same host as the driver. It only runs on the same host as the driver in local execution mode.
More info: Learning Spark, 2nd Edition, Chapter 1, and Spark: The Definitive Guide, Chapter 15.
Explanation
In cluster mode, the driver runs on the master node, while in client mode, the driver runs on a virtual machine in the cloud.
This is wrong, since execution modes do not specify whether workloads are run in the cloud or on-premise.
In cluster mode, each node will launch its own executor, while in client mode, executors will exclusively run on the client machine.
Wrong, since in both cases executors run on worker nodes.
In cluster mode, the driver runs on the edge node, while the client mode runs the driver in a worker node.
Wrong – in cluster mode, the driver runs on a worker node. In client mode, the driver runs on the client machine.
In client mode, the cluster manager runs on the same host as the driver, while in cluster mode, the cluster manager runs on a separate node.
No. In both modes, the cluster manager is typically on a separate node – not on the same host as the driver. It only runs on the same host as the driver in local execution mode.
More info: Learning Spark, 2nd Edition, Chapter 1, and Spark: The Definitive Guide, Chapter 15.
Question 10 Single Choice
Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?
Explanation

Click "Show Answer" to see the explanation here
Tasks run in parallel via slots.
Correct. Given the assumption, an executor then has one or more "slots", defined by the equation spark.executor.cores / spark.task.cpus. With the executor's resources divided into slots, each task takes up a slot and multiple tasks can be executed in parallel.
Slot is another name for executor.
No, a slot is part of an executor.
An executor runs on a single core.
No, an executor can occupy multiple cores. This is set by the spark.executor.cores option.
There has to be a greater number of slots than tasks.
No. Slots just process tasks. One could imagine a scenario where there was just a single slot for multiple tasks, processing one task at a time. Granted – this is the opposite of what Spark should be used for, which is distributed data processing over multiple cores and machines, performing many tasks in parallel.
There has to be a smaller number of executors than tasks.
No, there is no such requirement.
More info: Spark Architecture | Distributed Systems Architecture
Explanation
Tasks run in parallel via slots.
Correct. Given the assumption, an executor then has one or more "slots", defined by the equation spark.executor.cores / spark.task.cpus. With the executor's resources divided into slots, each task takes up a slot and multiple tasks can be executed in parallel.
Slot is another name for executor.
No, a slot is part of an executor.
An executor runs on a single core.
No, an executor can occupy multiple cores. This is set by the spark.executor.cores option.
There has to be a greater number of slots than tasks.
No. Slots just process tasks. One could imagine a scenario where there was just a single slot for multiple tasks, processing one task at a time. Granted – this is the opposite of what Spark should be used for, which is distributed data processing over multiple cores and machines, performing many tasks in parallel.
There has to be a smaller number of executors than tasks.
No, there is no such requirement.
More info: Spark Architecture | Distributed Systems Architecture



