Certified Data Engineer Professional Logo
Databricks Logo

Certified Data Engineer Professional Exam Questions

316

Total Questions

SEP
2025

Last Updated

1st

1st Try Guaranteed

Expert Verified

Experts Verified

Question 11 Single Choice

The data engineering team has configured a job to process customer requests to be forgotten (have their data deleted). All user data that needs to be deleted is stored in Delta Lake tables using default table settings.

The team has decided to process all deletions from the previous week as a batch job at 1am each Sunday. The total duration of this job is less than one hour. Every Monday at 3am, a batch job executes a series of VACUUM commands on all Delta Lake tables throughout the organization.

The compliance officer has recently learned about Delta Lake's time travel functionality. They are concerned that this might allow continued access to deleted data.

Assuming all delete logic is correctly implemented, which statement correctly addresses this concern?

Question 12 Single Choice

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create:

Assuming that all configurations and referenced resources are available, which statement correctly describes the result of executing this API request three times?

Question 13 Single Choice

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.

Which solution meets these requirements?

Question 14 Single Choice

An hourly batch job is configured to ingest data files from a cloud object storage container, where each batch represents all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed.

The user_id field represents a unique key for the data, which has the following schema:

  • New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source.

  • The next table in the system is named account_current and is implemented as a Type 1 table, representing the most recent value for each unique user_id.

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?

Question 15 Single Choice

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Question 16 Single Choice

A table is registered with the following SQL code:

Both users and orders are Delta Lake tables.

Which statement best describes the results of querying recent_orders?

Question 17 Single Choice

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.


Which of the following likely explains these smaller file sizes?

Question 18 Single Choice

Which statement regarding stream-static joins and static Delta tables is correct?

Question 19 Single Choice

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

The schema for the streaming DataFrame df is as follows:

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Question 20 Single Choice

A data engineer has decided to nest a checkpoint directory to be shared by both streams. The proposed directory structure is displayed below:

Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?

Page: 2 / 32