Certified Data Engineer Professional Logo
Databricks Logo

Certified Data Engineer Professional Exam Questions

316

Total Questions

SEP
2025

Last Updated

1st

1st Try Guaranteed

Expert Verified

Experts Verified

Question 1 Single Choice

An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code: df = spark.read.format("parquet").load(f"/mnt/source/(date)")


Which code block should be used to create the date Python variable used in the above code block?

Question 2 Single Choice

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions,

which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Question 3 Single Choice

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Question 4 Single Choice

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.


If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

Question 5 Single Choice

A junior developer complains that the code in their notebook isn't producing the correct results in the development environment. A shared screenshot reveals that while they're using a notebook versioned with Databricks Repos, they're using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.
Which approach will allow this developer to review the current logic for this notebook?

Question 6 Single Choice

The security team is exploring whether the Databricks secrets module can be used to securely connect to an external database.

After initially testing the code with all credentials as plain strings, they upload the database password to the Databricks secrets scope and configure the proper permissions for the currently active user.

They then modify the code as follows (leaving all other variables unchanged):

What will happen when this code is executed?

Question 7 Single Choice

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame named preds with the schema:

The data science team would like the predictions saved to a Delta Lake table, with the ability to compare all predictions across time. Churn predictions will be made at most once per day.


Which code block accomplishes this task while minimizing potential compute costs?

Question 8 Single Choice

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day, as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Question 9 Single Choice

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the code below is to register a view of all sales that occurred in countries on the continent of Africa, as listed in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database confirms that the database contains only two tables: geo_lookup and sales.

❖ Cmd 1 — Python Cell:

❖ Cmd 2 — SQL Cell:

Question 10 Single Choice

A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

Page: 1 / 32