Professional Data Engineer - Google Cloud Certified Exam Questions

658

Total Questions

SEP

2025

Last Updated

1st

1st Try Guaranteed

Experts Verified

Per page:

Question 1 Single Choice

Your company utilizes WILDCARD tables to query data across multiple tables with similar names. However, the SQL statement is currently encountering an error, displayed as:

# Syntax error: Expected end of statement but got "-" at [4:11]

SELECT age

FROM bigquery-data.noaa_gsod.gsod

WHERE

age != 199

AND_TABLE_SUFFIX = '2929'

ORDER BY age DESC

Which table name will enable the SQL statement to function correctly?

Question 2 Single Choice

You are working at RetailNova Corp., managing a BigQuery table that holds millions of rows of sales transactions, partitioned by date. This table is queried frequently—multiple times per minute—by various applications and users.

The queries compute aggregations such as AVG, MAX, and SUM, and they only need data from the past year, although the full historical data must be retained in the base table. The goal is to always return up-to-date results while also minimizing query costs, reducing maintenance overhead, and improving performance.

What is the best approach?

Click "Show Answer" to see the explanation here

The correct answer is:
A. Create a materialized view that aggregates the base table with a filter for the last year of partitions.

Correct Option Explanation

A. Create a materialized view that aggregates the base table with a filter for the last year of partitions

Why it's correct:
- A materialized view in BigQuery precomputes and stores the results of a query (like AVG, SUM, MAX) and automatically refreshes when the base table is updated.
- When you filter the materialized view to only include data from the last year, BigQuery will only maintain and compute aggregates over this smaller, relevant dataset.
- This:
  - Reduces query costs (since users query the precomputed results)
  - Improves performance (faster reads)
  - Requires minimal maintenance (auto-refresh managed by BigQuery)
  - Returns near real-time results, depending on the freshness interval (~30 minutes max).
Supports use case:
- Frequent, repeated aggregate queries.
- Queries only require recent data (past 1 year), but historical data must still be stored.
- Low-latency results and cost-efficiency are needed.
Official Documentation:
- BigQuery Materialized Views

Incorrect Option Justifications

B. Create a materialized view to aggregate the base table, and set a partition expiration on the base table to keep only the last year of data

Why it’s incorrect:
- The requirement clearly states: “the full historical data must be retained”.
- Setting a partition expiration to delete data older than one year violates this requirement and results in data loss.
Reference:
- Partition expiration

C. Create a standard view that aggregates the base table and filters for the last year of partitions

Why it’s incorrect:
- A standard view does not cache or store the result—it runs the underlying query each time it's accessed.
- This leads to higher query costs and slower performance, especially with frequent queries.
- Does not leverage materialized view optimizations such as incremental refresh.

D. Create a new table that stores the aggregated results from the last year of data, and run a scheduled query every hour to recreate it

Why it’s incorrect:
- This adds maintenance overhead (scheduling, managing updates).
- The data is only updated hourly, which may not meet real-time freshness requirements.
- Materialized views automatically refresh incrementally with better performance and reduced complexity.

Conclusion:
A materialized view with a 1-year filter gives RetailNova the best balance of performance, cost-efficiency, and data freshness, while ensuring full data retention in the base table.

Explanation

The correct answer is:
A. Create a materialized view that aggregates the base table with a filter for the last year of partitions.

Correct Option Explanation

A. Create a materialized view that aggregates the base table with a filter for the last year of partitions

Why it's correct:
- A materialized view in BigQuery precomputes and stores the results of a query (like AVG, SUM, MAX) and automatically refreshes when the base table is updated.
- When you filter the materialized view to only include data from the last year, BigQuery will only maintain and compute aggregates over this smaller, relevant dataset.
- This:
  - Reduces query costs (since users query the precomputed results)
  - Improves performance (faster reads)
  - Requires minimal maintenance (auto-refresh managed by BigQuery)
  - Returns near real-time results, depending on the freshness interval (~30 minutes max).
Supports use case:
- Frequent, repeated aggregate queries.
- Queries only require recent data (past 1 year), but historical data must still be stored.
- Low-latency results and cost-efficiency are needed.
Official Documentation:
- BigQuery Materialized Views

Incorrect Option Justifications

B. Create a materialized view to aggregate the base table, and set a partition expiration on the base table to keep only the last year of data

Why it’s incorrect:
- The requirement clearly states: “the full historical data must be retained”.
- Setting a partition expiration to delete data older than one year violates this requirement and results in data loss.
Reference:
- Partition expiration

C. Create a standard view that aggregates the base table and filters for the last year of partitions

Why it’s incorrect:
- A standard view does not cache or store the result—it runs the underlying query each time it's accessed.
- This leads to higher query costs and slower performance, especially with frequent queries.
- Does not leverage materialized view optimizations such as incremental refresh.

D. Create a new table that stores the aggregated results from the last year of data, and run a scheduled query every hour to recreate it

Why it’s incorrect:
- This adds maintenance overhead (scheduling, managing updates).
- The data is only updated hourly, which may not meet real-time freshness requirements.
- Materialized views automatically refresh incrementally with better performance and reduced complexity.

Question 3 Single Choice

At Invex Systems, a proprietary platform sends inventory data every 6 hours to a cloud-based ingestion service. Each transmission includes a payload with multiple fields and a timestamp. If a transmission issue is suspected, the system may re-send the same data.

As a data engineer, how can you efficiently deduplicate the incoming data?

Click "Show Answer" to see the explanation here

Correct Answer: D. Maintain a table that stores the hash value along with metadata for each data record.

✅ Justification for Correct Option:

D. Maintain a table that stores the hash value along with metadata for each data record.

This is the most efficient and scalable approach for deduplication in streaming or batch pipelines when duplicate data is occasionally re-sent.

Hashing the entire content of a record ensures that identical records produce the same hash.
Storing the hash in a lookup table with metadata (e.g., timestamp, source ID) allows:
- Fast lookup for duplicate detection.
- Efficient storage (as hashes are fixed-size).
- Flexibility to add additional deduplication context (e.g., source system, window).

This is a commonly used method in data pipelines such as those built with Cloud Dataflow, Apache Beam, or BigQuery for idempotent processing.

Official guidance from Google Cloud (BigQuery deduplication):

"To remove duplicate rows, you can use a hash function like FARM_FINGERPRINT() on the entire row and track previously seen hashes."

❌ Justifications for Incorrect Options:

A. Generate and assign a globally unique identifier (GUID) for each data record.

This defeats the purpose: a new GUID will be generated for each incoming record—even if it's a duplicate.
So true duplicates will appear different, making deduplication ineffective.

B. Calculate a hash for each data record and compare it against all previously stored data.

While similar in intent to option D, this is impractical at scale without an optimized structure.
"Comparing against all previously stored data" implies scanning entire datasets, which is inefficient and costly in cloud-scale systems.

C. Use a dedicated database where each data record is stored as a primary key and indexed.

Unless you already have a natural primary key, deduplication using arbitrary fields may not be effective.
Also, this approach can quickly become expensive and unscalable if the ingestion volume is high and the data lacks strong primary key semantics.

✅ Final Answer:

D. Maintain a table that stores the hash value along with metadata for each data record.

Explanation

Correct Answer: D. Maintain a table that stores the hash value along with metadata for each data record.

✅ Justification for Correct Option:

D. Maintain a table that stores the hash value along with metadata for each data record.

This is the most efficient and scalable approach for deduplication in streaming or batch pipelines when duplicate data is occasionally re-sent.

Hashing the entire content of a record ensures that identical records produce the same hash.
Storing the hash in a lookup table with metadata (e.g., timestamp, source ID) allows:
- Fast lookup for duplicate detection.
- Efficient storage (as hashes are fixed-size).
- Flexibility to add additional deduplication context (e.g., source system, window).

This is a commonly used method in data pipelines such as those built with Cloud Dataflow, Apache Beam, or BigQuery for idempotent processing.

Official guidance from Google Cloud (BigQuery deduplication):

"To remove duplicate rows, you can use a hash function like FARM_FINGERPRINT() on the entire row and track previously seen hashes."

❌ Justifications for Incorrect Options:

A. Generate and assign a globally unique identifier (GUID) for each data record.

This defeats the purpose: a new GUID will be generated for each incoming record—even if it's a duplicate.
So true duplicates will appear different, making deduplication ineffective.

B. Calculate a hash for each data record and compare it against all previously stored data.

While similar in intent to option D, this is impractical at scale without an optimized structure.
"Comparing against all previously stored data" implies scanning entire datasets, which is inefficient and costly in cloud-scale systems.

C. Use a dedicated database where each data record is stored as a primary key and indexed.

Unless you already have a natural primary key, deduplication using arbitrary fields may not be effective.
Also, this approach can quickly become expensive and unscalable if the ingestion volume is high and the data lacks strong primary key semantics.

✅ Final Answer:

D. Maintain a table that stores the hash value along with metadata for each data record.

Question 4 Single Choice

At CloudWare Systems, you're using a production-grade Memorystore for Redis (Standard Tier) instance. As part of your disaster recovery (DR) planning, you need to test failover behavior realistically on this instance. The goal is to ensure no data loss during this failover test.

What is the best approach?

Question 5 Single Choice

You're preparing data for your machine learning team to train a model using BigQueryML. The objective is to predict the price per square foot of real estate. The training data includes columns for price and square footage. However, the 'feature1' column contains null values due to missing data. To retain more data points, you aim to replace the nulls with zeros. Which query should you use?

Question 6 Single Choice

You're managing a Dataplex environment for DataSpring Inc., which includes both raw and curated zones. The data engineering team is uploading CSV and JSON files to a Cloud Storage bucket asset within the curated zone. However, these files are not being automatically discovered by Dataplex.

What should you do to ensure Dataplex can automatically discover these files?

Click "Show Answer" to see the explanation here

The correct answer is:
B. Enable auto-discovery for the curated zone’s bucket asset.

Correct Option Explanation

B. Enable auto-discovery for the curated zone’s bucket asset

Why it’s correct:
- In Dataplex, auto-discovery must be explicitly enabled per asset (like a Cloud Storage bucket or BigQuery dataset) to allow Dataplex to scan and catalog the data (e.g., CSV, JSON files).
- Without enabling this setting, Dataplex will not automatically discover schema, metadata, or partitioning information—even if the files are present in the asset.
- Once enabled, Dataplex will regularly crawl the bucket and automatically register discovered data assets and schemas in the Data Catalog, making them queryable and discoverable.

How to enable auto-discovery: You can configure this in the Dataplex UI, API, or gcloud:

gcloud dataplex assets update [ASSET_ID] \
 --project=[PROJECT_ID] \
 --lake=[LAKE_NAME] \
 --zone=[ZONE_NAME] \
 --resource-spec=type=STORAGE_BUCKET,name=[BUCKET_PATH] \
 --discovery-enabled

Official Documentation:
- Enable metadata discovery in Dataplex
- Dataplex discovery overview

Incorrect Option Justifications

A. Move the JSON and CSV files into the raw zone instead

Why it’s incorrect:
- Changing the zone doesn’t solve the issue—discovery depends on whether auto-discovery is enabled, not the zone type.
- The curated zone can support auto-discovery just like the raw zone, as long as it's configured.

C. Use the bq command-line tool to load the JSON and CSV files into BigQuery tables

Why it’s incorrect:
- Using the bq CLI imports the data into BigQuery, not Dataplex.
- This bypasses Dataplex’s discovery process and doesn't solve the problem of automatic discovery in the curated zone’s Cloud Storage asset.

D. Grant object-level access to the CSV and JSON files in Cloud Storage

Why it’s incorrect:
- Object-level permissions are not required for Dataplex to perform discovery if the service account has read access at the bucket or folder level.
- Discovery failure is typically due to discovery not being enabled, not missing access—especially if the asset was created by an authorized user.

Explanation

The correct answer is:
B. Enable auto-discovery for the curated zone’s bucket asset.

Correct Option Explanation

B. Enable auto-discovery for the curated zone’s bucket asset

Why it’s correct:
- In Dataplex, auto-discovery must be explicitly enabled per asset (like a Cloud Storage bucket or BigQuery dataset) to allow Dataplex to scan and catalog the data (e.g., CSV, JSON files).
- Without enabling this setting, Dataplex will not automatically discover schema, metadata, or partitioning information—even if the files are present in the asset.
- Once enabled, Dataplex will regularly crawl the bucket and automatically register discovered data assets and schemas in the Data Catalog, making them queryable and discoverable.

How to enable auto-discovery: You can configure this in the Dataplex UI, API, or gcloud:

gcloud dataplex assets update [ASSET_ID] \
 --project=[PROJECT_ID] \
 --lake=[LAKE_NAME] \
 --zone=[ZONE_NAME] \
 --resource-spec=type=STORAGE_BUCKET,name=[BUCKET_PATH] \
 --discovery-enabled

Official Documentation:
- Enable metadata discovery in Dataplex
- Dataplex discovery overview

Incorrect Option Justifications

A. Move the JSON and CSV files into the raw zone instead

Why it’s incorrect:
- Changing the zone doesn’t solve the issue—discovery depends on whether auto-discovery is enabled, not the zone type.
- The curated zone can support auto-discovery just like the raw zone, as long as it's configured.

C. Use the bq command-line tool to load the JSON and CSV files into BigQuery tables

Why it’s incorrect:
- Using the bq CLI imports the data into BigQuery, not Dataplex.
- This bypasses Dataplex’s discovery process and doesn't solve the problem of automatic discovery in the curated zone’s Cloud Storage asset.

D. Grant object-level access to the CSV and JSON files in Cloud Storage

Why it’s incorrect:
- Object-level permissions are not required for Dataplex to perform discovery if the service account has read access at the bucket or folder level.
- Discovery failure is typically due to discovery not being enabled, not missing access—especially if the asset was created by an authorized user.

Question 7 Multiple Choice

SecureTrust Analytics, your company operating in a tightly regulated industry, must enforce strict access controls to ensure that users only access the minimum data necessary for their responsibilities. You’re using Google BigQuery and need to apply this principle of least privilege effectively.

Which three of the following strategies would help enforce this requirement?

Click "Show Answer" to see the explanation here

Correct Answers: B, D, and E
These are the most appropriate strategies to enforce the principle of least privilege in Google BigQuery.

Explanation for Correct Options

B. Control table-level access using IAM roles. ✅

IAM (Identity and Access Management) is the foundation for enforcing granular access control in BigQuery.
You can assign roles at the table or dataset level (e.g., roles/bigquery.dataViewer) so users only have access to the data necessary for their tasks.
Custom roles can be created to fine-tune access permissions even further.

Official Documentation:
BigQuery access control with IAM

D. Limit BigQuery API access to only authorized users. ✅

Restricting access to the BigQuery API ensures that only trusted users or service accounts can execute queries or export data programmatically.
This limits the risk of unauthorized access or data exfiltration.
You can enforce this using IAM permissions and VPC Service Controls.

E. Organize data into separate tables or datasets to isolate access. ✅

Data partitioning by business unit, sensitivity, or purpose enables tighter access control.
IAM policies can then be applied independently to each dataset, ensuring users only access the datasets relevant to their role.
This is a core strategy for implementing least privilege in data governance.

Docs:
Best practices for data governance in BigQuery

Explanation for Incorrect Options

A. Disable write access on specific BigQuery tables.

While limiting write access helps maintain data integrity, it does not enforce least privilege for data access (read permissions).
Least privilege also involves controlling who can read or query the data.

C. Ensure all data remains encrypted at rest and in transit.

Encryption is important for data security, but it is a default behavior in BigQuery and does not control access to data.
It protects against unauthorized access at the infrastructure level, not user-level access.

Reference:
BigQuery encryption documentation

F. Use Cloud Audit Logs to track and identify potential access violations.

Cloud Audit Logs are critical for monitoring and compliance, but they provide visibility, not access control.
They support detecting violations, not preventing them.

Reference:
BigQuery audit logging

Final Answer: B, D, and E
These options directly support the principle of least privilege in Google BigQuery by enforcing fine-grained, role-based access control and data isolation.

Explanation

Correct Answers: B, D, and E
These are the most appropriate strategies to enforce the principle of least privilege in Google BigQuery.

Explanation for Correct Options

B. Control table-level access using IAM roles. ✅

IAM (Identity and Access Management) is the foundation for enforcing granular access control in BigQuery.
You can assign roles at the table or dataset level (e.g., roles/bigquery.dataViewer) so users only have access to the data necessary for their tasks.
Custom roles can be created to fine-tune access permissions even further.

Official Documentation:
BigQuery access control with IAM

D. Limit BigQuery API access to only authorized users. ✅

Restricting access to the BigQuery API ensures that only trusted users or service accounts can execute queries or export data programmatically.
This limits the risk of unauthorized access or data exfiltration.
You can enforce this using IAM permissions and VPC Service Controls.

E. Organize data into separate tables or datasets to isolate access. ✅

Data partitioning by business unit, sensitivity, or purpose enables tighter access control.
IAM policies can then be applied independently to each dataset, ensuring users only access the datasets relevant to their role.
This is a core strategy for implementing least privilege in data governance.

Docs:
Best practices for data governance in BigQuery

Explanation for Incorrect Options

A. Disable write access on specific BigQuery tables.

While limiting write access helps maintain data integrity, it does not enforce least privilege for data access (read permissions).
Least privilege also involves controlling who can read or query the data.

C. Ensure all data remains encrypted at rest and in transit.

Encryption is important for data security, but it is a default behavior in BigQuery and does not control access to data.
It protects against unauthorized access at the infrastructure level, not user-level access.

Reference:
BigQuery encryption documentation

F. Use Cloud Audit Logs to track and identify potential access violations.

Cloud Audit Logs are critical for monitoring and compliance, but they provide visibility, not access control.
They support detecting violations, not preventing them.

Reference:
BigQuery audit logging

Final Answer: B, D, and E
These options directly support the principle of least privilege in Google BigQuery by enforcing fine-grained, role-based access control and data isolation.

Question 8 Single Choice

Your organization, DataLink Global, follows a multi-cloud strategy by storing data in both Google Cloud Storage and Amazon S3 buckets, with all data residing in US-based regions. You need to allow your teams to query the most up-to-date data from either cloud using BigQuery, but without granting direct access to the Cloud Storage or S3 buckets themselves.

What should you do?

Question 9 Single Choice

You are a data engineer at AutoNova Motors, and you've built a data pipeline using Google Cloud Pub/Sub to capture sensor anomalies from connected vehicles. A push subscription is configured to send these events to a custom HTTPS endpoint that you’ve developed to take immediate action when anomalies are detected.

However, you notice that your HTTPS endpoint is receiving an unusually high number of duplicate messages.
What is the most likely reason for this issue?

Click "Show Answer" to see the explanation here

Correct Answer: D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.

✅ Justification for Correct Option:

D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.

This is the most accurate explanation for why duplicate messages are being received.

Cloud Pub/Sub guarantees at-least-once delivery, which means a message may be redelivered if:
- The subscriber fails to acknowledge the message.
- The subscriber does not respond within the acknowledgment deadline (default: 10 seconds).
- An error occurs during delivery.
In push subscriptions, Pub/Sub waits for a successful HTTP 200 OK response from your endpoint within the acknowledgment deadline.
If it doesn't receive it in time, it assumes the message wasn't processed, and resends the message, causing duplicates.

Official documentation:

“Push endpoints must return an HTTP success status code (200–299) within the acknowledgment deadline to acknowledge the message. If they do not, the message is redelivered.”
Source: https://cloud.google.com/pubsub/docs/push#receiving_messages

❌ Justifications for Incorrect Options:

A. The payload size of the sensor event messages is too large.

Pub/Sub supports payloads up to 10 MB.
A large payload might cause processing delays, but it is not the root cause of duplication.
Duplication is due to acknowledgment behavior, not payload size directly.

Reference:
https://cloud.google.com/pubsub/quotas#resource_limits

B. The SSL certificate on your custom HTTPS endpoint is outdated.

An outdated or invalid certificate would result in delivery failures, not successful duplication.
If the SSL handshake fails, Pub/Sub will log errors, but won’t send duplicates because no delivery occurred.

C. There is a high volume of messages being published to the Pub/Sub topic.

High volume can increase throughput needs, but it does not by itself cause duplication.
Duplicates are due to failure to acknowledge messages, regardless of volume.

✅ Final Answer: D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.

Explanation

Correct Answer: D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.

✅ Justification for Correct Option:

D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.

This is the most accurate explanation for why duplicate messages are being received.

Cloud Pub/Sub guarantees at-least-once delivery, which means a message may be redelivered if:
- The subscriber fails to acknowledge the message.
- The subscriber does not respond within the acknowledgment deadline (default: 10 seconds).
- An error occurs during delivery.
In push subscriptions, Pub/Sub waits for a successful HTTP 200 OK response from your endpoint within the acknowledgment deadline.
If it doesn't receive it in time, it assumes the message wasn't processed, and resends the message, causing duplicates.

Official documentation:

❌ Justifications for Incorrect Options:

A. The payload size of the sensor event messages is too large.

Pub/Sub supports payloads up to 10 MB.
A large payload might cause processing delays, but it is not the root cause of duplication.
Duplication is due to acknowledgment behavior, not payload size directly.

Reference:
https://cloud.google.com/pubsub/quotas#resource_limits

B. The SSL certificate on your custom HTTPS endpoint is outdated.

An outdated or invalid certificate would result in delivery failures, not successful duplication.
If the SSL handshake fails, Pub/Sub will log errors, but won’t send duplicates because no delivery occurred.

C. There is a high volume of messages being published to the Pub/Sub topic.

High volume can increase throughput needs, but it does not by itself cause duplication.
Duplicates are due to failure to acknowledge messages, regardless of volume.

✅ Final Answer: D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.

Question 10 Single Choice

You are working with Skyline Realty Corp, a large real estate company, and preparing 6 TB of property sales data for a machine learning use case. You plan to use SQL for data transformation and BigQuery ML to build the ML model. The model will be used to generate predictions on unprocessed raw data.

How should you design the workflow to avoid training-serving skew during predictions?

Page: 1 / 66