
Professional Data Engineer - Google Cloud Certified Exam Questions
Total Questions
Last Updated
1st Try Guaranteed

Experts Verified
Question 1 Single Choice
Your company utilizes WILDCARD tables to query data across multiple tables with similar names. However, the SQL statement is currently encountering an error, displayed as:
# Syntax error: Expected end of statement but got "-" at [4:11]
SELECT age
FROM bigquery-data.noaa_gsod.gsod
WHERE
age != 199
AND_TABLE_SUFFIX = '2929'
ORDER BY age DESC
Which table name will enable the SQL statement to function correctly?
Explanation

Click "Show Answer" to see the explanation here
D. 'bigquery-data.noaa_gsod.gsod*`
In WILDCARD tables in BigQuery, the asterisk () is used as a wildcard character to represent multiple tables with similar names. However, in the provided SQL statement, there's a syntax error due to the incorrect placement of the underscore character. Additionally, the asterisk () should be inside backticks (`) to indicate that it is part of the table name pattern.
Here's the corrected SQL statement:
In this corrected statement, the backticks (`) are used to enclose the table name pattern 'bigquery-data.noaa_gsod.gsod*', which includes the wildcard character (*) inside the backticks. This syntax enables the SQL statement to query all tables matching the pattern 'bigquery-data.noaa_gsod.gsod*'.
References:
https://cloud.google.com/bigquery/docs/querying-wildcard-tables
https://cloud.google.com/bigquery/docs/wildcard-table-reference
Explanation
D. 'bigquery-data.noaa_gsod.gsod*`
In WILDCARD tables in BigQuery, the asterisk () is used as a wildcard character to represent multiple tables with similar names. However, in the provided SQL statement, there's a syntax error due to the incorrect placement of the underscore character. Additionally, the asterisk () should be inside backticks (`) to indicate that it is part of the table name pattern.
Here's the corrected SQL statement:
In this corrected statement, the backticks (`) are used to enclose the table name pattern 'bigquery-data.noaa_gsod.gsod*', which includes the wildcard character (*) inside the backticks. This syntax enables the SQL statement to query all tables matching the pattern 'bigquery-data.noaa_gsod.gsod*'.
References:
https://cloud.google.com/bigquery/docs/querying-wildcard-tables
https://cloud.google.com/bigquery/docs/wildcard-table-reference
Question 2 Single Choice
You are working at RetailNova Corp., managing a BigQuery table that holds millions of rows of sales transactions, partitioned by date. This table is queried frequently—multiple times per minute—by various applications and users.
The queries compute aggregations such as AVG, MAX, and SUM, and they only need data from the past year, although the full historical data must be retained in the base table. The goal is to always return up-to-date results while also minimizing query costs, reducing maintenance overhead, and improving performance.
What is the best approach?
Explanation

Click "Show Answer" to see the explanation here
The correct answer is:
A. Create a materialized view that aggregates the base table with a filter for the last year of partitions.
Correct Option Explanation
A. Create a materialized view that aggregates the base table with a filter for the last year of partitions
Why it's correct:
A materialized view in BigQuery precomputes and stores the results of a query (like
AVG,SUM,MAX) and automatically refreshes when the base table is updated.When you filter the materialized view to only include data from the last year, BigQuery will only maintain and compute aggregates over this smaller, relevant dataset.
This:
Reduces query costs (since users query the precomputed results)
Improves performance (faster reads)
Requires minimal maintenance (auto-refresh managed by BigQuery)
Returns near real-time results, depending on the freshness interval (~30 minutes max).
Supports use case:
Frequent, repeated aggregate queries.
Queries only require recent data (past 1 year), but historical data must still be stored.
Low-latency results and cost-efficiency are needed.
Official Documentation:
Incorrect Option Justifications
B. Create a materialized view to aggregate the base table, and set a partition expiration on the base table to keep only the last year of data
Why it’s incorrect:
The requirement clearly states: “the full historical data must be retained”.
Setting a partition expiration to delete data older than one year violates this requirement and results in data loss.
Reference:
C. Create a standard view that aggregates the base table and filters for the last year of partitions
Why it’s incorrect:
A standard view does not cache or store the result—it runs the underlying query each time it's accessed.
This leads to higher query costs and slower performance, especially with frequent queries.
Does not leverage materialized view optimizations such as incremental refresh.
D. Create a new table that stores the aggregated results from the last year of data, and run a scheduled query every hour to recreate it
Why it’s incorrect:
This adds maintenance overhead (scheduling, managing updates).
The data is only updated hourly, which may not meet real-time freshness requirements.
Materialized views automatically refresh incrementally with better performance and reduced complexity.
Conclusion:
A materialized view with a 1-year filter gives RetailNova the best balance of performance, cost-efficiency, and data freshness, while ensuring full data retention in the base table.
Explanation
The correct answer is:
A. Create a materialized view that aggregates the base table with a filter for the last year of partitions.
Correct Option Explanation
A. Create a materialized view that aggregates the base table with a filter for the last year of partitions
Why it's correct:
A materialized view in BigQuery precomputes and stores the results of a query (like
AVG,SUM,MAX) and automatically refreshes when the base table is updated.When you filter the materialized view to only include data from the last year, BigQuery will only maintain and compute aggregates over this smaller, relevant dataset.
This:
Reduces query costs (since users query the precomputed results)
Improves performance (faster reads)
Requires minimal maintenance (auto-refresh managed by BigQuery)
Returns near real-time results, depending on the freshness interval (~30 minutes max).
Supports use case:
Frequent, repeated aggregate queries.
Queries only require recent data (past 1 year), but historical data must still be stored.
Low-latency results and cost-efficiency are needed.
Official Documentation:
Incorrect Option Justifications
B. Create a materialized view to aggregate the base table, and set a partition expiration on the base table to keep only the last year of data
Why it’s incorrect:
The requirement clearly states: “the full historical data must be retained”.
Setting a partition expiration to delete data older than one year violates this requirement and results in data loss.
Reference:
C. Create a standard view that aggregates the base table and filters for the last year of partitions
Why it’s incorrect:
A standard view does not cache or store the result—it runs the underlying query each time it's accessed.
This leads to higher query costs and slower performance, especially with frequent queries.
Does not leverage materialized view optimizations such as incremental refresh.
D. Create a new table that stores the aggregated results from the last year of data, and run a scheduled query every hour to recreate it
Why it’s incorrect:
This adds maintenance overhead (scheduling, managing updates).
The data is only updated hourly, which may not meet real-time freshness requirements.
Materialized views automatically refresh incrementally with better performance and reduced complexity.
Conclusion:
A materialized view with a 1-year filter gives RetailNova the best balance of performance, cost-efficiency, and data freshness, while ensuring full data retention in the base table.
Question 3 Single Choice
At Invex Systems, a proprietary platform sends inventory data every 6 hours to a cloud-based ingestion service. Each transmission includes a payload with multiple fields and a timestamp. If a transmission issue is suspected, the system may re-send the same data.
As a data engineer, how can you efficiently deduplicate the incoming data?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer: D. Maintain a table that stores the hash value along with metadata for each data record.
✅ Justification for Correct Option:
D. Maintain a table that stores the hash value along with metadata for each data record.
This is the most efficient and scalable approach for deduplication in streaming or batch pipelines when duplicate data is occasionally re-sent.
Hashing the entire content of a record ensures that identical records produce the same hash.
Storing the hash in a lookup table with metadata (e.g., timestamp, source ID) allows:
Fast lookup for duplicate detection.
Efficient storage (as hashes are fixed-size).
Flexibility to add additional deduplication context (e.g., source system, window).
This is a commonly used method in data pipelines such as those built with Cloud Dataflow, Apache Beam, or BigQuery for idempotent processing.
Official guidance from Google Cloud (BigQuery deduplication):
"To remove duplicate rows, you can use a hash function like FARM_FINGERPRINT() on the entire row and track previously seen hashes."
❌ Justifications for Incorrect Options:
A. Generate and assign a globally unique identifier (GUID) for each data record.
This defeats the purpose: a new GUID will be generated for each incoming record—even if it's a duplicate.
So true duplicates will appear different, making deduplication ineffective.
B. Calculate a hash for each data record and compare it against all previously stored data.
While similar in intent to option D, this is impractical at scale without an optimized structure.
"Comparing against all previously stored data" implies scanning entire datasets, which is inefficient and costly in cloud-scale systems.
C. Use a dedicated database where each data record is stored as a primary key and indexed.
Unless you already have a natural primary key, deduplication using arbitrary fields may not be effective.
Also, this approach can quickly become expensive and unscalable if the ingestion volume is high and the data lacks strong primary key semantics.
✅ Final Answer:
D. Maintain a table that stores the hash value along with metadata for each data record.
Explanation
Correct Answer: D. Maintain a table that stores the hash value along with metadata for each data record.
✅ Justification for Correct Option:
D. Maintain a table that stores the hash value along with metadata for each data record.
This is the most efficient and scalable approach for deduplication in streaming or batch pipelines when duplicate data is occasionally re-sent.
Hashing the entire content of a record ensures that identical records produce the same hash.
Storing the hash in a lookup table with metadata (e.g., timestamp, source ID) allows:
Fast lookup for duplicate detection.
Efficient storage (as hashes are fixed-size).
Flexibility to add additional deduplication context (e.g., source system, window).
This is a commonly used method in data pipelines such as those built with Cloud Dataflow, Apache Beam, or BigQuery for idempotent processing.
Official guidance from Google Cloud (BigQuery deduplication):
"To remove duplicate rows, you can use a hash function like FARM_FINGERPRINT() on the entire row and track previously seen hashes."
❌ Justifications for Incorrect Options:
A. Generate and assign a globally unique identifier (GUID) for each data record.
This defeats the purpose: a new GUID will be generated for each incoming record—even if it's a duplicate.
So true duplicates will appear different, making deduplication ineffective.
B. Calculate a hash for each data record and compare it against all previously stored data.
While similar in intent to option D, this is impractical at scale without an optimized structure.
"Comparing against all previously stored data" implies scanning entire datasets, which is inefficient and costly in cloud-scale systems.
C. Use a dedicated database where each data record is stored as a primary key and indexed.
Unless you already have a natural primary key, deduplication using arbitrary fields may not be effective.
Also, this approach can quickly become expensive and unscalable if the ingestion volume is high and the data lacks strong primary key semantics.
✅ Final Answer:
D. Maintain a table that stores the hash value along with metadata for each data record.
Question 4 Single Choice
At CloudWare Systems, you're using a production-grade Memorystore for Redis (Standard Tier) instance. As part of your disaster recovery (DR) planning, you need to test failover behavior realistically on this instance. The goal is to ensure no data loss during this failover test.
What is the best approach?
Explanation

Click "Show Answer" to see the explanation here
Answer: B. Create a Standard Tier Redis instance in a development environment, and initiate a manual failover using the force-data-loss data protection mode.
Why B is correct
Worst-case DR simulation
The force-data-loss mode skips the 30 MB replication-lag check and will promote the replica immediately, even if it’s behind. That behavior closely mimics a catastrophic primary failure in a real site-outage scenario. Google Cloud | Google CloudZero impact on production
By running this on a dev-sized clone of your Standard Tier instance, you practice the exactgcloud redis instances failover … --data-protection-mode=force-data-losscommand without risking live data or service uptime.
Why the other options are not suitable
A. Dev + limited-data-loss
Limited-data-loss will abort if the replica is more than 30 MB behind, so you aren’t guaranteed to see a successful failover under high lag. That doesn’t simulate a full-blown disaster scenario.
C. Increase replicas in Prod + force-data-loss
Running force-data-loss on production risks real data loss, and adding more replicas doesn’t mitigate that: any replica could be behind and you’d sever whichever one you promote.
D. Prod + limited-data-loss
Although safe, this only tests a best-case failover (it aborts under lag) and still touches production, risking service disruption if you mis-schedule or mis-configure the maintenance window.
References
Manual failover modes and their behaviors Google Cloud | Google Cloud
Explanation
Answer: B. Create a Standard Tier Redis instance in a development environment, and initiate a manual failover using the force-data-loss data protection mode.
Why B is correct
Worst-case DR simulation
The force-data-loss mode skips the 30 MB replication-lag check and will promote the replica immediately, even if it’s behind. That behavior closely mimics a catastrophic primary failure in a real site-outage scenario. Google Cloud | Google CloudZero impact on production
By running this on a dev-sized clone of your Standard Tier instance, you practice the exactgcloud redis instances failover … --data-protection-mode=force-data-losscommand without risking live data or service uptime.
Why the other options are not suitable
A. Dev + limited-data-loss
Limited-data-loss will abort if the replica is more than 30 MB behind, so you aren’t guaranteed to see a successful failover under high lag. That doesn’t simulate a full-blown disaster scenario.
C. Increase replicas in Prod + force-data-loss
Running force-data-loss on production risks real data loss, and adding more replicas doesn’t mitigate that: any replica could be behind and you’d sever whichever one you promote.
D. Prod + limited-data-loss
Although safe, this only tests a best-case failover (it aborts under lag) and still touches production, risking service disruption if you mis-schedule or mis-configure the maintenance window.
References
Manual failover modes and their behaviors Google Cloud | Google Cloud
Question 5 Single Choice
You're preparing data for your machine learning team to train a model using BigQueryML. The objective is to predict the price per square foot of real estate. The training data includes columns for price and square footage. However, the 'feature1' column contains null values due to missing data. To retain more data points, you aim to replace the nulls with zeros. Which query should you use?
Explanation

Click "Show Answer" to see the explanation here
A. SELECT * EXCEPT (feature1), IFNULL(feature1, 0) AS feature1_cleaned FROM training_data;
Replaces nulls:
IFNULL(feature1, 0)directly replaces null values in 'feature1' with 0.Preserves other data:
SELECT * EXCEPT (feature1)selects all columns other than 'feature1', ensuring no existing data is lost.Creates new column:
AS feature1_cleanedcreates a new column with the cleaned data, allowing you to compare the original and modified feature if needed.
Why the other options are less suitable:
B: This query calculates 'price_per_sqft' and excludes rows where 'feature1' is null. This loses potentially valuable data.
C: This is similar to B, additionally removing the 'feature1' column entirely, which might be important for later analysis.
D: This only selects rows where 'feature1' is not null, removing data points that could be useful after replacing the nulls.
Important Considerations
Why Impute? Missing data is common. Replacing nulls with zeros allows you to retain more data points, potentially improving model training. However, this assumes 'zero' is a meaningful value in your 'feature1' context.
Other Imputation Strategies: Consider alternatives like:
Mean/Median: Replacing with the average or middle value of the feature.
Predictive model: A more complex approach using other features to predict the missing values of 'feature1'.
Explanation
A. SELECT * EXCEPT (feature1), IFNULL(feature1, 0) AS feature1_cleaned FROM training_data;
Replaces nulls:
IFNULL(feature1, 0)directly replaces null values in 'feature1' with 0.Preserves other data:
SELECT * EXCEPT (feature1)selects all columns other than 'feature1', ensuring no existing data is lost.Creates new column:
AS feature1_cleanedcreates a new column with the cleaned data, allowing you to compare the original and modified feature if needed.
Why the other options are less suitable:
B: This query calculates 'price_per_sqft' and excludes rows where 'feature1' is null. This loses potentially valuable data.
C: This is similar to B, additionally removing the 'feature1' column entirely, which might be important for later analysis.
D: This only selects rows where 'feature1' is not null, removing data points that could be useful after replacing the nulls.
Important Considerations
Why Impute? Missing data is common. Replacing nulls with zeros allows you to retain more data points, potentially improving model training. However, this assumes 'zero' is a meaningful value in your 'feature1' context.
Other Imputation Strategies: Consider alternatives like:
Mean/Median: Replacing with the average or middle value of the feature.
Predictive model: A more complex approach using other features to predict the missing values of 'feature1'.
Question 6 Single Choice
You're managing a Dataplex environment for DataSpring Inc., which includes both raw and curated zones. The data engineering team is uploading CSV and JSON files to a Cloud Storage bucket asset within the curated zone. However, these files are not being automatically discovered by Dataplex.
What should you do to ensure Dataplex can automatically discover these files?
Explanation

Click "Show Answer" to see the explanation here
The correct answer is:
B. Enable auto-discovery for the curated zone’s bucket asset.
Correct Option Explanation
B. Enable auto-discovery for the curated zone’s bucket asset
Why it’s correct:
In Dataplex, auto-discovery must be explicitly enabled per asset (like a Cloud Storage bucket or BigQuery dataset) to allow Dataplex to scan and catalog the data (e.g., CSV, JSON files).
Without enabling this setting, Dataplex will not automatically discover schema, metadata, or partitioning information—even if the files are present in the asset.
Once enabled, Dataplex will regularly crawl the bucket and automatically register discovered data assets and schemas in the Data Catalog, making them queryable and discoverable.
How to enable auto-discovery: You can configure this in the Dataplex UI, API, or gcloud:
- gcloud dataplex assets update [ASSET_ID] \
- --project=[PROJECT_ID] \
- --lake=[LAKE_NAME] \
- --zone=[ZONE_NAME] \
- --resource-spec=type=STORAGE_BUCKET,name=[BUCKET_PATH] \
- --discovery-enabled
Official Documentation:
Incorrect Option Justifications
A. Move the JSON and CSV files into the raw zone instead
Why it’s incorrect:
Changing the zone doesn’t solve the issue—discovery depends on whether auto-discovery is enabled, not the zone type.
The curated zone can support auto-discovery just like the raw zone, as long as it's configured.
C. Use the bq command-line tool to load the JSON and CSV files into BigQuery tables
Why it’s incorrect:
Using the
bqCLI imports the data into BigQuery, not Dataplex.This bypasses Dataplex’s discovery process and doesn't solve the problem of automatic discovery in the curated zone’s Cloud Storage asset.
D. Grant object-level access to the CSV and JSON files in Cloud Storage
Why it’s incorrect:
Object-level permissions are not required for Dataplex to perform discovery if the service account has read access at the bucket or folder level.
Discovery failure is typically due to discovery not being enabled, not missing access—especially if the asset was created by an authorized user.
Explanation
The correct answer is:
B. Enable auto-discovery for the curated zone’s bucket asset.
Correct Option Explanation
B. Enable auto-discovery for the curated zone’s bucket asset
Why it’s correct:
In Dataplex, auto-discovery must be explicitly enabled per asset (like a Cloud Storage bucket or BigQuery dataset) to allow Dataplex to scan and catalog the data (e.g., CSV, JSON files).
Without enabling this setting, Dataplex will not automatically discover schema, metadata, or partitioning information—even if the files are present in the asset.
Once enabled, Dataplex will regularly crawl the bucket and automatically register discovered data assets and schemas in the Data Catalog, making them queryable and discoverable.
How to enable auto-discovery: You can configure this in the Dataplex UI, API, or gcloud:
- gcloud dataplex assets update [ASSET_ID] \
- --project=[PROJECT_ID] \
- --lake=[LAKE_NAME] \
- --zone=[ZONE_NAME] \
- --resource-spec=type=STORAGE_BUCKET,name=[BUCKET_PATH] \
- --discovery-enabled
Official Documentation:
Incorrect Option Justifications
A. Move the JSON and CSV files into the raw zone instead
Why it’s incorrect:
Changing the zone doesn’t solve the issue—discovery depends on whether auto-discovery is enabled, not the zone type.
The curated zone can support auto-discovery just like the raw zone, as long as it's configured.
C. Use the bq command-line tool to load the JSON and CSV files into BigQuery tables
Why it’s incorrect:
Using the
bqCLI imports the data into BigQuery, not Dataplex.This bypasses Dataplex’s discovery process and doesn't solve the problem of automatic discovery in the curated zone’s Cloud Storage asset.
D. Grant object-level access to the CSV and JSON files in Cloud Storage
Why it’s incorrect:
Object-level permissions are not required for Dataplex to perform discovery if the service account has read access at the bucket or folder level.
Discovery failure is typically due to discovery not being enabled, not missing access—especially if the asset was created by an authorized user.
Question 7 Multiple Choice
SecureTrust Analytics, your company operating in a tightly regulated industry, must enforce strict access controls to ensure that users only access the minimum data necessary for their responsibilities. You’re using Google BigQuery and need to apply this principle of least privilege effectively.
Which three of the following strategies would help enforce this requirement?
Explanation

Click "Show Answer" to see the explanation here
Correct Answers: B, D, and E
These are the most appropriate strategies to enforce the principle of least privilege in Google BigQuery.
Explanation for Correct Options
B. Control table-level access using IAM roles. ✅
IAM (Identity and Access Management) is the foundation for enforcing granular access control in BigQuery.
You can assign roles at the table or dataset level (e.g.,
roles/bigquery.dataViewer) so users only have access to the data necessary for their tasks.Custom roles can be created to fine-tune access permissions even further.
Official Documentation:
BigQuery access control with IAM
D. Limit BigQuery API access to only authorized users. ✅
Restricting access to the BigQuery API ensures that only trusted users or service accounts can execute queries or export data programmatically.
This limits the risk of unauthorized access or data exfiltration.
You can enforce this using IAM permissions and VPC Service Controls.
E. Organize data into separate tables or datasets to isolate access. ✅
Data partitioning by business unit, sensitivity, or purpose enables tighter access control.
IAM policies can then be applied independently to each dataset, ensuring users only access the datasets relevant to their role.
This is a core strategy for implementing least privilege in data governance.
Docs:
Best practices for data governance in BigQuery
Explanation for Incorrect Options
A. Disable write access on specific BigQuery tables.
While limiting write access helps maintain data integrity, it does not enforce least privilege for data access (read permissions).
Least privilege also involves controlling who can read or query the data.
C. Ensure all data remains encrypted at rest and in transit.
Encryption is important for data security, but it is a default behavior in BigQuery and does not control access to data.
It protects against unauthorized access at the infrastructure level, not user-level access.
Reference:
BigQuery encryption documentation
F. Use Cloud Audit Logs to track and identify potential access violations.
Cloud Audit Logs are critical for monitoring and compliance, but they provide visibility, not access control.
They support detecting violations, not preventing them.
Reference:
BigQuery audit logging
Final Answer: B, D, and E
These options directly support the principle of least privilege in Google BigQuery by enforcing fine-grained, role-based access control and data isolation.
Explanation
Correct Answers: B, D, and E
These are the most appropriate strategies to enforce the principle of least privilege in Google BigQuery.
Explanation for Correct Options
B. Control table-level access using IAM roles. ✅
IAM (Identity and Access Management) is the foundation for enforcing granular access control in BigQuery.
You can assign roles at the table or dataset level (e.g.,
roles/bigquery.dataViewer) so users only have access to the data necessary for their tasks.Custom roles can be created to fine-tune access permissions even further.
Official Documentation:
BigQuery access control with IAM
D. Limit BigQuery API access to only authorized users. ✅
Restricting access to the BigQuery API ensures that only trusted users or service accounts can execute queries or export data programmatically.
This limits the risk of unauthorized access or data exfiltration.
You can enforce this using IAM permissions and VPC Service Controls.
E. Organize data into separate tables or datasets to isolate access. ✅
Data partitioning by business unit, sensitivity, or purpose enables tighter access control.
IAM policies can then be applied independently to each dataset, ensuring users only access the datasets relevant to their role.
This is a core strategy for implementing least privilege in data governance.
Docs:
Best practices for data governance in BigQuery
Explanation for Incorrect Options
A. Disable write access on specific BigQuery tables.
While limiting write access helps maintain data integrity, it does not enforce least privilege for data access (read permissions).
Least privilege also involves controlling who can read or query the data.
C. Ensure all data remains encrypted at rest and in transit.
Encryption is important for data security, but it is a default behavior in BigQuery and does not control access to data.
It protects against unauthorized access at the infrastructure level, not user-level access.
Reference:
BigQuery encryption documentation
F. Use Cloud Audit Logs to track and identify potential access violations.
Cloud Audit Logs are critical for monitoring and compliance, but they provide visibility, not access control.
They support detecting violations, not preventing them.
Reference:
BigQuery audit logging
Final Answer: B, D, and E
These options directly support the principle of least privilege in Google BigQuery by enforcing fine-grained, role-based access control and data isolation.
Question 8 Single Choice
Your organization, DataLink Global, follows a multi-cloud strategy by storing data in both Google Cloud Storage and Amazon S3 buckets, with all data residing in US-based regions. You need to allow your teams to query the most up-to-date data from either cloud using BigQuery, but without granting direct access to the Cloud Storage or S3 buckets themselves.
What should you do?
Explanation

Click "Show Answer" to see the explanation here
Correct answer — A
Set up a BigQuery Omni connection to the Amazon S3 buckets, create BigLake tables over both Cloud Storage and S3 data, and let your teams query those tables from BigQuery.
Why this meets every requirement
Single query surface for both clouds
BigQuery Omni lets you run familiar BigQuery SQL against data that physically sits in Amazon S3 while keeping the data in-place. Google CloudBigLake tables decouple bucket access
BigLake adds an access-delegation layer: users receive permissions on the BigLake table, not on the underlying Cloud Storage or S3 buckets, satisfying the “no direct bucket access” constraint. Google Cloud | Google CloudAlways up to date & no data movement
Because the tables point straight at the objects in Cloud Storage and S3, queries always reflect the latest files—there is no transfer lag or duplication.Uniform governance & security
BigLake tables support column-level security, row-level security, and fine-grained IAM roles exactly as regular BigQuery tables do, so you manage policy once across both clouds. Google Cloud
Why the other options are less suitable
B – External tables instead of BigLake
BigQuery external tables work, but they don’t provide access delegation; each user (or a shared service account) still needs Storage IAM on the buckets. BigLake was created to remove that operational and security burden.C & D – Copy data with Storage Transfer Service
Copying S3 data into Cloud Storage introduces extra cost, operational complexity, and delay; the copied data can be stale. These options also break the goal of query-in-place across clouds.
Recommendation
Create one BigLake table for each dataset—using a Cloud Storage URI in Google Cloud and an S3 URI (through a BigQuery Omni connection) in AWS. Grant your analysts BigQuery roles on those tables only, and they can run unified SQL queries without ever touching the buckets.
Explanation
Correct answer — A
Set up a BigQuery Omni connection to the Amazon S3 buckets, create BigLake tables over both Cloud Storage and S3 data, and let your teams query those tables from BigQuery.
Why this meets every requirement
Single query surface for both clouds
BigQuery Omni lets you run familiar BigQuery SQL against data that physically sits in Amazon S3 while keeping the data in-place. Google CloudBigLake tables decouple bucket access
BigLake adds an access-delegation layer: users receive permissions on the BigLake table, not on the underlying Cloud Storage or S3 buckets, satisfying the “no direct bucket access” constraint. Google Cloud | Google CloudAlways up to date & no data movement
Because the tables point straight at the objects in Cloud Storage and S3, queries always reflect the latest files—there is no transfer lag or duplication.Uniform governance & security
BigLake tables support column-level security, row-level security, and fine-grained IAM roles exactly as regular BigQuery tables do, so you manage policy once across both clouds. Google Cloud
Why the other options are less suitable
B – External tables instead of BigLake
BigQuery external tables work, but they don’t provide access delegation; each user (or a shared service account) still needs Storage IAM on the buckets. BigLake was created to remove that operational and security burden.C & D – Copy data with Storage Transfer Service
Copying S3 data into Cloud Storage introduces extra cost, operational complexity, and delay; the copied data can be stale. These options also break the goal of query-in-place across clouds.
Recommendation
Create one BigLake table for each dataset—using a Cloud Storage URI in Google Cloud and an S3 URI (through a BigQuery Omni connection) in AWS. Grant your analysts BigQuery roles on those tables only, and they can run unified SQL queries without ever touching the buckets.
Question 9 Single Choice
You are a data engineer at AutoNova Motors, and you've built a data pipeline using Google Cloud Pub/Sub to capture sensor anomalies from connected vehicles. A push subscription is configured to send these events to a custom HTTPS endpoint that you’ve developed to take immediate action when anomalies are detected.
However, you notice that your HTTPS endpoint is receiving an unusually high number of duplicate messages.
What is the most likely reason for this issue?
Explanation

Click "Show Answer" to see the explanation here
Correct Answer: D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.
✅ Justification for Correct Option:
D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.
This is the most accurate explanation for why duplicate messages are being received.
Cloud Pub/Sub guarantees at-least-once delivery, which means a message may be redelivered if:
The subscriber fails to acknowledge the message.
The subscriber does not respond within the acknowledgment deadline (default: 10 seconds).
An error occurs during delivery.
In push subscriptions, Pub/Sub waits for a successful HTTP 200 OK response from your endpoint within the acknowledgment deadline.
If it doesn't receive it in time, it assumes the message wasn't processed, and resends the message, causing duplicates.
Official documentation:
“Push endpoints must return an HTTP success status code (200–299) within the acknowledgment deadline to acknowledge the message. If they do not, the message is redelivered.”
Source: https://cloud.google.com/pubsub/docs/push#receiving_messages
❌ Justifications for Incorrect Options:
A. The payload size of the sensor event messages is too large.
Pub/Sub supports payloads up to 10 MB.
A large payload might cause processing delays, but it is not the root cause of duplication.
Duplication is due to acknowledgment behavior, not payload size directly.
Reference:
https://cloud.google.com/pubsub/quotas#resource_limits
B. The SSL certificate on your custom HTTPS endpoint is outdated.
An outdated or invalid certificate would result in delivery failures, not successful duplication.
If the SSL handshake fails, Pub/Sub will log errors, but won’t send duplicates because no delivery occurred.
C. There is a high volume of messages being published to the Pub/Sub topic.
High volume can increase throughput needs, but it does not by itself cause duplication.
Duplicates are due to failure to acknowledge messages, regardless of volume.
✅ Final Answer: D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.
Explanation
Correct Answer: D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.
✅ Justification for Correct Option:
D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.
This is the most accurate explanation for why duplicate messages are being received.
Cloud Pub/Sub guarantees at-least-once delivery, which means a message may be redelivered if:
The subscriber fails to acknowledge the message.
The subscriber does not respond within the acknowledgment deadline (default: 10 seconds).
An error occurs during delivery.
In push subscriptions, Pub/Sub waits for a successful HTTP 200 OK response from your endpoint within the acknowledgment deadline.
If it doesn't receive it in time, it assumes the message wasn't processed, and resends the message, causing duplicates.
Official documentation:
“Push endpoints must return an HTTP success status code (200–299) within the acknowledgment deadline to acknowledge the message. If they do not, the message is redelivered.”
Source: https://cloud.google.com/pubsub/docs/push#receiving_messages
❌ Justifications for Incorrect Options:
A. The payload size of the sensor event messages is too large.
Pub/Sub supports payloads up to 10 MB.
A large payload might cause processing delays, but it is not the root cause of duplication.
Duplication is due to acknowledgment behavior, not payload size directly.
Reference:
https://cloud.google.com/pubsub/quotas#resource_limits
B. The SSL certificate on your custom HTTPS endpoint is outdated.
An outdated or invalid certificate would result in delivery failures, not successful duplication.
If the SSL handshake fails, Pub/Sub will log errors, but won’t send duplicates because no delivery occurred.
C. There is a high volume of messages being published to the Pub/Sub topic.
High volume can increase throughput needs, but it does not by itself cause duplication.
Duplicates are due to failure to acknowledge messages, regardless of volume.
✅ Final Answer: D. Your custom HTTPS endpoint is not sending acknowledgments within the required acknowledgment deadline.
Question 10 Single Choice
You are working with Skyline Realty Corp, a large real estate company, and preparing 6 TB of property sales data for a machine learning use case. You plan to use SQL for data transformation and BigQuery ML to build the ML model. The model will be used to generate predictions on unprocessed raw data.
How should you design the workflow to avoid training-serving skew during predictions?
Explanation

Click "Show Answer" to see the explanation here
Correct answer — A
Embed preprocessing with the
TRANSFORMclause.
InCREATE MODEL, you can specify all SQL feature-engineering steps inside aTRANSFORM( … )block. BigQuery stores those transformations inside the model object. Google CloudSame logic is auto-applied at serving time.
When you later invokeML.PREDICTorML.EVALUATE, BigQuery ML automatically re-runs the embedded transformations on whatever raw rows you pass in, so you feed unprocessed data and still get features identical to training. This removes the possibility of training-serving skew. Google Cloud | Google CloudWhy the other options fall short
B. Running a saved query before prediction duplicates logic in two places; any drift between the query and the original
TRANSFORMstatement can re-introduce skew.C. A view helps reuse SQL, but if you skip that view during prediction (as the option states) you still get mismatched features.
D. Doing preprocessing in Dataflow puts critical feature logic outside the model; you must replicate the code path for every inference job, again risking skew and extra maintenance.
Bottom line: putting preprocessing in a TRANSFORM clause lets BigQuery ML itself guarantee that training and serving see identical transformations, meeting Skyline Realty’s requirement with the least manual effort.
Explanation
Correct answer — A
Embed preprocessing with the
TRANSFORMclause.
InCREATE MODEL, you can specify all SQL feature-engineering steps inside aTRANSFORM( … )block. BigQuery stores those transformations inside the model object. Google CloudSame logic is auto-applied at serving time.
When you later invokeML.PREDICTorML.EVALUATE, BigQuery ML automatically re-runs the embedded transformations on whatever raw rows you pass in, so you feed unprocessed data and still get features identical to training. This removes the possibility of training-serving skew. Google Cloud | Google CloudWhy the other options fall short
B. Running a saved query before prediction duplicates logic in two places; any drift between the query and the original
TRANSFORMstatement can re-introduce skew.C. A view helps reuse SQL, but if you skip that view during prediction (as the option states) you still get mismatched features.
D. Doing preprocessing in Dataflow puts critical feature logic outside the model; you must replicate the code path for every inference job, again risking skew and extra maintenance.
Bottom line: putting preprocessing in a TRANSFORM clause lets BigQuery ML itself guarantee that training and serving see identical transformations, meeting Skyline Realty’s requirement with the least manual effort.



