Professional Data Engineer - Google Cloud Certified Exam Questions

658

Total Questions

SEP

2025

Last Updated

1st

1st Try Guaranteed

Experts Verified

Per page:

Question 11 Single Choice

At WeatherSense Labs, you're developing a machine learning model to predict rainfall for a given day. Your dataset contains thousands of input features, and you want to explore ways to speed up model training while keeping the impact on accuracy minimal.

What approach should you consider?

Question 12 Single Choice

NovaTech Manufacturing is streaming real-time sensor data from its production floor into Cloud Bigtable. However, the team has observed severe performance issues, particularly with queries used to power real-time dashboards.

To improve query performance, how should the row key design be modified?

Question 13 Single Choice

You’re building a clothing recommendation system at StyleSense AI, which must adapt to changing fashion trends and user preferences over time. You've already implemented a streaming data pipeline that delivers new interaction data to the model as it becomes available.

How should you incorporate this new data into your model training strategy?

Click "Show Answer" to see the explanation here

The correct answer is:
B. Continuously retrain the model using a combination of historical and new data.

Correct Option Explanation

B. Continuously retrain the model using a combination of historical and new data

Why it’s correct:
- In recommendation systems that must adapt to dynamic patterns like changing fashion trends and user preferences, it’s critical to continuously retrain the model.
- Using only new data (Option A) can cause the model to forget past patterns and lead to data drift or catastrophic forgetting.
- Combining historical data with new streaming data ensures:
  - The model retains knowledge of long-term trends
  - While adapting to recent behavior
  - Leads to better generalization and user experience
- This is especially important in domains like fashion where seasonal patterns and recency matter, but past trends still carry value.
Best Practices Reference:
- Google Cloud - Real-time ML Patterns
- ML Ops: Continuous training

Incorrect Option Justifications

A. Continuously retrain the model using only the new data

Why it’s incorrect:
- This can lead to overfitting to recent data and loss of important historical patterns.
- Makes the model unstable and prone to concept drift without context from the past.

C. Use the existing data for training, and treat the new data as the test set

Why it’s incorrect:
- This approach does not incorporate the new data into training, which is essential for up-to-date recommendations.
- New data is wasted as a test set instead of used to improve model adaptability.

D. Train the model using new data, and use the existing data as the test set

Why it’s incorrect:
- Using older data as a test set for a model trained only on new data violates proper evaluation practices.
- Test data should be representative of future data, not the past.
- Also, training only on new data misses valuable long-term user patterns.

Conclusion:
For adaptive, real-time systems like clothing recommenders at StyleSense AI, the best practice is to retrain models with both historical and new data to ensure stability, adaptability, and accuracy over time.

Explanation

The correct answer is:
B. Continuously retrain the model using a combination of historical and new data.

Correct Option Explanation

B. Continuously retrain the model using a combination of historical and new data

Why it’s correct:
- In recommendation systems that must adapt to dynamic patterns like changing fashion trends and user preferences, it’s critical to continuously retrain the model.
- Using only new data (Option A) can cause the model to forget past patterns and lead to data drift or catastrophic forgetting.
- Combining historical data with new streaming data ensures:
  - The model retains knowledge of long-term trends
  - While adapting to recent behavior
  - Leads to better generalization and user experience
- This is especially important in domains like fashion where seasonal patterns and recency matter, but past trends still carry value.
Best Practices Reference:
- Google Cloud - Real-time ML Patterns
- ML Ops: Continuous training

Incorrect Option Justifications

A. Continuously retrain the model using only the new data

Why it’s incorrect:
- This can lead to overfitting to recent data and loss of important historical patterns.
- Makes the model unstable and prone to concept drift without context from the past.

C. Use the existing data for training, and treat the new data as the test set

Why it’s incorrect:
- This approach does not incorporate the new data into training, which is essential for up-to-date recommendations.
- New data is wasted as a test set instead of used to improve model adaptability.

D. Train the model using new data, and use the existing data as the test set

Why it’s incorrect:
- Using older data as a test set for a model trained only on new data violates proper evaluation practices.
- Test data should be representative of future data, not the past.
- Also, training only on new data misses valuable long-term user patterns.

Question 14 Single Choice

At DataVista Corp, you've built a critical dashboard in Looker Studio 360 (formerly Google Data Studio) for your large internal team. The dashboard pulls data from BigQuery as its source. However, you've observed that the visualizations are not displaying data generated within the last hour.

What should you do to resolve this and ensure the dashboard shows the most up-to-date data?

Click "Show Answer" to see the explanation here

The correct answer is:
A. Turn off data caching by modifying the report settings in Looker Studio.

Correct Option Explanation

A. Turn off data caching by modifying the report settings in Looker Studio

Why it’s correct:
- Looker Studio 360 (formerly Google Data Studio) uses data caching to improve performance and reduce the number of queries sent to BigQuery.
- By default, Looker Studio may cache data for up to 12 hours, meaning recent data (like data from the last hour) might not appear in visualizations.
- You can resolve this by turning off or modifying the data freshness settings in the report:
  - Go to Resource > Manage added data sources > Edit.
  - In the BigQuery connector settings, find the “Caching” options.
  - Set “Data freshness” to a lower value (e.g., 15 minutes) or disable caching entirely.
- This ensures that every time the dashboard loads, it pulls the latest data directly from BigQuery.
Official Documentation:
- Control BigQuery caching in Looker Studio
- Set Data Freshness in Looker Studio

Incorrect Option Justifications

B. Disable caching settings directly in the BigQuery table configuration

Why it’s incorrect:
- BigQuery itself doesn’t cache table data in a way that affects Looker Studio dashboard freshness.
- Caching is managed in Looker Studio, not in the BigQuery table configuration.

C. Refresh your browser tab that displays the dashboard

Why it’s incorrect:
- Refreshing the tab only reloads the report from Looker Studio’s cache, unless the data freshness setting has been lowered.
- It won’t bypass Looker Studio's internal cache, so recent data still won’t appear.

D. Clear your browser history from the last hour and reload the dashboard

Why it’s incorrect:
- Browser history/cache has no impact on Looker Studio’s data freshness or query behavior.
- The issue is entirely related to server-side caching in Looker Studio, not client-side caching.

Conclusion:
To ensure your dashboard always shows the latest data, control the caching behavior directly in Looker Studio's data source settings.

Explanation

The correct answer is:
A. Turn off data caching by modifying the report settings in Looker Studio.

Correct Option Explanation

A. Turn off data caching by modifying the report settings in Looker Studio

Why it’s correct:
- Looker Studio 360 (formerly Google Data Studio) uses data caching to improve performance and reduce the number of queries sent to BigQuery.
- By default, Looker Studio may cache data for up to 12 hours, meaning recent data (like data from the last hour) might not appear in visualizations.
- You can resolve this by turning off or modifying the data freshness settings in the report:
  - Go to Resource > Manage added data sources > Edit.
  - In the BigQuery connector settings, find the “Caching” options.
  - Set “Data freshness” to a lower value (e.g., 15 minutes) or disable caching entirely.
- This ensures that every time the dashboard loads, it pulls the latest data directly from BigQuery.
Official Documentation:
- Control BigQuery caching in Looker Studio
- Set Data Freshness in Looker Studio

Incorrect Option Justifications

B. Disable caching settings directly in the BigQuery table configuration

Why it’s incorrect:
- BigQuery itself doesn’t cache table data in a way that affects Looker Studio dashboard freshness.
- Caching is managed in Looker Studio, not in the BigQuery table configuration.

C. Refresh your browser tab that displays the dashboard

Why it’s incorrect:
- Refreshing the tab only reloads the report from Looker Studio’s cache, unless the data freshness setting has been lowered.
- It won’t bypass Looker Studio's internal cache, so recent data still won’t appear.

D. Clear your browser history from the last hour and reload the dashboard

Why it’s incorrect:
- Browser history/cache has no impact on Looker Studio’s data freshness or query behavior.
- The issue is entirely related to server-side caching in Looker Studio, not client-side caching.

Conclusion:
To ensure your dashboard always shows the latest data, control the caching behavior directly in Looker Studio's data source settings.

Question 15 Single Choice

You have a dataset with two dimensions, X and Y, and each data point is shaded to represent its class. To accurately classify this data using a linear algorithm, you plan to introduce a synthetic feature. What should be the value of that feature?

Click "Show Answer" to see the explanation here

Option A (X² + Y²) is indeed the most suitable choice for introducing a synthetic feature in this scenario. Here's a comprehensive explanation along with why other options might not be ideal:

Why X² + Y² is a Good Choice:

Enhancing Separability: Imagine the data points form a non-linear pattern in the X-Y plane, making linear separation difficult. X² + Y² essentially squares both X and Y values, expanding the data points outwards along a circular pattern. This can potentially create a new dimension where the classes become more linearly separable.
Circular or Elliptical Boundaries: If the classes exhibit a circular or elliptical separation boundary in the original space, X² + Y² is particularly effective. Squaring both dimensions accentuates the distance from the origin along those axes, potentially creating a clear linear separation line in the new feature space.

Why Other Options Might Not Be Ideal:

X² or Y² Alone: These options might only be helpful if the data exhibits a linear separation tendency along a single axis (X or Y). Squaring just one dimension wouldn't create a significant improvement in separability for most non-linear class boundaries.
cos(X): While introducing trigonometric functions can be a powerful technique for feature engineering, using just cos(X) in this case wouldn't necessarily improve the linear separability. The cosine function introduces periodicity, which might not align well with the existing class distribution.

Additional Considerations:

The effectiveness of X² + Y² as a synthetic feature depends on the specific distribution of your data points and the nature of the class separation boundary. Experimentation with different options might be necessary.
In some cases, more complex transformations beyond simple squaring might be required. Techniques like polynomial expansions or custom functions based on domain knowledge could be explored.

Overall, X² + Y² offers a good starting point for introducing a synthetic feature because it has the potential to create a new dimension where the classes are more linearly separable, especially for circular or elliptical class boundaries.

References:

https://medium.com/@sachinkun21/using-a-linear-model-to-deal-with-nonlinear-dataset-c6ed0f7f3f51

http://playground.tensorflow.org

https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture

Explanation

Option A (X² + Y²) is indeed the most suitable choice for introducing a synthetic feature in this scenario. Here's a comprehensive explanation along with why other options might not be ideal:

Why X² + Y² is a Good Choice:

Enhancing Separability: Imagine the data points form a non-linear pattern in the X-Y plane, making linear separation difficult. X² + Y² essentially squares both X and Y values, expanding the data points outwards along a circular pattern. This can potentially create a new dimension where the classes become more linearly separable.
Circular or Elliptical Boundaries: If the classes exhibit a circular or elliptical separation boundary in the original space, X² + Y² is particularly effective. Squaring both dimensions accentuates the distance from the origin along those axes, potentially creating a clear linear separation line in the new feature space.

Why Other Options Might Not Be Ideal:

X² or Y² Alone: These options might only be helpful if the data exhibits a linear separation tendency along a single axis (X or Y). Squaring just one dimension wouldn't create a significant improvement in separability for most non-linear class boundaries.
cos(X): While introducing trigonometric functions can be a powerful technique for feature engineering, using just cos(X) in this case wouldn't necessarily improve the linear separability. The cosine function introduces periodicity, which might not align well with the existing class distribution.

Additional Considerations:

The effectiveness of X² + Y² as a synthetic feature depends on the specific distribution of your data points and the nature of the class separation boundary. Experimentation with different options might be necessary.
In some cases, more complex transformations beyond simple squaring might be required. Techniques like polynomial expansions or custom functions based on domain knowledge could be explored.

References:

https://medium.com/@sachinkun21/using-a-linear-model-to-deal-with-nonlinear-dataset-c6ed0f7f3f51

http://playground.tensorflow.org

https://developers.google.com/machine-learning/crash-course/feature-crosses/video-lecture

Question 16 Single Choice

You are migrating a large set of files from a public HTTPS source to a Cloud Storage bucket for NexaData Corp. Access to the files is secured using signed URLs, and you’ve prepared a TSV file listing all the URLs. You initiated the migration using Storage Transfer Service (STS).

The transfer ran successfully for a while but then failed, and job logs show HTTP 403 errors for the remaining files. You verified that nothing changed on the source system. You now need to resolve the issue and resume the migration.

What should you do?

Click "Show Answer" to see the explanation here

Pick C — regenerate fresh, longer-lived signed URLs for the remaining files, create a new (or smaller split) TSV list, and restart Storage Transfer Service jobs

Why this solves the 403 failures

What really happened:
HTTP 403 from a signed-URL source almost always means the URL’s time-limited signature has expired. Signed URLs are only valid until the expiration timestamp you set when generating them; after that point every request is rejected with 403. Google Cloud | Google Cloud
STS needs the URL to stay valid for the full runtime.
The Storage Transfer Service guides explicitly remind you to choose an expiration long enough for the job to finish (for example, the default eight-hour SAS token limit in another URL-based workflow). Google Cloud
Regenerating and resuming:
1. Re-sign only the objects that failed and build a new TSV (or break the file into smaller chunks).
2. Launch new or parallel STS jobs that reference those fresh URLs.
3. Because STS skips objects that already exist in the destination, no re-copy of completed files is needed, and the transfer continues where it left off.

Splitting the TSV lets several jobs run at once, shortening total duration so the new URLs won’t time out.

Why the other choices fall short

A — mount the bucket and run a shell script
Abandons the managed transfer service, adds VM costs, and re-implements retry, parallelism, and integrity checks that STS already handles.
B — renew the TLS certificate
TLS certificates on the source host are unrelated to the signed-URL signature; an unexpired cert would return 4xx only if it was invalid, not give a 403 for every remaining object.
D — switch the checksum algorithm
Checksums matter after the file is downloaded; a 403 is an authorization error that occurs before any checksum comparison happens, so changing MD5 to SHA-256 won’t help.

In short: the job failed because the signed URLs timed out; issue fresh URLs with a longer expiry for the unfinished objects, feed them to new or parallel STS jobs, and the migration will complete successfully without re-copying data.

Explanation

Pick C — regenerate fresh, longer-lived signed URLs for the remaining files, create a new (or smaller split) TSV list, and restart Storage Transfer Service jobs

Why this solves the 403 failures

What really happened:
HTTP 403 from a signed-URL source almost always means the URL’s time-limited signature has expired. Signed URLs are only valid until the expiration timestamp you set when generating them; after that point every request is rejected with 403. Google Cloud | Google Cloud
STS needs the URL to stay valid for the full runtime.
The Storage Transfer Service guides explicitly remind you to choose an expiration long enough for the job to finish (for example, the default eight-hour SAS token limit in another URL-based workflow). Google Cloud
Regenerating and resuming:
1. Re-sign only the objects that failed and build a new TSV (or break the file into smaller chunks).
2. Launch new or parallel STS jobs that reference those fresh URLs.
3. Because STS skips objects that already exist in the destination, no re-copy of completed files is needed, and the transfer continues where it left off.

Splitting the TSV lets several jobs run at once, shortening total duration so the new URLs won’t time out.

Why the other choices fall short

A — mount the bucket and run a shell script
Abandons the managed transfer service, adds VM costs, and re-implements retry, parallelism, and integrity checks that STS already handles.
B — renew the TLS certificate
TLS certificates on the source host are unrelated to the signed-URL signature; an unexpired cert would return 4xx only if it was invalid, not give a 403 for every remaining object.
D — switch the checksum algorithm
Checksums matter after the file is downloaded; a 403 is an authorization error that occurs before any checksum comparison happens, so changing MD5 to SHA-256 won’t help.

Question 17 Single Choice

Your company's customer_order table in BigQuery contains the order history for 10 million customers, with a table size of 10 PB. You're tasked with creating a dashboard for the support team to view order history. The dashboard includes two filters, country_name and username, both stored as string data types in the BigQuery table. However, applying filters to the dashboard's query results in slow performance.

SELECT date, order, status FROM customer_order

WHERE country = '<country_name>' AND username = '<username>'

How should you redesign the BigQuery table to facilitate faster access?

Click "Show Answer" to see the explanation here

Best redesign — Option A: Cluster the table by country and username.

Why clustering is the right fit

BigQuery clustering works with STRING columns.
Clustering rewrites the table so that rows with similar values for the chosen columns are stored in the same data blocks. When a query filters on a clustered column (or a left-most subset of them), BigQuery can skip blocks that do not match, which sharply reduces the amount of data read and speeds up scans. Google’s documentation explicitly notes that clustering can be applied to STRING columns such as country and username. Google Cloud
Partitioning is impossible on these columns.
BigQuery only allows partitioning on DATE / DATETIME / TIMESTAMP columns, ingestion-time, or an INTEGER range column. Because country and username are strings, they cannot be partition keys, ruling out the partition-based options. Google Cloud
Expected performance gain for the dashboard query.
Your support-team query always filters on both country and username. Defining the clustering columns in that order lets BigQuery prune to the exact blocks that hold the requested records, cutting scan time and cost for the 10 PB table.

Why the other options don’t solve the problem

B — Partition by username, cluster by country.
Partitioning on a string column is not supported, so this design cannot be created. Same limitation applies to any attempt to partition on country or username. Google Cloud
C — Partition by both country and username.
BigQuery supports only one partition column per table, and that column must be of an allowed type (date/time or integer). Partitioning by two string columns is therefore impossible. Google Cloud
D — Partition by _PARTITIONTIME.
Ingestion-time partitioning helps when queries filter on load time; it offers no benefit when your predicates reference country and username. The query would still need to scan every partition, so the performance issue remains.

Additional Reference:

https://cloud.google.com/bigquery/docs/partitioned-tables#integer_range

Explanation

Best redesign — Option A: Cluster the table by country and username.

Why clustering is the right fit

BigQuery clustering works with STRING columns.
Clustering rewrites the table so that rows with similar values for the chosen columns are stored in the same data blocks. When a query filters on a clustered column (or a left-most subset of them), BigQuery can skip blocks that do not match, which sharply reduces the amount of data read and speeds up scans. Google’s documentation explicitly notes that clustering can be applied to STRING columns such as country and username. Google Cloud
Partitioning is impossible on these columns.
BigQuery only allows partitioning on DATE / DATETIME / TIMESTAMP columns, ingestion-time, or an INTEGER range column. Because country and username are strings, they cannot be partition keys, ruling out the partition-based options. Google Cloud
Expected performance gain for the dashboard query.
Your support-team query always filters on both country and username. Defining the clustering columns in that order lets BigQuery prune to the exact blocks that hold the requested records, cutting scan time and cost for the 10 PB table.

Why the other options don’t solve the problem

B — Partition by username, cluster by country.
Partitioning on a string column is not supported, so this design cannot be created. Same limitation applies to any attempt to partition on country or username. Google Cloud
C — Partition by both country and username.
BigQuery supports only one partition column per table, and that column must be of an allowed type (date/time or integer). Partitioning by two string columns is therefore impossible. Google Cloud
D — Partition by _PARTITIONTIME.
Ingestion-time partitioning helps when queries filter on load time; it offers no benefit when your predicates reference country and username. The query would still need to scan every partition, so the performance issue remains.

Additional Reference:

https://cloud.google.com/bigquery/docs/partitioned-tables#integer_range

Question 18 Single Choice

You're utilizing Google BigQuery as your data warehouse. Users have reported that a seemingly simple query runs exceptionally slowly, regardless of when they execute it:

SELECT country, state, city 
FROM [myproject:mydataset.mytable] 
GROUP BY country

Upon inspecting the query plan, you observe the following output in the Read section of Stage:1:

What is the most probable cause of the delay for this query?

Click "Show Answer" to see the explanation here

Here's the breakdown of why option D is the most probable cause of the slow query performance:

Understanding the Problem

Slow Query: A simple query on a BigQuery table is unexpectedly slow.
Query Plan: The query plan reveals a high slot usage and a large amount of data processed in the Read stage. This suggests a disproportionate amount of work being done at the initial stage of the query.

Analyzing the Options

A. Concurrent Queries: This could introduce some slowdown, but it's less likely to be the primary cause if the issue persists regardless of when the query is executed. BigQuery is designed to handle multiple concurrent queries.
B. Excessive Partitions: While too many partitions can impact performance, it usually manifests in filtering issues, not a generally slow query.
C. NULL Values: NULL values can affect query processing, but it's unlikely the main culprit for such a significant slowdown on a simple query.
D. Data Skew: This is the most likely issue. When a significant portion of rows share the same value for a grouping column ('country' in this case), BigQuery can encounter difficulty parallelizing the query. This leads to a bottleneck, lots of data being processed by a few slots, and overall slow performance.

Why Data Skew is the Issue:

Uneven Work Distribution: Skewed data means some workers in BigQuery will process a significantly larger amount of data than others.
Reduced Parallelism: The efficiency of BigQuery comes from its ability to split work across many machines. Data skew hinders this parallelization.
Query Plan Clues: The high slot usage and large "Read" size in the query plan align with the behavior of a skewed query.

How to Verify and Mitigate

Check Distribution: Examine the distribution of values in the country column. You likely have one or a few countries with much larger row counts.
Salting: Add a random element to the country value before grouping to improve distribution.
Partitioning: If data is time-based, consider partitioning on a different column with better distribution.

References:

https://cloud.google.com/bigquery/query-plan-explanation

https://medium.com/slalom-build/using-bigquery-execution-plans-to-improve-query-performance-af141b0cc33d

https://cloud.google.com/bigquery/docs/best-practices-performance-patterns

https://cloud.google.com/bigquery/docs/best-practices-performance-patterns#data_skew

Explanation

Here's the breakdown of why option D is the most probable cause of the slow query performance:

Understanding the Problem

Slow Query: A simple query on a BigQuery table is unexpectedly slow.
Query Plan: The query plan reveals a high slot usage and a large amount of data processed in the Read stage. This suggests a disproportionate amount of work being done at the initial stage of the query.

Analyzing the Options

A. Concurrent Queries: This could introduce some slowdown, but it's less likely to be the primary cause if the issue persists regardless of when the query is executed. BigQuery is designed to handle multiple concurrent queries.
B. Excessive Partitions: While too many partitions can impact performance, it usually manifests in filtering issues, not a generally slow query.
C. NULL Values: NULL values can affect query processing, but it's unlikely the main culprit for such a significant slowdown on a simple query.
D. Data Skew: This is the most likely issue. When a significant portion of rows share the same value for a grouping column ('country' in this case), BigQuery can encounter difficulty parallelizing the query. This leads to a bottleneck, lots of data being processed by a few slots, and overall slow performance.

Why Data Skew is the Issue:

Uneven Work Distribution: Skewed data means some workers in BigQuery will process a significantly larger amount of data than others.
Reduced Parallelism: The efficiency of BigQuery comes from its ability to split work across many machines. Data skew hinders this parallelization.
Query Plan Clues: The high slot usage and large "Read" size in the query plan align with the behavior of a skewed query.

How to Verify and Mitigate

Check Distribution: Examine the distribution of values in the country column. You likely have one or a few countries with much larger row counts.
Salting: Add a random element to the country value before grouping to improve distribution.
Partitioning: If data is time-based, consider partitioning on a different column with better distribution.

References:

https://cloud.google.com/bigquery/query-plan-explanation

https://medium.com/slalom-build/using-bigquery-execution-plans-to-improve-query-performance-af141b0cc33d

https://cloud.google.com/bigquery/docs/best-practices-performance-patterns

https://cloud.google.com/bigquery/docs/best-practices-performance-patterns#data_skew

Question 19 Single Choice

You possess an inventory of VM data stored in a BigQuery table called dataset.inventory_vm. To prepare the data for regular reporting in the most cost-effective manner, you need to exclude VM rows with fewer than 8 vCPUs. What action should you take?

Question 20 Single Choice

You are overseeing the data lake infrastructure for DataNova Corp, which is built on BigQuery. The data ingestion pipelines pull messages from Pub/Sub and write the incoming data into BigQuery tables. After rolling out a new version of the ingestion pipelines, you notice that the daily data volume stored in BigQuery has surged by 50%, even though the data volume in Pub/Sub hasn't changed. Only certain BigQuery tables show a doubling in the size of their daily partitions.

How should you investigate and resolve the root cause of this increase?

A- A. Inspect the BigQuery tables with increased daily partition sizes for duplicate records.

Set up scheduled SQL jobs to remove duplicates from the affected tables.

Share the deduplication script with other teams in case similar issues arise in different tables.

B- D. Revert the pipeline deployment to the previous version.

Use BigQuery's time travel feature to restore table contents to their state before the deployment.

Restart the Dataflow jobs and seek the Pub/Sub subscription to the release timestamp to reprocess the data.

C- B. Review the updated pipeline code for bugs.

Investigate whether multiple processes are writing to the same BigQuery sink.

Examine Cloud Logging for any errors logged on the day the new version was deployed.

If no issues are found, use BigQuery's time travel feature to restore tables to their pre-deployment state.

D- C. Check the tables with increased data for duplicated entries.

Use BigQuery Audit Logs to trace job activity and retrieve relevant job IDs.

Use Cloud Monitoring to identify when each Dataflow job started and determine the associated code version.

If multiple pipeline versions are pushing to the same table, stop all except the most recent one.

Click "Show Answer" to see the explanation here

Correct Answer: C.
1. Check the tables with increased data for duplicated entries.
2. Use BigQuery Audit Logs to trace job activity and retrieve relevant job IDs.
3. Use Cloud Monitoring to identify when each Dataflow job started and determine the associated code version.
4. If multiple pipeline versions are pushing to the same table, stop all except the most recent one.

Justification for Correct Option (C):

This option provides a comprehensive, non-destructive, and systematic approach to:

Investigate the root cause of the data volume spike,
Identify potential duplicate entries,
Trace pipeline behavior via logs, and
Resolve the issue by stopping unintended writes, without risking data loss.

Let’s break it down:

1. Check for duplicated entries

Doubling of partition sizes often indicates duplicate writes. Verifying for duplication is the first logical step.

2. Use BigQuery Audit Logs to trace job activity

BigQuery Audit Logs help identify who or what is writing to a table. They contain:

Job type (e.g., load, query)
Execution timestamp
Service account and job configuration

Relevant docs:
BigQuery Audit Logs

3. Use Cloud Monitoring to identify job start times and versions

Helps correlate pipeline start times with observed anomalies, and determine whether multiple Dataflow jobs (or older versions) are concurrently writing to the same table.

Docs:
Cloud Monitoring for Dataflow

4. Stop unintended concurrent pipelines

If multiple versions of pipelines are found writing to the same sink, stopping the older ones prevents further duplication, resolving the issue at the source.

This diagnoses and remediates the actual issue without reverting or deleting any data.

Why the Other Options Are Incorrect:

Option A:

While checking for duplicates is valid, manually scheduling jobs to clean up duplicates is reactive, not a root cause fix.
Also, this does not explain why duplication started, or prevent recurrence.
Sharing the deduplication script doesn’t help if the pipeline issue persists.

Option B:

Reviewing pipeline code and logging is helpful, but:
- No duplication check is mentioned.
- Using time travel to restore tables is risky and not recommended as a first step, especially without confirming data corruption.
Time travel only works within 7 days and can be costly if misused.
BigQuery Time Travel

Option D:

Reverting the pipeline and using time travel is drastic.
It does not diagnose the issue, only rolls back.
Reprocessing data via Pub/Sub seek risks data loss or inconsistency, especially if ordering is not guaranteed or deduplication isn’t built in.

Final Answer: C

Explanation

Justification for Correct Option (C):

This option provides a comprehensive, non-destructive, and systematic approach to:

Investigate the root cause of the data volume spike,
Identify potential duplicate entries,
Trace pipeline behavior via logs, and
Resolve the issue by stopping unintended writes, without risking data loss.

Let’s break it down:

1. Check for duplicated entries

Doubling of partition sizes often indicates duplicate writes. Verifying for duplication is the first logical step.

2. Use BigQuery Audit Logs to trace job activity

BigQuery Audit Logs help identify who or what is writing to a table. They contain:

Job type (e.g., load, query)
Execution timestamp
Service account and job configuration

Relevant docs:
BigQuery Audit Logs

3. Use Cloud Monitoring to identify job start times and versions

Helps correlate pipeline start times with observed anomalies, and determine whether multiple Dataflow jobs (or older versions) are concurrently writing to the same table.

Docs:
Cloud Monitoring for Dataflow

4. Stop unintended concurrent pipelines

If multiple versions of pipelines are found writing to the same sink, stopping the older ones prevents further duplication, resolving the issue at the source.

This diagnoses and remediates the actual issue without reverting or deleting any data.

Why the Other Options Are Incorrect:

Option A:

While checking for duplicates is valid, manually scheduling jobs to clean up duplicates is reactive, not a root cause fix.
Also, this does not explain why duplication started, or prevent recurrence.
Sharing the deduplication script doesn’t help if the pipeline issue persists.

Option B:

Reviewing pipeline code and logging is helpful, but:
- No duplication check is mentioned.
- Using time travel to restore tables is risky and not recommended as a first step, especially without confirming data corruption.
Time travel only works within 7 days and can be costly if misused.
BigQuery Time Travel

Option D:

Reverting the pipeline and using time travel is drastic.
It does not diagnose the issue, only rolls back.
Reprocessing data via Pub/Sub seek risks data loss or inconsistency, especially if ordering is not guaranteed or deduplication isn’t built in.

Final Answer: C

← Previous

Page: 2 / 66