

Associate Data Practitioner - Google Cloud Certified Exam Questions
Total Questions
Last Updated
1st Try Guaranteed

Experts Verified
Question 11 Single Choice
A genomics research company needs to store DNA sequence files that are write-once, rarely read, but must be immutable for regulatory compliance. Which Google Cloud storage option is most appropriate?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. Cloud Storage with object versioning is the most appropriate solution for this scenario because it provides native immutability features through Object Versioning and Object Holds or Retention Policies, which can prevent modification or deletion of objects for regulatory compliance. It's designed for write-once, read-many (WORM) workloads and is cost-effective for rarely accessed data, especially when combined with appropriate storage classes like Coldline or Archive. It can easily store large binary files like DNA sequences.
Option A is INCORRECT. BigQuery is optimized for analytical queries, not for storing large binary files. While it can store structured data efficiently, it's not designed for raw file storage like DNA sequence files, which are typically large binary files with specialized formats.
Option C is INCORRECT. Cloud SQL is a relational database not designed for large binary file storage. It's optimized for transactional workloads and structured data, making it inappropriate for storing genomic sequence files that are rarely read but must be preserved immutably.
Option D is INCORRECT. Filestore is a managed file storage service primarily for applications requiring a file system interface, which is unnecessary and cost-ineffective for rarely accessed immutable data. It's designed for high-performance workloads that need a file system, not for long-term archival storage.
Explanation
Option B is CORRECT. Cloud Storage with object versioning is the most appropriate solution for this scenario because it provides native immutability features through Object Versioning and Object Holds or Retention Policies, which can prevent modification or deletion of objects for regulatory compliance. It's designed for write-once, read-many (WORM) workloads and is cost-effective for rarely accessed data, especially when combined with appropriate storage classes like Coldline or Archive. It can easily store large binary files like DNA sequences.
Option A is INCORRECT. BigQuery is optimized for analytical queries, not for storing large binary files. While it can store structured data efficiently, it's not designed for raw file storage like DNA sequence files, which are typically large binary files with specialized formats.
Option C is INCORRECT. Cloud SQL is a relational database not designed for large binary file storage. It's optimized for transactional workloads and structured data, making it inappropriate for storing genomic sequence files that are rarely read but must be preserved immutably.
Option D is INCORRECT. Filestore is a managed file storage service primarily for applications requiring a file system interface, which is unnecessary and cost-ineffective for rarely accessed immutable data. It's designed for high-performance workloads that need a file system, not for long-term archival storage.
Question 12 Single Choice
When choosing between Cloud SQL and BigQuery for a new application's database, which factor most strongly favors Cloud SQL?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. The requirement for real-time transactional processing strongly favors Cloud SQL because it's designed as an OLTP (Online Transaction Processing) database that supports low-latency read and write operations, row-level updates, inserts, and deletes. It provides ACID compliance for transaction reliability and supports transactional consistency and locking mechanisms needed for real-time processing of multiple concurrent transactions.
Option A is INCORRECT. The need for complex analytical queries strongly favors BigQuery, not Cloud SQL. BigQuery is optimized for analytical workloads with its columnar storage and massive parallel processing capabilities that can scan terabytes in seconds.
Option C is INCORRECT. Storage of petabytes of historical data favors BigQuery rather than Cloud SQL. BigQuery can efficiently store and query petabytes of data, while Cloud SQL has practical limits far below petabyte scale for most workloads.
Option D is INCORRECT. Support for semi-structured data formats like JSON with nested fields and arrays is a strength of BigQuery, not Cloud SQL. While Cloud SQL can store JSON, BigQuery has native functions and capabilities specifically designed for querying and analyzing semi-structured data.
Explanation
Option B is CORRECT. The requirement for real-time transactional processing strongly favors Cloud SQL because it's designed as an OLTP (Online Transaction Processing) database that supports low-latency read and write operations, row-level updates, inserts, and deletes. It provides ACID compliance for transaction reliability and supports transactional consistency and locking mechanisms needed for real-time processing of multiple concurrent transactions.
Option A is INCORRECT. The need for complex analytical queries strongly favors BigQuery, not Cloud SQL. BigQuery is optimized for analytical workloads with its columnar storage and massive parallel processing capabilities that can scan terabytes in seconds.
Option C is INCORRECT. Storage of petabytes of historical data favors BigQuery rather than Cloud SQL. BigQuery can efficiently store and query petabytes of data, while Cloud SQL has practical limits far below petabyte scale for most workloads.
Option D is INCORRECT. Support for semi-structured data formats like JSON with nested fields and arrays is a strength of BigQuery, not Cloud SQL. While Cloud SQL can store JSON, BigQuery has native functions and capabilities specifically designed for querying and analyzing semi-structured data.
Question 13 Single Choice
A manufacturing company needs to process sensor data in real-time from thousands of IoT devices for anomaly detection. Which ingestion approach is most suitable?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. Streaming data through Pub/Sub to a processing pipeline is the most suitable approach for real-time IoT anomaly detection because Pub/Sub provides a scalable, reliable messaging service designed for event ingestion from thousands or millions of sources. It decouples data producers (IoT devices) from consumers (processing applications), can handle massive throughput with low latency, and integrates seamlessly with processing services like Dataflow for real-time analytics and anomaly detection.
Option A is INCORRECT. Batch processing with CSV files introduces too much latency for real-time anomaly detection. This approach would mean significant delays between when anomalies occur and when they're detected, defeating the purpose of real-time monitoring.
Option C is INCORRECT. BigQuery streaming inserts could work but would be more expensive and less flexible than a dedicated streaming solution for high-volume IoT data. While BigQuery supports streaming inserts, it's primarily an analytical database, not an event ingestion service, and would be costlier for this high-volume use case.
Option D is INCORRECT. Cloud SQL is not designed for high-throughput event ingestion and would struggle with thousands of concurrent connections from IoT devices. Periodic querying also introduces latency that conflicts with the real-time requirement for anomaly detection.
Explanation
Option B is CORRECT. Streaming data through Pub/Sub to a processing pipeline is the most suitable approach for real-time IoT anomaly detection because Pub/Sub provides a scalable, reliable messaging service designed for event ingestion from thousands or millions of sources. It decouples data producers (IoT devices) from consumers (processing applications), can handle massive throughput with low latency, and integrates seamlessly with processing services like Dataflow for real-time analytics and anomaly detection.
Option A is INCORRECT. Batch processing with CSV files introduces too much latency for real-time anomaly detection. This approach would mean significant delays between when anomalies occur and when they're detected, defeating the purpose of real-time monitoring.
Option C is INCORRECT. BigQuery streaming inserts could work but would be more expensive and less flexible than a dedicated streaming solution for high-volume IoT data. While BigQuery supports streaming inserts, it's primarily an analytical database, not an event ingestion service, and would be costlier for this high-volume use case.
Option D is INCORRECT. Cloud SQL is not designed for high-throughput event ingestion and would struggle with thousands of concurrent connections from IoT devices. Periodic querying also introduces latency that conflicts with the real-time requirement for anomaly detection.
Question 14 Single Choice
Which ETL tool would be most appropriate for transforming data using Apache Spark without managing infrastructure?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. Dataproc Serverless allows you to run Apache Spark jobs without managing any infrastructure, making it the most appropriate choice for Spark-based transformations with minimal operational overhead. With Dataproc Serverless, you submit Spark batch workloads and Google Cloud automatically provisions and manages the Spark environment, handling scaling, dependencies, and infrastructure. You only pay for the resources used during job execution.
Option A is INCORRECT. Dataflow uses Apache Beam, not Spark, so it's not suitable for existing Spark code. While Dataflow is also serverless, it requires code written in the Beam programming model, which would mean rewriting existing Spark transformations.
Option C is INCORRECT. Cloud Data Fusion provides a visual interface for ETL but uses Dataproc clusters behind the scenes, which still requires some cluster configuration. It's not optimized for running custom Spark code without modifications.
Option D is INCORRECT. BigQuery uses SQL for transformations, not Apache Spark, so existing Spark code couldn't be directly utilized. While BigQuery is serverless, it doesn't support the Spark programming model or APIs.
Explanation
Option B is CORRECT. Dataproc Serverless allows you to run Apache Spark jobs without managing any infrastructure, making it the most appropriate choice for Spark-based transformations with minimal operational overhead. With Dataproc Serverless, you submit Spark batch workloads and Google Cloud automatically provisions and manages the Spark environment, handling scaling, dependencies, and infrastructure. You only pay for the resources used during job execution.
Option A is INCORRECT. Dataflow uses Apache Beam, not Spark, so it's not suitable for existing Spark code. While Dataflow is also serverless, it requires code written in the Beam programming model, which would mean rewriting existing Spark transformations.
Option C is INCORRECT. Cloud Data Fusion provides a visual interface for ETL but uses Dataproc clusters behind the scenes, which still requires some cluster configuration. It's not optimized for running custom Spark code without modifications.
Option D is INCORRECT. BigQuery uses SQL for transformations, not Apache Spark, so existing Spark code couldn't be directly utilized. While BigQuery is serverless, it doesn't support the Spark programming model or APIs.
Question 15 Single Choice
A retail company needs to clean and standardize product data containing inconsistent formats across multiple fields. Which Google Cloud service provides the most user-friendly interface for this task?
Explanation

Click "Show Answer" to see the explanation here
Option A is CORRECT. Dataprep by Trifacta provides the most user-friendly interface for data cleaning and standardization tasks because it offers a visual, interactive environment specifically designed for data preparation. It includes intelligent data profiling that automatically identifies inconsistencies, outliers, and formatting issues, suggests transformations based on detected patterns in the data, and provides visual feedback on transformations without requiring coding skills.
Option B is INCORRECT. Dataflow requires coding in Java, Python, or using templates, making it less user-friendly for data cleaning tasks. While powerful for data processing, it lacks the visual interface and automatic pattern detection that makes Dataprep more accessible for standardization tasks.
Option C is INCORRECT. BigQuery requires SQL knowledge to perform data cleaning and standardization. While it can perform powerful transformations, it doesn't offer the interactive, visual data profiling and cleaning capabilities that Dataprep provides.
Option D is INCORRECT. Dataproc is a managed Hadoop and Spark service that requires coding in languages like Scala, Python, or Java. It's designed for large-scale data processing but doesn't provide the interactive, user-friendly interface needed for visual data cleaning and standardization.
Explanation
Option A is CORRECT. Dataprep by Trifacta provides the most user-friendly interface for data cleaning and standardization tasks because it offers a visual, interactive environment specifically designed for data preparation. It includes intelligent data profiling that automatically identifies inconsistencies, outliers, and formatting issues, suggests transformations based on detected patterns in the data, and provides visual feedback on transformations without requiring coding skills.
Option B is INCORRECT. Dataflow requires coding in Java, Python, or using templates, making it less user-friendly for data cleaning tasks. While powerful for data processing, it lacks the visual interface and automatic pattern detection that makes Dataprep more accessible for standardization tasks.
Option C is INCORRECT. BigQuery requires SQL knowledge to perform data cleaning and standardization. While it can perform powerful transformations, it doesn't offer the interactive, visual data profiling and cleaning capabilities that Dataprep provides.
Option D is INCORRECT. Dataproc is a managed Hadoop and Spark service that requires coding in languages like Scala, Python, or Java. It's designed for large-scale data processing but doesn't provide the interactive, user-friendly interface needed for visual data cleaning and standardization.
Question 16 Single Choice
When loading data into BigQuery, which statement about schema auto-detection is correct?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. When using schema auto-detection in BigQuery, the system samples a portion of the input data to infer the schema. By default, BigQuery scans up to 100 rows of data to determine the schema, though this sampling size can be configured. This approach balances performance (not needing to scan the entire dataset) with accuracy (getting enough data to make good inferences).
Option A is INCORRECT. Schema auto-detection works with multiple file formats including CSV, JSON, Avro, and Parquet, not just CSV. BigQuery can automatically detect schemas from various structured and semi-structured formats.
Option C is INCORRECT. While the auto-detection is generally good, it doesn't always detect the optimal column types, especially with edge cases or inconsistent data. This is why manual schema definition is sometimes preferred for production workloads to ensure precise type control.
Option D is INCORRECT. BigQuery doesn't require a JSON schema file for auto-detection; in fact, auto-detection is used when a schema is not explicitly provided. The purpose of auto-detection is to avoid having to create a schema definition manually.
Explanation
Option B is CORRECT. When using schema auto-detection in BigQuery, the system samples a portion of the input data to infer the schema. By default, BigQuery scans up to 100 rows of data to determine the schema, though this sampling size can be configured. This approach balances performance (not needing to scan the entire dataset) with accuracy (getting enough data to make good inferences).
Option A is INCORRECT. Schema auto-detection works with multiple file formats including CSV, JSON, Avro, and Parquet, not just CSV. BigQuery can automatically detect schemas from various structured and semi-structured formats.
Option C is INCORRECT. While the auto-detection is generally good, it doesn't always detect the optimal column types, especially with edge cases or inconsistent data. This is why manual schema definition is sometimes preferred for production workloads to ensure precise type control.
Option D is INCORRECT. BigQuery doesn't require a JSON schema file for auto-detection; in fact, auto-detection is used when a schema is not explicitly provided. The purpose of auto-detection is to avoid having to create a schema definition manually.
Question 17 Single Choice
A company needs to migrate time-series data from an on-premises Cassandra database to Google Cloud. Which Google Cloud database is most suitable for time-series data with high write throughput?
Explanation

Click "Show Answer" to see the explanation here
Option C is CORRECT. Bigtable is the most suitable Google Cloud database for time-series data with high write throughput because it's specifically optimized for time-series workloads with a design that excels at writing and reading sequential data. It provides consistent single-digit millisecond latency at scale, can handle millions of operations per second, and its architecture allows it to scale horizontally for virtually unlimited capacity. Bigtable is also a NoSQL database like Cassandra, making the migration path more straightforward.
Option A is INCORRECT. Cloud SQL is a relational database that doesn't scale as effectively for high-throughput time-series data. It has limitations on write throughput and total database size that make it unsuitable for high-volume time-series workloads typically stored in Cassandra.
Option B is INCORRECT. Firestore is designed for mobile and web applications with document-based data models, not optimized for time-series. While it can handle reasonable write throughput, it's not specialized for the sequential access patterns and extreme write throughput common in time-series workloads.
Option D is INCORRECT. Spanner offers global consistency and relational features but at a higher cost and complexity than required for most time-series workloads. While Spanner can scale horizontally, Bigtable's simpler data model is more aligned with time-series data needs and provides better cost-performance for this specific use case.
Explanation
Option C is CORRECT. Bigtable is the most suitable Google Cloud database for time-series data with high write throughput because it's specifically optimized for time-series workloads with a design that excels at writing and reading sequential data. It provides consistent single-digit millisecond latency at scale, can handle millions of operations per second, and its architecture allows it to scale horizontally for virtually unlimited capacity. Bigtable is also a NoSQL database like Cassandra, making the migration path more straightforward.
Option A is INCORRECT. Cloud SQL is a relational database that doesn't scale as effectively for high-throughput time-series data. It has limitations on write throughput and total database size that make it unsuitable for high-volume time-series workloads typically stored in Cassandra.
Option B is INCORRECT. Firestore is designed for mobile and web applications with document-based data models, not optimized for time-series. While it can handle reasonable write throughput, it's not specialized for the sequential access patterns and extreme write throughput common in time-series workloads.
Option D is INCORRECT. Spanner offers global consistency and relational features but at a higher cost and complexity than required for most time-series workloads. While Spanner can scale horizontally, Bigtable's simpler data model is more aligned with time-series data needs and provides better cost-performance for this specific use case.
Question 18 Single Choice
A marketing analyst needs to calculate a 7-day rolling average of daily sales using BigQuery. Which SQL feature is most appropriate?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. Window functions in BigQuery SQL are specifically designed for calculating values across a set of rows related to the current row, making them perfect for computing rolling averages. With window functions, you can define a "window" of rows (in this case, 7 days) and perform aggregate calculations over that window for each row. The syntax would use the AVG() function with an OVER clause that defines the window frame, such as: AVG(daily_sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).
Option A is INCORRECT. Common Table Expressions (CTEs) help organize complex queries but don't directly provide rolling calculation functionality. They're useful for query readability and breaking down complex logic into manageable pieces, but you'd still need window functions within the CTE to calculate rolling averages.
Option C is INCORRECT. The UNNEST function is used for working with arrays, not for time-based calculations. It expands arrays into rows but doesn't provide functionality for calculating values across multiple rows like rolling averages.
Option D is INCORRECT. The GROUP BY clause aggregates data into groups but doesn't maintain the row-by-row context needed for rolling calculations. It collapses rows into summary values rather than maintaining the sequential relationship required for rolling averages.
Explanation
Option B is CORRECT. Window functions in BigQuery SQL are specifically designed for calculating values across a set of rows related to the current row, making them perfect for computing rolling averages. With window functions, you can define a "window" of rows (in this case, 7 days) and perform aggregate calculations over that window for each row. The syntax would use the AVG() function with an OVER clause that defines the window frame, such as: AVG(daily_sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).
Option A is INCORRECT. Common Table Expressions (CTEs) help organize complex queries but don't directly provide rolling calculation functionality. They're useful for query readability and breaking down complex logic into manageable pieces, but you'd still need window functions within the CTE to calculate rolling averages.
Option C is INCORRECT. The UNNEST function is used for working with arrays, not for time-based calculations. It expands arrays into rows but doesn't provide functionality for calculating values across multiple rows like rolling averages.
Option D is INCORRECT. The GROUP BY clause aggregates data into groups but doesn't maintain the row-by-row context needed for rolling calculations. It collapses rows into summary values rather than maintaining the sequential relationship required for rolling averages.
Question 19 Single Choice
Which BigQuery ML function would you use to evaluate the performance of a classification model?
Explanation

Click "Show Answer" to see the explanation here
Option D is CORRECT. ML.EVALUATE is the primary function used to evaluate the performance of any BigQuery ML model, including classification models. It returns standard evaluation metrics appropriate to the model type. For classification models specifically, ML.EVALUATE returns metrics such as precision, recall, accuracy, F1 score, log loss, and ROC AUC, providing a comprehensive assessment of model performance.
Option A is INCORRECT. ML.PREDICT is used to generate predictions with a trained model, not to evaluate its performance. While you could theoretically use predictions to manually calculate evaluation metrics, ML.EVALUATE does this automatically and comprehensively.
Option B is INCORRECT. ML.ROC_CURVE generates points on the ROC curve for classification models, which is useful but more specific than the comprehensive evaluation provided by ML.EVALUATE. It focuses on just one aspect of model evaluation rather than providing a complete set of metrics.
Option C is INCORRECT. ML.TRAINING_INFO returns information about the training process itself, such as iteration details and loss values during training, but doesn't evaluate performance on validation or test data. It's about how the model was trained, not how well it performs on new data.
Explanation
Option D is CORRECT. ML.EVALUATE is the primary function used to evaluate the performance of any BigQuery ML model, including classification models. It returns standard evaluation metrics appropriate to the model type. For classification models specifically, ML.EVALUATE returns metrics such as precision, recall, accuracy, F1 score, log loss, and ROC AUC, providing a comprehensive assessment of model performance.
Option A is INCORRECT. ML.PREDICT is used to generate predictions with a trained model, not to evaluate its performance. While you could theoretically use predictions to manually calculate evaluation metrics, ML.EVALUATE does this automatically and comprehensively.
Option B is INCORRECT. ML.ROC_CURVE generates points on the ROC curve for classification models, which is useful but more specific than the comprehensive evaluation provided by ML.EVALUATE. It focuses on just one aspect of model evaluation rather than providing a complete set of metrics.
Option C is INCORRECT. ML.TRAINING_INFO returns information about the training process itself, such as iteration details and loss values during training, but doesn't evaluate performance on validation or test data. It's about how the model was trained, not how well it performs on new data.
Question 20 Single Choice
When comparing Looker and Looker Studio for data visualization needs, which factor would most strongly favor Looker?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. The requirement for version-controlled data definitions strongly favors Looker because it uses LookML, a modeling language that allows developers to define business metrics, dimensions, calculations, and relationships in version-controlled files. This approach provides several key advantages: centralized definitions ensure consistency across all reports; changes to definitions can be tracked, reviewed, and rolled back using Git version control; development environments can be separated from production; and complex business logic can be defined once and reused across the organization.
Option A is INCORRECT. The need for ad-hoc, one-time reports actually favors Looker Studio over Looker. Looker Studio is designed for quicker, simpler report creation without the upfront modeling investment that Looker requires, making it better suited for one-off analyses.
Option C is INCORRECT. Integration with personal Google accounts is a strength of Looker Studio, not Looker. Looker Studio integrates seamlessly with personal Google accounts, while Looker typically requires enterprise setup and administration.
Option D is INCORRECT. The free usage tier is available for Looker Studio, not Looker. Looker is an enterprise product with licensing costs, while Looker Studio offers a free tier for individual users, making this factor favor Looker Studio rather than Looker.
Explanation
Option B is CORRECT. The requirement for version-controlled data definitions strongly favors Looker because it uses LookML, a modeling language that allows developers to define business metrics, dimensions, calculations, and relationships in version-controlled files. This approach provides several key advantages: centralized definitions ensure consistency across all reports; changes to definitions can be tracked, reviewed, and rolled back using Git version control; development environments can be separated from production; and complex business logic can be defined once and reused across the organization.
Option A is INCORRECT. The need for ad-hoc, one-time reports actually favors Looker Studio over Looker. Looker Studio is designed for quicker, simpler report creation without the upfront modeling investment that Looker requires, making it better suited for one-off analyses.
Option C is INCORRECT. Integration with personal Google accounts is a strength of Looker Studio, not Looker. Looker Studio integrates seamlessly with personal Google accounts, while Looker typically requires enterprise setup and administration.
Option D is INCORRECT. The free usage tier is available for Looker Studio, not Looker. Looker is an enterprise product with licensing costs, while Looker Studio offers a free tier for individual users, making this factor favor Looker Studio rather than Looker.



