

Associate Data Practitioner - Google Cloud Certified Exam Questions
Total Questions
Last Updated
1st Try Guaranteed

Experts Verified
Question 1 Single Choice
A manufacturing company needs to migrate 50TB of sensor data from on-premises servers to Google Cloud with minimal network impact. Which service is most appropriate?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. Transfer Appliance is specifically designed for large data transfers (tens or hundreds of terabytes) where network bandwidth is limited or costly. The appliance is physically shipped to your location, you load data onto it locally (avoiding network transfer), and then ship it back to Google for uploading to Cloud Storage. This approach is ideal for the scenario's 50TB of sensor data where minimizing network impact is a requirement.
Option A is INCORRECT. Cloud Storage Transfer Service works over the network and would still consume significant bandwidth, which conflicts with the requirement to minimize network impact for this large data volume.
Option C is INCORRECT. Database Migration Service is designed specifically for migrating database content from one database system to another, not for raw sensor data or general file migration.
Option D is INCORRECT. The gsutil command-line utility transfers data over the network and would be inefficient for this volume of data, causing significant network traffic that the scenario aims to avoid.
Explanation
Option B is CORRECT. Transfer Appliance is specifically designed for large data transfers (tens or hundreds of terabytes) where network bandwidth is limited or costly. The appliance is physically shipped to your location, you load data onto it locally (avoiding network transfer), and then ship it back to Google for uploading to Cloud Storage. This approach is ideal for the scenario's 50TB of sensor data where minimizing network impact is a requirement.
Option A is INCORRECT. Cloud Storage Transfer Service works over the network and would still consume significant bandwidth, which conflicts with the requirement to minimize network impact for this large data volume.
Option C is INCORRECT. Database Migration Service is designed specifically for migrating database content from one database system to another, not for raw sensor data or general file migration.
Option D is INCORRECT. The gsutil command-line utility transfers data over the network and would be inefficient for this volume of data, causing significant network traffic that the scenario aims to avoid.
Question 2 Single Choice
Which data transformation pattern is most appropriate when you need to perform heavy transformations on a large dataset and your target system is BigQuery?
Explanation

Click "Show Answer" to see the explanation here
Option B is CORRECT. ELT is the most appropriate pattern when working with BigQuery as the target system for large datasets. In this approach, data is extracted from source systems and loaded directly into BigQuery without pre-transformation. Once in BigQuery, you can leverage its massive parallel processing capabilities to perform transformations using SQL, which is more efficient for large datasets than transforming before loading. BigQuery is specifically designed for this pattern, as it can handle the compute-intensive transformation work at scale.
Option A is INCORRECT. ETL would require separate processing resources before loading data into BigQuery, which is less efficient when BigQuery itself has powerful transformation capabilities. This approach doesn't take advantage of BigQuery's strengths.
Option C is INCORRECT. ETLT adds unnecessary complexity when BigQuery can handle all transformations at once after loading. The additional transformation step before loading is redundant when working with BigQuery.
Option D is INCORRECT. Batch processing refers to a processing method rather than a data transformation pattern. It describes how data is processed (in batches rather than continuously) but doesn't specify the sequence of extraction, transformation, and loading operations.
Explanation
Option B is CORRECT. ELT is the most appropriate pattern when working with BigQuery as the target system for large datasets. In this approach, data is extracted from source systems and loaded directly into BigQuery without pre-transformation. Once in BigQuery, you can leverage its massive parallel processing capabilities to perform transformations using SQL, which is more efficient for large datasets than transforming before loading. BigQuery is specifically designed for this pattern, as it can handle the compute-intensive transformation work at scale.
Option A is INCORRECT. ETL would require separate processing resources before loading data into BigQuery, which is less efficient when BigQuery itself has powerful transformation capabilities. This approach doesn't take advantage of BigQuery's strengths.
Option C is INCORRECT. ETLT adds unnecessary complexity when BigQuery can handle all transformations at once after loading. The additional transformation step before loading is redundant when working with BigQuery.
Option D is INCORRECT. Batch processing refers to a processing method rather than a data transformation pattern. It describes how data is processed (in batches rather than continuously) but doesn't specify the sequence of extraction, transformation, and loading operations.
Question 3 Single Choice
A data engineer needs to assess data quality for a customer dataset before processing. Which approach is most effective?
Explanation

Click "Show Answer" to see the explanation here
Option A is CORRECT. Automated validation checks provide a systematic and comprehensive approach to data quality assessment that can be applied to the entire dataset, not just samples. This includes checking for null values in required fields, verifying data types match expected formats, ensuring values fall within acceptable ranges, and validating adherence to business rules. This approach is scalable and repeatable, making it the most effective method for ensuring data quality before processing.
Option B is INCORRECT. Manual review of a small sample is time-consuming and prone to missing issues in larger datasets. It doesn't scale well and may miss patterns or issues that occur infrequently in the data.
Option C is INCORRECT. Relying on downstream systems to handle data quality issues is risky as it can lead to processing errors or incorrect results. It also pushes the problem downstream rather than addressing it at the source, potentially causing cascading issues.
Option D is INCORRECT. Converting all fields to strings masks underlying data quality issues rather than addressing them properly. This approach can cause problems with data analysis, lose important type information, and create additional work when the proper types need to be restored later.
Explanation
Option A is CORRECT. Automated validation checks provide a systematic and comprehensive approach to data quality assessment that can be applied to the entire dataset, not just samples. This includes checking for null values in required fields, verifying data types match expected formats, ensuring values fall within acceptable ranges, and validating adherence to business rules. This approach is scalable and repeatable, making it the most effective method for ensuring data quality before processing.
Option B is INCORRECT. Manual review of a small sample is time-consuming and prone to missing issues in larger datasets. It doesn't scale well and may miss patterns or issues that occur infrequently in the data.
Option C is INCORRECT. Relying on downstream systems to handle data quality issues is risky as it can lead to processing errors or incorrect results. It also pushes the problem downstream rather than addressing it at the source, potentially causing cascading issues.
Option D is INCORRECT. Converting all fields to strings masks underlying data quality issues rather than addressing them properly. This approach can cause problems with data analysis, lose important type information, and create additional work when the proper types need to be restored later.
Question 4 Single Choice
Which data format would be most efficient for analytical queries on columnar data in BigQuery?
Explanation

Click "Show Answer" to see the explanation here
Option C is CORRECT. Parquet is a columnar storage format that is specifically designed for efficient analytical processing. It stores data by column rather than by row, allowing BigQuery to read only the columns needed for a query rather than scanning entire rows. It includes built-in compression that works well with columnar data, preserves data types and schema information, and supports efficient predicate pushdown for filtering. These features make Parquet significantly more efficient than row-based formats for analytical workloads.
Option A is INCORRECT. CSV is a row-based format that requires BigQuery to scan entire rows even when only specific columns are needed for a query. It also lacks built-in compression and doesn't preserve data types, making it less efficient for analytical queries on large datasets.
Option B is INCORRECT. JSON is also a row-based format and tends to be verbose, increasing storage requirements and scan times. While it preserves data structure better than CSV, it's still not optimized for columnar analytical processing like Parquet is.
Option D is INCORRECT. While Avro is a good binary format that preserves schema, it uses row-based storage which is less efficient for analytical queries that typically access only a subset of columns. Avro is better suited for record-oriented processing rather than column-oriented analytics.
Explanation
Option C is CORRECT. Parquet is a columnar storage format that is specifically designed for efficient analytical processing. It stores data by column rather than by row, allowing BigQuery to read only the columns needed for a query rather than scanning entire rows. It includes built-in compression that works well with columnar data, preserves data types and schema information, and supports efficient predicate pushdown for filtering. These features make Parquet significantly more efficient than row-based formats for analytical workloads.
Option A is INCORRECT. CSV is a row-based format that requires BigQuery to scan entire rows even when only specific columns are needed for a query. It also lacks built-in compression and doesn't preserve data types, making it less efficient for analytical queries on large datasets.
Option B is INCORRECT. JSON is also a row-based format and tends to be verbose, increasing storage requirements and scan times. While it preserves data structure better than CSV, it's still not optimized for columnar analytical processing like Parquet is.
Option D is INCORRECT. While Avro is a good binary format that preserves schema, it uses row-based storage which is less efficient for analytical queries that typically access only a subset of columns. Avro is better suited for record-oriented processing rather than column-oriented analytics.
Question 5 Single Choice
A retail company needs to load product catalog data from an on-premises MySQL database into BigQuery daily. Which service would be most efficient?
Explanation

Click "Show Answer" to see the explanation here
Option A is CORRECT. BigQuery Data Transfer Service (DTS) is specifically designed to automate data movement into BigQuery from various sources on a scheduled basis. For a daily transfer of product catalog data, DTS provides the most efficient solution as it can be configured to automatically run the transfer job daily without manual intervention. It handles the extraction, transfer, and loading processes in an optimized way.
Option B is INCORRECT. Database Migration Service is designed for one-time migrations or continuous replication of entire databases, which is excessive for a daily catalog update. It's meant for migrating databases to Google Cloud, not for recurring data transfers to BigQuery.
Option C is INCORRECT. Dataflow with JDBC connector requires custom pipeline development and maintenance. While flexible, it introduces unnecessary complexity for a straightforward database-to-BigQuery transfer that can be handled by a specialized service like DTS.
Option D is INCORRECT. Manual exports and imports would require daily human intervention, making it inefficient and error-prone for a recurring task. This approach doesn't scale well and introduces opportunities for human error in the regular data transfer process.
Explanation
Option A is CORRECT. BigQuery Data Transfer Service (DTS) is specifically designed to automate data movement into BigQuery from various sources on a scheduled basis. For a daily transfer of product catalog data, DTS provides the most efficient solution as it can be configured to automatically run the transfer job daily without manual intervention. It handles the extraction, transfer, and loading processes in an optimized way.
Option B is INCORRECT. Database Migration Service is designed for one-time migrations or continuous replication of entire databases, which is excessive for a daily catalog update. It's meant for migrating databases to Google Cloud, not for recurring data transfers to BigQuery.
Option C is INCORRECT. Dataflow with JDBC connector requires custom pipeline development and maintenance. While flexible, it introduces unnecessary complexity for a straightforward database-to-BigQuery transfer that can be handled by a specialized service like DTS.
Option D is INCORRECT. Manual exports and imports would require daily human intervention, making it inefficient and error-prone for a recurring task. This approach doesn't scale well and introduces opportunities for human error in the regular data transfer process.
Question 6 Single Choice
Which Google Cloud service is most appropriate for cleaning and transforming data with a visual, code-free interface?
Explanation

Click "Show Answer" to see the explanation here
Option C is CORRECT. Cloud Data Fusion is purpose-built for data integration with a visual, code-free interface. It provides a graphical environment where users can build data pipelines by dragging and dropping connectors, transformations, and other components. This makes it ideal for data cleaning and transformation tasks without requiring coding expertise. It includes pre-built connectors, transformations, and data validation capabilities specifically designed for ETL workflows.
Option A is INCORRECT. BigQuery is a data warehouse that requires SQL for transformations. While powerful for data analysis and transformation, it doesn't provide a visual, code-free interface for building data pipelines.
Option B is INCORRECT. Dataproc is a managed Spark and Hadoop service that requires coding in Spark, Scala, or Python. It's designed for data processing but doesn't offer a visual interface for building transformations without code.
Option D is INCORRECT. Dataflow is a stream and batch processing service that requires coding in Java, Python, or using templates. While it offers some pre-built templates, it doesn't provide a fully visual, drag-and-drop interface for creating custom data transformations without coding experience.
Explanation
Option C is CORRECT. Cloud Data Fusion is purpose-built for data integration with a visual, code-free interface. It provides a graphical environment where users can build data pipelines by dragging and dropping connectors, transformations, and other components. This makes it ideal for data cleaning and transformation tasks without requiring coding expertise. It includes pre-built connectors, transformations, and data validation capabilities specifically designed for ETL workflows.
Option A is INCORRECT. BigQuery is a data warehouse that requires SQL for transformations. While powerful for data analysis and transformation, it doesn't provide a visual, code-free interface for building data pipelines.
Option B is INCORRECT. Dataproc is a managed Spark and Hadoop service that requires coding in Spark, Scala, or Python. It's designed for data processing but doesn't offer a visual interface for building transformations without code.
Option D is INCORRECT. Dataflow is a stream and batch processing service that requires coding in Java, Python, or using templates. While it offers some pre-built templates, it doesn't provide a fully visual, drag-and-drop interface for creating custom data transformations without coding experience.
Question 7 Single Choice
A company is ingesting semi-structured JSON data with nested fields and arrays. Which Google Cloud storage solution is most appropriate?
Explanation

Click "Show Answer" to see the explanation here
Option C is CORRECT. BigQuery is specifically designed to work well with semi-structured data like JSON with nested fields and arrays. It natively supports JSON ingestion and provides functions for parsing and querying nested and repeated fields without needing to flatten the data first. BigQuery's schema auto-detection can automatically identify nested structures in JSON data, and its SQL dialect includes functions specifically for working with arrays and nested records.
Option A is INCORRECT. Cloud SQL is a relational database that doesn't natively support nested JSON structures without additional processing. Storing JSON in Cloud SQL would typically require flattening the data or storing it as text and parsing it during queries, which is inefficient.
Option B is INCORRECT. Cloud Storage can store JSON files but doesn't provide query capabilities. It's good for storing the raw data but would require additional services to analyze and query the nested structures.
Option D is INCORRECT. Firestore is a NoSQL document database good for operational access patterns but not optimized for analytical queries on large volumes of semi-structured data. While it handles document structures well, it's designed for application data access rather than analytical processing.
Explanation
Option C is CORRECT. BigQuery is specifically designed to work well with semi-structured data like JSON with nested fields and arrays. It natively supports JSON ingestion and provides functions for parsing and querying nested and repeated fields without needing to flatten the data first. BigQuery's schema auto-detection can automatically identify nested structures in JSON data, and its SQL dialect includes functions specifically for working with arrays and nested records.
Option A is INCORRECT. Cloud SQL is a relational database that doesn't natively support nested JSON structures without additional processing. Storing JSON in Cloud SQL would typically require flattening the data or storing it as text and parsing it during queries, which is inefficient.
Option B is INCORRECT. Cloud Storage can store JSON files but doesn't provide query capabilities. It's good for storing the raw data but would require additional services to analyze and query the nested structures.
Option D is INCORRECT. Firestore is a NoSQL document database good for operational access patterns but not optimized for analytical queries on large volumes of semi-structured data. While it handles document structures well, it's designed for application data access rather than analytical processing.
Question 8 Single Choice
When loading data into BigQuery from Cloud Storage, which file format would provide the fastest load times for a 500GB structured dataset?
Explanation

Click "Show Answer" to see the explanation here
Option C is CORRECT. Parquet offers the fastest load times for large structured datasets into BigQuery because it's a columnar format that BigQuery can process very efficiently. It includes built-in compression, type information, and schema metadata. BigQuery can load Parquet files without needing to infer the schema, which speeds up the loading process. Parquet's structure allows for parallel processing during the load operation, making it ideal for large datasets like 500GB.
Option A is INCORRECT. Compressed CSV requires BigQuery to parse the entire file and infer the schema (unless provided), which is slower for large datasets. While gzip compression reduces data transfer time, the text-based nature of CSV makes it less efficient to process than binary formats like Parquet.
Option B is INCORRECT. JSON with gzip compression faces similar issues to CSV but is even more verbose due to the repeated field names in each record. This verbosity increases the amount of data to process, making it slower than Parquet for large datasets.
Option D is INCORRECT. Avro is also a good choice and preserves schema information, but its row-based nature makes it slightly less efficient than Parquet for loading into BigQuery's columnar storage system. While better than text formats, it's still not as optimized as Parquet for BigQuery's architecture.
Explanation
Option C is CORRECT. Parquet offers the fastest load times for large structured datasets into BigQuery because it's a columnar format that BigQuery can process very efficiently. It includes built-in compression, type information, and schema metadata. BigQuery can load Parquet files without needing to infer the schema, which speeds up the loading process. Parquet's structure allows for parallel processing during the load operation, making it ideal for large datasets like 500GB.
Option A is INCORRECT. Compressed CSV requires BigQuery to parse the entire file and infer the schema (unless provided), which is slower for large datasets. While gzip compression reduces data transfer time, the text-based nature of CSV makes it less efficient to process than binary formats like Parquet.
Option B is INCORRECT. JSON with gzip compression faces similar issues to CSV but is even more verbose due to the repeated field names in each record. This verbosity increases the amount of data to process, making it slower than Parquet for large datasets.
Option D is INCORRECT. Avro is also a good choice and preserves schema information, but its row-based nature makes it slightly less efficient than Parquet for loading into BigQuery's columnar storage system. While better than text formats, it's still not as optimized as Parquet for BigQuery's architecture.
Question 9 Single Choice
Which storage class in Cloud Storage would be most cost-effective for log files that must be retained for compliance but are accessed only during annual audits?
Explanation

Click "Show Answer" to see the explanation here
Option D is CORRECT. Archive Storage is specifically designed for data retention and long-term storage of data that is rarely accessed (less than once per year), making it perfect for compliance logs that are only accessed during annual audits. It offers the lowest storage costs among all Google Cloud Storage classes, but with higher retrieval costs and minimum storage duration (365 days). Since the scenario specifies annual access patterns, the higher retrieval costs are acceptable given the significant savings on storage costs throughout the year.
Option A is INCORRECT. Standard Storage would be unnecessarily expensive for rarely accessed data. It's designed for frequently accessed data (hot storage) and has the highest storage costs among the options, making it cost-ineffective for files only accessed annually.
Option B is INCORRECT. Nearline Storage is optimized for data accessed less than once per month, which is still too frequent for this use case. While cheaper than Standard Storage, it's still more expensive than Archive Storage for data accessed only once per year.
Option C is INCORRECT. Coldline Storage is designed for data accessed less than once per quarter, which is still more frequent than the annual access pattern described. Though less expensive than Nearline, it's not as cost-effective as Archive Storage for the annual access pattern in this scenario.
Explanation
Option D is CORRECT. Archive Storage is specifically designed for data retention and long-term storage of data that is rarely accessed (less than once per year), making it perfect for compliance logs that are only accessed during annual audits. It offers the lowest storage costs among all Google Cloud Storage classes, but with higher retrieval costs and minimum storage duration (365 days). Since the scenario specifies annual access patterns, the higher retrieval costs are acceptable given the significant savings on storage costs throughout the year.
Option A is INCORRECT. Standard Storage would be unnecessarily expensive for rarely accessed data. It's designed for frequently accessed data (hot storage) and has the highest storage costs among the options, making it cost-ineffective for files only accessed annually.
Option B is INCORRECT. Nearline Storage is optimized for data accessed less than once per month, which is still too frequent for this use case. While cheaper than Standard Storage, it's still more expensive than Archive Storage for data accessed only once per year.
Option C is INCORRECT. Coldline Storage is designed for data accessed less than once per quarter, which is still more frequent than the annual access pattern described. Though less expensive than Nearline, it's not as cost-effective as Archive Storage for the annual access pattern in this scenario.
Question 10 Single Choice
A data engineer needs to transfer 5TB of data from an AWS S3 bucket to Google Cloud Storage. Which service should they use?
Explanation

Click "Show Answer" to see the explanation here
Option A is CORRECT. Storage Transfer Service is specifically designed for transferring data between cloud storage providers, including from AWS S3 to Google Cloud Storage. It provides managed, high-performance transfers directly between the cloud providers without needing to download and re-upload the data. For 5TB of data, this is the most efficient option as it handles authentication, scheduling, and monitoring of the transfer process automatically.
Option B is INCORRECT. Transfer Appliance is unnecessary for this amount of data when a direct cloud-to-cloud transfer is possible. Transfer Appliance is more appropriate for hundreds of terabytes or petabytes of data, or when network bandwidth is severely limited, which isn't indicated in this scenario.
Option C is INCORRECT. The gsutil command-line tool would require downloading the data to a local machine first and then uploading to GCS, which is inefficient for 5TB. This approach would consume significant bandwidth, time, and local storage resources.
Option D is INCORRECT. BigQuery Data Transfer Service is designed for loading data into BigQuery, not for general storage transfers between cloud providers. It's optimized for analytics data sources rather than general file transfers between cloud storage systems.
Explanation
Option A is CORRECT. Storage Transfer Service is specifically designed for transferring data between cloud storage providers, including from AWS S3 to Google Cloud Storage. It provides managed, high-performance transfers directly between the cloud providers without needing to download and re-upload the data. For 5TB of data, this is the most efficient option as it handles authentication, scheduling, and monitoring of the transfer process automatically.
Option B is INCORRECT. Transfer Appliance is unnecessary for this amount of data when a direct cloud-to-cloud transfer is possible. Transfer Appliance is more appropriate for hundreds of terabytes or petabytes of data, or when network bandwidth is severely limited, which isn't indicated in this scenario.
Option C is INCORRECT. The gsutil command-line tool would require downloading the data to a local machine first and then uploading to GCS, which is inefficient for 5TB. This approach would consume significant bandwidth, time, and local storage resources.
Option D is INCORRECT. BigQuery Data Transfer Service is designed for loading data into BigQuery, not for general storage transfers between cloud providers. It's optimized for analytics data sources rather than general file transfers between cloud storage systems.



