One of your encryption keys stored in Cloud Key Management Service (Cloud KMS) was exposed. You need to re-encrypt all of your CMEK-protected Cloud Storage data that used that key. and then delete the compromised key. You also want to reduce the risk of objects getting written without customer-managed encryption key (CMEK protection in the future. What should you do?
To re-encrypt all of your CMEK-protected Cloud Storage data after a key has been exposed, and to ensure future writes are protected with a new key, creating a new Cloud KMS key and a new Cloud Storage bucket is the best approach. Here's why option C is the best choice:
Re-encryption of Data:
By creating a new Cloud Storage bucket and copying all objects from the old bucket to the new bucket while specifying the new Cloud KMS key, you ensure that all data is re-encrypted with the new key.
This process effectively re-encrypts the data, removing any dependency on the compromised key.
Ensuring CMEK Protection:
Creating a new bucket and setting the new CMEK as the default ensures that all future objects written to the bucket are automatically protected with the new key.
This reduces the risk of objects being written without CMEK protection.
Deletion of Compromised Key:
Once the data has been copied and re-encrypted, the old key can be safely deleted from Cloud KMS, eliminating the risk associated with the compromised key.
Steps to Implement:
Create a New Cloud KMS Key:
Create a new encryption key in Cloud KMS to replace the compromised key.
Create a New Cloud Storage Bucket:
Create a new Cloud Storage bucket and set the default CMEK to the new key.
Copy and Re-encrypt Data:
Use the gsutil tool to copy data from the old bucket to the new bucket while specifying the new CMEK key:
gsutil -o 'GSUtil:gs_json_api_version=2' cp -r gs://old-bucket/* gs://new-bucket/
Delete the Old Key:
After ensuring all data is copied and re-encrypted, delete the compromised key from Cloud KMS.
Cloud KMS Documentation
Cloud Storage Encryption
Re-encrypting Data in Cloud Storage
You are running your BigQuery project in the on-demand billing model and are executing a change data capture (CDC) process that ingests dat
a. The CDC process loads 1 GB of data every 10 minutes into a temporary table, and then performs a merge into a 10 TB target table. This process is very scan intensive and you want to explore options to enable a predictable cost model. You need to create a BigQuery reservation based on utilization information gathered from BigQuery Monitoring and apply the reservation to the CDC process. What should you do?
https://cloud.google.com/blog/products/data-analytics/manage-bigquery-costs-with-custom-quotas.
Here's why creating a BigQuery reservation for the project is the most suitable solution:
Project-Level Reservation: BigQuery reservations are applied at the project level. This means that the reserved slots (processing capacity) are shared across all jobs and queries running within that project. Since your CDC process is a significant contributor to your BigQuery usage, reserving slots for the entire project ensures that your CDC process always has access to the necessary resources, regardless of other activities in the project.
Predictable Cost Model: Reservations provide a fixed, predictable cost model. Instead of paying the on-demand price for each query, you pay a fixed monthly fee for the reserved slots. This eliminates the variability of costs associated with on-demand billing, making it easier to budget and forecast your BigQuery expenses.
BigQuery Monitoring: You can use BigQuery Monitoring to analyze the historical usage patterns of your CDC process and other queries within your project. This information helps you determine the appropriate amount of slots to reserve, ensuring that you have enough capacity to handle your workload while optimizing costs.
Why other options are not suitable:
A . Create a BigQuery reservation for the job: BigQuery does not support reservations at the individual job level. Reservations are applied at the project or assignment level.
B . Create a BigQuery reservation for the service account running the job: While you can create reservations for assignments (groups of users or service accounts), it's less efficient than a project-level reservation in this scenario. A project-level reservation covers all jobs within the project, regardless of the service account used.
C . Create a BigQuery reservation for the dataset: BigQuery does not support reservations at the dataset level.
By creating a BigQuery reservation for your project based on your utilization analysis, you can achieve a predictable cost model while ensuring that your CDC process and other queries have the necessary resources to run smoothly.
You are migrating your on-premises data warehouse to BigQuery. As part of the migration, you want to facilitate cross-team collaboration to get the most value out of the organization's dat
a. You need to design an architecture that would allow teams within the organization to securely publish, discover, and subscribe to read-only data in a self-service manner. You need to minimize costs while also maximizing data freshness What should you do?
To provide a cost-effective storage and processing solution that allows data scientists to explore data similarly to using the on-premises HDFS cluster with SQL on the Hive query engine, deploying a Dataproc cluster is the best choice. Here's why:
Compatibility with Hive:
Dataproc is a fully managed Apache Spark and Hadoop service that provides native support for Hive, making it easy for data scientists to run SQL queries on the data as they would in an on-premises Hadoop environment.
This ensures that the transition to Google Cloud is smooth, with minimal changes required in the workflow.
Cost-Effective Storage:
Storing the ORC files in Cloud Storage is cost-effective and scalable, providing a reliable and durable storage solution that integrates seamlessly with Dataproc.
Cloud Storage allows you to store large datasets at a lower cost compared to other storage options.
Hive Integration:
Dataproc supports running Hive directly, which is essential for data scientists familiar with SQL on the Hive query engine.
This setup enables the use of existing Hive queries and scripts without significant modifications.
Steps to Implement:
Copy ORC Files to Cloud Storage:
Transfer the ORC files from the on-premises HDFS cluster to Cloud Storage, ensuring they are organized in a similar directory structure.
Deploy Dataproc Cluster:
Set up a Dataproc cluster configured to run Hive. Ensure that the cluster has access to the ORC files stored in Cloud Storage.
Configure Hive:
Configure Hive on Dataproc to read from the ORC files in Cloud Storage. This can be done by setting up external tables in Hive that point to the Cloud Storage location.
Provide Access to Data Scientists:
Grant the data scientist team access to the Dataproc cluster and the necessary permissions to interact with the Hive tables.
Dataproc Documentation
Hive on Dataproc
Google Cloud Storage Documentation
A web server sends click events to a Pub/Sub topic as messages. The web server includes an event Timestamp attribute in the messages, which is the time when the click occurred. You have a Dataflow streaming job that reads from this Pub/Sub topic through a subscription, applies some transformations, and writes the result to another Pub/Sub topic for use by the advertising department. The advertising department needs to receive each message within 30 seconds of the corresponding click occurrence, but they report receiving the messages late. Your Dataflow job's system lag is about 5 seconds, and the data freshness is about 40 seconds. Inspecting a few messages show no more than 1 second lag between their event Timestamp and publish Time. What is the problem and what should you do?
To ensure that the advertising department receives messages within 30 seconds of the click occurrence, and given the current system lag and data freshness metrics, the issue likely lies in the processing capacity of the Dataflow job. Here's why option B is the best choice:
System Lag and Data Freshness:
The system lag of 5 seconds indicates that Dataflow itself is processing messages relatively quickly.
However, the data freshness of 40 seconds suggests a significant delay before processing begins, indicating a backlog.
Backlog in Pub/Sub Subscription:
A backlog occurs when the rate of incoming messages exceeds the rate at which the Dataflow job can process them, causing delays.
Optimizing the Dataflow Job:
To handle the incoming message rate, the Dataflow job needs to be optimized or scaled up by increasing the number of workers, ensuring it can keep up with the message inflow.
Steps to Implement:
Analyze the Dataflow Job:
Inspect the Dataflow job metrics to identify bottlenecks and inefficiencies.
Optimize Processing Logic:
Optimize the transformations and operations within the Dataflow pipeline to improve processing efficiency.
Increase Number of Workers:
Scale the Dataflow job by increasing the number of workers to handle the higher load, reducing the backlog.
Dataflow Monitoring
Scaling Dataflow Jobs
You have a BigQuery dataset named "customers". All tables will be tagged by using a Data Catalog tag template named "gdpr". The template contains one mandatory field, "has sensitive data~. with a boolean value. All employees must be able to do a simple search and find tables in the dataset that have either true or false in the "has sensitive data" field. However, only the Human Resources (HR) group should be able to see the data inside the tables for which "hass-ensitive-data" is true. You give the all employees group the bigquery.metadataViewer and bigquery.connectionUser roles on the dataset. You want to minimize configuration overhead. What should you do next?
To ensure that all employees can search and find tables with GDPR tags while restricting data access to sensitive tables only to the HR group, follow these steps:
Data Catalog Tag Template:
Use Data Catalog to create a tag template named 'gdpr' with a boolean field 'has sensitive data'. Set the visibility to public so all employees can see the tags.
Roles and Permissions:
Assign the datacatalog.tagTemplateViewer role to the all employees group. This role allows users to view the tags and search for tables based on the 'has sensitive data' field.
Assign the bigquery.dataViewer role to the HR group specifically on tables that contain sensitive data. This ensures only HR can access the actual data in these tables.
Steps to Implement:
Create the GDPR Tag Template:
Define the tag template in Data Catalog with the necessary fields and set visibility to public.
Assign Roles:
Grant the datacatalog.tagTemplateViewer role to the all employees group for visibility into the tags.
Grant the bigquery.dataViewer role to the HR group on tables marked as having sensitive data.
Data Catalog Documentation
Managing Access Control in BigQuery
IAM Roles in Data Catalog
Gwenn
7 days agoRonald
8 days agoShawn
14 days agoDonte
22 days agoAntonette
29 days agoSon
1 months agoDouglass
1 months agoAliza
1 months agoJavier
2 months agoShannon
2 months agoTheron
2 months agoKristofer
2 months agoLauna
2 months agoDerick
3 months agoVerdell
3 months agoFreida
3 months agoVesta
4 months agoLashaunda
5 months agoLon
5 months agoEric
5 months agoErasmo
6 months agoDierdre
6 months agoZack
6 months agosaqib
8 months agoanderson
9 months ago