Google Exam Professional Data Engineer Topic 3 Question 98 Discussion

Actual exam question for Google's Professional Data Engineer exam

Question #: 98
Topic #: 3

[All Professional Data Engineer Questions]

You are migrating your on-premises data warehouse to BigQuery. As part of the migration, you want to facilitate cross-team collaboration to get the most value out of the organization's dat

a. You need to design an architecture that would allow teams within the organization to securely publish, discover, and subscribe to read-only data in a self-service manner. You need to minimize costs while also maximizing data freshness What should you do?

ACreate authorized datasets to publish shared data in the subscribing team's project.

BCreate a new dataset for sharing in each individual team's project. Grant the subscribing team the bigquery. dataViewer role on the
dataset.

CUse BigQuery Data Transfer Service to copy datasets to a centralized BigQuery project for sharing.

DUse Analytics Hub to facilitate data sharing.

Show Suggested Answer

Suggested Answer: C

To provide a cost-effective storage and processing solution that allows data scientists to explore data similarly to using the on-premises HDFS cluster with SQL on the Hive query engine, deploying a Dataproc cluster is the best choice. Here's why:

Compatibility with Hive:

Dataproc is a fully managed Apache Spark and Hadoop service that provides native support for Hive, making it easy for data scientists to run SQL queries on the data as they would in an on-premises Hadoop environment.

This ensures that the transition to Google Cloud is smooth, with minimal changes required in the workflow.

Cost-Effective Storage:

Storing the ORC files in Cloud Storage is cost-effective and scalable, providing a reliable and durable storage solution that integrates seamlessly with Dataproc.

Cloud Storage allows you to store large datasets at a lower cost compared to other storage options.

Hive Integration:

Dataproc supports running Hive directly, which is essential for data scientists familiar with SQL on the Hive query engine.

This setup enables the use of existing Hive queries and scripts without significant modifications.

Steps to Implement:

Copy ORC Files to Cloud Storage:

Transfer the ORC files from the on-premises HDFS cluster to Cloud Storage, ensuring they are organized in a similar directory structure.

Deploy Dataproc Cluster:

Set up a Dataproc cluster configured to run Hive. Ensure that the cluster has access to the ORC files stored in Cloud Storage.

Configure Hive:

Configure Hive on Dataproc to read from the ORC files in Cloud Storage. This can be done by setting up external tables in Hive that point to the Cloud Storage location.

Provide Access to Data Scientists:

Grant the data scientist team access to the Dataproc cluster and the necessary permissions to interact with the Hive tables.

Dataproc Documentation

Hive on Dataproc

Google Cloud Storage Documentation

by Dottie at Oct 21, 2024, 07:36 AM

Limited Time Offer

25%

Off

Get Premium Professional Data Engineer Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Oliva

5 months ago

Option D is the way to go. Analytics Hub is the data-sharing equivalent of a one-stop-shop. It's like having a personal shopper for your data needs!

upvoted 0 times

...

I think option A is the best choice. It allows for secure data sharing and minimizes costs.

upvoted 0 times

...

5 months ago

Option D seems like the best choice here. Analytics Hub is designed specifically for secure data sharing, and it allows teams to discover and subscribe to data in a self-service way.

upvoted 0 times

...