Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 1 Question 18 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 18
Topic #: 1

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient

executor memory is available, in a fault-tolerant way. Find the error.

Code block:

transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)

ACaching is not supported in Spark, data are always recomputed.

BData caching capabilities can be accessed through the spark object, but not through the DataFrame API.

CThe storage level is inappropriate for fault-tolerant storage.

DThe code block uses the wrong operator for caching.

EThe DataFrameWriter needs to be invoked.

Show Suggested Answer

Suggested Answer: C

The storage level is inappropriate for fault-tolerant storage.

Correct. Typically, when thinking about fault tolerance and storage levels, you would want to store redundant copies of the dataset. This can be achieved by using a storage level such as

StorageLevel.MEMORY_AND_DISK_2.

The code block uses the wrong command for caching.

Wrong. In this case, DataFrame.persist() needs to be used, since this operator supports passing a storage level. DataFrame.cache() does not support passing a storage level.

Caching is not supported in Spark, data are always recomputed.

Incorrect. Caching is an important component of Spark, since it can help to accelerate Spark programs to great extent. Caching is often a good idea for datasets that need to be accessed

repeatedly.

Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.

No. Caching is either accessed through DataFrame.cache() or DataFrame.persist().

The DataFrameWriter needs to be invoked.

Wrong. The DataFrameWriter can be accessed via DataFrame.write and is used to write data to external data stores, mostly on disk. Here, we find keywords such as 'cache' and 'executor

memory' that point us away from using external data stores. We aim to save data to memory to accelerate the reading process, since reading from disk is comparatively slower. The

DataFrameWriter does not write to memory, so we cannot use it here.

More info: Best practices for caching in Spark SQL | by David Vrba | Towards Data Science

by Joanna at May 07, 2022, 12:38 PM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!