The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient
executor memory is available, in a fault-tolerant way. Find the error.
Code block:
transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)
The storage level is inappropriate for fault-tolerant storage.
Correct. Typically, when thinking about fault tolerance and storage levels, you would want to store redundant copies of the dataset. This can be achieved by using a storage level such as
StorageLevel.MEMORY_AND_DISK_2.
The code block uses the wrong command for caching.
Wrong. In this case, DataFrame.persist() needs to be used, since this operator supports passing a storage level. DataFrame.cache() does not support passing a storage level.
Caching is not supported in Spark, data are always recomputed.
Incorrect. Caching is an important component of Spark, since it can help to accelerate Spark programs to great extent. Caching is often a good idea for datasets that need to be accessed
repeatedly.
Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.
No. Caching is either accessed through DataFrame.cache() or DataFrame.persist().
The DataFrameWriter needs to be invoked.
Wrong. The DataFrameWriter can be accessed via DataFrame.write and is used to write data to external data stores, mostly on disk. Here, we find keywords such as 'cache' and 'executor
memory' that point us away from using external data stores. We aim to save data to memory to accelerate the reading process, since reading from disk is comparatively slower. The
DataFrameWriter does not write to memory, so we cannot use it here.
More info: Best practices for caching in Spark SQL | by David Vrba | Towards Data Science
Currently there are no comments in this discussion, be the first to comment!