Which of the following statements about storage levels is incorrect?
MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.
Correct, this statement is wrong. Spark prioritizes storage in memory, and will only store data on disk that does not fit into memory.
DISK_ONLY will not use the worker node's memory.
Wrong, this statement is correct. DISK_ONLY keeps data only on the worker node's disk, but not in memory.
In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.
Wrong, this statement is correct. In fact, Spark does not have a provision to cache DataFrames in the driver (which sits on the edge node in client mode). Spark caches DataFrames in the executors'
memory.
Caching can be undone using the DataFrame.unpersist() operator.
Wrong, this statement is correct. Caching, as achieved via the DataFrame.cache() or DataFrame.persist() operators can be undone using the DataFrame.unpersist() operator. This operator will
remove all of its parts from the executors' memory and disk.
The cache operator on DataFrames is evaluated like a transformation.
Wrong, this statement is correct. DataFrame.cache() is evaluated like a transformation: Through lazy evaluation. This means that after calling DataFrame.cache() the command will not have any
effect until you call a subsequent action, like DataFrame.cache().count().
More info: pyspark.sql.DataFrame.unpersist --- PySpark 3.1.2 documentation
Currently there are no comments in this discussion, be the first to comment!