Which of the following statements about reducing out-of-memory errors is incorrect?
Concatenating multiple string columns into a single column may guard against out-of-memory errors.
Exactly, this is an incorrect answer! Concatenating any string columns does not reduce the size of the data, it just structures it a different way. This does little to how Spark processes the data and
definitely does not reduce out-of-memory errors.
Reducing partition size can help against out-of-memory errors.
No, this is not incorrect. Reducing partition size is a viable way to aid against out-of-memory errors, since executors need to load partitions into memory before processing them. If the executor does
not have enough memory available to do that, it will throw an out-of-memory error. Decreasing partition size can therefore be very helpful for preventing that.
Decreasing the number of cores available to each executor can help against out-of-memory errors.
No, this is not incorrect. To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine that every core in every executor processes a partition, potentially in
parallel with other executors, you can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage of executors is a concern, especially when multiple
partitions are processed at the same time. To strike a balance between performance and memory usage, decreasing the number of cores may help against out-of-memory errors.
Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.
No, this is not incorrect. When using commands like collect() that trigger the transmission of potentially large amounts of data from the cluster to the driver, the driver may experience out-of-memory
errors. One strategy to avoid this is to be careful about using commands like collect() that send back large amounts of data to the driver. Another strategy is setting the parameter
spark.driver.maxResultSize. If data to be transmitted to the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore prevent an out-of-memory error.
Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.
Wrong, this is not incorrect. As part of Spark's internal optimization, Spark may choose to speed up operations by broadcasting (usually relatively small) tables to executors. This broadcast is
happening from the driver, so all the broadcast tables are loaded into the driver first. If these tables are relatively big, or multiple mid-size tables are being broadcast, this may lead to an out-of-
memory error. The maximum table size for which Spark will consider broadcasting is set by the spark.sql.autoBroadcastJoinThreshold parameter.
More info: Configuration - Spark 3.1.2 Documentation and Spark OOM Error --- Closeup. Does the following look familiar when... | by Amit Singh Rathore | The Startup | Medium
Currently there are no comments in this discussion, be the first to comment!