BlackFriday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 3 Question 16 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 16
Topic #: 3
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?

Show Suggested Answer Hide Answer
Suggested Answer: E

Correct code block:

accum=sc.accumulator(0)

def check_if_inc_in_supplier(row):

if 'Inc.' in row['supplier']:

accum.add(1)

itemsDf.foreach(check_if_inc_in_supplier)

print(accum.value)

To answer this Question: correctly, you need to know both about the DataFrame.foreach() method and accumulators.

When Spark runs the code, it executes it on the executors. The executors do not have any information about variables outside of their scope. This is whhy simply using a Python variable counter,

like in the two examples that start with counter = 0, will not work. You need to tell the executors explicitly that counter is a special shared variable, an Accumulator, which is managed by the driver

and can be accessed by all executors for the purpose of adding to it.

If you have used Pandas in the past, you might be familiar with the iterrows() command. Notice that there is no such command in PySpark.

The two examples that start with print do not work, since DataFrame.foreach() does not have a return value.

More info: pyspark.sql.DataFrame.foreach --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 22 (Databricks import instructions)


Contribute your Thoughts:

Currently there are no comments in this discussion, be the first to comment!


Save Cancel