Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 1 Question 6 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 6
Topic #: 1

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching

column names and inserting null values where column names do not appear in both DataFrames. Find the error.

Sample of DataFrame transactionsDfMonday:

1. +-------------+---------+-----+-------+---------+----+

3. +-------------+---------+-----+-------+---------+----+

4. | 5| null| null| null| 2|null|

5. | 6| 3| 2| 25| 2|null|

6. +-------------+---------+-----+-------+---------+----+

Sample of DataFrame transactionsDfTuesday:

1. +-------+-------------+---------+-----+

3. +-------+-------------+---------+-----+

4. | 25| 1| 1| 4|

5. | 2| 2| 2| 7|

6. | 3| 4| 2| null|

7. | null| 5| 2| null|

8. +-------+-------------+---------+-----+

Code block:

sc.union([transactionsDfMonday, transactionsDfTuesday])

AThe DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.

BInstead of union, the concat method should be used, making sure to not use its default arguments.

CInstead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.

DInstead of the Spark context, transactionDfMonday should be called with the union method.

EInstead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.

Show Suggested Answer

Suggested Answer: E

Correct code block:

transactionsDfMonday.unionByName(transactionsDfTuesday, True)

Output of correct code block:

+-------------+---------+-----+-------+---------+----+

+-------------+---------+-----+-------+---------+----+

| 6| 3| 2| 25| 2|null|

| 1| null| 4| 25| 1|null|

| 2| null| 7| 2| 2|null|

| 4| null| null| 3| 2|null|

+-------------+---------+-----+-------+---------+----+

For solving this question, you should be aware of the difference between the DataFrame.union() and DataFrame.unionByName() methods. The first one matches columns independent of their

names, just by their order. The second one matches columns by their name (which is asked for in the question). It also has a useful optional argument, allowMissingColumns. This allows you to

merge DataFrames that have different columns - just like in this example.

sc stands for SparkContext and is automatically provided when executing code on Databricks. While sc.union() allows you to join RDDs, it is not the right choice for joining DataFrames. A hint away

from sc.union() is given where the Question: talks about joining 'into a new DataFrame'.

concat is a method in pyspark.sql.functions. It is great for consolidating values from different columns, but has no place when trying to join rows of multiple DataFrames.

Finally, the join method is a contender here. However, the default join defined for that method is an inner join which does not get us closer to the goal to match the two DataFrames as instructed,

especially given that with the default arguments we cannot define a join condition.

More info:

- pyspark.sql.DataFrame.unionByName --- PySpark 3.1.2 documentation

- pyspark.SparkContext.union --- PySpark 3.1.2 documentation

- pyspark.sql.functions.concat --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 45 (Databricks import instructions)

by Ettie at May 04, 2022, 06:50 PM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!