Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 2 Question 7 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 7
Topic #: 2

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column

productId from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1. +-------------+---------+-----+-------+---------+----+

3. +-------------+---------+-----+-------+---------+----+

4. | 1| 3| 4| 25| 1|null|

5. | 2| 6| 7| 2| 2|null|

6. | 3| 3| null| 25| 3|null|

7. | 4| null| null| 3| 2|null|

8. | 5| null| null| null| 2|null|

9. | 6| 3| 2| 25| 2|null|

10. +-------------+---------+-----+-------+---------+----+

AtransactionsDf.max('value').min('value')

BtransactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))

CtransactionsDf.groupby(col(productId)).agg(max(col(value)).alias('highest'), min(col(value)).alias('lowest'))

DtransactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))

EtransactionsDf.groupby('productId').agg({'highest': max('value'), 'lowest': min('value')})

Show Suggested Answer

Suggested Answer: D

transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby('productId').agg({'highest': max('value'), 'lowest': min('value')})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {'value': 'max'}, so using the column

name as the key and the aggregating function as value.

transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))

Incorrect. While this is valid Spark syntax, it does not achieve what the Question: asks for. The Question: specifically asks for values to be aggregated per value in column productId -

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max('value').min('value')

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias('highest'), min(col(value)).alias('lowest'))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 32 (Databricks import instructions)

by Ty at May 07, 2022, 08:34 AM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!