BlackFriday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 2 Question 7 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 7
Topic #: 2
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column

productId from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1. +-------------+---------+-----+-------+---------+----+

2. |transactionId|predError|value|storeId|productId| f|

3. +-------------+---------+-----+-------+---------+----+

4. | 1| 3| 4| 25| 1|null|

5. | 2| 6| 7| 2| 2|null|

6. | 3| 3| null| 25| 3|null|

7. | 4| null| null| 3| 2|null|

8. | 5| null| null| null| 2|null|

9. | 6| 3| 2| 25| 2|null|

10. +-------------+---------+-----+-------+---------+----+

Show Suggested Answer Hide Answer
Suggested Answer: D

transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))

Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.

transactionsDf.groupby('productId').agg({'highest': max('value'), 'lowest': min('value')})

Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {'value': 'max'}, so using the column

name as the key and the aggregating function as value.

transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))

Incorrect. While this is valid Spark syntax, it does not achieve what the Question: asks for. The Question: specifically asks for values to be aggregated per value in column productId -

this column is

not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.

transactionsDf.max('value').min('value')

Wrong. There is no DataFrame.max() method in Spark, so this command will fail.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias('highest'), min(col(value)).alias('lowest'))

No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand

which columns you want to aggregate.

More info: pyspark.sql.DataFrame.agg --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 32 (Databricks import instructions)


Contribute your Thoughts:

Currently there are no comments in this discussion, be the first to comment!


Save Cancel