Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column
productId from DataFrame transactionsDf?
Sample of DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. | 4| null| null| 3| 2|null|
8. | 5| null| null| null| 2|null|
9. | 6| 3| 2| 25| 2|null|
10. +-------------+---------+-----+-------+---------+----+
transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))
Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.
transactionsDf.groupby('productId').agg({'highest': max('value'), 'lowest': min('value')})
Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {'value': 'max'}, so using the column
name as the key and the aggregating function as value.
transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))
Incorrect. While this is valid Spark syntax, it does not achieve what the Question: asks for. The Question: specifically asks for values to be aggregated per value in column productId -
this column is
not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.
transactionsDf.max('value').min('value')
Wrong. There is no DataFrame.max() method in Spark, so this command will fail.
transactionsDf.groupby(col(productId)).agg(max(col(value)).alias('highest'), min(col(value)).alias('lowest'))
No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand
which columns you want to aggregate.
More info: pyspark.sql.DataFrame.agg --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 32 (Databricks import instructions)
Currently there are no comments in this discussion, be the first to comment!