In which order should the code blocks shown below be run in order to create a DataFrame that shows the mean of column predError of DataFrame transactionsDf per column storeId and productId,
where productId should be either 2 or 3 and the returned DataFrame should be sorted in ascending order by column storeId, leaving out any nulls in that column?
DataFrame transactionsDf:
1. +-------------+---------+-----+-------+---------+----+
2. |transactionId|predError|value|storeId|productId| f|
3. +-------------+---------+-----+-------+---------+----+
4. | 1| 3| 4| 25| 1|null|
5. | 2| 6| 7| 2| 2|null|
6. | 3| 3| null| 25| 3|null|
7. | 4| null| null| 3| 2|null|
8. | 5| null| null| null| 2|null|
9. | 6| 3| 2| 25| 2|null|
10. +-------------+---------+-----+-------+---------+----+
1. .mean("predError")
2. .groupBy("storeId")
3. .orderBy("storeId")
4. transactionsDf.filter(transactionsDf.storeId.isNotNull())
5. .pivot("productId", [2, 3])
Correct code block:
transactionsDf.filter(transactionsDf.storeId.isNotNull()).groupBy('storeId').pivot('productId', [2, 3]).mean('predError').orderBy('storeId')
Output of correct code block:
+-------+----+----+
|storeId| 2| 3|
+-------+----+----+
| 2| 6.0|null|
| 3|null|null|
| 25| 3.0| 3.0|
+-------+----+----+
This Question: is quite convoluted and requires you to think hard about the correct order of operations. The pivot method also makes an appearance - a method that you may not know all
that much
about (yet).
At the first position in all answers is code block 4, so the Question: is essentially just about the ordering of the remaining 4 code blocks.
The Question: states that the returned DataFrame should be sorted by column storeId. So, it should make sense to have code block 3 which includes the orderBy operator at the very end of
the code
block. This leaves you with only two answer options.
Now, it is useful to know more about the context of pivot in PySpark. A common pattern is groupBy, pivot, and then another aggregating function, like mean. In the documentation linked below you
can see that pivot is a method of pyspark.sql.GroupedData - meaning that before pivoting, you have to use groupBy. The only answer option matching this requirement is the one in which code
block 2 (which includes groupBy) is stated before code block 5 (which includes pivot).
More info: pyspark.sql.GroupedData.pivot --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 43 (Databricks import instructions)
Currently there are no comments in this discussion, be the first to comment!