Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?
transactionsDf.select('value').union(transactionsDf.select('productId')).distinct()
Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark
documentation for the union command (link below).
transactionsDf.select('value', 'productId').distinct()
Wrong. This code block returns unique rows, but not unique values.
transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'})
Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls).
transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'})
No. This command will count the number of rows, but will not return unique values.
transactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'), 'outer')
Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read
up on the difference between union and join, a link is posted below.
More info: pyspark.sql.DataFrame.union --- PySpark 3.1.2 documentation, sql - What is the difference between JOIN and UNION? - Stack Overflow
Static notebook | Dynamic notebook: See test 3, Question: 21 (Databricks import instructions)
Currently there are no comments in this discussion, be the first to comment!