Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from
DataFrame transactionsDf and column attributes from DataFrame itemsDf?
This Question: offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to
understand
some SQL syntax to get to the correct answer here.
transactionsDf.createOrReplaceTempView('transactionsDf')
itemsDf.createOrReplaceTempView('itemsDf')
statement = '''
SELECT * FROM transactionsDf
INNER JOIN itemsDf
ON transactionsDf.productId==itemsDf.itemId
'''
spark.sql(statement).drop('value', 'storeId', 'attributes')
Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote ''' in Python: This allows
you to express strings as multiple lines.
transactionsDf
.drop(col('value'), col('storeId'))
.join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))
No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead.
transactionsDf.drop('value', 'storeId').join(itemsDf.drop('attributes'), 'transactionsDf.productId==itemsDf.itemId')
Incorrect - Spark does not evaluate 'transactionsDf.productId==itemsDf.itemId' as a valid join expression. This would work if it would not be a string.
transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)
Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.
transactionsDf.createOrReplaceTempView('transactionsDf')
itemsDf.createOrReplaceTempView('itemsDf')
spark.sql('SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId').drop('attributes')
No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.
More info: pyspark.sql.DataFrame.join --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 25 (Databricks import instructions)
Currently there are no comments in this discussion, be the first to comment!