BlackFriday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 3 Question 13 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 13
Topic #: 3
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from

DataFrame transactionsDf and column attributes from DataFrame itemsDf?

Show Suggested Answer Hide Answer
Suggested Answer: E

This Question: offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView('transactionsDf')

itemsDf.createOrReplaceTempView('itemsDf')

statement = '''

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

'''

spark.sql(statement).drop('value', 'storeId', 'attributes')

Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote ''' in Python: This allows

you to express strings as multiple lines.

transactionsDf

.drop(col('value'), col('storeId'))

.join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead.

transactionsDf.drop('value', 'storeId').join(itemsDf.drop('attributes'), 'transactionsDf.productId==itemsDf.itemId')

Incorrect - Spark does not evaluate 'transactionsDf.productId==itemsDf.itemId' as a valid join expression. This would work if it would not be a string.

transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView('transactionsDf')

itemsDf.createOrReplaceTempView('itemsDf')

spark.sql('SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId').drop('attributes')

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 25 (Databricks import instructions)


Contribute your Thoughts:

Currently there are no comments in this discussion, be the first to comment!


Save Cancel