BlackFriday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 3 Question 52 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 52
Topic #: 3
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame

itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

__1__.__2__(__3__, __4__, __5__)

Show Suggested Answer Hide Answer
Suggested Answer: C

Correct code block:

transactionsDf.join(broadcast(itemsDf), 'transactionId', 'left_semi')

This Question: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that 'the query should be executed in an optimized way'. You also have qualitative information about the size of itemsDf and

transactionsDf. Given that itemsDf is 'very small' and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the 'very small'

DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard

it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can

likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An

outer join would include columns from both DataFrames, where a left semi join only includes columns from the 'left' table, here transactionsDf, just as asked for by the question. So, the correct

answer is the one that uses the left_semi join.


Contribute your Thoughts:

Paris
5 months ago
I think D is the correct answer as it focuses on the smaller DataFrame itemsDf.
upvoted 0 times
...
Kaycee
5 months ago
I see your point, but I still think A is the best option for performance reasons.
upvoted 0 times
...
Kimbery
5 months ago
I believe the correct answer is C as it uses 'broadcast' for optimization and 'transactionId' for joining.
upvoted 0 times
...
Wilburn
5 months ago
I disagree, I think the answer is B because it directly joins the two DataFrames.
upvoted 0 times
...
Kaycee
5 months ago
I think the answer is A because it uses 'broadcast' for the smaller DataFrame.
upvoted 0 times
...
Merilyn
6 months ago
That makes sense. Maybe I'll reconsider my choice.
upvoted 0 times
...
Skye
6 months ago
I think D is correct because we need to start with itemsDf and broadcast it.
upvoted 0 times
...
Merilyn
6 months ago
Why do you think D is the correct answer, Bob?
upvoted 0 times
...
Michal
6 months ago
I'm not sure, but I think it might be C.
upvoted 0 times
...
Skye
6 months ago
I disagree, I believe the correct answer is D.
upvoted 0 times
...
Merilyn
6 months ago
I think the answer is A.
upvoted 0 times
...
Layla
7 months ago
Ha! Leave it to *Tamar* to cut through the noise. C does seem like the most straightforward and efficient solution. I'm with you, let's lock that in and move on to the next question.
upvoted 0 times
...
Tamar
7 months ago
You guys are really overthinking this. I'm just going to go with C and call it a day. Broadcast is the way to go, and 'left_semi' is the perfect join type to get the desired result. *laughs* Sometimes the simplest answer is the right one, you know?
upvoted 0 times
...
Isaiah
7 months ago
Hmm, that's an interesting point. But I'm not sure the 'anti' join is the best approach here, since we want to keep the columns from 'transactionsDf', not just the rows that don't match 'itemsDf'. I think C is still the cleanest solution.
upvoted 0 times
...
Francisca
7 months ago
You know, I was thinking the same thing. But I wonder if B might also work, using the 'anti' join to get the opposite of the join condition. That could be another way to optimize the query if the 'itemsDf' only has a few matching rows in 'transactionsDf'.
upvoted 0 times
Marsha
5 months ago
User4
upvoted 0 times
...
Tammara
6 months ago
User1
upvoted 0 times
...
...
Clemencia
7 months ago
I agree, C seems like the best option. Broadcast is a great way to efficiently join smaller datasets with larger ones, especially when the smaller dataset can fit in memory. This will help avoid shuffling large amounts of data across the network.
upvoted 0 times
...
Rosendo
7 months ago
Hmm, this is an interesting question. The key here is to optimize the query by using the smaller DataFrame 'itemsDf' to filter the larger DataFrame 'transactionsDf'. I think the correct answer is C, since it uses the 'broadcast' function to optimize the join operation.
upvoted 0 times
...

Save Cancel