New Year Sale ! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 2 Question 61 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 61
Topic #: 2
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame

itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

__1__.__2__(__3__, __4__, __5__)

Show Suggested Answer Hide Answer
Suggested Answer: C

Correct code block:

transactionsDf.join(broadcast(itemsDf), 'transactionId', 'left_semi')

This Question: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that 'the query should be executed in an optimized way'. You also have qualitative information about the size of itemsDf and

transactionsDf. Given that itemsDf is 'very small' and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the 'very small'

DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard

it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can

likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An

outer join would include columns from both DataFrames, where a left semi join only includes columns from the 'left' table, here transactionsDf, just as asked for by the question. So, the correct

answer is the one that uses the left_semi join.


Contribute your Thoughts:

Valentin
4 months ago
Wait, are we supposed to use the itemsDf or the transactionsDf as the main DataFrame? I'm a bit confused, but I'll go with B just to be safe.
upvoted 0 times
Meaghan
3 months ago
Great choice, B is the correct answer for this scenario.
upvoted 0 times
...
Maile
3 months ago
That makes sense, let's use transactionsDf for the join.
upvoted 0 times
...
Lizette
4 months ago
I agree, let's go with option B then.
upvoted 0 times
...
Kattie
4 months ago
I think we should use transactionsDf as the main DataFrame.
upvoted 0 times
...
...
Brinda
4 months ago
Ah, the age-old debate of 'join' vs 'broadcast'. I think C is the way to go here. Nice and concise, plus it aligns with the requirements.
upvoted 0 times
...
Micheal
5 months ago
Haha, this question is like a game of 'Spot the Difference'! I'm going with D. Seems like a straightforward solution to me.
upvoted 0 times
Elvis
4 months ago
I agree, D seems like the most straightforward option.
upvoted 0 times
...
Lera
4 months ago
I'm not sure, but D looks like the best choice.
upvoted 0 times
...
Claribel
4 months ago
I think D is the correct answer too.
upvoted 0 times
...
...
Vincent
5 months ago
I think D is the correct answer because it focuses on itemsDf first.
upvoted 0 times
...
Amie
5 months ago
Hmm, this is a tricky one. I'm leaning towards A since the question specifically mentions that the itemsDf DataFrame is much smaller, so using 'broadcast' makes sense to optimize the query.
upvoted 0 times
Magdalene
4 months ago
I see your point, but I still think A is the best choice based on the size of itemsDf mentioned in the question.
upvoted 0 times
...
Erick
4 months ago
But what about option E? It also uses broadcasting but with transactionsDf instead.
upvoted 0 times
...
Cyndy
4 months ago
I agree, broadcasting the smaller DataFrame is a good optimization technique.
upvoted 0 times
...
Sylvia
4 months ago
I think A is the correct answer because broadcasting the smaller DataFrame itemsDf will optimize the query.
upvoted 0 times
...
...
Goldie
5 months ago
But A includes broadcasting itemsDf which is smaller, so it should be more optimized.
upvoted 0 times
...
Antonio
5 months ago
I think the answer is C. Using 'left_semi' join ensures that we only keep the rows in the transactionsDf DataFrame that have a matching transactionId in the itemsDf DataFrame.
upvoted 0 times
Pamella
4 months ago
I agree, 'left_semi' join is the correct choice.
upvoted 0 times
...
Markus
5 months ago
I think the answer is C.
upvoted 0 times
...
...
Malcom
5 months ago
I disagree, I believe the answer is C.
upvoted 0 times
...
Goldie
5 months ago
I think the answer is A.
upvoted 0 times
...

Save Cancel