New Year Sale ! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 1 Question 59 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 59
Topic #: 1
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame

itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

__1__.__2__(__3__, __4__, __5__)

Show Suggested Answer Hide Answer
Suggested Answer: C

Correct code block:

transactionsDf.join(broadcast(itemsDf), 'transactionId', 'left_semi')

This Question: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that 'the query should be executed in an optimized way'. You also have qualitative information about the size of itemsDf and

transactionsDf. Given that itemsDf is 'very small' and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the 'very small'

DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard

it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can

likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An

outer join would include columns from both DataFrames, where a left semi join only includes columns from the 'left' table, here transactionsDf, just as asked for by the question. So, the correct

answer is the one that uses the left_semi join.


Contribute your Thoughts:

Leota
5 months ago
Haha, I bet the exam writer is just trying to trip us up with that 'optimized way' part. I'm going to go with C - it seems the most straightforward and efficient solution.
upvoted 0 times
Jani
4 months ago
I'm not sure, but C does seem like the most efficient solution. Let's go with that.
upvoted 0 times
...
Beula
4 months ago
Yeah, I agree. It's better to keep it simple when it comes to optimizing queries.
upvoted 0 times
...
Jolene
4 months ago
I think C is the best choice too. It's always good to go with the straightforward option.
upvoted 0 times
...
Ashleigh
4 months ago
I'm not sure, but C does seem like the most efficient solution here.
upvoted 0 times
...
Herminia
4 months ago
Yeah, I agree. It's better to keep it simple when optimizing code.
upvoted 0 times
...
Deeanna
4 months ago
Yeah, C looks like the most efficient option. Let's go with that.
upvoted 0 times
...
Dianne
5 months ago
I think C is the best choice too. It seems like the most optimized way to get the desired result.
upvoted 0 times
...
Starr
5 months ago
I think C is the best choice too. It's always good to go with the straightforward option.
upvoted 0 times
...
...
Kristian
5 months ago
I agree with Danilo, A seems like the correct choice for optimized execution.
upvoted 0 times
...
Danilo
5 months ago
But A includes broadcasting itemsDf which is smaller, so it should be more optimized.
upvoted 0 times
...
Nieves
5 months ago
Hold on, why would we use itemsDf as the main DataFrame? Doesn't the question say we want the columns from transactionsDf? I'm leaning more towards C or E.
upvoted 0 times
Patrick
4 months ago
So, the correct answer would be C. transactionsDf join broadcast(itemsDf) on 'transactionId' using 'left_semi'.
upvoted 0 times
...
Thaddeus
4 months ago
I agree, we should use broadcast(itemsDf) to optimize the query.
upvoted 0 times
...
Veronique
4 months ago
But the question asks for columns from transactionsDf, so we need to join transactionsDf.
upvoted 0 times
...
Leslie
4 months ago
I think we should use itemsDf because it is much smaller than transactionsDf.
upvoted 0 times
...
...
Aliza
5 months ago
I disagree, I believe the answer is C.
upvoted 0 times
...
Caitlin
5 months ago
Ah, I see! The key here is to use the 'left_semi' join type, which will return only the rows from transactionsDf that have a matching transactionId in itemsDf. Definitely going with C on this one.
upvoted 0 times
Albert
5 months ago
C) 1. transactionsDf 2. join 3. broadcast(itemsDf) 4. 'transactionId' 5. 'left_semi'
upvoted 0 times
...
Renato
5 months ago
C) 1. transactionsDf 2. join 3. broadcast(itemsDf) 4. 'transactionId' 5. 'left_semi'
upvoted 0 times
...
...
Jacquelyne
6 months ago
I'm not sure about this one. The question says the query should be executed in an optimized way, so I'm guessing the answer has something to do with using broadcast. Maybe C or E?
upvoted 0 times
...
Blossom
6 months ago
Hmm, I think the answer is C. The question specifically mentions that DataFrame itemsDf is much smaller than transactionsDf, so using broadcast(itemsDf) would be more efficient than broadcasting the larger DataFrame.
upvoted 0 times
Kanisha
5 months ago
Yes, C makes sense because broadcasting the smaller DataFrame would be more efficient.
upvoted 0 times
...
Nidia
5 months ago
I think C is the correct answer.
upvoted 0 times
...
...
Danilo
6 months ago
I think the answer is A.
upvoted 0 times
...

Save Cancel