The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame
itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to
accomplish this.
__1__.__2__(__3__, __4__, __5__)
Correct code block:
transactionsDf.join(broadcast(itemsDf), 'transactionId', 'left_semi')
This Question: is extremely difficult and exceeds the difficulty of questions in the exam by far.
A first indication of what is asked from you here is the remark that 'the query should be executed in an optimized way'. You also have qualitative information about the size of itemsDf and
transactionsDf. Given that itemsDf is 'very small' and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the 'very small'
DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard
it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can
likewise be disregarded.
When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame
class has no broadcast() method, so this answer option can be eliminated as well.
All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An
outer join would include columns from both DataFrames, where a left semi join only includes columns from the 'left' table, here transactionsDf, just as asked for by the question. So, the correct
answer is the one that uses the left_semi join.
Paris
5 months agoKaycee
5 months agoKimbery
5 months agoWilburn
5 months agoKaycee
5 months agoMerilyn
6 months agoSkye
6 months agoMerilyn
6 months agoMichal
6 months agoSkye
6 months agoMerilyn
6 months agoLayla
7 months agoTamar
7 months agoIsaiah
7 months agoFrancisca
7 months agoMarsha
5 months agoTammara
6 months agoClemencia
7 months agoRosendo
7 months ago