Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 2 Question 50 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam
Question #: 50
Topic #: 2
[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the

code block to accomplish this.

transactionsDf.__1__(__2__)

Show Suggested Answer Hide Answer
Suggested Answer: A

transactionsDf.select('storeId').dropDuplicates().count()

Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.

transactionsDf.select(count('storeId')).dropDuplicates()

No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.

transactionsDf.dropDuplicates().agg(count('storeId'))

Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates

instead.

transactionsDf.distinct().select('storeId').count()

Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count

not represent the number of unique values in that column.

transactionsDf.select(distinct('storeId')).count()

False. There is no distinct method in pyspark.sql.functions.


Contribute your Thoughts:

Lezlie
1 months ago
I hear the exam is really DataFrame-ing difficult this year. You'll need to be pretty Spark-y to get through it!
upvoted 0 times
...
Dorothy
2 months ago
Option D, for sure! Using the col() function is the way to go. It makes the code more self-documenting and easier to maintain.
upvoted 0 times
...
Frederica
2 months ago
I'm torn between B and C. Both look like they'll do the job, but I'm leaning towards C because it's more explicit about the column names.
upvoted 0 times
...
Jacquelyne
2 months ago
Nah, I'm going with option E. Using the col() function to select the columns looks a bit more concise and modern.
upvoted 0 times
Arleen
20 days ago
User2: I'm going with option E. Using the col() function to select the columns looks a bit more concise and modern.
upvoted 0 times
...
Jaime
1 months ago
User1: I think option A is the correct one.
upvoted 0 times
...
...
Fairy
2 months ago
Hmm, I think option C looks good. Selecting the columns as a list in the select method seems straightforward and readable.
upvoted 0 times
Jules
18 days ago
User1: Alright, I'll give option B a try then.
upvoted 0 times
...
Lindsey
24 days ago
User3: I agree with User2, option B seems like the best option.
upvoted 0 times
...
Nada
26 days ago
User2: I disagree, I believe option B is the right choice. Selecting columns directly.
upvoted 0 times
...
Shawna
1 months ago
User1: I think option A is the correct one. Using filter to select columns.
upvoted 0 times
...
...
Lemuel
3 months ago
But E doesn't make sense, we need to specify the columns individually, so B is the right choice.
upvoted 0 times
...
Charolette
3 months ago
I disagree, I believe the correct answer is E.
upvoted 0 times
...
Lemuel
3 months ago
I think the answer is B.
upvoted 0 times
...

Save Cancel