BlackFriday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 2 Question 50 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 50
Topic #: 2
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?

Show Suggested Answer Hide Answer
Suggested Answer: A

transactionsDf.select('storeId').dropDuplicates().count()

Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.

transactionsDf.select(count('storeId')).dropDuplicates()

No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.

transactionsDf.dropDuplicates().agg(count('storeId'))

Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates

instead.

transactionsDf.distinct().select('storeId').count()

Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count

not represent the number of unique values in that column.

transactionsDf.select(distinct('storeId')).count()

False. There is no distinct method in pyspark.sql.functions.


Contribute your Thoughts:

Huey
5 months ago
I see both points, but I think E) transactionsDf.distinct().select("storeId").count() could also be a viable option. By first getting distinct values and then selecting the column, we can still get the count of unique values.
upvoted 0 times
...
Fredric
6 months ago
I disagree, I believe the correct answer is C) transactionsDf.select(distinct("storeId")).count(). The distinct function directly gives us unique values, so we just need to count them.
upvoted 0 times
...
Julie
6 months ago
I think the answer is A) transactionsDf.select("storeId").dropDuplicates().count(). It makes sense to first select the column and then count the number of unique values.
upvoted 0 times
...
Ben
6 months ago
I think E) transactionsDf.distinct().select('storeId').count() might be the right choice because it first gets distinct values and then counts the number of storeId.
upvoted 0 times
...
Xuan
6 months ago
I'm not sure about the correct answer, but I think option D) looks promising as it drops duplicates from the whole DataFrame and then aggregates the count of storeId.
upvoted 0 times
...
Cletus
6 months ago
I disagree, I believe the answer is C) transactionsDf.select(distinct('storeId')).count() because using distinct function directly on select will return the unique values.
upvoted 0 times
...
Mirta
6 months ago
I think the correct answer is A) transactionsDf.select('storeId').dropDuplicates().count() because it first selects the column storeId, drops duplicates, and then counts the unique values.
upvoted 0 times
...

Save Cancel