BlackFriday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 3 Question 51 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 51
Topic #: 3
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?

Schema of first partition:

1. root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- productId: integer (nullable = true)

7. |-- f: integer (nullable = true)

Schema of second partition:

1. root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- rollId: integer (nullable = true)

7. |-- f: integer (nullable = true)

8. |-- tax_id: integer (nullable = false)

Show Suggested Answer Hide Answer
Suggested Answer: B

This is a very tricky Question: and involves both knowledge about merging as well as schemas when reading parquet files.

spark.read.option('mergeSchema', 'true').parquet(filePath)

Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or

more columns with the same name that appear in both partitions would have different data types.

spark.read.parquet(filePath)

Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition

(e.g. tax_id) would be lost.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.union(df_temp)

nx = nx+1

df

Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical

data types.

spark.read.parquet(filePath, mergeSchema='y')

False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a

boolean or string variable. But 'y' is not a valid option.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.join(df_temp, how='outer')

nx = nx+1

df

No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the Question: says all

columns that

are included in the partitions should appear exactly once.

More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium

Static notebook | Dynamic notebook: See test 3, Question: 37 (Databricks import instructions)


Contribute your Thoughts:

Rikki
5 months ago
I'm not sure, but I think option E should work too.
upvoted 0 times
...
Kristofer
5 months ago
I disagree, I believe the correct answer is D.
upvoted 0 times
...
Helga
5 months ago
I think the answer is B.
upvoted 0 times
...
Leontine
6 months ago
Yes, that's a good point. Option E might lead to data duplication, so I think option B is still the best choice
upvoted 0 times
...
Lennie
6 months ago
That makes sense, but won't joining the DataFrames in option E cause data duplication?
upvoted 0 times
...
Sue
6 months ago
I prefer option D) because it merges all the partitions into a single DataFrame
upvoted 0 times
...
Leontine
6 months ago
I agree with Lennie, option B) ensures all columns are included exactly once even with different schemas
upvoted 0 times
...
Lennie
6 months ago
I think the answer is B) spark.read.option("mergeSchema", "true").parquet(filePath)
upvoted 0 times
...

Save Cancel