BlackFriday 2024! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 2 Question 64 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 64
Topic #: 2
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before

2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.

Schema:

1. root

2. |-- itemId: integer (nullable = true)

3. |-- attributes: array (nullable = true)

4. | |-- element: string (containsNull = true)

5. |-- supplier: string (nullable = true)

Code block:

1. schema = StructType([

2. StructType("itemId", IntegerType(), True),

3. StructType("attributes", ArrayType(StringType(), True), True),

4. StructType("supplier", StringType(), True)

5. ])

6.

7. spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)

Show Suggested Answer Hide Answer
Suggested Answer: D

Correct code block:

schema = StructType([

StructField('itemId', IntegerType(), True),

StructField('attributes', ArrayType(StringType(), True), True),

StructField('supplier', StringType(), True)

])

spark.read.options(modifiedBefore='2029-03-20T05:44:46').schema(schema).parquet(filePath)

This Question: is more difficult than what you would encounter in the exam. In the exam, for this Question: type, only one error needs to be identified and not 'one or multiple' as in the

question.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways

of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the Question: is

wrong.

The modification date threshold should be specified by a keyword argument like options(modifiedBefore='2029-03-20T05:44:46') and not two consecutive non-keyword arguments as in the original

code block (see documentation linked below).

Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for

example, DataFrameReader.parquet().

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be

nullable and this is specified correctly by the third argument being True in the schema in the code block.

It is correct, however, that the modification date threshold is specified incorrectly (see above).

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer

above. In addition, the DataFrameReader is called correctly through the SparkSession spark.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified

incorrectly (see correct answer above).


Contribute your Thoughts:

Delisa
1 months ago
Wait, am I the only one who noticed the modification date threshold is specified in the year 2029? Shouldn't that be 2023 or something?
upvoted 0 times
...
Latricia
1 months ago
This reminds me of that time I tried to load data into a DataFrame with a schema that was half-baked. Good times, good times.
upvoted 0 times
Beatriz
3 days ago
C) The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
upvoted 0 times
...
Temeka
7 days ago
B) Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
upvoted 0 times
...
Lajuana
13 days ago
A) The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
upvoted 0 times
...
...
Gerald
2 months ago
Yes, that could be causing the issue with loading the parquet files.
upvoted 0 times
...
Jospeh
2 months ago
I believe the modification date threshold is also specified incorrectly.
upvoted 0 times
...
Julio
2 months ago
Hmm, I think the data type of the schema is the issue here. Shouldn't it be 'StructField' instead of 'StructType'?
upvoted 0 times
...
Edelmira
2 months ago
Hold up, the attributes array is specified incorrectly. Shouldn't it be 'ArrayType(StringType(), True)'?
upvoted 0 times
Hillary
23 days ago
It seems like the columns in the schema definition are using the wrong object type.
upvoted 0 times
...
Danica
26 days ago
The code block has errors in the schema definition and the call to Spark's DataFrameReader.
upvoted 0 times
...
Salome
28 days ago
Yes, you're right. The attributes array should be 'ArrayType(StringType(), True)'.
upvoted 0 times
...
...
Tonette
2 months ago
The schema looks good, but the syntax of the call to Spark's DataFrameReader is off. Shouldn't it be '.option(\'modifiedBefore\', \'2029-03-20T05:44:46\')' instead?
upvoted 0 times
Elise
14 days ago
D) Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.
upvoted 0 times
...
Felix
17 days ago
C) The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.
upvoted 0 times
...
Elizabeth
21 days ago
B) Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
upvoted 0 times
...
Rosalind
1 months ago
B) Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.
upvoted 0 times
...
Martina
1 months ago
A) The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
upvoted 0 times
...
Zachary
2 months ago
A) The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.
upvoted 0 times
...
...
Donte
2 months ago
I agree, the columns in the schema definition seem to be using the wrong object type.
upvoted 0 times
...
Gerald
2 months ago
I think the error is in the schema definition.
upvoted 0 times
...

Save Cancel