Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 3 Question 56 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam
Question #: 56
Topic #: 3
[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

In which order should the code blocks shown below be run in order to read a JSON file from location jsonPath into a DataFrame and return only the rows that do not have value 3 in column

productId?

1. importedDf.createOrReplaceTempView("importedDf")

2. spark.sql("SELECT * FROM importedDf WHERE productId != 3")

3. spark.sql("FILTER * FROM importedDf WHERE productId != 3")

4. importedDf = spark.read.option("format", "json").path(jsonPath)

5. importedDf = spark.read.json(jsonPath)

Show Suggested Answer Hide Answer
Suggested Answer: D

Correct code block:

schema = StructType([

StructField('itemId', IntegerType(), True),

StructField('attributes', ArrayType(StringType(), True), True),

StructField('supplier', StringType(), True)

])

spark.read.options(modifiedBefore='2029-03-20T05:44:46').schema(schema).parquet(filePath)

This Question: is more difficult than what you would encounter in the exam. In the exam, for this Question: type, only one error needs to be identified and not 'one or multiple' as in the

question.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways

of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the Question: is

wrong.

The modification date threshold should be specified by a keyword argument like options(modifiedBefore='2029-03-20T05:44:46') and not two consecutive non-keyword arguments as in the original

code block (see documentation linked below).

Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for

example, DataFrameReader.parquet().

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be

nullable and this is specified correctly by the third argument being True in the schema in the code block.

It is correct, however, that the modification date threshold is specified incorrectly (see above).

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer

above. In addition, the DataFrameReader is called correctly through the SparkSession spark.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified

incorrectly (see correct answer above).


Contribute your Thoughts:

Mel
2 months ago
I'm just gonna roll a dice and hope for the best. This exam is like a choose-your-own-adventure novel, but without the fun of getting eaten by a dragon.
upvoted 0 times
...
Noel
2 months ago
Option B, because who doesn't love a little SQL magic? Plus, it's shorter than the other options, which is always a plus in my book.
upvoted 0 times
Simona
1 days ago
User2: No, I believe it should be 5, 1, 2
upvoted 0 times
...
Delbert
6 days ago
User1: I think the order should be 5, 1, 3
upvoted 0 times
...
...
Johnetta
2 months ago
Hmm, I'm not sure. Option D also looks good, but I'm leaning towards A. Can't go wrong with the good old read, create view, and filter approach.
upvoted 0 times
Huey
22 days ago
Yeah, A makes sense. It's always good to stick to the basics when dealing with JSON files and DataFrames.
upvoted 0 times
...
Goldie
24 days ago
I agree, A seems like the right choice. It follows the standard process of reading the file, creating a view, and then filtering the data.
upvoted 0 times
...
Malcom
28 days ago
Yeah, option A makes sense. It's always good to stick to the basics when working with dataframes.
upvoted 0 times
...
Rikki
1 months ago
I agree, option A seems like the right choice. It follows the standard process of reading the file, creating a view, and then filtering the data based on the condition.
upvoted 0 times
...
Sabra
2 months ago
I think option A is the correct order. Start with reading the JSON file, then create a temporary view, and finally filter out the rows with productId not equal to 3.
upvoted 0 times
...
Malcom
2 months ago
I think A is the correct order. Start with reading the JSON file, then create a temp view, and finally filter out the rows with productId not equal to 3.
upvoted 0 times
...
...
Irene
2 months ago
I think option E is the way to go. We can directly read the JSON file using the `spark.read.json()` method and then use SQL to filter the rows.
upvoted 0 times
Terrilyn
2 months ago
Yes, I agree. Option E is the right order to read the JSON file and filter the rows based on the productId.
upvoted 0 times
...
Terrilyn
2 months ago
I think option E is correct. We should read the JSON file first and then filter the rows using SQL.
upvoted 0 times
...
...
Long
3 months ago
Option A seems correct. We need to read the JSON file first, then create a temporary view, and finally use SQL to filter the rows.
upvoted 0 times
...
Scarlet
3 months ago
But option E) makes more sense because we first need to read the JSON file before filtering the rows based on productId.
upvoted 0 times
...
Wendell
3 months ago
I disagree, I believe it should be option D) 4, 1, 3.
upvoted 0 times
...
Scarlet
3 months ago
I think the correct order is option E) 5, 1, 2.
upvoted 0 times
...

Save Cancel