Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 1 Question 19 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 19
Topic #: 1

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below?

Sample of itemsDf:

1. +------+-----------------------------+-------------------+

2. |itemId|attributes |supplier |

3. +------+-----------------------------+-------------------+

4. |1 |[blue, winter, cozy] |Sports Company Inc.|

5. |2 |[red, summer, fresh, cooling]|YetiX |

6. |3 |[green, summer, travel] |Sports Company Inc.|

7. +------+-----------------------------+-------------------+

A1. itemsDfSchema = StructType([
2. StructField('itemId', IntegerType()),
3. StructField('attributes', StringType()),
4. StructField('supplier', StringType())])
5.
6. itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

B1. itemsDfSchema = StructType([
2. StructField('itemId', IntegerType),
3. StructField('attributes', ArrayType(StringType)),
4. StructField('supplier', StringType)])
5.
6. itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

C1. itemsDf = spark.read.schema('itemId integer, attributes <string>, supplier string').parquet(filePath)

D1. itemsDfSchema = StructType([
2. StructField('itemId', IntegerType()),
3. StructField('attributes', ArrayType(StringType())),
4. StructField('supplier', StringType())])
5.
6. itemsDf = spark.read.schema(itemsDfSchema).parquet(filePath)

E1. itemsDfSchema = StructType([
2. StructField('itemId', IntegerType()),
3. StructField('attributes', ArrayType([StringType()])),
4. StructField('supplier', StringType())])
5.
6. itemsDf = spark.read(schema=itemsDfSchema).parquet(filePath)

Show Suggested Answer

Suggested Answer: D

The challenge in this Question: comes from there being an array variable in the schema. In addition, you should know how to pass a schema to the DataFrameReader that is invoked by

spark.read.

The correct way to define an array of strings in a schema is through ArrayType(StringType()). A schema can be passed to the DataFrameReader by simply appending schema(structType) to the

read() operator. Alternatively, you can also define a schema as a string. For example, for the schema of itemsDf, the following string would make sense: itemId integer, attributes array<string>,

supplier string.

A thing to keep in mind is that in schema definitions, you always need to instantiate the types, like so: StringType(). Just using StringType does not work in pySpark and will fail.

Another concern with schemas is whether columns should be nullable, so allowed to have null values. In the case at hand, this is not a concern however, since the Question: just asks for a

'valid'

schema. Both non-nullable and nullable column schemas would be valid here, since no null value appears in the DataFrame sample.

More info: Learning Spark, 2nd Edition, Chapter 3

Static notebook | Dynamic notebook: See test 3, Question: 19 (Databricks import instructions)

by Tawna at May 09, 2022, 11:07 AM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!