Which of the following code blocks reads the parquet file stored at filePath into DataFrame itemsDf, using a valid schema for the sample of itemsDf shown below?
Sample of itemsDf:
1. +------+-----------------------------+-------------------+
2. |itemId|attributes |supplier |
3. +------+-----------------------------+-------------------+
4. |1 |[blue, winter, cozy] |Sports Company Inc.|
5. |2 |[red, summer, fresh, cooling]|YetiX |
6. |3 |[green, summer, travel] |Sports Company Inc.|
7. +------+-----------------------------+-------------------+
The challenge in this Question: comes from there being an array variable in the schema. In addition, you should know how to pass a schema to the DataFrameReader that is invoked by
spark.read.
The correct way to define an array of strings in a schema is through ArrayType(StringType()). A schema can be passed to the DataFrameReader by simply appending schema(structType) to the
read() operator. Alternatively, you can also define a schema as a string. For example, for the schema of itemsDf, the following string would make sense: itemId integer, attributes array<string>,
supplier string.
A thing to keep in mind is that in schema definitions, you always need to instantiate the types, like so: StringType(). Just using StringType does not work in pySpark and will fail.
Another concern with schemas is whether columns should be nullable, so allowed to have null values. In the case at hand, this is not a concern however, since the Question: just asks for a
'valid'
schema. Both non-nullable and nullable column schemas would be valid here, since no null value appears in the DataFrame sample.
More info: Learning Spark, 2nd Edition, Chapter 3
Static notebook | Dynamic notebook: See test 3, Question: 19 (Databricks import instructions)
Currently there are no comments in this discussion, be the first to comment!