A Data Engineer is building a simple data pipeline using Lakeflow Declarative Pipelines (LDP) in Databricks to ingest customer data. The raw customer data is stored in a cloud storage location in JSON format. The task is to create Lakeflow Declarative Pipelines that read the raw JSON data and write it into a Delta table for further processing.
Which code snippet will correctly ingest the raw JSON data and create a Delta table using LDP?
A.
import dlt
@dlt.table
def raw_customers():
return spark.read.format("csv").load("s3://my-bucket/raw-customers/")
B.
import dlt
@dlt.table
def raw_customers():
return spark.read.json("s3://my-bucket/raw-customers/")
C.
import dlt
@dlt.table
def raw_customers():
return spark.read.format("parquet").load("s3://my-bucket/raw-customers/")
D.
import dlt
@dlt.view
def raw_customers():
return spark.format.json("s3://my-bucket/raw-customers/")
The correct method to define a table using Lakeflow Declarative Pipelines (LDP) is with the @dlt.table decorator, which persists the output as a managed Delta table. When ingesting raw JSON data, spark.read.json() or spark.read.format('json').load() is the standard approach. This reads JSON-formatted files from the source and stores them in Delta format automatically managed by Databricks.
Reference Source: Databricks Lakeflow Declarative Pipelines Developer Guide -- ''Create tables from raw JSON and Delta sources.''
When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?
Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?
This is the correct answer because it is where in the Spark UI one can diagnose a performance problem induced by not leveraging predicate push-down. Predicate push-down is an optimization technique that allows filtering data at the source before loading it into memory or processing it further. This can improve performance and reduce I/O costs by avoiding reading unnecessary data. To leverage predicate push-down, one should use supported data sources and formats, such as Delta Lake, Parquet, or JDBC, and use filter expressions that can be pushed down to the source. To diagnose a performance problem induced by not leveraging predicate push-down, one can use the Spark UI to access the Query Detail screen, which shows information about a SQL query executed on a Spark cluster. The Query Detail screen includes the Physical Plan, which is the actual plan executed by Spark to perform the query. The Physical Plan shows the physical operators used by Spark, such as Scan, Filter, Project, or Aggregate, and their input and output statistics, such as rows and bytes. By interpreting the Physical Plan, one can see if the filter expressions are pushed down to the source or not, and how much data is read or processed by each operator. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Spark Core'' section; Databricks Documentation, under ''Predicate pushdown'' section; Databricks Documentation, under ''Query detail page'' section.
A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:
(spark.readStream
.format("parquet")
.load("/mnt/raw_orders/")
.withWatermark("time", "2 hours")
.dropDuplicates(["customer_id", "order_id"])
.writeStream
.trigger(once=True)
.table("orders")
)
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?
Comprehensive and Detailed Explanation From Exact Extract:
Exact extract: ''dropDuplicates with watermark performs stateful deduplication on the keys within the watermark delay.''
Exact extract: ''Records older than the event-time watermark are considered late and may be dropped.''
Exact extract: ''trigger(once) processes all available data once and then stops.''
The watermark of 2 hours bounds the deduplication state. Duplicate orders within the 2-hour window are removed; duplicates arriving later than 2 hours behind the corresponding first event are considered late and are ignored, so they won't appear, but any orders that themselves arrive later than the watermark will be dropped and thus be missing.
===========
How are the operational aspects of Lakeflow Declarative Pipelines different from Spark Structured Streaming?
Comprehensive and Detailed Explanation From Exact Extract of Databricks Data Engineer Documents:
Databricks documentation explains that Lakeflow Declarative Pipelines build upon Structured Streaming but add higher-level orchestration and automation capabilities. They automatically manage dependencies, materialization, and recovery across multi-stage data flows without requiring external orchestration tools such as Airflow or Azure Data Factory. In contrast, Structured Streaming operates at a lower level, where developers must manually handle orchestration, retries, and dependencies between streaming jobs. Both support Delta Lake outputs and schema evolution; however, Lakeflow Declarative Pipelines simplify management by declaratively defining transformations and data quality expectations. Hence, the correct distinction is A --- automated orchestration and management in Lakeflow Declarative Pipelines.
Evelynn
23 hours agoYoko
8 days agoLouvenia
23 days agoGlory
30 days agoMona
1 month agoMattie
2 months agoLavonda
2 months agoAntonio
2 months agoBillye
2 months agoRosio
2 months agoKimbery
3 months agoNoe
3 months agoSharen
3 months agoMitsue
3 months agoLacresha
4 months agoDomitila
4 months agoCassi
4 months agoChau
4 months agoNadine
5 months agoSharee
5 months agoNiesha
5 months agoMary
5 months agoMing
6 months agoDante
6 months agoMargot
6 months agoLindsey
6 months agoRyan
6 months agoFernanda
8 months agoStacey
9 months agoRosann
9 months agoMarti
9 months agoEllen
10 months agoEmmett
10 months agoCherry
11 months agoAlana
11 months agoJovita
12 months agoBeatriz
12 months agoLeslie
1 year agoMichael
1 year agoLaurena
1 year agoRemedios
1 year agoDana
1 year agoBrittni
1 year agoLaurel
1 year agoNidia
1 year agoLezlie
1 year agoDana
1 year agoRenato
1 year agoYaeko
1 year agoDean
1 year agoSon
1 year agoAlex
1 year agoEffie
1 year agoMaybelle
1 year agoStefany
1 year agoHeike
1 year agoGearldine
1 year agoMisty
1 year agoCharlesetta
1 year agoAlesia
1 year agoAretha
1 year agoGary
1 year agoMozell
1 year agoSharen
2 years agoIsabella
2 years agoSheridan
2 years agoAdolph
2 years agoJaime
2 years agoElmira
2 years agoJesusita
2 years agoRichelle
2 years agoDenny
2 years agoAlysa
2 years agoHerman
2 years agoThad
2 years ago