Databricks Certified Data Engineer Professional: Resources

Blog Contents

Overview

Each of the 6 sections below have several skills being measured for the Databricks Certified Data Engineer Professional. The resources are links to blogs that look to cover those skills, I’ll be working through these links to plug any gaps in knowledge before taking the exam.

Exam page: Databricks Certified Data Engineer Professional | Databricks
Databricks course: https://customer-academy.databricks.com/learn/courses/2268/advanced-data-engineering-with-databricks

Section 1: Databricks Tooling

Skill	Resources
Explain how Delta Lake uses the transaction log and cloud object storage to guarantee atomicity and durability	Storage configuration — Delta Lake Documentation What are ACID guarantees on Azure Databricks? – Azure Databricks \| Microsoft Learn delta/PROTOCOL.md at master · delta-io/delta
Describe how Delta Lake’s Optimistic Concurrency Control provides isolation, and which transactions might conflict	Concurrency control — Delta Lake Documentation Isolation levels and write conflicts on Azure Databricks – Azure Databricks \| Microsoft Learn
Describe basic functionality of Delta clone.	Clone a table on Azure Databricks – Azure Databricks \| Microsoft Learn
Apply common Delta Lake indexing optimizations including partitioning, zorder, bloom filters, and file sizes	Optimizations — Delta Lake Documentation When to partition tables on Azure Databricks – Azure Databricks \| Microsoft Learn Adding and Deleting Partitions in Delta Lake tables \| Delta Lake Bloom filter indexes – Azure Databricks \| Microsoft Learn CREATE BLOOM FILTER INDEX – Azure Databricks – Databricks SQL \| Microsoft Learn
Implement Delta tables optimized for Databricks SQL service	Optimize data file layout – Azure Databricks \| Microsoft Learn Optimization recommendations on Azure Databricks – Azure Databricks \| Microsoft Learn
Contrast different strategies for partitioning data (e.g. identify proper partitioning columns to use)	When to partition tables on Azure Databricks – Azure Databricks \| Microsoft Learn Best practices — Delta Lake Documentation

Section 2: Data Processing (Batch processing, Incremental processing, and Optimization)

Skill	Resources
Describe and distinguish partition hints: coalesce, repartition, repartition by range, and rebalance	Hints – Azure Databricks – Databricks SQL \| Microsoft Learn
Articulate how to write Pyspark dataframes to disk while manually controlling the size of individual part-files.	PySpark partitionBy() – Write to Disk Example – Spark By {Examples} Configure Delta Lake to control data file size – Azure Databricks \| Microsoft Learn
Articulate multiple strategies for updating 1+ records in a spark table	Table deletes, updates, and merges — Delta Lake Documentation
Implement common design patterns unlocked by Structured Streaming and Delta Lake.	Structured Streaming patterns on Azure Databricks – Azure Databricks \| Microsoft Learn Delta table streaming reads and writes – Azure Databricks \| Microsoft Learn
Explore and tune state information using stream-static joins and Delta Lake	Work with joins on Azure Databricks – Azure Databricks \| Microsoft Learn Delta table streaming reads and writes – Azure Databricks \| Microsoft Learn
Implement stream-static joins	as above…
Implement necessary logic for deduplication using Spark Structured Streaming	Structured Streaming Programming Guide – Spark 3.5.3 Documentation
Enable CDF on Delta Lake tables and re-design data processing steps to process CDC output instead of incremental feed from normal Structured Streaming read	Change data feed — Delta Lake Documentation Simplify CDC with Delta Lake’s Data Feed \| Databricks Blog
Leverage CDF to easily propagate deletes	Propagating Deletes: Managing Data Removal using D… – Databricks Community – 90978
Demonstrate how proper partitioning of data allows for simple archiving or deletion of data	partitioning documentation
Articulate, how “smalls” (tiny files, scanning overhead, over partitioning, etc) induce performance problems into Spark queries	Configure Delta Lake to control data file size – Azure Databricks \| Microsoft Learn

Section 3: Data Modeling

Skill	Resources
Describe the objective of data transformations during promotion from bronze to silver	What is a Medallion Architecture? What is the medallion lakehouse architecture? – Azure Databricks \| Microsoft Learn Transform data – Azure Databricks \| Microsoft Learn
Discuss how Change Data Feed (CDF) addresses past difficulties propagating updates and deletes within Lakehouse architecture	Change data feed — Delta Lake Documentation
Design a multiplex bronze table to avoid common pitfalls when trying to productionalize streaming workloads.	Building CDC Pipelines with Databricks \| Databricks Blog
Implement best practices when streaming data from multiplex bronze tables.	A Data Engineer’s Guide to Optimized Streaming wit… – Databricks Community – 62969 Advanced Streaming on Databricks — Multiplexing with Databricks Workflows \| by Cody Austin Davis \| Medium
Apply incremental processing, quality enforcement, and deduplication to process data from bronze to silver	What is the medallion lakehouse architecture? – Azure Databricks \| Microsoft Learn
Make informed decisions about how to enforce data quality based on strengths and limitations of various approaches in Delta Lake	Data Quality Management With Databricks \| Databricks
Implement tables avoiding issues caused by lack of foreign key constraints	Constraints on Azure Databricks – Azure Databricks \| Microsoft Learn Constraints — Delta Lake Documentation
Add constraints to Delta Lake tables to prevent bad data from being written	As above
Implement lookup tables and describe the trade-offs for normalized data models	Data modeling – Azure Databricks \| Microsoft Learn Data Warehouse Modeling on Databricks \| Databricks Blog
Diagram architectures and operations necessary to implement various Slowly Changing Dimension tables using Delta Lake with streaming and batch workloads.
Implement SCD Type0, 1, and 2 tables	DLT with SCD attribute?

Section 4: Security & Governance

Skill	Resources
Create Dynamic views to perform data masking	Create a dynamic view – Azure Databricks \| Microsoft Learn
Use dynamic views to control access to rows and columns	same as above

Section 5: Monitoring & Logging

Skill	Resources
Describe the elements in the Spark UI to aid in performance analysis, application debugging, and tuning of Spark applications	Debugging with the Apache Spark UI – Azure Databricks \| Microsoft Learn Diagnose cost and performance issues using the Spark UI – Azure Databricks \| Microsoft Learn
Inspect event timelines and metrics for stages and jobs performed on a cluster	Jobs timeline – Azure Databricks \| Microsoft Learn Monitoring and observability for Databricks Jobs – Azure Databricks \| Microsoft Learn
Draw conclusions from information presented in the Spark UI, Ganglia UI, and the Cluster UI to assess performance problems and debug failing applications.	As above and possibly with Manage compute – Azure Databricks \| Microsoft Learn
Design systems that control for cost and latency SLAs for production streaming jobs	Production considerations for Structured Streaming – Azure Databricks \| Microsoft Learn Cost-Effective Streaming Data Pipelines \| Databricks Blog Best practices for cost optimization – Azure Databricks \| Microsoft Learn
Deploy and monitor streaming and batch jobs	Monitoring Structured Streaming queries on Azure Databricks – Azure Databricks \| Microsoft Learn Run your first Structured Streaming workload \| Databricks on AWS

Section 6: Testing & Deployment

Skill	Resources
Adapt a notebook dependency pattern to use Python file dependencies	Install notebook dependencies – Azure Databricks \| Microsoft Learn Run your first Structured Streaming workload – Azure Databricks \| Microsoft Learn
Adapt Python code maintained as Wheels to direct imports using relative paths	Develop a Python wheel file using Databricks Asset Bundles – Azure Databricks \| Microsoft Learn Use a Python wheel file in an Azure Databricks job – Azure Databricks \| Microsoft Learn
Repair and rerun failed jobs	Schedule and orchestrate workflows – Azure Databricks \| Microsoft Learn Troubleshoot and repair job failures – Azure Databricks \| Microsoft Learn Repair a job run \| Jobs API \| REST API reference \| Databricks on AWS
Create Jobs based on common use cases and patterns	Schedule and orchestrate workflows – Azure Databricks \| Microsoft Learn
Create a multi-task job with multiple dependencies	Schedule and orchestrate workflows – Azure Databricks \| Microsoft Learn
Configure the Databricks CLI and execute basic commands to interact with the workspace and clusters	What is the Databricks CLI? – Azure Databricks \| Microsoft Learn
Execute commands from the CLI to deploy and monitor Databricks jobs	Databricks CLI commands – Azure Databricks \| Microsoft Learn
Use REST API to clone a job, trigger a run, and export the run output	Trigger a new job run \| Jobs API \| REST API reference \| Azure Databricks