Databricks Certified Data Engineer Professional: Resources
Overview
Each of the 6 sections below have several skills being measured for the Databricks Certified Data Engineer Professional. The resources are links to blogs that look to cover those skills, I’ll be working through these links to plug any gaps in knowledge before taking the exam.
- Exam page: Databricks Certified Data Engineer Professional | Databricks
- Databricks course: https://customer-academy.databricks.com/learn/courses/2268/advanced-data-engineering-with-databricks
Section 1: Databricks Tooling
Section 2: Data Processing (Batch processing, Incremental processing, and Optimization)
Skill | Resources |
---|---|
Describe and distinguish partition hints: coalesce, repartition, repartition by range, and rebalance | Hints – Azure Databricks – Databricks SQL | Microsoft Learn |
Articulate how to write Pyspark dataframes to disk while manually controlling the size of individual part-files. | PySpark partitionBy() – Write to Disk Example – Spark By {Examples} Configure Delta Lake to control data file size – Azure Databricks | Microsoft Learn |
Articulate multiple strategies for updating 1+ records in a spark table | Table deletes, updates, and merges — Delta Lake Documentation |
Implement common design patterns unlocked by Structured Streaming and Delta Lake. | Structured Streaming patterns on Azure Databricks – Azure Databricks | Microsoft Learn Delta table streaming reads and writes – Azure Databricks | Microsoft Learn |
Explore and tune state information using stream-static joins and Delta Lake | Work with joins on Azure Databricks – Azure Databricks | Microsoft Learn Delta table streaming reads and writes – Azure Databricks | Microsoft Learn |
Implement stream-static joins | as above… |
Implement necessary logic for deduplication using Spark Structured Streaming | Structured Streaming Programming Guide – Spark 3.5.3 Documentation |
Enable CDF on Delta Lake tables and re-design data processing steps to process CDC output instead of incremental feed from normal Structured Streaming read | Change data feed — Delta Lake Documentation Simplify CDC with Delta Lake’s Data Feed | Databricks Blog |
Leverage CDF to easily propagate deletes | Propagating Deletes: Managing Data Removal using D… – Databricks Community – 90978 |
Demonstrate how proper partitioning of data allows for simple archiving or deletion of data | partitioning documentation |
Articulate, how “smalls” (tiny files, scanning overhead, over partitioning, etc) induce performance problems into Spark queries | Configure Delta Lake to control data file size – Azure Databricks | Microsoft Learn |
Section 3: Data Modeling
Skill | Resources |
---|---|
Describe the objective of data transformations during promotion from bronze to silver | What is a Medallion Architecture? What is the medallion lakehouse architecture? – Azure Databricks | Microsoft Learn Transform data – Azure Databricks | Microsoft Learn |
Discuss how Change Data Feed (CDF) addresses past difficulties propagating updates and deletes within Lakehouse architecture | Change data feed — Delta Lake Documentation |
Design a multiplex bronze table to avoid common pitfalls when trying to productionalize streaming workloads. | Building CDC Pipelines with Databricks | Databricks Blog |
Implement best practices when streaming data from multiplex bronze tables. | A Data Engineer’s Guide to Optimized Streaming wit… – Databricks Community – 62969 Advanced Streaming on Databricks — Multiplexing with Databricks Workflows | by Cody Austin Davis | Medium |
Apply incremental processing, quality enforcement, and deduplication to process data from bronze to silver | What is the medallion lakehouse architecture? – Azure Databricks | Microsoft Learn |
Make informed decisions about how to enforce data quality based on strengths and limitations of various approaches in Delta Lake | Data Quality Management With Databricks | Databricks |
Implement tables avoiding issues caused by lack of foreign key constraints | Constraints on Azure Databricks – Azure Databricks | Microsoft Learn Constraints — Delta Lake Documentation |
Add constraints to Delta Lake tables to prevent bad data from being written | As above |
Implement lookup tables and describe the trade-offs for normalized data models | Data modeling – Azure Databricks | Microsoft Learn Data Warehouse Modeling on Databricks | Databricks Blog |
Diagram architectures and operations necessary to implement various Slowly Changing Dimension tables using Delta Lake with streaming and batch workloads. | |
Implement SCD Type0, 1, and 2 tables | DLT with SCD attribute? |
Section 4: Security & Governance
Skill | Resources |
---|---|
Create Dynamic views to perform data masking | Create a dynamic view – Azure Databricks | Microsoft Learn |
Use dynamic views to control access to rows and columns | same as above |