Databricks Certified Data Engineer Professional: Resources

0

Overview

Each of the 6 sections below have several skills being measured for the Databricks Certified Data Engineer Professional. The resources are links to blogs that look to cover those skills, I’ll be working through these links to plug any gaps in knowledge before taking the exam.


Section 1: Databricks Tooling

SkillResources
Explain how Delta Lake uses the transaction log and cloud object storage to guarantee
atomicity and durability
Storage configuration — Delta Lake Documentation

What are ACID guarantees on Azure Databricks? – Azure Databricks | Microsoft Learn

delta/PROTOCOL.md at master · delta-io/delta
Describe how Delta Lake’s Optimistic Concurrency Control provides isolation, and which
transactions might conflict
Concurrency control — Delta Lake Documentation

Isolation levels and write conflicts on Azure Databricks – Azure Databricks | Microsoft Learn
Describe basic functionality of Delta clone.Clone a table on Azure Databricks – Azure Databricks | Microsoft Learn
Apply common Delta Lake indexing optimizations including partitioning, zorder, bloom
filters, and file sizes
Optimizations — Delta Lake Documentation

When to partition tables on Azure Databricks – Azure Databricks | Microsoft Learn

Adding and Deleting Partitions in Delta Lake tables | Delta Lake

Bloom filter indexes – Azure Databricks | Microsoft Learn

CREATE BLOOM FILTER INDEX – Azure Databricks – Databricks SQL | Microsoft Learn
Implement Delta tables optimized for Databricks SQL serviceOptimize data file layout – Azure Databricks | Microsoft Learn

Optimization recommendations on Azure Databricks – Azure Databricks | Microsoft Learn
Contrast different strategies for partitioning data (e.g. identify proper partitioning columns
to use)
When to partition tables on Azure Databricks – Azure Databricks | Microsoft Learn

Best practices — Delta Lake Documentation

Section 2: Data Processing (Batch processing, Incremental processing, and Optimization)

SkillResources
Describe and distinguish partition hints: coalesce, repartition, repartition by range, and
rebalance
Hints – Azure Databricks – Databricks SQL | Microsoft Learn
Articulate how to write Pyspark dataframes to disk while manually controlling the size of
individual part-files.
PySpark partitionBy() – Write to Disk Example – Spark By {Examples}

Configure Delta Lake to control data file size – Azure Databricks | Microsoft Learn
Articulate multiple strategies for updating 1+ records in a spark tableTable deletes, updates, and merges — Delta Lake Documentation
Implement common design patterns unlocked by Structured Streaming and Delta Lake.Structured Streaming patterns on Azure Databricks – Azure Databricks | Microsoft Learn

Delta table streaming reads and writes – Azure Databricks | Microsoft Learn
Explore and tune state information using stream-static joins and Delta LakeWork with joins on Azure Databricks – Azure Databricks | Microsoft Learn

Delta table streaming reads and writes – Azure Databricks | Microsoft Learn
Implement stream-static joinsas above…
Implement necessary logic for deduplication using Spark Structured StreamingStructured Streaming Programming Guide – Spark 3.5.3 Documentation
Enable CDF on Delta Lake tables and re-design data processing steps to process CDC
output instead of incremental feed from normal Structured Streaming read
Change data feed — Delta Lake Documentation

Simplify CDC with Delta Lake’s Data Feed | Databricks Blog
Leverage CDF to easily propagate deletesPropagating Deletes: Managing Data Removal using D… – Databricks Community – 90978
Demonstrate how proper partitioning of data allows for simple archiving or deletion of datapartitioning documentation
Articulate, how “smalls” (tiny files, scanning overhead, over partitioning, etc) induce
performance problems into Spark queries
Configure Delta Lake to control data file size – Azure Databricks | Microsoft Learn

Section 3: Data Modeling

SkillResources
Describe the objective of data transformations during promotion from bronze to silverWhat is a Medallion Architecture?

What is the medallion lakehouse architecture? – Azure Databricks | Microsoft Learn

Transform data – Azure Databricks | Microsoft Learn
Discuss how Change Data Feed (CDF) addresses past difficulties propagating updates and
deletes within Lakehouse architecture
Change data feed — Delta Lake Documentation
Design a multiplex bronze table to avoid common pitfalls when trying to productionalize streaming workloads.Building CDC Pipelines with Databricks | Databricks Blog
Implement best practices when streaming data from multiplex bronze tables.A Data Engineer’s Guide to Optimized Streaming wit… – Databricks Community – 62969

Advanced Streaming on Databricks — Multiplexing with Databricks Workflows | by Cody Austin Davis | Medium
Apply incremental processing, quality enforcement, and deduplication to process data from bronze to silverWhat is the medallion lakehouse architecture? – Azure Databricks | Microsoft Learn
Make informed decisions about how to enforce data quality based on strengths and
limitations of various approaches in Delta Lake
Data Quality Management With Databricks | Databricks
Implement tables avoiding issues caused by lack of foreign key constraintsConstraints on Azure Databricks – Azure Databricks | Microsoft Learn

Constraints — Delta Lake Documentation
Add constraints to Delta Lake tables to prevent bad data from being writtenAs above
Implement lookup tables and describe the trade-offs for normalized data modelsData modeling – Azure Databricks | Microsoft Learn

Data Warehouse Modeling on Databricks | Databricks Blog
Diagram architectures and operations necessary to implement various Slowly Changing Dimension tables using Delta Lake with streaming and batch workloads.
Implement SCD Type0, 1, and 2 tablesDLT with SCD attribute?

Section 4: Security & Governance

SkillResources
Create Dynamic views to perform data maskingCreate a dynamic view – Azure Databricks | Microsoft Learn
Use dynamic views to control access to rows and columnssame as above

Section 5: Monitoring & Logging

SkillResources
Describe the elements in the Spark UI to aid in performance analysis, application debugging, and tuning of Spark applicationsDebugging with the Apache Spark UI – Azure Databricks | Microsoft Learn

Diagnose cost and performance issues using the Spark UI – Azure Databricks | Microsoft Learn
Inspect event timelines and metrics for stages and jobs performed on a clusterJobs timeline – Azure Databricks | Microsoft Learn

Monitoring and observability for Databricks Jobs – Azure Databricks | Microsoft Learn
Draw conclusions from information presented in the Spark UI, Ganglia UI, and the Cluster UI
to assess performance problems and debug failing applications.
As above and possibly with Manage compute – Azure Databricks | Microsoft Learn
Design systems that control for cost and latency SLAs for production streaming jobsProduction considerations for Structured Streaming – Azure Databricks | Microsoft Learn

Cost-Effective Streaming Data Pipelines | Databricks Blog

Best practices for cost optimization – Azure Databricks | Microsoft Learn
Deploy and monitor streaming and batch jobsMonitoring Structured Streaming queries on Azure Databricks – Azure Databricks | Microsoft Learn

Run your first Structured Streaming workload | Databricks on AWS

Section 6: Testing & Deployment

SkillResources
Adapt a notebook dependency pattern to use Python file dependenciesInstall notebook dependencies – Azure Databricks | Microsoft Learn

Run your first Structured Streaming workload – Azure Databricks | Microsoft Learn
Adapt Python code maintained as Wheels to direct imports using relative pathsDevelop a Python wheel file using Databricks Asset Bundles – Azure Databricks | Microsoft Learn

Use a Python wheel file in an Azure Databricks job – Azure Databricks | Microsoft Learn
Repair and rerun failed jobsSchedule and orchestrate workflows – Azure Databricks | Microsoft Learn

Troubleshoot and repair job failures – Azure Databricks | Microsoft Learn

Repair a job run | Jobs API | REST API reference | Databricks on AWS
Create Jobs based on common use cases and patternsSchedule and orchestrate workflows – Azure Databricks | Microsoft Learn
Create a multi-task job with multiple dependenciesSchedule and orchestrate workflows – Azure Databricks | Microsoft Learn
Configure the Databricks CLI and execute basic commands to interact with the workspace and clustersWhat is the Databricks CLI? – Azure Databricks | Microsoft Learn
Execute commands from the CLI to deploy and monitor Databricks jobsDatabricks CLI commands – Azure Databricks | Microsoft Learn
Use REST API to clone a job, trigger a run, and export the run outputTrigger a new job run | Jobs API | REST API reference | Azure Databricks

Leave a Reply

Your email address will not be published. Required fields are marked *