This repository contains code to run some basic checks on delta lake to identify common problems related to partitioning and Zordering
The repo contains three notebooks:
- Profiler: This is the main notebook which needs to be executed to run the profiler. This notebook takes
databaseName
as input through widget. - Functions: This notebook contains useful functions that are referenced in the Profiler Notebook.
- DataGenerator: This notebbok is used to simulate various test tables required to test the Profiler's functionality
-
Modify Profiler to run checks on each table in parallel rather than sequentially
-
Add additional check to check for High Cardinality of ZORDER Columns
-
Add following checks for partitioning:
- partition column does not have null values
- partition column does not have blank or empty values
- Paritition column has low cardinality
- Partitition column provides a uniform data distribution