Skip to content

Commit a17ac1e

Browse files
jayacrylyoonhyejin
authored andcommitted
docs(assertions): better explanations of collection mechanisms (#14419)
1 parent f80cb19 commit a17ac1e

File tree

5 files changed

+65
-62
lines changed

5 files changed

+65
-62
lines changed

docs-website/sidebars.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ module.exports = {
9696
className: "saasOnly",
9797
},
9898
{
99-
label: "Assertion Notes (Troubleshooting & Documentation)",
99+
label: "Adding Notes to Assertions",
100100
type: "doc",
101101
id: "docs/managed-datahub/observe/assertion-notes",
102102
className: "saasOnly",

docs/managed-datahub/observe/assertions.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,7 @@
22

33
:::note Supported Data Platforms
44
Currently we support monitoring data on Snowflake, Redshift, BigQuery, and Databricks as part of DataHub Cloud Observe.
5-
For other data platforms, DataHub Cloud Observe can monitor assertions against dataset metrics (such as volume, or column nullness) and dataset freshenss by using the ingested statistics for each asset.
6-
Column Value and Custom SQL Assertions are not currently supported for other data platforms.
5+
DataHub Cloud Observe can still monitor assertions for other data platforms against dataset metrics (such as row count, or column nullness) and dataset freshenss by using the [ingested statistics](/metadata-ingestion/docs/dev_guides/sql_profiles.md).
76
:::
87

98
An assertion is **a data quality test that finds data that violates a specified rule.**
@@ -25,13 +24,11 @@ For DataHub-provided assertion runners, we can deploy an agent in your environme
2524

2625
#### Bulk Creating Assertions
2726

28-
You can bulk create Freshness and Volume [Smart Assertions](/docs/managed-datahub/observe/smart-assertions.md) (AI Anomaly Monitors) across several tables at once via the [Data Health Dashboard](/docs/managed-datahub/observe/data-health-dashboard.md):
29-
30-
<div align="center"><iframe width="560" height="315" src="https://www.loom.com/embed/f6720541914645aab6b28cdff8695d9f?sid=58dff84d-bb88-4f02-b814-17fb4986ad1f" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe></div>
27+
You can bulk create Freshness and Volume [Smart Assertions](/docs/managed-datahub/observe/smart-assertions.md) (AI Anomaly Monitors) across several tables at once via the [Data Health Dashboard](/docs/managed-datahub/observe/data-health-dashboard.md).
3128

3229
To bulk create column metric assertions on a given dataset, follow the steps under the **Anomaly Detection** section of [Column Assertion](https://docs.datahub.com/docs/managed-datahub/observe/column-assertions#anomaly-detection-with-smart-assertions-).
3330

34-
### AI Anomaly Detection (Smart Assertions)
31+
### Detecting Anomalies Across Massive Data Landscapes
3532

3633
There are many cases where either you do not have the time to figure out what a good rule for an assertion is, or strict rules simply do not suffice for your data validation needs. Traditional rule-based assertions can become inadequate when dealing with complex data patterns or large-scale operations.
3734

@@ -45,7 +42,7 @@ Here are some typical situations where manual assertion rules fall short:
4542

4643
- **Dynamic data environments** - When data patterns evolve over time, manually updating assertion rules becomes a maintenance burden that can lead to false positives or missed anomalies.
4744

48-
### The Smart Assertion Solution
45+
### The AI Smart Assertion Solution
4946

5047
In these scenarios, you may want to consider creating a [Smart Assertion](./smart-assertions.md) to let machine learning automatically detect the normal patterns in your data and alert you when anomalies occur. This approach allows for more flexible and adaptive data quality monitoring without the overhead of manual rule maintenance.
5148

docs/managed-datahub/observe/column-assertions.md

Lines changed: 52 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -126,86 +126,92 @@ Once these are in place, you're ready to create your Column Assertions!
126126

127127
### Steps
128128

129-
1. Navigate to the Table that you want to monitor
130-
2. Click the **Quality** tab
129+
#### 1. Navigate to the Table that you want to monitor
130+
131+
#### 2. Click the **Quality** tab
131132

132133
<p align="left">
133134
<img width="90%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/freshness/profile-validation-tab.png"/>
134135
</p>
135136

136-
3. Click **+ Create Assertion**
137+
#### 3. Click **+ Create Assertion**
138+
139+
#### 4. Choose **'Column'**
137140

138141
<p align="left">
139142
<img width="40%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/column/assertion-builder-column-choose-type.png"/>
140143
</p>
141144

142-
4. Choose **Column**
145+
#### 5. Configure the evaluation **schedule**.
143146

144-
5. Configure the evaluation **schedule**. This is the frequency at which the assertion will be evaluated to produce a
145-
pass or fail result, and the times when the column values will be checked.
147+
This is the frequency at which the assertion will be evaluated to produce a
148+
pass or fail result, and the times when the column values will be checked.
146149

147-
6. Configure the **column assertion type**. You can choose from **Column Value** or **Column Metric**.
148-
**Column Value** assertions are used to monitor the value of a specific column in a table, and ensure that every row
149-
adheres to a specific condition. **Column Metric** assertions are used to compute a metric for that column, and then compare the value of that metric to your expectations.
150+
#### 6. Configure the **column assertion type**.
151+
152+
You can choose from **Column Value** or **Column Metric**.
153+
**Column Value** assertions are used to monitor the value of a specific column in a table, and ensure that every row
154+
adheres to a specific condition. **Column Metric** assertions are used to compute a metric for that column, and then compare the value of that metric to your expectations.
150155

151156
<p align="left">
152157
<img width="30%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/column/assertion-builder-column-assertion-type.png"/>
153158
</p>
154159

155-
7. Configure the **column selection**. This defines the column that should be monitored by the Column Assertion.
156-
You can choose from any of the columns from the table listed in the dropdown.
160+
#### 7. Configure the **column selection**.
161+
162+
This defines the column that should be monitored by the Column Assertion.
163+
You can choose from any of the columns from the table listed in the dropdown.
157164

158165
<p align="left">
159166
<img width="30%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/column/assertion-builder-column-field-selection.png"/>
160167
</p>
161168

162-
8. Configure the **evaluation criteria**. This step varies based on the type of assertion you chose in the previous step.
169+
#### 8. Configure the **evaluation criteria**. This step varies based on the type of assertion you chose in the previous step.
163170

164-
- **Column Value Assertions**: You will be able to choose from a set of operators that can be applied to the column
165-
value. The options presented will vary based on the data type of the selected column. For example with numeric types, you
166-
can check that the column value is greater than a specific value. For string types, you can check that the column value
167-
matches a particular regex pattern. You will also be able to control the behavior of null values in the column. If the
168-
**Allow Nulls** option is _disabled_, any null values encountered will be reported as a failure when evaluating the
169-
assertion.
171+
- **Column Value Assertions**: You will be able to choose from a set of operators that can be applied to the column
172+
value. The options presented will vary based on the data type of the selected column. For example with numeric types, you
173+
can check that the column value is greater than a specific value. For string types, you can check that the column value
174+
matches a particular regex pattern. You will also be able to control the behavior of null values in the column. If the
175+
**Allow Nulls** option is _disabled_, any null values encountered will be reported as a failure when evaluating the
176+
assertion. Note, Smart Assertions are not supported for Column Value Assertions today.
170177

171-
- **Column Metric Assertions**: You will be able to choose from a list of common metrics and then specify the operator
172-
and value to compare against. The list of metrics will vary based on the data type of the selected column. For example
173-
with numeric types, you can choose to compute the average value of the column, and then assert that it is greater than a
174-
specific number. For string types, you can choose to compute the max length of all column values, and then assert that it
175-
is less than a specific number.
178+
- **Column Metric Assertions**: You will be able to choose from a list of common metrics and then specify the operator
179+
and value to compare against. The list of metrics will vary based on the data type of the selected column. For example
180+
with numeric types, you can choose to compute the average value of the column, and then assert that it is greater than a
181+
specific number. For string types, you can choose to compute the max length of all column values, and then assert that it
182+
is less than a specific number. You can also select the **Detect with AI** option to use Smart Assertions to detect anomalies in the column metric.
176183

177-
9. Configure the **row evaluation type**. This defines which rows in the table the Column Assertion should evaluate. You can choose
178-
from the following options:
184+
#### 9. Configure the **row evaluation type**. This defines which rows in the table the Column Assertion should evaluate.
179185

180-
- **All Table Rows**: Evaluate the Column Assertion against all rows in the table. This is the default option. Note that
181-
this may not be desirable for large tables.
186+
- **All Table Rows**: Evaluate the Column Assertion against all rows in the table. This is the default option. Note that
187+
this may not be desirable for large tables.
182188

183-
- **Only Rows That Have Changed**: Evaluate the Column Assertion only against rows that have changed since the last
184-
evaluation. If you choose this option, you will need to specify a **High Watermark Column** to help determine which rows
185-
have changed. A **High Watermark Column** is a column that contains a constantly-incrementing value - a date, a time, or
186-
another always-increasing number. When selected, a query will be issued to the table find only the rows which have changed since the last assertion run.
189+
- **Only Rows That Have Changed**: Evaluate the Column Assertion only against rows that have changed since the last
190+
evaluation. If you choose this option, you will need to specify a **High Watermark Column** to help determine which rows
191+
have changed. A **High Watermark Column** is a column that contains a constantly-incrementing value - a date, a time, or
192+
another always-increasing number. When selected, a query will be issued to the table find only the rows which have changed since the last assertion run.
187193

188194
<p align="left">
189195
<img width="60%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/column/assertion-builder-column-row-evaluation-type.png"/>
190196
</p>
191197

192-
10. (Optional) Click **Advanced** to further customize the Column Assertion. The options listed here will vary based on the
193-
type of assertion you chose in the previous step.
198+
#### 10. (Optional) Click **Advanced** to further customize the Column Assertion.
199+
200+
The options listed here will vary based on the type of assertion you chose in the previous step.
194201

195-
- **Invalid Values Threshold**: For **Column Value** assertions, you can configure the number of invalid values
196-
(i.e. rows) that are allowed to fail before the assertion is marked as failing. This is useful if you want to allow a limited number
197-
of invalid values in the column. By default this is 0, meaning the assertion will fail if any rows have an invalid column value.
202+
- **Invalid Values Threshold**: For **Column Value** assertions, you can configure the number of invalid values
203+
(i.e. rows) that are allowed to fail before the assertion is marked as failing. This is useful if you want to allow a limited number
204+
of invalid values in the column. By default this is 0, meaning the assertion will fail if any rows have an invalid column value.
198205

199-
- **Source**: For **Column Metric** assertions, you can choose the mechanism that will be used to obtain the column
200-
metric. **Query** will issue a query to the dataset to compute the metric. **DataHub Dataset Profile** will use the
201-
DataHub Dataset Profile metadata to compute the metric. Note that this option requires that dataset profiling
202-
statistics are up-to-date as of the assertion run time.
206+
- **Source**: For **Column Metric** assertions, you can choose the mechanism that will be used to obtain the column
207+
metric. **Query** will issue a query to the dataset to compute the metric. This issues a query to the table, which can be more expensive than Information Schema.
208+
**DataHub Dataset Profile** will use the DataHub Dataset Profile metadata to compute the metric. This is the cheapest option, but requires that Dataset Profiles are reported to DataHub. By default, Ingestion will report Dataset Profiles to DataHub, which can be and infrequent. You can report Dataset Profiles via the DataHub APIs for more frequent and reliable data.
203209

204-
- **Additional Filters**: You can choose to add additional filters to the query that will be used to evaluate the
205-
assertion. This is useful if you want to limit the assertion to a subset of rows in the table. Note this option will not
206-
be available if you choose **DataHub Dataset Profile** as the **source**.
210+
- **Additional Filters**: You can choose to add additional filters to the query that will be used to evaluate the
211+
assertion. This is useful if you want to limit the assertion to a subset of rows in the table. Note this option will not
212+
be available if you choose **DataHub Dataset Profile** as the **source**.
207213

208-
11. Configure actions that should be taken when the Column Assertion passes or fails
214+
#### 11. Configure actions that should be taken when the Column Assertion passes or fails
209215

210216
<p align="left">
211217
<img width="45%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/shared/assertion-builder-actions.png"/>
@@ -217,7 +223,7 @@ Once these are in place, you're ready to create your Column Assertions!
217223
- **Resolve incident**: Automatically resolved any incidents that were raised due to failures in this Column Assertion. Note that
218224
any other incidents will not be impacted.
219225

220-
12. Click **Next** and then **Save**.
226+
#### 12. Click **Next** and then **Save**.
221227

222228
And that's it! DataHub will now begin to monitor your Column Assertion for the table.
223229

@@ -276,7 +282,7 @@ Note that to create or delete Assertions and Monitors for a specific entity on D
276282

277283
In order to create or update a Column Assertion, you can the `upsertDatasetColumnAssertionMonitor` mutation.
278284

279-
##### Examples
285+
#### Examples
280286

281287
Creating a Field Values Column Assertion that runs every 8 hours:
282288

docs/managed-datahub/observe/freshness-assertions.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -188,14 +188,14 @@ _Check whether the table has changed in a specific window of time_
188188
<img width="40%" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/observe/freshness/assertion-builder-freshness-source-type.png"/>
189189
</p>
190190

191-
- **Audit Log**: Check the Data Platform operational audit log to determine whether the table changed within the evaluation period.
192-
- **Information Schema**: Check the Data Platform system metadata tables to determine whether the table changed within the evaluation period.
191+
- **Audit Log**: Check the Data Platform operational audit log to determine whether the table changed within the evaluation period. This will filter out No-Ops (e.g. `INSERT 0`). However, the Audit Log can be delayed by several hours depending on the Data Platform. This is also a little more costly on the warehouse than Information Schema.
192+
- **Information Schema**: Check the Data Platform system metadata tables to determine whether the table changed within the evaluation period. This is the optimal balance between cost and accuracy for most Data Platforms.
193193
- **Last Modified Column**: Check for the presence of rows using a "Last Modified Time" column, which should reflect the time at which a given row was last changed in the table, to
194-
determine whether the table changed within the evaluation period.
194+
determine whether the table changed within the evaluation period. This issues a query to the table, which can be more expensive than Information Schema.
195195
- **High Watermark Column**: Monitor changes to a continuously-increasing "high watermark" column value to determine whether a table
196196
has been changed. This option is particularly useful for tables that grow consistently with time, for example fact or event (e.g. click-stream) tables. It is not available
197-
when using a fixed lookback period.
198-
- **DataHub Operation**: Use DataHub Operations to determine whether the table changed within the evaluation period.
197+
when using a fixed lookback period. This issues a query to the table, which can be more expensive than Information Schema.
198+
- **DataHub Operation**: Use DataHub Operations to determine whether the table changed within the evaluation period. This is the cheapest option, but requires that Operations are reported to DataHub. By default, Ingestion will report Operations to DataHub, which can be and infrequent. You can report Operations via the DataHub APIs for more frequent and reliable data.
199199

200200
8. Configure actions that should be taken when the Freshness Assertion passes or fails
201201

docs/managed-datahub/observe/volume-assertions.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -114,15 +114,15 @@ source types vary by the platform, but generally fall into these categories:
114114

115115
- **Information Schema**: A system Table that is exposed by the Data Warehouse which contains live information about the Databases
116116
and Tables stored inside the Data Warehouse, including their row count. It is usually efficient to check, but can in some cases be slightly delayed to update
117-
once a change has been made to a table.
117+
once a change has been made to a table. This is the optimal balance between cost and accuracy for most Data Platforms.
118118

119119
- **Query**: A `COUNT(*)` query is used to retrieve the latest row count for a table, with optional SQL filters applied (depending on platform).
120120
This can be less efficient to check depending on the size of the table. This approach is more portable, as it does not involve
121-
system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers.
121+
system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers. This issues a query to the table, which can be more expensive than Information Schema.
122122

123123
- **DataHub Dataset Profile**: The DataHub Dataset Profile aspect is used to retrieve the latest row count information for a table.
124124
Using this option avoids contacting your data platform, and instead uses the DataHub Dataset Profile metadata to evaluate Volume Assertions.
125-
Note if you have not configured an ingestion source through DataHub, then this may be the only option available.
125+
Note if you have not configured a managed ingestion source through DataHub, then this may be the only option available. This is the cheapest option, but requires that Dataset Profiles are reported to DataHub. By default, Ingestion will report Dataset Profiles to DataHub, which can be and infrequent. You can report Dataset Profiles via the DataHub APIs for more frequent and reliable data.
126126

127127
Volume Assertions also have an off switch: they can be started or stopped at any time with the click of button.
128128

0 commit comments

Comments
 (0)