You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/managed-datahub/observe/assertions.md
+4-7Lines changed: 4 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,8 +2,7 @@
2
2
3
3
:::note Supported Data Platforms
4
4
Currently we support monitoring data on Snowflake, Redshift, BigQuery, and Databricks as part of DataHub Cloud Observe.
5
-
For other data platforms, DataHub Cloud Observe can monitor assertions against dataset metrics (such as volume, or column nullness) and dataset freshenss by using the ingested statistics for each asset.
6
-
Column Value and Custom SQL Assertions are not currently supported for other data platforms.
5
+
DataHub Cloud Observe can still monitor assertions for other data platforms against dataset metrics (such as row count, or column nullness) and dataset freshenss by using the [ingested statistics](/metadata-ingestion/docs/dev_guides/sql_profiles.md).
7
6
:::
8
7
9
8
An assertion is **a data quality test that finds data that violates a specified rule.**
@@ -25,13 +24,11 @@ For DataHub-provided assertion runners, we can deploy an agent in your environme
25
24
26
25
#### Bulk Creating Assertions
27
26
28
-
You can bulk create Freshness and Volume [Smart Assertions](/docs/managed-datahub/observe/smart-assertions.md) (AI Anomaly Monitors) across several tables at once via the [Data Health Dashboard](/docs/managed-datahub/observe/data-health-dashboard.md):
You can bulk create Freshness and Volume [Smart Assertions](/docs/managed-datahub/observe/smart-assertions.md) (AI Anomaly Monitors) across several tables at once via the [Data Health Dashboard](/docs/managed-datahub/observe/data-health-dashboard.md).
31
28
32
29
To bulk create column metric assertions on a given dataset, follow the steps under the **Anomaly Detection** section of [Column Assertion](https://docs.datahub.com/docs/managed-datahub/observe/column-assertions#anomaly-detection-with-smart-assertions-).
33
30
34
-
### AI Anomaly Detection (Smart Assertions)
31
+
### Detecting Anomalies Across Massive Data Landscapes
35
32
36
33
There are many cases where either you do not have the time to figure out what a good rule for an assertion is, or strict rules simply do not suffice for your data validation needs. Traditional rule-based assertions can become inadequate when dealing with complex data patterns or large-scale operations.
37
34
@@ -45,7 +42,7 @@ Here are some typical situations where manual assertion rules fall short:
45
42
46
43
-**Dynamic data environments** - When data patterns evolve over time, manually updating assertion rules becomes a maintenance burden that can lead to false positives or missed anomalies.
47
44
48
-
### The Smart Assertion Solution
45
+
### The AI Smart Assertion Solution
49
46
50
47
In these scenarios, you may want to consider creating a [Smart Assertion](./smart-assertions.md) to let machine learning automatically detect the normal patterns in your data and alert you when anomalies occur. This approach allows for more flexible and adaptive data quality monitoring without the overhead of manual rule maintenance.
5. Configure the evaluation **schedule**. This is the frequency at which the assertion will be evaluated to produce a
145
-
pass or fail result, and the times when the column values will be checked.
147
+
This is the frequency at which the assertion will be evaluated to produce a
148
+
pass or fail result, and the times when the column values will be checked.
146
149
147
-
6. Configure the **column assertion type**. You can choose from **Column Value** or **Column Metric**.
148
-
**Column Value** assertions are used to monitor the value of a specific column in a table, and ensure that every row
149
-
adheres to a specific condition. **Column Metric** assertions are used to compute a metric for that column, and then compare the value of that metric to your expectations.
150
+
#### 6. Configure the **column assertion type**.
151
+
152
+
You can choose from **Column Value** or **Column Metric**.
153
+
**Column Value** assertions are used to monitor the value of a specific column in a table, and ensure that every row
154
+
adheres to a specific condition. **Column Metric** assertions are used to compute a metric for that column, and then compare the value of that metric to your expectations.
8. Configure the **evaluation criteria**. This step varies based on the type of assertion you chose in the previous step.
169
+
#### 8. Configure the **evaluation criteria**. This step varies based on the type of assertion you chose in the previous step.
163
170
164
-
-**Column Value Assertions**: You will be able to choose from a set of operators that can be applied to the column
165
-
value. The options presented will vary based on the data type of the selected column. For example with numeric types, you
166
-
can check that the column value is greater than a specific value. For string types, you can check that the column value
167
-
matches a particular regex pattern. You will also be able to control the behavior of null values in the column. If the
168
-
**Allow Nulls** option is _disabled_, any null values encountered will be reported as a failure when evaluating the
169
-
assertion.
171
+
-**Column Value Assertions**: You will be able to choose from a set of operators that can be applied to the column
172
+
value. The options presented will vary based on the data type of the selected column. For example with numeric types, you
173
+
can check that the column value is greater than a specific value. For string types, you can check that the column value
174
+
matches a particular regex pattern. You will also be able to control the behavior of null values in the column. If the
175
+
**Allow Nulls** option is _disabled_, any null values encountered will be reported as a failure when evaluating the
176
+
assertion. Note, Smart Assertions are not supported for Column Value Assertions today.
170
177
171
-
-**Column Metric Assertions**: You will be able to choose from a list of common metrics and then specify the operator
172
-
and value to compare against. The list of metrics will vary based on the data type of the selected column. For example
173
-
with numeric types, you can choose to compute the average value of the column, and then assert that it is greater than a
174
-
specific number. For string types, you can choose to compute the max length of all column values, and then assert that it
175
-
is less than a specific number.
178
+
-**Column Metric Assertions**: You will be able to choose from a list of common metrics and then specify the operator
179
+
and value to compare against. The list of metrics will vary based on the data type of the selected column. For example
180
+
with numeric types, you can choose to compute the average value of the column, and then assert that it is greater than a
181
+
specific number. For string types, you can choose to compute the max length of all column values, and then assert that it
182
+
is less than a specific number. You can also select the **Detect with AI** option to use Smart Assertions to detect anomalies in the column metric.
176
183
177
-
9. Configure the **row evaluation type**. This defines which rows in the table the Column Assertion should evaluate. You can choose
178
-
from the following options:
184
+
#### 9. Configure the **row evaluation type**. This defines which rows in the table the Column Assertion should evaluate.
179
185
180
-
-**All Table Rows**: Evaluate the Column Assertion against all rows in the table. This is the default option. Note that
181
-
this may not be desirable for large tables.
186
+
-**All Table Rows**: Evaluate the Column Assertion against all rows in the table. This is the default option. Note that
187
+
this may not be desirable for large tables.
182
188
183
-
-**Only Rows That Have Changed**: Evaluate the Column Assertion only against rows that have changed since the last
184
-
evaluation. If you choose this option, you will need to specify a **High Watermark Column** to help determine which rows
185
-
have changed. A **High Watermark Column** is a column that contains a constantly-incrementing value - a date, a time, or
186
-
another always-increasing number. When selected, a query will be issued to the table find only the rows which have changed since the last assertion run.
189
+
-**Only Rows That Have Changed**: Evaluate the Column Assertion only against rows that have changed since the last
190
+
evaluation. If you choose this option, you will need to specify a **High Watermark Column** to help determine which rows
191
+
have changed. A **High Watermark Column** is a column that contains a constantly-incrementing value - a date, a time, or
192
+
another always-increasing number. When selected, a query will be issued to the table find only the rows which have changed since the last assertion run.
10. (Optional) Click **Advanced** to further customize the Column Assertion. The options listed here will vary based on the
193
-
type of assertion you chose in the previous step.
198
+
#### 10. (Optional) Click **Advanced** to further customize the Column Assertion.
199
+
200
+
The options listed here will vary based on the type of assertion you chose in the previous step.
194
201
195
-
-**Invalid Values Threshold**: For **Column Value** assertions, you can configure the number of invalid values
196
-
(i.e. rows) that are allowed to fail before the assertion is marked as failing. This is useful if you want to allow a limited number
197
-
of invalid values in the column. By default this is 0, meaning the assertion will fail if any rows have an invalid column value.
202
+
-**Invalid Values Threshold**: For **Column Value** assertions, you can configure the number of invalid values
203
+
(i.e. rows) that are allowed to fail before the assertion is marked as failing. This is useful if you want to allow a limited number
204
+
of invalid values in the column. By default this is 0, meaning the assertion will fail if any rows have an invalid column value.
198
205
199
-
-**Source**: For **Column Metric** assertions, you can choose the mechanism that will be used to obtain the column
200
-
metric. **Query** will issue a query to the dataset to compute the metric. **DataHub Dataset Profile** will use the
201
-
DataHub Dataset Profile metadata to compute the metric. Note that this option requires that dataset profiling
202
-
statistics are up-to-date as of the assertion run time.
206
+
-**Source**: For **Column Metric** assertions, you can choose the mechanism that will be used to obtain the column
207
+
metric. **Query** will issue a query to the dataset to compute the metric. This issues a query to the table, which can be more expensive than Information Schema.
208
+
**DataHub Dataset Profile** will use the DataHub Dataset Profile metadata to compute the metric. This is the cheapest option, but requires that Dataset Profiles are reported to DataHub. By default, Ingestion will report Dataset Profiles to DataHub, which can be and infrequent. You can report Dataset Profiles via the DataHub APIs for more frequent and reliable data.
203
209
204
-
-**Additional Filters**: You can choose to add additional filters to the query that will be used to evaluate the
205
-
assertion. This is useful if you want to limit the assertion to a subset of rows in the table. Note this option will not
206
-
be available if you choose **DataHub Dataset Profile** as the **source**.
210
+
-**Additional Filters**: You can choose to add additional filters to the query that will be used to evaluate the
211
+
assertion. This is useful if you want to limit the assertion to a subset of rows in the table. Note this option will not
212
+
be available if you choose **DataHub Dataset Profile** as the **source**.
207
213
208
-
11. Configure actions that should be taken when the Column Assertion passes or fails
214
+
#### 11. Configure actions that should be taken when the Column Assertion passes or fails
-**Audit Log**: Check the Data Platform operational audit log to determine whether the table changed within the evaluation period.
192
-
-**Information Schema**: Check the Data Platform system metadata tables to determine whether the table changed within the evaluation period.
191
+
-**Audit Log**: Check the Data Platform operational audit log to determine whether the table changed within the evaluation period. This will filter out No-Ops (e.g. `INSERT 0`). However, the Audit Log can be delayed by several hours depending on the Data Platform. This is also a little more costly on the warehouse than Information Schema.
192
+
-**Information Schema**: Check the Data Platform system metadata tables to determine whether the table changed within the evaluation period. This is the optimal balance between cost and accuracy for most Data Platforms.
193
193
-**Last Modified Column**: Check for the presence of rows using a "Last Modified Time" column, which should reflect the time at which a given row was last changed in the table, to
194
-
determine whether the table changed within the evaluation period.
194
+
determine whether the table changed within the evaluation period. This issues a query to the table, which can be more expensive than Information Schema.
195
195
-**High Watermark Column**: Monitor changes to a continuously-increasing "high watermark" column value to determine whether a table
196
196
has been changed. This option is particularly useful for tables that grow consistently with time, for example fact or event (e.g. click-stream) tables. It is not available
197
-
when using a fixed lookback period.
198
-
-**DataHub Operation**: Use DataHub Operations to determine whether the table changed within the evaluation period.
197
+
when using a fixed lookback period. This issues a query to the table, which can be more expensive than Information Schema.
198
+
-**DataHub Operation**: Use DataHub Operations to determine whether the table changed within the evaluation period. This is the cheapest option, but requires that Operations are reported to DataHub. By default, Ingestion will report Operations to DataHub, which can be and infrequent. You can report Operations via the DataHub APIs for more frequent and reliable data.
199
199
200
200
8. Configure actions that should be taken when the Freshness Assertion passes or fails
Copy file name to clipboardExpand all lines: docs/managed-datahub/observe/volume-assertions.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -114,15 +114,15 @@ source types vary by the platform, but generally fall into these categories:
114
114
115
115
-**Information Schema**: A system Table that is exposed by the Data Warehouse which contains live information about the Databases
116
116
and Tables stored inside the Data Warehouse, including their row count. It is usually efficient to check, but can in some cases be slightly delayed to update
117
-
once a change has been made to a table.
117
+
once a change has been made to a table. This is the optimal balance between cost and accuracy for most Data Platforms.
118
118
119
119
-**Query**: A `COUNT(*)` query is used to retrieve the latest row count for a table, with optional SQL filters applied (depending on platform).
120
120
This can be less efficient to check depending on the size of the table. This approach is more portable, as it does not involve
121
-
system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers.
121
+
system warehouse tables, it is also easily portable across Data Warehouse and Data Lake providers. This issues a query to the table, which can be more expensive than Information Schema.
122
122
123
123
-**DataHub Dataset Profile**: The DataHub Dataset Profile aspect is used to retrieve the latest row count information for a table.
124
124
Using this option avoids contacting your data platform, and instead uses the DataHub Dataset Profile metadata to evaluate Volume Assertions.
125
-
Note if you have not configured an ingestion source through DataHub, then this may be the only option available.
125
+
Note if you have not configured a managed ingestion source through DataHub, then this may be the only option available. This is the cheapest option, but requires that Dataset Profiles are reported to DataHub. By default, Ingestion will report Dataset Profiles to DataHub, which can be and infrequent. You can report Dataset Profiles via the DataHub APIs for more frequent and reliable data.
126
126
127
127
Volume Assertions also have an off switch: they can be started or stopped at any time with the click of button.
0 commit comments