Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions docs/concepts/audits.md
Original file line number Diff line number Diff line change
Expand Up @@ -722,3 +722,104 @@ MODEL (
)
);
```

### AUDIT_ONLY Models

In addition to traditional audits, SQLMesh provides a special model kind called `AUDIT_ONLY` for validating data relationships across multiple tables without materializing any results.

#### When to Use AUDIT_ONLY Models

Use `AUDIT_ONLY` models when you need to:
- Validate relationships between multiple tables (e.g., referential integrity)
- Run complex validation queries that don't belong to a single model
- Create validation logic that participates in the model DAG with proper dependencies
- Avoid creating unnecessary materialized tables just for validation

Unlike traditional audits that are scoped to a single model, `AUDIT_ONLY` models can depend on multiple models and validate relationships between them.

#### Creating AUDIT_ONLY Models

AUDIT_ONLY models are defined like regular models but with `kind AUDIT_ONLY`:

```sql
MODEL (
name data_quality.order_validation,
kind AUDIT_ONLY (
blocking TRUE, -- Fail pipeline if validation fails (default)
max_failing_rows 20 -- Number of sample rows to show in error (default: 10)
),
depends_on [orders, customers],
cron '@daily',
owner 'data_quality_team'
);

-- Query should return 0 rows for success
-- Any returned rows indicate validation failures
SELECT
o.order_id,
o.customer_id,
'Missing customer record' as issue_type
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
```

#### Key Differences from Regular Audits

| Feature | Traditional Audits | AUDIT_ONLY Models |
|---------|-------------------|-------------------|
| **Scope** | Single model | Multiple models |
| **Dependencies** | Implicit (via @this_model) | Explicit (via depends_on) |
| **Materialization** | N/A | Never materializes |
| **Location** | `audits/` directory or inline | `models/` directory |
| **Scheduling** | With parent model | Independent cron schedule |
| **DAG Participation** | Attached to model | Full model in DAG |

#### Configuration Options

AUDIT_ONLY models support these configuration options:

- **`blocking`** (default: `TRUE`): Whether validation failures should stop the pipeline
- **`max_failing_rows`** (default: `10`): Maximum number of failing rows to show in error messages

Example with non-blocking configuration:

```sql
MODEL (
name data_quality.revenue_anomalies,
kind AUDIT_ONLY (
blocking FALSE, -- Log warnings but don't stop pipeline
max_failing_rows 50 -- Show up to 50 failing rows
),
depends_on [revenue_by_day]
);

-- Detect revenue anomalies
WITH stats AS (
SELECT AVG(revenue) as avg_rev, STDDEV(revenue) as stddev_rev
FROM revenue_by_day
)
SELECT
day,
revenue,
'Anomaly: >3 standard deviations' as issue
FROM revenue_by_day
CROSS JOIN stats
WHERE revenue > avg_rev + (3 * stddev_rev)
OR revenue < 0;
```

#### How AUDIT_ONLY Models Work

1. **No Table Creation**: The model's query executes but doesn't create or update any tables
2. **Validation Logic**: The model fails if the query returns any rows (similar to audits)
3. **Error Reporting**: Shows a sample of failing rows in the error message
4. **Pipeline Integration**: Participates in plan/apply workflow with proper dependency ordering

#### Best Practices

1. **Use descriptive names**: Name your AUDIT_ONLY models clearly (e.g., `audit_order_integrity`, `validate_user_consistency`)
2. **Set appropriate blocking**: Use `blocking TRUE` for critical validations, `FALSE` for warnings
3. **Include context in output**: Return columns that help identify and debug issues
4. **Group related validations**: Consider combining related checks in a single AUDIT_ONLY model
5. **Document validation logic**: Use model descriptions to explain what's being validated and why
109 changes: 109 additions & 0 deletions docs/concepts/models/model_kinds.md
Original file line number Diff line number Diff line change
Expand Up @@ -860,6 +860,115 @@ SELECT DISTINCT
FROM db.employees;
```

## AUDIT_ONLY

The `AUDIT_ONLY` model kind is designed for data validation across multiple tables without materializing any results. These models execute validation queries and fail if any rows are returned, similar to [audits](../audits.md#audit_only-models) but with the ability to participate as full models in the DAG.

### Purpose

`AUDIT_ONLY` models are ideal for:
- Validating referential integrity between multiple tables
- Detecting data quality issues across different models
- Running complex validation queries that don't belong to a single model
- Avoiding unnecessary table materialization for validation purposes

### Configuration

The `AUDIT_ONLY` kind supports two configuration parameters:

- **`blocking`** (default: `TRUE`): Determines whether validation failures stop the pipeline
- **`max_failing_rows`** (default: `10`): Maximum number of failing rows to display in error messages

### Example: Referential Integrity Check

This example validates that all orders reference existing customers:

```sql linenums="1"
MODEL (
name data_quality.order_integrity,
kind AUDIT_ONLY (
blocking TRUE,
max_failing_rows 20
),
depends_on [orders, customers],
cron '@daily',
owner 'data_quality_team'
);

-- Query should return 0 rows for validation to pass
SELECT
o.order_id,
o.customer_id,
o.order_date,
'Missing customer record' as issue_type
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
```

### Example: Non-Blocking Anomaly Detection

This example detects revenue anomalies but doesn't stop the pipeline:

```sql linenums="1"
MODEL (
name data_quality.revenue_anomalies,
kind AUDIT_ONLY (
blocking FALSE, -- Log warnings but continue
max_failing_rows 100
),
depends_on [daily_revenue],
cron '@hourly'
);

WITH stats AS (
SELECT
AVG(revenue) as avg_revenue,
STDDEV(revenue) as stddev_revenue
FROM daily_revenue
WHERE revenue > 0
)
SELECT
date,
revenue,
CASE
WHEN revenue < 0 THEN 'Negative revenue'
WHEN revenue > avg_revenue + (5 * stddev_revenue) THEN 'Extreme outlier'
END as anomaly_type
FROM daily_revenue
CROSS JOIN stats
WHERE revenue < 0
OR revenue > avg_revenue + (5 * stddev_revenue);
```

### Behavior

1. **No Materialization**: AUDIT_ONLY models never create or update tables
2. **Validation Logic**: The model succeeds if the query returns 0 rows, fails otherwise
3. **Error Reporting**: When validation fails, shows a sample of failing rows (up to `max_failing_rows`)
4. **DAG Integration**: Fully participates in the model DAG with proper dependency tracking
5. **Scheduling**: Can be scheduled independently using cron expressions

### Best Practices

- **Naming Convention**: Use descriptive names like `audit_*` or `validate_*` to clearly indicate the model's purpose
- **Include Context**: Add columns that describe what validation failed for easier debugging
- **Optimize Performance**: These queries run during every plan/apply, so ensure they're efficient
- **Set Appropriate Blocking**: Use `blocking TRUE` for critical validations, `FALSE` for monitoring
- **Document Purpose**: Use the `description` field to explain what the validation checks

### Comparison with Traditional Audits

While both AUDIT_ONLY models and traditional audits validate data, they serve different purposes:

| Aspect | Traditional Audits | AUDIT_ONLY Models |
|--------|-------------------|-------------------|
| **Scope** | Single model | Multiple models |
| **Location** | `audits/` directory or inline | `models/` directory |
| **Dependencies** | Implicit via parent model | Explicit via `depends_on` |
| **Scheduling** | With parent model | Independent cron |
| **Use Case** | Validate model output | Validate cross-model relationships |

## SEED
The `SEED` model kind is used to specify [seed models](./seed_models.md) for using static CSV datasets in your SQLMesh project.

Expand Down
11 changes: 11 additions & 0 deletions docs/reference/model_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,17 @@ Configuration options for [`SCD_TYPE_2_BY_COLUMN` models](../concepts/models/mod

Python model kind `name` enum value: [ModelKindName.SCD_TYPE_2_BY_COLUMN](https://sqlmesh.readthedocs.io/en/stable/_readthedocs/html/sqlmesh/core/model/kind.html#ModelKindName)

### `AUDIT_ONLY` models

Configuration options for [`AUDIT_ONLY` models](../concepts/models/model_kinds.md#audit_only) (in addition to [general model properties](#general-model-properties)).

| Option | Description | Type | Required |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--: | :------: |
| `blocking` | If set to true, the pipeline will fail when the validation query returns any rows. If false, only warnings are logged. (Default: `True`) | bool | N |
| `max_failing_rows` | Maximum number of failing rows to display in error messages when a validation fails. (Default: `10`) | int | N |

Python model kind `name` enum value: [ModelKindName.AUDIT_ONLY](https://sqlmesh.readthedocs.io/en/stable/_readthedocs/html/sqlmesh/core/model/kind.html#ModelKindName)

### `SEED` models

Configuration options for [`SEED` models](../concepts/models/model_kinds.md#seed). `SEED` models do not support all the general properties supported by other models; they only support the properties listed in this table.
Expand Down
43 changes: 43 additions & 0 deletions examples/sushi/models/audit_duplicate_orders.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
MODEL (
name sushi.audit_duplicate_orders,
kind AUDIT_ONLY (
blocking FALSE,
max_failing_rows 100
),
depends_on [sushi.orders],
cron '@hourly',
owner 'data_engineering',
tags ['validation', 'duplicates', 'data_quality'],
description 'Detects potential duplicate orders based on customer, waiter, and timing'
);

-- Find potential duplicate orders
-- Orders from the same customer to the same waiter within 5 minutes might be duplicates
WITH potential_duplicates AS (
SELECT
o1.id as order_id_1,
o2.id as order_id_2,
o1.customer_id,
o1.waiter_id,
o1.start_ts as order_1_time,
o2.start_ts as order_2_time,
ABS(o1.start_ts - o2.start_ts) as seconds_apart
FROM sushi.orders o1
INNER JOIN sushi.orders o2
ON o1.customer_id = o2.customer_id
AND o1.waiter_id = o2.waiter_id
AND o1.id < o2.id -- Avoid comparing order with itself and duplicating pairs
AND o1.event_date = o2.event_date -- Same day
WHERE ABS(o1.start_ts - o2.start_ts) <= 300 -- Within 5 minutes (300 seconds)
)
SELECT
order_id_1,
order_id_2,
customer_id,
waiter_id,
seconds_apart,
CONCAT('Orders ', order_id_1::TEXT, ' and ', order_id_2::TEXT,
' from customer ', customer_id::TEXT,
' are only ', seconds_apart::TEXT, ' seconds apart') as issue_description
FROM potential_duplicates
ORDER BY seconds_apart, order_id_1
25 changes: 25 additions & 0 deletions examples/sushi/models/audit_order_integrity.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
MODEL (
name sushi.audit_order_integrity,
kind AUDIT_ONLY (
blocking FALSE, -- Set to non-blocking for example/demo purposes
max_failing_rows 20
),
depends_on [sushi.orders, sushi.customers],
cron '@daily',
owner 'data_quality_team',
tags ['validation', 'referential_integrity', 'critical'],
description 'Validates referential integrity between orders and customers tables'
);

-- Check for orders with non-existent customer IDs
-- This should return no rows if all orders have valid customers
SELECT
o.id as order_id,
o.customer_id,
o.event_date,
'Missing customer record' as issue_type,
CONCAT('Order ', o.id::TEXT, ' references non-existent customer ', o.customer_id::TEXT) as issue_description
FROM sushi.orders o
LEFT JOIN sushi.customers c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL
47 changes: 47 additions & 0 deletions examples/sushi/models/audit_waiter_revenue_anomalies.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
MODEL (
name sushi.audit_waiter_revenue_anomalies,
kind AUDIT_ONLY (
blocking FALSE,
max_failing_rows 50
),
depends_on [sushi.waiter_revenue_by_day],
cron '@daily',
owner 'analytics_team',
tags ['validation', 'revenue', 'daily'],
description 'Detects anomalies in daily waiter revenue that may indicate data quality issues'
);

-- Detect anomalies in waiter daily revenue
-- Only flag extreme outliers (>5 std dev) or negative revenue
WITH revenue_stats AS (
SELECT
AVG(revenue) as avg_revenue,
STDDEV(revenue) as stddev_revenue
FROM sushi.waiter_revenue_by_day
WHERE revenue > 0 -- Exclude zeros from stats calculation
),
anomalies AS (
SELECT
w.waiter_id,
w.event_date,
w.revenue,
r.avg_revenue,
r.stddev_revenue,
CASE
WHEN w.revenue < 0 THEN 'Negative revenue'
WHEN w.revenue > r.avg_revenue + (5 * r.stddev_revenue) THEN 'Extremely high revenue (>5 std dev)'
END as anomaly_type
FROM sushi.waiter_revenue_by_day w
CROSS JOIN revenue_stats r
WHERE
w.revenue < 0
OR w.revenue > r.avg_revenue + (5 * r.stddev_revenue) -- Only flag extreme outliers
)
SELECT
waiter_id,
event_date,
revenue,
anomaly_type,
CONCAT('Waiter ', waiter_id::TEXT, ' has ', anomaly_type, ' on ', event_date::TEXT) as issue_description
FROM anomalies
ORDER BY event_date DESC, waiter_id
1 change: 1 addition & 0 deletions sqlmesh/core/dialect.py
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,7 @@ def parse(self: Parser) -> t.Optional[exp.Expression]:
ModelKindName.SCD_TYPE_2_BY_TIME,
ModelKindName.SCD_TYPE_2_BY_COLUMN,
ModelKindName.CUSTOM,
ModelKindName.AUDIT_ONLY,
) and self._match(TokenType.L_PAREN, advance=False):
props = self._parse_wrapped_csv(functools.partial(_parse_props, self))
else:
Expand Down
Loading