Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -144,9 +144,40 @@ dbt relies on a read-after-insert consistency model. This is not compatible with
| settings | A map/dictionary of "TABLE" settings to be used to DDL statements like 'CREATE TABLE' with this model | |
| query_settings | A map/dictionary of ClickHouse user level settings to be used with `INSERT` or `DELETE` statements in conjunction with this model | |
| ttl | A TTL expression to be used with the table. The TTL expression is a string that can be used to specify the TTL for the table. | |
| indexes | A list of indexes to create, available only for `table` materialization. For examples look at ([#397](https://github.com/ClickHouse/dbt-clickhouse/pull/397)) | |
| sql_security | Allow you to specify which ClickHouse user to use when executing the view's underlying query. [`SQL SECURITY`](https://clickhouse.com/docs/sql-reference/statements/create/view#sql_security) has two legal values: `definer` `invoker`. | |
| indexes | A list of [data skipping indexes to create](/optimize/skipping-indexes). Check below for more information. | |
| sql_security | Allow you to specify which ClickHouse user to use when executing the view's underlying query. `SQL SECURITY` [has two legal values](/sql-reference/statements/create/view#sql_security): `definer` `invoker`. | |
| definer | If `sql_security` was set to `definer`, you have to specify any existing user or `CURRENT_USER` in the `definer` clause. | |
| projections | A list of [projections](/data-modeling/projections) to be created. Check below for more information. | |

#### About Data Skipping Indexes {#data-skipping-indexes}

These indexes are only available for `table` materialization. A list of these indexes can be added in the table setting as

```sql
{{ config(
materialized='table',
indexes=[{
'name': 'your_index_name',
'definition': 'your_column TYPE minmax GRANULARITY 2'
}]
) }}
```

#### About Projections {#projections}

Projections are added to the `table` and `distributed_table` materializations as a model setting. For distributed tables, the projection is applied to the `_local` tables, not to the distributed proxy table. For example

```sql
{{ config(
materialized='table',
projections=[
{
'name': 'your_projection_name',
'query': 'SELECT department, avg(age) AS avg_age GROUP BY department'
}
]
) }}
```

### Supported table engines {#supported-table-engines}

Expand Down Expand Up @@ -191,7 +222,7 @@ should be carefully researched and tested.
| codec | A string consisting of arguments passed to `CODEC()` in the column's DDL. For example: `codec: "Delta, ZSTD"` will be compiled as `CODEC(Delta, ZSTD)`. |
| ttl | A string consisting of a [TTL (time-to-live) expression](https://clickhouse.com/docs/guides/developer/ttl) that defines a TTL rule in the column's DDL. For example: `ttl: ts + INTERVAL 1 DAY` will be compiled as `TTL ts + INTERVAL 1 DAY`. |

#### Example {#example}
#### Example of schema configuration {#example-of-schema-configuration}

```yaml
models:
Expand All @@ -209,6 +240,30 @@ models:
ttl: ts + INTERVAL 1 DAY
```

#### Adding complex types {#adding-complex-types}

dbt automatically determines the data type of each column by analyzing the SQL used to create the model. However, in some cases this process may not accurately determine the data type, leading to conflicts with the types specified in the contract `data_type` property. To address this, we recommend using the `CAST()` function in the model SQL to explicitly define the desired type. For example:

```sql
{{
config(
materialized="materialized_view",
engine="AggregatingMergeTree",
order_by=["event_type"],
)
}}

select
-- event_type may be infered as a String but we may prefer LowCardinality(String):
CAST(event_type, 'LowCardinality(String)') as event_type,
-- countState() may be infered as `AggregateFunction(count)` but we may prefer to change the type of the argument used:
CAST(countState(), 'AggregateFunction(count, UInt32)') as response_count,
-- maxSimpleState() may be infered as `SimpleAggregateFunction(max, String)` but we may prefer to also change the type of the argument used:
CAST(maxSimpleState(event_type), 'SimpleAggregateFunction(max, LowCardinality(String))') as max_event_type
from {{ ref('user_events') }}
group by event_type
```

## Features {#features}

### Materialization: view {#materialization-view}
Expand Down
37 changes: 36 additions & 1 deletion docs/integrations/data-ingestion/etl-tools/dbt/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ List of supported features:
- [x] Distributed table materialization (experimental)
- [x] Distributed incremental materialization (experimental)
- [x] Contracts
- [x] ClickHouse-specific column configurations (Codec, TTL...)
- [x] ClickHouse-specific table settings (indexes, projections...)

All features up to dbt-core 1.9 are supported. We will soon add the features added in dbt-core 1.10.

Expand Down Expand Up @@ -125,7 +127,36 @@ Execute `dbt debug` with the CLI tool to confirm whether dbt is able to connect

Go to the [guides page](/integrations/dbt/guides) to learn more about how to use dbt with ClickHouse.

## Troubleshooting Connections {#troubleshooting-connections}
### Testing and Deploying your models (CI/CD) {#testing-and-deploying-your-models-ci-cd}

There are many ways to test and deploy your dbt project. dbt has some suggestions for [best practice workflows](https://docs.getdbt.com/best-practices/best-practice-workflows#pro-tips-for-workflows) and [CI jobs](https://docs.getdbt.com/docs/deploy/ci-jobs). We are going to discuss several strategies, but keep into account that these strategies may need to be deeply adjusted to fit your specific use case.

#### CI/CD with simple data tests and unit tests {#ci-with-simple-data-tests-and-unit-tests}

One simple way to kick-start your CI pipeline is to run a ClickHouse cluster inside your job and then run your models against it. You can insert demo data into this cluster before running your models. You can just use a [seed](https://docs.getdbt.com/reference/commands/seed) to populate the staging environment with a subset of your production data.

Once the data is inserted, you can then run your [data tests](https://docs.getdbt.com/docs/build/data-tests) and your [unit tests](https://docs.getdbt.com/docs/build/unit-tests).

Your CD step can be as simple as running `dbt build` against your production ClickHouse cluster.

#### More complete CI/CD stage: Use recent data, only test affected models {#more-complete-ci-stage}

One common strategy is to use [Slim CI](https://docs.getdbt.com/best-practices/best-practice-workflows#run-only-modified-models-to-test-changes-slim-ci) jobs, where only the refreshed models (and their downstream dependencies) are tested. You can use the artifacts from your production runs to keep your development environment(s) in sync

To keep your development environments in sync and avoid running your models against stale deployments, you can use [clone](https://docs.getdbt.com/reference/commands/clone) or even [defer](https://docs.getdbt.com/reference/node-selection/defer).

It's better to use a different ClickHouse cluster (an `staging` one) to handle the testing phase. That way you can avoid impacting the performance of your production environment and the data there. You can keep a small subset of your production data there so you can run your models against it. There are different ways of handling this:
- If your data doesn't need to be really recent, you can load backups of your production data into the staging cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach isn't very practical, since backups always require a new service to be created. Again, worth experimenting with it so we have a better idea of how this could work, in practice. For example: is it realistic to create a new service each time you need to refresh the data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it realistic to create a new service each time you need to refresh the data?

In my mind it is but because the user may not need to refresh the data daily, maybe just once each several months. In that case it may be worth to just suggest to create this copy manually. If the user needs frequent updates, I guess we should then recommend going the other way with ClickPipes and stuff.

Still I don't have strong opinions on that. We can discuss it in the issue ClickHouse/dbt-clickhouse#547

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing this thread to move the conversation to the issue.

- If you need more recent data, you can also find different strategies to load your data into the staging cluster. For example, you could use a refreshable materialized view and `remoteSecure()` and insert the data daily. If the insert fails or if there is data loss, you should be able to quickly re-trigger it.
- Another way could be to use a cron or refreshable materialized view to write the data to object storage and then set up a clickpipe on staging to pull any new files when they drop.

Doing your CI testing in an accessible cluster can let you also do some manual testing of your results. For example, you may want to access to this environment using one of your BI tools.

Your CD step can reuse the artifacts from your last production deployment to only update the models that have changed with something like `dbt build --select state:modified+ --state path/to/last/deploy/state.json`

## Troubleshooting common issues {#troubleshooting-common-issues}

### Connections {#troubleshooting-connections}

If you encounter issues connecting to ClickHouse from dbt, make sure the following criteria are met:

Expand All @@ -134,6 +165,10 @@ If you encounter issues connecting to ClickHouse from dbt, make sure the followi
- If you're not using the default table engine for the database, you must specify a table engine in your model
configuration.

### Understanding long-running operations {#understanding-long-running-operations}

Some operations may take longer than expected due to specific ClickHouse queries. To gain more insight into which queries are taking longer, you can increase the log level to `debug` as it will print the time used by each one. For example, this can be achieved by appending `---log-level debug` to the command.

## Limitations {#limitations}

The current ClickHouse adapter for dbt has several limitations users should be aware of:
Expand Down
Loading