You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/etl-tools/dbt/features-and-configurations.md
+59-3Lines changed: 59 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -144,9 +144,41 @@ dbt relies on a read-after-insert consistency model. This is not compatible with
144
144
| settings | A map/dictionary of "TABLE" settings to be used to DDL statements like 'CREATE TABLE' with this model ||
145
145
| query_settings | A map/dictionary of ClickHouse user level settings to be used with `INSERT` or `DELETE` statements in conjunction with this model ||
146
146
| ttl | A TTL expression to be used with the table. The TTL expression is a string that can be used to specify the TTL for the table. ||
147
-
| indexes | A list of indexes to create, available only for `table` materialization. For examples look at ([#397](https://github.com/ClickHouse/dbt-clickhouse/pull/397))||
148
-
| sql_security | Allow you to specify which ClickHouse user to use when executing the view's underlying query. [`SQL SECURITY`](https://clickhouse.com/docs/sql-reference/statements/create/view#sql_security) has two legal values: `definer``invoker`. ||
147
+
| indexes | A list of [data skipping indexes to create](/optimize/skipping-indexes). Check below for more information.||
148
+
| sql_security | Allow you to specify which ClickHouse user to use when executing the view's underlying query. `SQL SECURITY`[has two legal values](/sql-reference/statements/create/view#sql_security): `definer``invoker`. ||
149
149
| definer | If `sql_security` was set to `definer`, you have to specify any existing user or `CURRENT_USER` in the `definer` clause. ||
150
+
| projections | A list of [projections](/data-modeling/projections) to be created. Check [About projections](#projections) for details. ||
151
+
152
+
#### About data skipping indexes {#data-skipping-indexes}
153
+
154
+
Data skipping indexes are only available for the `table` materialization. To add a list of data skipping indexes to a table, use the `indexes` configuration:
155
+
156
+
```sql
157
+
{{ config(
158
+
materialized='table',
159
+
indexes=[{
160
+
'name': 'your_index_name',
161
+
'definition': 'your_column TYPE minmax GRANULARITY 2'
162
+
}]
163
+
) }}
164
+
```
165
+
166
+
#### About projections {#projections}
167
+
168
+
You can add [projections](/data-modeling/projections) to `table` and `distributed_table` materializations using the `projections` configuration:
169
+
170
+
```sql
171
+
{{ config(
172
+
materialized='table',
173
+
projections=[
174
+
{
175
+
'name': 'your_projection_name',
176
+
'query': 'SELECT department, avg(age) AS avg_age GROUP BY department'
177
+
}
178
+
]
179
+
) }}
180
+
```
181
+
**Note**: For distributed tables, the projection is applied to the `_local` tables, not to the distributed proxy table.
@@ -191,7 +223,7 @@ should be carefully researched and tested.
191
223
| codec | A string consisting of arguments passed to `CODEC()` in the column's DDL. For example: `codec: "Delta, ZSTD"` will be compiled as `CODEC(Delta, ZSTD)`. |
192
224
| ttl | A string consisting of a [TTL (time-to-live) expression](https://clickhouse.com/docs/guides/developer/ttl) that defines a TTL rule in the column's DDL. For example: `ttl: ts + INTERVAL 1 DAY` will be compiled as `TTL ts + INTERVAL 1 DAY`. |
193
225
194
-
#### Example {#example}
226
+
#### Example of schema configuration {#example-of-schema-configuration}
195
227
196
228
```yaml
197
229
models:
@@ -209,6 +241,30 @@ models:
209
241
ttl: ts + INTERVAL 1 DAY
210
242
```
211
243
244
+
#### Adding complex types {#adding-complex-types}
245
+
246
+
dbt automatically determines the data type of each column by analyzing the SQL used to create the model. However, in some cases this process may not accurately determine the data type, leading to conflicts with the types specified in the contract `data_type` property. To address this, we recommend using the `CAST()` function in the model SQL to explicitly define the desired type. For example:
247
+
248
+
```sql
249
+
{{
250
+
config(
251
+
materialized="materialized_view",
252
+
engine="AggregatingMergeTree",
253
+
order_by=["event_type"],
254
+
)
255
+
}}
256
+
257
+
select
258
+
-- event_type may be infered as a String but we may prefer LowCardinality(String):
259
+
CAST(event_type, 'LowCardinality(String)') as event_type,
260
+
-- countState() may be infered as `AggregateFunction(count)` but we may prefer to change the type of the argument used:
261
+
CAST(countState(), 'AggregateFunction(count, UInt32)') as response_count,
262
+
-- maxSimpleState() may be infered as `SimpleAggregateFunction(max, String)` but we may prefer to also change the type of the argument used:
263
+
CAST(maxSimpleState(event_type), 'SimpleAggregateFunction(max, LowCardinality(String))') as max_event_type
### Testing and Deploying your models (CI/CD) {#testing-and-deploying-your-models-ci-cd}
131
+
132
+
There are many ways to test and deploy your dbt project. dbt has some suggestions for [best practice workflows](https://docs.getdbt.com/best-practices/best-practice-workflows#pro-tips-for-workflows) and [CI jobs](https://docs.getdbt.com/docs/deploy/ci-jobs). We are going to discuss several strategies, but keep in mind that these strategies may need to be deeply adjusted to fit your specific use case.
133
+
134
+
#### CI/CD with simple data tests and unit tests {#ci-with-simple-data-tests-and-unit-tests}
135
+
136
+
One simple way to kick-start your CI pipeline is to run a ClickHouse cluster inside your job and then run your models against it. You can insert demo data into this cluster before running your models. You can just use a [seed](https://docs.getdbt.com/reference/commands/seed) to populate the staging environment with a subset of your production data.
137
+
138
+
Once the data is inserted, you can then run your [data tests](https://docs.getdbt.com/docs/build/data-tests) and your [unit tests](https://docs.getdbt.com/docs/build/unit-tests).
139
+
140
+
Your CD step can be as simple as running `dbt build` against your production ClickHouse cluster.
141
+
142
+
#### More complete CI/CD stage: Use recent data, only test affected models {#more-complete-ci-stage}
143
+
144
+
One common strategy is to use [Slim CI](https://docs.getdbt.com/best-practices/best-practice-workflows#run-only-modified-models-to-test-changes-slim-ci) jobs, where only the modified models (and their up- and downstream dependencies) are re-deployed. This approach uses artifacts from your production runs (i.e., the [dbt manifest](https://docs.getdbt.com/reference/artifacts/manifest-json)) to reduce the run time of your project and ensure there is no schema drift across environments.
145
+
146
+
To keep your development environments in sync and avoid running your models against stale deployments, you can use [clone](https://docs.getdbt.com/reference/commands/clone) or even [defer](https://docs.getdbt.com/reference/node-selection/defer).
147
+
148
+
We recommend using a dedicated ClickHouse cluster or service for the testing environment (i.e., a staging environment) to avoid impacting the operation of your production environment. To ensure the testing environment is representative, it's important that you use a subset of your production data, as well as run dbt in a way that prevents schema drift between environments.
149
+
150
+
- If you don't need fresh data to test against, you can restore a backup of your production data into the staging environment.
151
+
- If you need fresh data to test against, you can use a combination of the [`remoteSecure()` table function](/sql-reference/table-functions/remote) and refreshable materialized views to insert at the desired frequency. Another option is to use object storage as an intermediate and periodically write data from your production service, then import it into the staging environment using the object storage table functions or ClickPipes (for continuous ingestion).
152
+
153
+
Using a dedicated environment for CI testing also allows you to perform manual testing without impacting your production environment. For example, you may want to point a BI tool to this environment for testing.
154
+
155
+
For deployment (i.e., the CD step), we recommend using the artifacts from your production deployments to only update the models that have changed. This requires setting up object storage (e.g., S3) as intermediate storage for your dbt artifacts. Once that is set up, you can run a command like `dbt build --select state:modified+ --state path/to/last/deploy/state.json` to selectively rebuild the minimum amount of models needed based on what changed since the last run in production.
156
+
157
+
## Troubleshooting common issues {#troubleshooting-common-issues}
158
+
159
+
### Connections {#troubleshooting-connections}
129
160
130
161
If you encounter issues connecting to ClickHouse from dbt, make sure the following criteria are met:
131
162
@@ -134,6 +165,10 @@ If you encounter issues connecting to ClickHouse from dbt, make sure the followi
134
165
- If you're not using the default table engine for the database, you must specify a table engine in your model
Some operations may take longer than expected due to specific ClickHouse queries. To gain more insight into which queries are taking longer, increase the [log level](https://docs.getdbt.com/reference/global-configs/logs#log-level) to `debug` — this will print the time used by each query. For example, this can be achieved by appending `--log-level debug` to dbt commands.
171
+
137
172
## Limitations {#limitations}
138
173
139
174
The current ClickHouse adapter for dbt has several limitations users should be aware of:
0 commit comments