Skip to content

Commit 6c128bd

Browse files
authored
Merge pull request #3148 from ClickHouse/add-dataflow-docs
Add Google Dataflow docs
2 parents 53f6ca2 + 3af28b3 commit 6c128bd

File tree

10 files changed

+308
-25
lines changed

10 files changed

+308
-25
lines changed

Diff for: docs/en/integrations/data-ingestion/etl-tools/apache-beam.md

+38-25
Original file line numberDiff line numberDiff line change
@@ -97,31 +97,44 @@ public class Main {
9797

9898
## Supported Data Types
9999

100-
| ClickHouse | Apache Beam | Is Supported | Notes |
101-
|--------------------------------------|------------------------------|--------------|----------------------------------------------------------------------------------------------------------------------------------------|
102-
| `TableSchema.TypeName.FLOAT32` | `Schema.TypeName#FLOAT` || |
103-
| `TableSchema.TypeName.FLOAT64` | `Schema.TypeName#DOUBLE` || |
104-
| `TableSchema.TypeName.INT8` | `Schema.TypeName#BYTE` || |
105-
| `TableSchema.TypeName.INT16` | `Schema.TypeName#INT16` || |
106-
| `TableSchema.TypeName.INT32` | `Schema.TypeName#INT32` || |
107-
| `TableSchema.TypeName.INT64` | `Schema.TypeName#INT64` || |
108-
| `TableSchema.TypeName.STRING` | `Schema.TypeName#STRING` || |
109-
| `TableSchema.TypeName.UINT8` | `Schema.TypeName#INT16` || |
110-
| `TableSchema.TypeName.UINT16` | `Schema.TypeName#INT32` || |
111-
| `TableSchema.TypeName.UINT32` | `Schema.TypeName#INT64` || |
112-
| `TableSchema.TypeName.UINT64` | `Schema.TypeName#INT64` || |
113-
| `TableSchema.TypeName.DATE` | `Schema.TypeName#DATETIME` || |
114-
| `TableSchema.TypeName.DATETIME` | `Schema.TypeName#DATETIME` || |
115-
| `TableSchema.TypeName.ARRAY` | `Schema.TypeName#ARRAY` || |
116-
| `TableSchema.TypeName.ENUM8` | `Schema.TypeName#STRING` || |
117-
| `TableSchema.TypeName.ENUM16` | `Schema.TypeName#STRING` || |
118-
| `TableSchema.TypeName.BOOL` | `Schema.TypeName#BOOLEAN` || |
119-
| `TableSchema.TypeName.TUPLE` | `Schema.TypeName#ROW` || |
120-
| `TableSchema.TypeName.FIXEDSTRING` | `FixedBytes` || `FixedBytes` is a `LogicalType` representing a fixed-length <br/> byte array located at <br/> `org.apache.beam.sdk.schemas.logicaltypes` |
121-
| | `Schema.TypeName#DECIMAL` || |
122-
| | `Schema.TypeName#MAP` || |
123-
124-
100+
| ClickHouse | Apache Beam | Is Supported | Notes |
101+
|------------------------------------|----------------------------|--------------|------------------------------------------------------------------------------------------------------------------------------------------|
102+
| `TableSchema.TypeName.FLOAT32` | `Schema.TypeName#FLOAT` || |
103+
| `TableSchema.TypeName.FLOAT64` | `Schema.TypeName#DOUBLE` || |
104+
| `TableSchema.TypeName.INT8` | `Schema.TypeName#BYTE` || |
105+
| `TableSchema.TypeName.INT16` | `Schema.TypeName#INT16` || |
106+
| `TableSchema.TypeName.INT32` | `Schema.TypeName#INT32` || |
107+
| `TableSchema.TypeName.INT64` | `Schema.TypeName#INT64` || |
108+
| `TableSchema.TypeName.STRING` | `Schema.TypeName#STRING` || |
109+
| `TableSchema.TypeName.UINT8` | `Schema.TypeName#INT16` || |
110+
| `TableSchema.TypeName.UINT16` | `Schema.TypeName#INT32` || |
111+
| `TableSchema.TypeName.UINT32` | `Schema.TypeName#INT64` || |
112+
| `TableSchema.TypeName.UINT64` | `Schema.TypeName#INT64` || |
113+
| `TableSchema.TypeName.DATE` | `Schema.TypeName#DATETIME` || |
114+
| `TableSchema.TypeName.DATETIME` | `Schema.TypeName#DATETIME` || |
115+
| `TableSchema.TypeName.ARRAY` | `Schema.TypeName#ARRAY` || |
116+
| `TableSchema.TypeName.ENUM8` | `Schema.TypeName#STRING` || |
117+
| `TableSchema.TypeName.ENUM16` | `Schema.TypeName#STRING` || |
118+
| `TableSchema.TypeName.BOOL` | `Schema.TypeName#BOOLEAN` || |
119+
| `TableSchema.TypeName.TUPLE` | `Schema.TypeName#ROW` || |
120+
| `TableSchema.TypeName.FIXEDSTRING` | `FixedBytes` || `FixedBytes` is a `LogicalType` representing a fixed-length <br/> byte array located at <br/> `org.apache.beam.sdk.schemas.logicaltypes` |
121+
| | `Schema.TypeName#DECIMAL` || |
122+
| | `Schema.TypeName#MAP` || |
123+
124+
## ClickHouseIO.Write Parameters
125+
126+
You can adjust the `ClickHouseIO.Write` configuration with the following setter functions:
127+
128+
| Parameter Setter Function | Argument Type | Default Value | Description |
129+
|-----------------------------|-----------------------------|-------------------------------|-----------------------------------------------------------------|
130+
| `withMaxInsertBlockSize` | `(long maxInsertBlockSize)` | `1000000` | Maximum size of a block of rows to insert. |
131+
| `withMaxRetries` | `(int maxRetries)` | `5` | Maximum number of retries for failed inserts. |
132+
| `withMaxCumulativeBackoff` | `(Duration maxBackoff)` | `Duration.standardDays(1000)` | Maximum cumulative backoff duration for retries. |
133+
| `withInitialBackoff` | `(Duration initialBackoff)` | `Duration.standardSeconds(5)` | Initial backoff duration before the first retry. |
134+
| `withInsertDistributedSync` | `(Boolean sync)` | `true` | If true, synchronizes insert operations for distributed tables. |
135+
| `withInsertQuorum` | `(Long quorum)` | `null` | The number of replicas required to confirm an insert operation. |
136+
| `withInsertDeduplicate` | `(Boolean deduplicate)` | `true` | If true, deduplication is enabled for insert operations. |
137+
| `withTableSchema` | `(TableSchema schema)` | `null` | Schema of the target ClickHouse table. |
125138

126139
## Limitations
127140

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
sidebar_label: Integrating Dataflow with ClickHouse
3+
slug: /en/integrations/google-dataflow/dataflow
4+
sidebar_position: 1
5+
description: Users can ingest data into ClickHouse using Google Dataflow
6+
---
7+
8+
# Integrating Google Dataflow with ClickHouse
9+
10+
[Google Dataflow](https://cloud.google.com/dataflow) is a fully managed stream and batch data processing service. It supports pipelines written in Java or Python and is built on the Apache Beam SDK.
11+
12+
There are two main ways to use Google Dataflow with ClickHouse, both are leveraging [`ClickHouseIO Apache Beam connector`](../../apache-beam):
13+
14+
## 1. Java Runner
15+
The [Java Runner](./java-runner) allows users to implement custom Dataflow pipelines using the Apache Beam SDK `ClickHouseIO` integration. This approach provides full flexibility and control over the pipeline logic, enabling users to tailor the ETL process to specific requirements.
16+
However, this option requires knowledge of Java programming and familiarity with the Apache Beam framework.
17+
18+
### Key Features
19+
- High degree of customization.
20+
- Ideal for complex or advanced use cases.
21+
- Requires coding and understanding of the Beam API.
22+
23+
## 2. Predefined Templates
24+
ClickHouse offers [predefined templates](./templates) designed for specific use cases, such as importing data from BigQuery into ClickHouse. These templates are ready-to-use and simplify the integration process, making them an excellent choice for users who prefer a no-code solution.
25+
26+
### Key Features
27+
- No Beam coding required.
28+
- Quick and easy setup for simple use cases.
29+
- Suitable also for users with minimal programming expertise.
30+
31+
Both approaches are fully compatible with Google Cloud and the ClickHouse ecosystem, offering flexibility depending on your technical expertise and project requirements.
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
sidebar_label: Java Runner
3+
slug: /en/integrations/google-dataflow/java-runner
4+
sidebar_position: 2
5+
description: Users can ingest data into ClickHouse using Google Dataflow Java Runner
6+
---
7+
8+
# Dataflow Java Runner
9+
10+
The Dataflow Java Runner lets you execute custom Apache Beam pipelines on Google Cloud's Dataflow service. This approach provides maximum flexibility and is well-suited for advanced ETL workflows.
11+
12+
## How It Works
13+
14+
1. **Pipeline Implementation**
15+
To use the Java Runner, you need to implement your Beam pipeline using the `ClickHouseIO` - our official Apache Beam connector. For code examples and instructions on how to use the `ClickHouseIO`, please visit [ClickHouse Apache Beam](../../apache-beam).
16+
17+
2. **Deployment**
18+
Once your pipeline is implemented and configured, you can deploy it to Dataflow using Google Cloud's deployment tools. Comprehensive deployment instructions are provided in the [Google Cloud Dataflow documentation - Java Pipeline](https://cloud.google.com/dataflow/docs/quickstarts/create-pipeline-java).
19+
20+
**Note**: This approach assumes familiarity with the Beam framework and coding expertise. If you prefer a no-code solution, consider using [ClickHouse's predefined templates](./templates).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
---
2+
sidebar_label: Templates
3+
slug: /en/integrations/google-dataflow/templates
4+
sidebar_position: 3
5+
description: Users can ingest data into ClickHouse using Google Dataflow Templates
6+
---
7+
8+
# Google Dataflow Templates
9+
10+
Google Dataflow templates provide a convenient way to execute prebuilt, ready-to-use data pipelines without the need to write custom code. These templates are designed to simplify common data processing tasks and are built using [Apache Beam](https://beam.apache.org/), leveraging connectors like `ClickHouseIO` for seamless integration with ClickHouse databases. By running these templates on Google Dataflow, you can achieve highly scalable, distributed data processing with minimal effort.
11+
12+
13+
14+
15+
## Why Use Dataflow Templates?
16+
17+
- **Ease of Use**: Templates eliminate the need for coding by offering preconfigured pipelines tailored to specific use cases.
18+
- **Scalability**: Dataflow ensures your pipeline scales efficiently, handling large volumes of data with distributed processing.
19+
- **Cost Efficiency**: Pay only for the resources you consume, with the ability to optimize pipeline execution costs.
20+
21+
## How to Run Dataflow Templates
22+
23+
As of today, the ClickHouse official template is available via the Google Cloud CLI or Dataflow REST API.
24+
For detailed step-by-step instructions, refer to the [Google Dataflow Run Pipeline From a Template Guide](https://cloud.google.com/dataflow/docs/templates/provided-templates).
25+
26+
27+
## List of ClickHouse Templates
28+
* [BigQuery To ClickHouse](./templates/bigquery-to-clickhouse)
29+
* [GCS To ClickHouse](https://github.com/ClickHouse/DataflowTemplates/issues/3) (coming soon!)
30+
* [Pub Sub To ClickHouse](https://github.com/ClickHouse/DataflowTemplates/issues/4) (coming soon!)

0 commit comments

Comments
 (0)