Skip to content

DB Pipes: Docs on sync control, resync, initial load and more #4037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: 'Controlling the Syncing of a Database ClickPipe'
description: 'Doc for controllling the sync a database ClickPipe'
slug: /integrations/clickpipes/mysql/sync_control
sidebar_label: 'Controlling syncs'
---

import edit_sync_button from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/edit_sync_button.png'
import create_sync_settings from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/create_sync_settings.png'
import edit_sync_settings from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/sync_settings_edit.png'
import cdc_syncs from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/cdc_syncs.png'

This document describes how to control the sync of a database ClickPipe (Postgres, MySQL etc.) when the ClickPipe is in **CDC (Running) mode**.

## Overview {#overview-mysql-sync}

Database ClickPipes have an architecture that consists of two parallel processes - pulling from the source database and pushing to the target database. The pulling process is controlled by a sync configuration that defines how often the data should be pulled and how much data should be pulled at a time. By "at a time", we mean one batch - since the ClickPipe pulls and pushes data in batches.

There are two main ways to control the sync of a database ClickPipe. The ClickPipe will start pushing when one of the below settings kicks in.

### Sync interval {#interval-mysql-sync}
The sync interval of the pipe is the amount of time (in seconds) for which the ClickPipe will pull records from the source database. The time to push what we have to ClickHouse is not included in this interval.

The default is **1 minute**.
Sync interval can be set to any positive integer value, but it is recommended to keep it above 10 seconds.

### Pull batch size {#batch-size-mysql-sync}
The pull batch size is the number of records that the ClickPipe will pull from the source database in one batch. Records mean inserts, updates and deletes done on the tables that are part of the pipe.

The default is **100,000** records.
A safe maximum is 10 million.

### An exception: Long-running transactions on source {#transactions-pg-sync}
When a transaction is run on the source database, the ClickPipe waits until it receives the COMMIT of the transaction before it moves forward. This with **overrides** both the sync interval and the pull batch size.

### Configuring sync settings {#configuring-mysql-sync}
You can set the sync interval and pull batch size when you create a ClickPipe or edit an existing one.
When creating a ClickPipe it will be seen in the second step of the creation wizard, as shown below:
<img src={create_sync_settings} alt="Create sync settings" />

When editing an existing ClickPipe, you can head over to the **Settings** tab of the pipe, pause the pipe and then click on **Configure** here:
<img src={edit_sync_button} alt="Edit sync button" />

This will open a flyout with the sync settings, where you can change the sync interval and pull batch size:
<img src={edit_sync_settings} alt="Edit sync settings" />

### Monitoring sync control behaviour {#monitoring-mysql-sync}
You can see how long each batch takes in the **CDC Syncs** table in the **Metrics** tab of the ClickPipe. Note that the duration here includes push time and also if there are no rows incoming, the ClickPipe waits and the wait time is also included in the duration.

<img src={cdc_syncs} alt="CDC Syncs table" />
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
title: 'Parallel Snapshot In The MySQL ClickPipe'
description: 'Doc for explaining parallel snapshot in the MySQL ClickPipe'
slug: /integrations/data-ingestion/clickpipes/mysql/parallel_initial_load
sidebar_label: 'How parallel snapshot works'
---

import snapshot_params from '@site/static/images/integrations/data-ingestion/clickpipes/mysql/snapshot_params.png'
import partition_key from '@site/static/images/integrations/data-ingestion/clickpipes/mysql/partition_key.png'

This document explains parallelized snapshot/initial load in the MySQL ClickPipe works and talks about the snapshot parameters that can be used to control it.

## Overview {#overview-mysql-snapshot}

Initial load is the first phase of a CDC ClickPipe, where the ClickPipe syncs the historical data of the tables in the source database over to ClickHouse, before then starting CDC. A lot of the times, developers do this in a single-threaded manner.
However, the MySQL ClickPipe can parallelize this process, which can significantly speed up the initial load.

### Partition key column {#key-mysql-snapshot}

Once we've enabled the feature flag, you should see the below setting in the ClickPipe table picker (both during creation and editing of a ClickPipe):
<img src={partition_key} alt="Partition key column" />

The MySQL ClickPipe uses a column on your source table to logically partition the source tables. This column is called the **partition key column**. It is used to divide the source table into partitions, which can then be processed in parallel by the ClickPipe.

:::warning
The partition key column must be indexed in the source table to see a good performance boost. This can be seen by running `SHOW INDEX FROM <table_name>` in MySQL.
:::

### Logical partitioning {#logical-partitioning-mysql-snapshot}

Let's talk about the below settings:

<img src={snapshot_params} alt="Snapshot parameters" />

#### Snapshot number of rows per partition {#numrows-mysql-snapshot}
This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks will be processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition.

#### Initial load parallelism {#parallelism-mysql-snapshot}
This setting controls how many partitions will be processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition.

#### Snapshot number of tables in parallel {#tables-parallel-mysql-snapshot}
Not really related to parallel snapshot, but this setting controls how many tables will be processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel.

### Monitoring parallel snapshot in MySQL {#monitoring-parallel-mysql-snapshot}
You can run **SHOW processlist** in MySQL to see the parallel snapshot in action. The ClickPipe will create multiple connections to the source database, each reading a different partition of the source table. If you see **SELECT** queries with different ranges, it means that the ClickPipe is reading the source tables. You can also see the COUNT(*) and the partitioning query in here.

### Limitations {#limitations-parallel-mysql-snapshot}
- The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe.
- When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables.
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
title: 'Pausing and Resuming a MySQL ClickPipe'
description: 'Pausing and Resuming a MySQL ClickPipe'
sidebar_label: 'Pause Table'
slug: /integrations/clickpipes/mysql/pause_and_resume
---

import Image from '@theme/IdealImage';
import pause_button from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/pause_button.png'
import pause_dialog from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/pause_dialog.png'
import pause_status from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/pause_status.png'
import resume_button from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/resume_button.png'
import resume_dialog from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/resume_dialog.png'


There are scenarios where it would be useful to pause a MySQL ClickPipe. For example, you may want to run some analytics on existing data in a static state. Or, you might be performing upgrades on MySQL. Here is how you can pause and resume a MySQL ClickPipe.

## Steps to pause a MySQL ClickPipe {#pause-clickpipe-steps}

1. In the Data Sources tab, click on the MySQL ClickPipe you wish to pause.
2. Head over to the **Settings** tab.
3. Click on the **Pause** button.
<br/>

<Image img={pause_button} border size="md"/>

4. A dialog box should appear for confirmation. Click on Pause again.
<br/>

<Image img={pause_dialog} border size="md"/>

4. Head over to the **Metrics** tab.
5. In around 5 seconds (and also on page refresh), the status of the pipe should be **Paused**.
<br/>

<Image img={pause_status} border size="md"/>

## Steps to resume a MySQL ClickPipe {#resume-clickpipe-steps}
1. In the Data Sources tab, click on the MySQL ClickPipe you wish to resume. The status of the mirror should be **Paused** initially.
2. Head over to the **Settings** tab.
3. Click on the **Resume** button.
<br/>

<Image img={resume_button} border size="md"/>

4. A dialog box should appear for confirmation. Click on Resume again.
<br/>

<Image img={resume_dialog} border size="md"/>

5. Head over to the **Metrics** tab.
6. In around 5 seconds (and also on page refresh), the status of the pipe should be **Running**.
43 changes: 43 additions & 0 deletions docs/integrations/data-ingestion/clickpipes/mysql/resync.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: 'Resyncing a Database ClickPipe'
description: 'Doc for resyncing a database ClickPipe'
slug: /integrations/clickpipes/mysql/resync
sidebar_label: 'Resync ClickPipe'
---

import resync_button from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/resync_button.png'

### What does Resync do? {#what-mysql-resync-do}

Resync involves the following operations in order:
1. The existing ClickPipe is dropped, and a new "resync" ClickPipe is kicked off. Thus, changes to source table structures will be picked up when you resync.
2. The resync ClickPipe creates (or replaces) a new set of destination tables which have the same names as the original tables except with a `_resync` suffix.
3. Initial load is performed on the `_resync` tables.
4. The `_resync` tables are then swapped with the original tables. Soft deleted rows are transferred from the original tables to the `_resync` tables before the swap.


All the settings of the original ClickPipe are retained in the resync ClickPipe. The statistics of the original ClickPipe are cleared in the UI.

### Use cases for resyncing a ClickPipe {#use-cases-mysql-resync}
Here are a few scenarios:

1. You may need to perform major schema changes on the source tables which would break the existing ClickPipe and you would need to restart. You can just click Resync after performing the changes.
2. Specifically for Clickhouse, maybe you needed to change the ORDER BY keys on the target tables. You can Resync to re-populate data into the new table with the right sorting key.
3. The replication slot of the ClickPipe is invalidated: Resync creates a new ClickPipe and a new slot on the source database.

:::info
You can resync multiple times, however please account for the load on the source database when you resync,
since initial load with parallel threads is involved each time.
:::

### Resync ClickPipe Guide {#guide-mysql-resync}
1. In the Data Sources tab, click on the MySQL ClickPipe you wish to resync.
2. Head over to the **Settings** tab.
3. Click on the **Resync** button.
<br/>
<img img={resync_button} border size="md"/>
4. A dialog box should appear for confirmation. Click on Resync again.
<br/>
5. Head over to the **Metrics** tab.
6. In around 5 seconds (and also on page refresh), the status of the pipe should be **Setup** or **Snapshot**.
7. The initial load of the resync can be monitored in the **Tables** tab - in the **Initial Load Stats** section.
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
title: 'Resyncing Specific Tables'
description: 'Resyncing specific tables in a MySQL ClickPipe'
slug: /integrations/clickpipes/mysql/table_resync
sidebar_label: 'Resync Table'
---

# Resyncing specific tables {#resync-tables}

There are scenarios where it would be useful to have specific tables of a pipe be re-synced. Some sample use-cases could be major schema changes on source, or maybe some data re-modelling on the ClickHouse.

While resyncing individual tables with a button click is a work-in-progress, this guide will share steps on how you can achieve this today in the MySQL ClickPipe.

### 1. Remove the table from the pipe {#removing-table}

This can be followed by following the [table removal guide](./removing_tables).

### 2. Truncate or drop the table on ClickHouse {#truncate-drop-table}

This step is to avoid data duplication when we add this table again in the next step. You can do this by heading over to the **SQL Console** tab in ClickHouse Cloud and running a query.
Note that since PeerDB creates ReplacingMergeTree tables by default, if your table is small enough where temporary duplicates is harmless, this step can be skipped.

### 3. Add the table to the ClickPipe again {#add-table-again}

This can be followed by following the [table addition guide](./add_table).
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
title: 'Controlling the Syncing of a Database ClickPipe'
description: 'Doc for controlling the sync a database ClickPipe'
slug: /integrations/clickpipes/postgres/sync_control
sidebar_label: 'Controlling syncs'
---

import edit_sync_button from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/edit_sync_button.png'
import create_sync_settings from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/create_sync_settings.png'
import edit_sync_settings from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/sync_settings_edit.png'
import cdc_syncs from '@site/static/images/integrations/data-ingestion/clickpipes/postgres/cdc_syncs.png'

This document describes how to control the sync of a database ClickPipe (Postgres, MySQL etc.) when the ClickPipe is in **CDC (Running) mode**.

## Overview {#overview-pg-sync}

Database ClickPipes have an architecture that consists of two parallel processes - pulling from the source database and pushing to the target database. The pulling process is controlled by a sync configuration that defines how often the data should be pulled and how much data should be pulled at a time. By "at a time", we mean one batch - since the ClickPipe pulls and pushes data in batches.

There are two main ways to control the sync of a database ClickPipe. The ClickPipe will start pushing when one of the below settings kicks in.

### Sync interval {#interval-pg-sync}
The sync interval of the pipe is the amount of time (in seconds) for which the ClickPipe will pull records from the source database. The time to push what we have to ClickHouse is not included in this interval.

The default is **1 minute**.
Sync interval can be set to any positive integer value, but it is recommended to keep it above 10 seconds.

### Pull batch size {#batch-size-pg-sync}
The pull batch size is the number of records that the ClickPipe will pull from the source database in one batch. Records mean inserts, updates and deletes done on the tables that are part of the pipe.

The default is **100,000** records.
A safe maximum is 10 million.

### An exception: Long-running transactions on source {#transactions-pg-sync}
When a transaction is run on the source database, the ClickPipe waits until it receives the COMMIT of the transaction before it moves forward. This with **overrides** both the sync interval and the pull batch size.

### Configuring sync settings {#configuring-pg-sync}
You can set the sync interval and pull batch size when you create a ClickPipe or edit an existing one.
When creating a ClickPipe it will be seen in the second step of the creation wizard, as shown below:
<img src={create_sync_settings} alt="Create sync settings" />

When editing an existing ClickPipe, you can head over to the **Settings** tab of the pipe, pause the pipe and then click on **Configure** here:
<img src={edit_sync_button} alt="Edit sync button" />

This will open a flyout with the sync settings, where you can change the sync interval and pull batch size:
<img src={edit_sync_settings} alt="Edit sync settings" />

### Tweaking the sync settings to help with replication slot growth {#tweaking-pg-sync}
Let's talk about how to use these settings to handle a large replication slot of a CDC pipe.
The pushing time to ClickHouse does not scale linearly with the pulling time from the source database. This can be leveraged to reduce the size of a large replication slot.
By increasing both the sync interval and pull batch size, the ClickPipe will pull a whole lot of data from the source database in one go, and then push it to ClickHouse.

### Monitoring sync control behaviour {#monitoring-pg-sync}
You can see how long each batch takes in the **CDC Syncs** table in the **Metrics** tab of the ClickPipe. Note that the duration here includes push time and also if there are no rows incoming, the ClickPipe waits and the wait time is also included in the duration.

<img src={cdc_syncs} alt="CDC Syncs table" />
Loading
Loading