Skip to content

Commit e198c8a

Browse files
committed
more images
1 parent 6fd55a5 commit e198c8a

File tree

6 files changed

+29
-55
lines changed

6 files changed

+29
-55
lines changed

docs/integrations/data-ingestion/dbms/dynamodb/index.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ import ExperimentalBadge from '@theme/badges/ExperimentalBadge';
1212
import dynamodb_kinesis_stream from '@site/static/images/integrations/data-ingestion/dbms/dynamodb/dynamodb-kinesis-stream.png';
1313
import dynamodb_s3_export from '@site/static/images/integrations/data-ingestion/dbms/dynamodb/dynamodb-s3-export.png';
1414
import dynamodb_map_columns from '@site/static/images/integrations/data-ingestion/dbms/dynamodb/dynamodb-map-columns.png';
15+
import Image from '@theme/IdealImage';
1516

1617
# CDC from DynamoDB to ClickHouse
1718

@@ -31,14 +32,14 @@ Data will be ingested into a `ReplacingMergeTree`. This table engine is commonly
3132
First, you will want to enable a Kinesis stream on your DynamoDB table to capture changes in real-time. We want to do this before we create the snapshot to avoid missing any data.
3233
Find the AWS guide located [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/kds.html).
3334

34-
<img src={dynamodb_kinesis_stream} alt="DynamoDB Kinesis Stream"/>
35+
<Image img={dynamodb_kinesis_stream} size="lg" alt="DynamoDB Kinesis Stream" />
3536

3637
## 2. Create the snapshot {#2-create-the-snapshot}
3738

3839
Next, we will create a snapshot of the DynamoDB table. This can be achieved through an AWS export to S3. Find the AWS guide located [here](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.HowItWorks.html).
3940
**You will want to do a "Full export" in the DynamoDB JSON format.**
4041

41-
<img src={dynamodb_s3_export} alt="DynamoDB S3 Export"/>
42+
<Image img={dynamodb_s3_export} size="lg" alt="DynamoDB S3 Export"/>
4243

4344
## 3. Load the snapshot into ClickHouse {#3-load-the-snapshot-into-clickhouse}
4445

@@ -91,7 +92,7 @@ CREATE TABLE IF NOT EXISTS "default"."destination" (
9192
"first_name" String,
9293
"age" Int8,
9394
"version" Int64
94-
)
95+
)
9596
ENGINE ReplacingMergeTree("version")
9697
ORDER BY id;
9798
```
@@ -128,7 +129,7 @@ Now we can set up the Kinesis ClickPipe to capture real-time changes from the Ki
128129
- `ApproximateCreationDateTime`: `version`
129130
- Map other fields to the appropriate destination columns as shown below
130131

131-
<img src={dynamodb_map_columns} alt="DynamoDB Map Columns"/>
132+
<Image img={dynamodb_map_columns} size="lg" alt="DynamoDB Map Columns"/>
132133

133134
## 5. Cleanup (optional) {#5-cleanup-optional}
134135

@@ -139,4 +140,3 @@ DROP TABLE IF EXISTS "default"."snapshot";
139140
DROP TABLE IF EXISTS "default"."snapshot_clickpipes_error";
140141
DROP VIEW IF EXISTS "default"."snapshot_mv";
141142
```
142-

docs/integrations/data-ingestion/dbms/postgresql/postgres-vs-clickhouse.md

Lines changed: 9 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ description: 'Page which explores the similarities and differences between Postg
66
---
77

88
import postgresReplicas from '@site/static/images/integrations/data-ingestion/dbms/postgres-replicas.png';
9+
import Image from '@theme/IdealImage';
910

1011
## Postgres vs ClickHouse: Equivalent and different concepts {#postgres-vs-clickhouse-equivalent-and-different-concepts}
1112

@@ -33,22 +34,15 @@ ClickHouse uses ClickHouse Keeper (C++ ZooKeeper implementation, ZooKeeper can a
3334

3435
The replication process in ClickHouse (1) starts when data is inserted into any replica. This data, in its raw insert form, is (2) written to disk along with its checksums. Once written, the replica (3) attempts to register this new data part in Keeper by allocating a unique block number and logging the new part's details. Other replicas, upon (4) detecting new entries in the replication log, (5) download the corresponding data part via an internal HTTP protocol, verifying it against the checksums listed in ZooKeeper. This method ensures that all replicas eventually hold consistent and up-to-date data despite varying processing speeds or potential delays. Moreover, the system is capable of handling multiple operations concurrently, optimizing data management processes, and allowing for system scalability and robustness against hardware discrepancies.
3536

36-
<br />
37-
38-
<img src={postgresReplicas}
39-
class="image"
40-
alt="NEEDS ALT"
41-
style={{width: '500px'}} />
42-
43-
<br />
37+
<Image img={postgresReplicas} size="md" alt="Eventual consistency"/>
4438

45-
Note that ClickHouse Cloud uses a [cloud-optimized replication mechanism](https://clickhouse.com/blog/clickhouse-cloud-boosts-performance-with-sharedmergetree-and-lightweight-updates) adapted to its separation of storage and compute architecture. By storing data in shared object storage, data is automatically available for all compute nodes without the need to physically replicate data between nodes. Instead, Keeper is used to only share metadata (which data exists where in object storage) between compute nodes.
39+
Note that ClickHouse Cloud uses a [cloud-optimized replication mechanism](https://clickhouse.com/blog/clickhouse-cloud-boosts-performance-with-sharedmergetree-and-lightweight-updates) adapted to its separation of storage and compute architecture. By storing data in shared object storage, data is automatically available for all compute nodes without the need to physically replicate data between nodes. Instead, Keeper is used to only share metadata (which data exists where in object storage) between compute nodes.
4640

4741
PostgreSQL employs a different replication strategy compared to ClickHouse, primarily using streaming replication, which involves a primary replica model where data is continuously streamed from the primary to one or more replica nodes. This type of replication ensures near real-time consistency and is synchronous or asynchronous, giving administrators control over the balance between availability and consistency. Unlike ClickHouse, PostgreSQL relies on a WAL (Write-Ahead Logging) with logical replication and decoding to stream data objects and changes between nodes. This approach in PostgreSQL is more straightforward but might not offer the same level of scalability and fault tolerance in highly distributed environments that ClickHouse achieves through its complex use of Keeper for distributed operations coordination and eventual consistency.
4842

4943
## User implications {#user-implications}
5044

51-
In ClickHouse, the possibility of dirty reads - where users can write data to one replica and then read potentially unreplicated data from another—arises from its eventually consistent replication model managed via Keeper. This model emphasizes performance and scalability across distributed systems, allowing replicas to operate independently and sync asynchronously. As a result, newly inserted data might not be immediately visible across all replicas, depending on the replication lag and the time it takes for changes to propagate through the system.
45+
In ClickHouse, the possibility of dirty reads - where users can write data to one replica and then read potentially unreplicated data from another—arises from its eventually consistent replication model managed via Keeper. This model emphasizes performance and scalability across distributed systems, allowing replicas to operate independently and sync asynchronously. As a result, newly inserted data might not be immediately visible across all replicas, depending on the replication lag and the time it takes for changes to propagate through the system.
5246

5347
Conversely, PostgreSQL's streaming replication model typically can prevent dirty reads by employing synchronous replication options where the primary waits for at least one replica to confirm the receipt of data before committing transactions. This ensures that once a transaction is committed, a guarantee exists that the data is available in another replica. In the event of primary failure, the replica will ensure queries see the committed data, thereby maintaining a stricter level of consistency.
5448

@@ -88,27 +82,27 @@ In this case, users should ensure consistent node routing is performed based on
8882
8983
## Sequential consistency {#sequential-consistency}
9084

91-
In exceptional cases, users may need sequential consistency.
85+
In exceptional cases, users may need sequential consistency.
9286

93-
Sequential consistency in databases is where the operations on a database appear to be executed in some sequential order, and this order is consistent across all processes interacting with the database. This means that every operation appears to take effect instantaneously between its invocation and completion, and there is a single, agreed-upon order in which all operations are observed by any process.
87+
Sequential consistency in databases is where the operations on a database appear to be executed in some sequential order, and this order is consistent across all processes interacting with the database. This means that every operation appears to take effect instantaneously between its invocation and completion, and there is a single, agreed-upon order in which all operations are observed by any process.
9488

9589
From a user's perspective this typically manifests itself as the need to write data into ClickHouse and when reading data, to guarantee that the latest inserted rows are returned.
9690
This can be achieved in several ways (in order of preference):
9791

9892
1. **Read/Write to the same node** - If you are using native protocol, or a [session to do your write/read via HTTP](/interfaces/http#default-database), you should then be connected to the same replica: in this scenario you're reading directly from the node where you're writing, then your read will always be consistent.
99-
1. **Sync replicas manually** - If you write to one replica and read from another, you can use issue `SYSTEM SYNC REPLICA LIGHTWEIGHT` prior to reading.
93+
1. **Sync replicas manually** - If you write to one replica and read from another, you can use issue `SYSTEM SYNC REPLICA LIGHTWEIGHT` prior to reading.
10094
1. **Enable sequential consistency** - via the query setting [`select_sequential_consistency = 1`](/operations/settings/settings#select_sequential_consistency). In OSS, the setting `insert_quorum = 'auto'` must also be specified.
10195

10296
<br />
10397

10498
See [here](/cloud/reference/shared-merge-tree#consistency) for further details on enabling these settings.
10599

106-
> Use of sequential consistency will place a greater load on ClickHouse Keeper. The result can
100+
> Use of sequential consistency will place a greater load on ClickHouse Keeper. The result can
107101
mean slower inserts and reads. SharedMergeTree, used in ClickHouse Cloud as the main table engine, sequential consistency [incurs less overhead and will scale better](/cloud/reference/shared-merge-tree#consistency). OSS users should use this approach cautiously and measure Keeper load.
108102

109103
## Transactional (ACID) support {#transactional-acid-support}
110104

111-
Users migrating from PostgreSQL may be used to its robust support for ACID (Atomicity, Consistency, Isolation, Durability) properties, making it a reliable choice for transactional databases. Atomicity in PostgreSQL ensures that each transaction is treated as a single unit, which either completely succeeds or is entirely rolled back, preventing partial updates. Consistency is maintained by enforcing constraints, triggers, and rules that guarantee that all database transactions lead to a valid state. Isolation levels, from Read Committed to Serializable, are supported in PostgreSQL, allowing fine-tuned control over the visibility of changes made by concurrent transactions. Lastly, Durability is achieved through write-ahead logging (WAL), ensuring that once a transaction is committed, it remains so even in the event of a system failure.
105+
Users migrating from PostgreSQL may be used to its robust support for ACID (Atomicity, Consistency, Isolation, Durability) properties, making it a reliable choice for transactional databases. Atomicity in PostgreSQL ensures that each transaction is treated as a single unit, which either completely succeeds or is entirely rolled back, preventing partial updates. Consistency is maintained by enforcing constraints, triggers, and rules that guarantee that all database transactions lead to a valid state. Isolation levels, from Read Committed to Serializable, are supported in PostgreSQL, allowing fine-tuned control over the visibility of changes made by concurrent transactions. Lastly, Durability is achieved through write-ahead logging (WAL), ensuring that once a transaction is committed, it remains so even in the event of a system failure.
112106

113107
These properties are common for OLTP databases that act as a source of truth.
114108

@@ -125,6 +119,3 @@ PeerDB is now available natively in ClickHouse Cloud - Blazing-fast Postgres to
125119
[PeerDB](https://www.peerdb.io/) enables you to seamlessly replicate data from Postgres to ClickHouse. You can use this tool for
126120
1. continuous replication using CDC, allowing Postgres and ClickHouse to coexist—Postgres for OLTP and ClickHouse for OLAP; and
127121
2. migrating from Postgres to ClickHouse.
128-
129-
130-

docs/integrations/data-ingestion/redshift/index.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ import pull from '@site/static/images/integrations/data-ingestion/redshift/pull.
1111
import pivot from '@site/static/images/integrations/data-ingestion/redshift/pivot.png';
1212
import s3_1 from '@site/static/images/integrations/data-ingestion/redshift/s3-1.png';
1313
import s3_2 from '@site/static/images/integrations/data-ingestion/redshift/s3-2.png';
14+
import Image from '@theme/IdealImage';
1415

1516
# Migrating Data from Redshift to ClickHouse
1617

@@ -34,7 +35,7 @@ import s3_2 from '@site/static/images/integrations/data-ingestion/redshift/s3-2.
3435

3536
[Amazon Redshift](https://aws.amazon.com/redshift/) is a popular cloud data warehousing solution that is part of the Amazon Web Services offerings. This guide presents different approaches to migrating data from a Redshift instance to ClickHouse. We will cover three options:
3637

37-
<img src={redshiftToClickhouse} class="image" alt="Redshift to ClickHouse Migration Options"/>
38+
<Image img={redshiftToClickhouse} size="lg" alt="Redshift to ClickHouse Migration Options" background="white"/>
3839

3940
From the ClickHouse instance standpoint, you can either:
4041

@@ -53,8 +54,7 @@ We used Redshift as a data source in this tutorial. However, the migration appro
5354

5455
In the push scenario, the idea is to leverage a third-party tool or service (either custom code or an [ETL/ELT](https://en.wikipedia.org/wiki/Extract,_transform,_load#ETL_vs._ELT)) to send your data to your ClickHouse instance. For example, you can use a software like [Airbyte](https://www.airbyte.com/) to move data between your Redshift instance (as a source) and ClickHouse as a destination ([see our integration guide for Airbyte](/integrations/data-ingestion/etl-tools/airbyte-and-clickhouse.md))
5556

56-
57-
<img src={push} class="image" alt="PUSH Redshift to ClickHouse"/>
57+
<Image img={push} size="lg" alt="PUSH Redshift to ClickHouse" background="white"/>
5858

5959
### Pros {#pros}
6060

@@ -72,8 +72,7 @@ In the push scenario, the idea is to leverage a third-party tool or service (eit
7272

7373
In the pull scenario, the idea is to leverage the ClickHouse JDBC Bridge to connect to a Redshift cluster directly from a ClickHouse instance and perform `INSERT INTO ... SELECT` queries:
7474

75-
76-
<img src={pull} class="image" alt="PULL from Redshift to ClickHouse"/>
75+
<Image img={pull} size="lg" alt="PULL from Redshift to ClickHouse" background="white"/>
7776

7877
### Pros {#pros-1}
7978

@@ -197,7 +196,7 @@ If you are using ClickHouse Cloud, you will need to run your ClickHouse JDBC Bri
197196

198197
In this scenario, we export data to S3 in an intermediary pivot format and, in a second step, load the data from S3 into ClickHouse.
199198

200-
<img src={pivot} class="image" alt="PIVOT from Redshift using S3"/>
199+
<Image img={pivot} size="lg" alt="PIVOT from Redshift using S3" background="white"/>
201200

202201
### Pros {#pros-2}
203202

@@ -214,11 +213,11 @@ In this scenario, we export data to S3 in an intermediary pivot format and, in a
214213

215214
1. Using Redshift's [UNLOAD](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html) feature, export the data into a an existing private S3 bucket:
216215

217-
<img src={s3_1} class="image" alt="UNLOAD from Redshift to S3"/>
216+
<Image img={s3_1} size="md" alt="UNLOAD from Redshift to S3" background="white"/>
218217

219218
It will generate part files containing the raw data in S3
220219

221-
<img src={s3_2} class="image" alt="Data in S3"/>
220+
<Image img={s3_2} size="md" alt="Data in S3" background="white"/>
222221

223222
2. Create the table in ClickHouse:
224223

@@ -261,4 +260,3 @@ In this scenario, we export data to S3 in an intermediary pivot format and, in a
261260
:::note
262261
This example used CSV as the pivot format. However, for production workloads we recommend Apache Parquet as the best option for large migrations since it comes with compression and can save some storage costs while reducing transfer times. (By default, each row group is compressed using SNAPPY). ClickHouse also leverages Parquet's column orientation to speed up data ingestion.
263262
:::
264-

docs/migrations/postgres/data-modeling-techniques.md

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ keywords: ['postgres', 'postgresql', 'migrate', 'migration', 'data modeling']
77

88
import postgres_partitions from '@site/static/images/migrations/postgres-partitions.png';
99
import postgres_projections from '@site/static/images/migrations/postgres-projections.png';
10+
import Image from '@theme/IdealImage';
1011

1112
> This is **Part 3** of a guide on migrating from PostgreSQL to ClickHouse. This content can be considered introductory, with the aim of helping users deploy an initial functional system that adheres to ClickHouse best practices. It avoids complex topics and will not result in a fully optimized schema; rather, it provides a solid foundation for users to build a production system and base their learning.
1213
@@ -18,11 +19,7 @@ Postgres users will be familiar with the concept of table partitioning for enhan
1819

1920
In ClickHouse, partitioning is specified on a table when it is initially defined via the `PARTITION BY` clause. This clause can contain a SQL expression on any columns, the results of which will define which partition a row is sent to.
2021

21-
<br />
22-
23-
<img src={postgres_partitions} class="image" alt="PostgreSQL partitions to ClickHouse partitions" style={{width: '600px'}} />
24-
25-
<br />
22+
<Image img={postgres_partitions} size="md" alt="PostgreSQL partitions to ClickHouse partitions"/>
2623

2724
The data parts are logically associated with each partition on disk and can be queried in isolation. For the example below, we partition the `posts` table by year using the expression `toYear(CreationDate)`. As rows are inserted into ClickHouse, this expression will be evaluated against each row and routed to the resulting partition if it exists (if the row is the first for a year, the partition will be created).
2825

@@ -210,11 +207,7 @@ WHERE UserId = 8592047
210207

211208
Projections are an appealing feature for new users as they are automatically maintained as data is inserted. Furthermore, queries can just be sent to a single table where the projections are exploited where possible to speed up the response time.
212209

213-
<br />
214-
215-
<img src={postgres_projections} class="image" alt="PostgreSQL projections in ClickHouse" style={{width: '600px'}} />
216-
217-
<br />
210+
<Image img={postgres_projections} size="md" alt="PostgreSQL projections in ClickHouse"/>
218211

219212
This is in contrast to materialized views, where the user has to select the appropriate optimized target table or rewrite their query, depending on the filters. This places greater emphasis on user applications and increases client-side complexity.
220213

0 commit comments

Comments
 (0)