Skip to content

Commit 4b9dda3

Browse files
committed
ETL/CDC: Guidance, Layout
1 parent 2788f10 commit 4b9dda3

File tree

13 files changed

+303
-26
lines changed

13 files changed

+303
-26
lines changed

docs/_include/links.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
[Amazon DynamoDB Streams]: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
22
[Amazon Kinesis Data Streams]: https://docs.aws.amazon.com/streams/latest/dev/introduction.html
3+
[Apache Airflow]: https://airflow.apache.org/
4+
[Astronomer]: https://www.astronomer.io/
35
[AWS Database Migration Service (AWS DMS)]: https://aws.amazon.com/dms/
46
[AWS DMS Integration with CrateDB]: https://cratedb-toolkit.readthedocs.io/io/dms/
7+
[AWS Lambda]: https://aws.amazon.com/lambda/
58
[BM25]: https://en.wikipedia.org/wiki/Okapi_BM25
69
[cloud-datashader-colab]: https://colab.research.google.com/github/crate/cratedb-examples/blob/amo/cloud-datashader/topic/timeseries/explore/cloud-datashader.ipynb
710
[cloud-datashader-github]: https://github.com/crate/cratedb-examples/blob/amo/cloud-datashader/topic/timeseries/explore/cloud-datashader.ipynb
@@ -17,6 +20,7 @@
1720
[dask-weather-data-github]: https://github.com/crate/cratedb-examples/blob/main/topic/timeseries/dask-weather-data-import.ipynb
1821
[Datashader]: https://datashader.org/
1922
[Dynamic Database Schemas]: https://cratedb.com/product/features/dynamic-schemas
23+
[DynamoDB]: https://aws.amazon.com/dynamodb/
2024
[DynamoDB CDC Relay]: https://cratedb-toolkit.readthedocs.io/io/dynamodb/cdc.html
2125
[DynamoDB CDC Relay with AWS Lambda]: https://cratedb-toolkit.readthedocs.io/io/dynamodb/cdc-lambda.html
2226
[DynamoDB Table Loader]: https://cratedb-toolkit.readthedocs.io/io/dynamodb/loader.html

docs/_include/styles.html

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,4 +46,11 @@
4646
height: 0;
4747
}
4848

49+
/* On tiled link overview index pages, give ul/li elements more space */
50+
.ul-li-wide {
51+
ul li {
52+
margin-bottom: 1rem;
53+
}
54+
}
55+
4956
</style>

docs/connect/index.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33

44
:::{include} /_include/links.md
55
:::
6+
:::{include} /_include/styles.html
7+
:::
68

79
:::::{grid}
810
:padding: 0
@@ -71,6 +73,7 @@ protocol.
7173
:gutter: 2
7274

7375
::::{grid-item-card} {material-outlined}`link;2em` How to connect
76+
:class-body: ul-li-wide
7477
- {ref}`connect-configure`
7578

7679
To connect to CrateDB, your application or driver needs to be configured
@@ -87,7 +90,7 @@ protocol.
8790
Database connectivity options and tools.
8891
::::
8992

90-
::::{grid-item-card} {material-outlined}`not_started;2em` How to use database drivers
93+
::::{grid-item-card} {material-outlined}`link;2em` How to connect
9194
- {ref}`connect-java`
9295
- {ref}`connect-javascript`
9396
- {ref}`connect-php`

docs/ingest/cdc/index.md

Lines changed: 86 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5,20 +5,96 @@
55
:::
66

77
:::{div}
8-
CrateDB provides many options to connect and integrate with third-party
8+
Options to connect and integrate CrateDB with third-party
99
CDC applications, mostly using [CrateDB's PostgreSQL interface].
1010
CrateDB also provides native adapter components to leverage advanced
1111
features.
1212

13-
This documentation section lists corresponding CDC applications and
14-
frameworks which can be used together with CrateDB, and outlines how
15-
to use them optimally.
13+
This documentation section lists CDC applications,
14+
frameworks, and solutions, which can be used together with CrateDB,
15+
and outlines how to use them optimally.
1616
Please also take a look at support for {ref}`generic ETL <etl>` solutions.
1717
:::
1818

19-
- {ref}`aws-dms`
20-
- {ref}`aws-dynamodb`
21-
- {ref}`aws-kinesis`
22-
- {ref}`debezium`
23-
- {ref}`mongodb`
24-
- {ref}`streamsets`
19+
20+
## Connectors
21+
22+
Native and specialized connectors for CrateDB, both managed and unmanaged.
23+
24+
:::::{grid} 1
25+
:gutter: 2
26+
27+
::::{grid-item-card} Amazon DynamoDB
28+
:link: aws-dynamodb
29+
:link-type: ref
30+
Load data from DynamoDB, a fully managed NoSQL database service provided by
31+
Amazon Web Services (AWS), which is designed for high-performance, scalable
32+
applications and offers key-value and document data structures.
33+
::::
34+
35+
::::{grid-item-card} Amazon Kinesis
36+
:link: aws-kinesis
37+
:link-type: ref
38+
Load data from Amazon Kinesis Data Streams, a serverless streaming data service
39+
that simplifies the capture, processing, and storage of data streams at any scale.
40+
::::
41+
42+
::::{grid-item-card} MongoDB
43+
:link: mongodb
44+
:link-type: ref
45+
Load data from MongoDB or MongoDB Atlas, a document database, self-hosted
46+
or multi-cloud.
47+
::::
48+
49+
:::::
50+
51+
52+
## Platforms
53+
54+
Support for data integration frameworks and platforms, both managed and unmanaged.
55+
56+
:::::{grid} 1
57+
:gutter: 2
58+
59+
::::{grid-item-card} AWS DMS
60+
:link: aws-dms
61+
:link-type: ref
62+
Use AWS Database Migration Service (AWS DMS), a managed migration and replication
63+
service that helps move your database and analytics workloads between different
64+
kinds of databases.
65+
::::
66+
67+
::::{grid-item-card} Debezium
68+
:link: debezium
69+
:link-type: ref
70+
Use, Debezium an open source distributed platform for change data capture for
71+
loading data into CrateDB.
72+
It is used as a building block by a number of downstream third-party projects and products.
73+
::::
74+
75+
::::{grid-item-card} Estuary
76+
:link: estuary
77+
:link-type: ref
78+
Use Estuary Flow, a managed, real-time, reliable change data capture (CDC) solution,
79+
to load data into CrateDB.
80+
It combines agentless CDC, zero-code pipelines, and enterprise-grade governance to
81+
simplify data integration.
82+
::::
83+
84+
::::{grid-item-card} RisingWave
85+
:link: risingwave
86+
:link-type: ref
87+
Use RisingWave, a stream processing and management platform, to load data into CrateDB.
88+
It provides a Postgres-compatible SQL interface, like CrateDB, and a DataFrame-style
89+
Python interface. It is available for on-premises and as a managed service.
90+
::::
91+
92+
::::{grid-item-card} StreamSets
93+
:link: streamsets
94+
:link-type: ref
95+
Use the StreamSets Data Collector Engine to ingest and transform data from a variety
96+
of sources into CrateDB. It runs on-premises or in any cloud.
97+
::::
98+
99+
:::::
100+

docs/ingest/etl/index.md

Lines changed: 182 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,29 +5,207 @@
55

66
:::{include} /_include/links.md
77
:::
8+
:::{include} /_include/styles.html
9+
:::
810

911
:::{div}
10-
CrateDB provides many options to connect and integrate with third-party
12+
Options to connect and integrate CrateDB with third-party
1113
ETL applications, mostly using [CrateDB's PostgreSQL interface].
1214
CrateDB also provides native adapter components to leverage advanced
1315
features.
1416

15-
This documentation section lists corresponding ETL applications and
17+
This documentation section lists ETL applications and
1618
frameworks which can be used together with CrateDB, and outlines how
1719
to use them optimally.
1820
Please also take a look at support for {ref}`cdc` solutions.
1921
:::
2022

2123

24+
:::{rubric} Grouped by category
25+
:::
26+
27+
:::::{grid} 1 2 2 2
28+
:margin: 4 4 0 0
29+
:padding: 0
30+
:gutter: 2
31+
:class-container: ul-li-wide
32+
33+
34+
::::{grid-item-card} {material-outlined}`air;2em` Dataflow / Pipeline / Code-first
35+
- {ref}`apache-airflow`
36+
37+
Apache Airflow is an open source software platform to programmatically author,
38+
schedule, and monitor workflows. Pipelines are defined in Python, allowing for
39+
dynamic pipeline generation and on-demand, code-driven pipeline invocation.
40+
41+
- {ref}`apache-flink`
42+
43+
Apache Flink is a programming framework and distributed processing engine for
44+
stateful computations over unbounded and bounded data streams, written in Java.
45+
46+
- {ref}`apache-nifi`
47+
48+
Apache NiFi is a dataflow system based on the concepts of flow-based programming.
49+
It supports powerful and scalable directed graphs of data routing, transformation,
50+
and system mediation logic.
51+
52+
- {ref}`dbt`
53+
54+
dbt is an SQL-first platform for transforming data in data warehouses using
55+
Python and SQL. The data abstraction layer provided by dbt-core allows the
56+
decoupling of the models on which reports and dashboards rely from the source data.
57+
58+
- {ref}`kestra`
59+
60+
Kestra is an open source workflow automation and orchestration toolkit with a rich
61+
plugin ecosystem. It enables users to automate and manage complex workflows in a
62+
streamlined and efficient manner, defining them both declaratively, or imperatively
63+
using any scripting language like Python, Bash, or JavaScript.
64+
65+
- {ref}`meltano`
66+
67+
Meltano is a declarative code-first polyglot data integration engine adhering to
68+
the Singer specification. Singer is a composable open source ETL framework and
69+
specification, including powerful data extraction and consolidation elements.
70+
71+
+++
72+
Data pipeline programming frameworks and platforms.
73+
::::
74+
75+
76+
::::{grid-item-card} {material-outlined}`all_inclusive;2em` Low-code / No-code / Visual
77+
- {ref}`apache-hop`
78+
79+
Apache Hop aims to be the future of data integration. Visual development enables
80+
developers to be more productive than they can be through code.
81+
82+
- {ref}`estuary`
83+
84+
Estuary provides real-time data integration and modern ETL and ELT data pipelines
85+
as a fully managed solution. Estuary Flow is a real-time, reliable change data
86+
capture (CDC) solution.
87+
88+
- {ref}`node-red`
89+
90+
Node-RED is an open-source programming tool for wiring together hardware devices,
91+
APIs and online services within a low-code programming environment for event-driven
92+
applications.
93+
94+
+++
95+
Visual data flow and integration frameworks and platforms.
96+
::::
97+
98+
99+
::::{grid-item-card} {material-outlined}`storage;2em` Databases
100+
- {ref}`aws-dms`
101+
102+
AWS DMS is a managed migration and replication service that helps move your
103+
database and analytics workloads between different kinds of databases quickly,
104+
securely, and with minimal downtime and zero data loss.
105+
106+
- {ref}`aws-dynamodb`
107+
108+
DynamoDB is a fully managed NoSQL database service provided by Amazon Web Services (AWS).
109+
110+
- {ref}`influxdb`
111+
112+
InfluxDB is a scalable datastore for metrics, events, and real-time analytics to
113+
collect, process, transform, and store event and time series data.
114+
115+
- {ref}`mongodb`
116+
117+
MongoDB is a document database designed for ease of application development and scaling.
118+
119+
- {ref}`mysql`
120+
121+
MySQL and MariaDB are well-known free and open-source relational database management
122+
systems (RDBMS), available as standalone and managed variants.
123+
124+
- {ref}`sql-server`
125+
126+
Microsoft SQL Server Integration Services (SSIS) is a component of the Microsoft SQL
127+
Server database software that can be used to perform a broad range of data migration tasks.
128+
129+
+++
130+
Load data from database systems.
131+
::::
132+
133+
134+
::::{grid-item-card} {material-outlined}`fast_forward;2em` Streams
135+
- {ref}`apache-kafka`
136+
137+
Apache Kafka is an open-source distributed event streaming platform
138+
for high-performance data pipelines, streaming analytics, data integration,
139+
and mission-critical applications.
140+
141+
- {ref}`aws-kinesis`
142+
143+
Amazon Kinesis Data Streams is a serverless streaming data service that simplifies
144+
the capture, processing, and storage of data streams at any scale, such as
145+
application logs, website clickstreams, and IoT telemetry data, for machine
146+
learning (ML), analytics, and other applications.
147+
148+
- {ref}`risingwave`
149+
150+
RisingWave is a stream processing and management platform that allows configuring
151+
data sources, views on that data, and destinations where results are materialized.
152+
It provides both a Postgres-compatible SQL interface, like CrateDB, and a
153+
DataFrame-style Python interface.
154+
It delivers low-latency insights from real-time streams, database CDC, and
155+
time-series data, bringing streaming and batch together.
156+
157+
- {ref}`streamsets`
158+
159+
The StreamSets Data Collector is a lightweight and powerful engine that allows you
160+
to build streaming, batch and change-data-capture (CDC) pipelines that can ingest
161+
and transform data from a variety of sources.
162+
163+
+++
164+
Load data from streaming platforms.
165+
::::
166+
167+
168+
::::{grid-item-card} {material-outlined}`add_to_queue;2em` Serverless Compute
169+
170+
- {ref}`azure-functions`
171+
172+
An Azure Function is a short-lived, serverless computation that is triggered by
173+
external events. The trigger produces an input payload, which is delivered to
174+
the Azure Function. The Azure Function then does computation with this payload
175+
and subsequently outputs its result to other Azure Functions, computation
176+
services, or storage services.
177+
+++
178+
Use serverless compute units for custom import tasks.
179+
::::
180+
181+
182+
::::{grid-item-card} {material-outlined}`dataset;2em` Datasets
183+
184+
- {ref}`apache-iceberg`
185+
186+
Apache Iceberg is an open table format for analytic datasets.
187+
188+
+++
189+
Load data from datasets and open table formats.
190+
::::
191+
192+
193+
:::::
194+
195+
196+
:::{rubric} Alphabetically sorted
197+
:::
198+
199+
:::{div}
22200
- {ref}`apache-airflow`
23201
- {ref}`apache-flink`
24202
- {ref}`apache-hop`
25203
- {ref}`apache-iceberg`
26204
- {ref}`apache-kafka`
27205
- {ref}`apache-nifi`
28-
- {ref}`aws-dms`
29206
- {ref}`aws-dynamodb`
30207
- {ref}`aws-kinesis`
208+
- {ref}`aws-dms`
31209
- {ref}`azure-functions`
32210
- {ref}`dbt`
33211
- {ref}`estuary`
@@ -40,3 +218,4 @@ Please also take a look at support for {ref}`cdc` solutions.
40218
- {ref}`risingwave`
41219
- {ref}`sql-server`
42220
- {ref}`streamsets`
221+
:::

docs/ingest/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ All data ingestion methods for CrateDB at a glance.
1010
:margin: 4 4 0 0
1111
:padding: 0
1212
:gutter: 2
13+
:class-container: ul-li-wide
1314

1415
::::{grid-item-card} {material-outlined}`file_upload;2em` Load data using CrateDB
1516
- {ref}`Import files <crate-reference:sql-copy-from>`

docs/ingest/telemetry/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
# Telemetry data
55

66
:::{div}
7-
CrateDB integrations with metrics collection agents, brokers, and stores.
7+
CrateDB integrates with metrics collection agents, brokers, and stores.
88
This documentation section lists applications and daemons which can
99
be used together with CrateDB, and educates about how to use them optimally.
1010

0 commit comments

Comments
 (0)