Skip to content

Commit aa17ca0

Browse files
committed
Integrate/dlt: Pull README and usage guide from upstream repository
Includes: - Overview about supported features - Usage guide based on `dlt init`
1 parent 914fd22 commit aa17ca0

File tree

2 files changed

+209
-8
lines changed

2 files changed

+209
-8
lines changed

docs/integrate/dlt/index.md

Lines changed: 112 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -75,32 +75,136 @@ pipeline = dlt.pipeline(
7575
pipeline.run(source)
7676
```
7777

78-
## Learn
78+
## Supported features
79+
80+
### Data loading
81+
82+
Data is loaded into CrateDB using the most efficient method depending on the data source.
83+
84+
- For local files, the `psycopg2` library is used to directly load files into
85+
CrateDB tables using the `INSERT` command.
86+
- For files in remote storage like S3 or Azure Blob Storage,
87+
CrateDB data loading functions are used to read the files and insert the data into tables.
88+
89+
### Datasets
90+
91+
Use `dataset_name="doc"` to address CrateDB's default schema `doc`.
92+
When addressing other schemas, make sure they contain at least one table. [^create-schema]
93+
94+
### File formats
95+
96+
- The [SQL INSERT file format] is the preferred format for both direct loading and staging.
97+
98+
### Column types
99+
100+
The `cratedb` destination has a few specific deviations from the default SQL destinations.
101+
102+
- CrateDB does not support the `time` datatype. Time will be loaded to a `text` column.
103+
- CrateDB does not support the `binary` datatype. Binary will be loaded to a `text` column.
104+
- CrateDB can produce rounding errors under certain conditions when using the `float/double` datatype.
105+
Make sure to use the `decimal` datatype if you can’t afford to have rounding errors.
106+
107+
### Column hints
108+
109+
CrateDB supports the following [column hints].
110+
111+
- `primary_key` - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.
112+
113+
### File staging
114+
115+
CrateDB supports Amazon S3, Google Cloud Storage, and Azure Blob Storage as file staging destinations.
116+
117+
`dlt` will upload CSV or JSONL files to the staging location and use CrateDB data loading functions
118+
to load the data directly from the staged files.
119+
120+
Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations.
121+
122+
- [AWS S3]
123+
- [Azure Blob Storage]
124+
- [Google Storage]
125+
126+
Invoke a pipeline with staging enabled.
127+
128+
```python
129+
pipeline = dlt.pipeline(
130+
pipeline_name='chess_pipeline',
131+
destination='cratedb',
132+
staging='filesystem', # add this to activate staging
133+
dataset_name='chess_data'
134+
)
135+
```
136+
137+
### dbt support
138+
139+
Integration with [dbt] is generally supported via [dbt-cratedb2] but not tested by us.
140+
141+
### dlt state sync
142+
143+
The CrateDB destination fully supports [dlt state sync].
144+
145+
146+
## See also
147+
148+
:::{rubric} Examples
149+
:::
79150

80151
::::{grid}
81152

153+
:::{grid-item-card} Usage guide: Load API data with dlt
154+
:link: dlt-usage
155+
:link-type: ref
156+
Exercise a canonical `dlt init` example with CrateDB.
157+
:::
158+
82159
:::{grid-item-card} Examples: Use dlt with CrateDB
83160
:link: https://github.com/crate/cratedb-examples/tree/main/framework/dlt
84161
:link-type: url
85-
Executable code examples that demonstrate how to use dlt with CrateDB.
162+
Executable code examples on GitHub that demonstrate how to use dlt with CrateDB.
163+
:::
164+
165+
::::
166+
167+
:::{rubric} Resources
86168
:::
87169

88-
:::{grid-item-card} Adapter: The dlt destination adapter for CrateDB
89-
:link: https://github.com/crate/dlt-cratedb
170+
::::{grid}
171+
172+
:::{grid-item-card} Package: `dlt-cratedb`
173+
:link: https://pypi.org/project/dlt-cratedb/
90174
:link-type: url
91-
Based on the dlt PostgreSQL adapter, the package enables you to work
92-
with dlt and CrateDB.
175+
The dlt destination adapter for CrateDB is
176+
based on the dlt PostgreSQL adapter.
93177
:::
94178

95-
:::{grid-item-card} See also: ingestr
179+
:::{grid-item-card} Package: `ingestr`
96180
:link: ingestr
97181
:link-type: ref
98-
The ingestr data import/export application uses dlt.
182+
The ingestr data import/export application uses dlt as a workhorse.
99183
:::
100184

101185
::::
102186

103187

188+
:::{toctree}
189+
:maxdepth: 1
190+
:hidden:
191+
Usage <usage>
192+
:::
193+
194+
195+
[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
196+
This means by default, unless tables exist within a schema, it appears to not exist,
197+
however, it also can't be explicitly created. Schemas are currently implicitly created
198+
when tables exist in them.
104199

200+
[AWS S3]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#aws-s3
201+
[Azure Blob Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#azure-blob-storage
202+
[column hints]: https://dlthub.com/docs/general-usage/schema#column-hint-rules
203+
[CRATEDB-14601]: https://github.com/crate/crate/issues/14601
105204
[databases supported by SQLAlchemy]: https://docs.sqlalchemy.org/en/20/dialects/
205+
[dbt]: https://dlthub.com/docs/hub/features/transformations/dbt-transformations
206+
[dbt-cratedb2]: https://pypi.org/project/dbt-cratedb2/
106207
[dlt]: https://dlthub.com/
208+
[dlt state sync]: https://dlthub.com/docs/general-usage/state#syncing-state-with-destination
209+
[Google Storage]: https://dlthub.com/docs/dlt-ecosystem/destinations/filesystem#google-storage
210+
[SQL INSERT file format]: https://dlthub.com/docs/dlt-ecosystem/file-formats/insert-format

docs/integrate/dlt/usage.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
---
2+
title: CrateDB
3+
description: CrateDB `dlt` destination
4+
keywords: [ cratedb, destination, data warehouse ]
5+
---
6+
7+
(dlt-usage)=
8+
# Load API data with dlt
9+
10+
:::{div} sd-text-muted
11+
Exercise a canonical `dlt init` example with CrateDB.
12+
:::
13+
14+
## Install the package
15+
16+
Install the dlt destination adapter for CrateDB.
17+
```shell
18+
pip install dlt-cratedb
19+
```
20+
21+
## Initialize the dlt project
22+
23+
Start by initializing a new example `dlt` project.
24+
25+
```shell
26+
export DESTINATION__CRATEDB__DESTINATION_TYPE=postgres
27+
dlt init chess cratedb
28+
```
29+
30+
The `dlt init` command will initialize your pipeline with `chess` [^chess-source]
31+
as the source, and `cratedb` as the destination. It generates several files and directories.
32+
33+
## Edit the pipeline definition
34+
35+
The pipeline definition is stored in the Python file `chess_pipeline.py`.
36+
37+
- Because the dlt adapter currently only supports writing to the default `doc` schema
38+
of CrateDB [^create-schema], please replace `dataset_name="chess_players_games_data"`
39+
by `dataset_name="doc"` within the generated `chess_pipeline.py` file.
40+
41+
- To initialize the CrateDB destination adapter, insert the `import dlt_cratedb`
42+
statement at the top of the file. Otherwise, the destination will not be found,
43+
so you will receive a corresponding error [^not-initialized-error].
44+
45+
## Configure credentials
46+
47+
Next, set up the CrateDB credentials in the `.dlt/secrets.toml` file as shown below.
48+
CrateDB is compatible with PostgreSQL and uses the `psycopg2` driver, like the
49+
`postgres` destination.
50+
51+
```toml
52+
[destination.cratedb.credentials]
53+
host = "localhost" # CrateDB server host.
54+
port = 5432 # CrateDB PostgreSQL TCP protocol port, default is 5432.
55+
username = "crate" # CrateDB username, default is usually "crate".
56+
password = "crate" # CrateDB password, if any.
57+
database = "crate" # CrateDB only knows a single database called `crate`.
58+
connect_timeout = 15
59+
```
60+
61+
Alternatively, you can pass a database connection string as shown below.
62+
```toml
63+
destination.cratedb.credentials="postgres://crate:crate@localhost:5432/"
64+
```
65+
Keep it at the top of your TOML file, before any section starts.
66+
Because CrateDB uses `psycopg2`, using `postgres://` is the right choice.
67+
68+
## Start CrateDB
69+
70+
Use Docker or Podman to run an instance of CrateDB for evaluation purposes.
71+
```shell
72+
docker run --rm --name=cratedb --publish=4200:4200 --publish=5432:5432 crate:latest '-Cdiscovery.type=single-node'
73+
```
74+
75+
## Run pipeline
76+
77+
```shell
78+
python chess_pipeline.py
79+
```
80+
81+
## Explore data
82+
```shell
83+
crash -c 'SELECT * FROM players_profiles LIMIT 10;'
84+
crash -c 'SELECT * FROM players_online_status LIMIT 10;'
85+
```
86+
87+
88+
[^chess-source]: The `chess` dlt source pulls publicly available data from
89+
the [Chess.com Published-Data API].
90+
[^create-schema]: CrateDB does not support `CREATE SCHEMA` yet, see [CRATEDB-14601].
91+
This means by default, unless tables exist within a schema, it appears to not exist,
92+
however, it also can't be explicitly created. Schemas are currently implicitly created
93+
when tables exist in them.
94+
[^not-initialized-error]: `UnknownDestinationModule: Destination "cratedb" is not one of the standard dlt destinations`
95+
96+
[Chess.com Published-Data API]: https://www.chess.com/news/view/published-data-api
97+
[CRATEDB-14601]: https://github.com/crate/crate/issues/14601

0 commit comments

Comments
 (0)