export PYTHONPATH=$(pwd)
in repo rootpoetry shell
to entry poetry venvexport CONFIG_PATH=...
set path to site config or use--config-file
flag. Available configs inconfigs/
- Run
luigid
in another terminal when running luigi tasks to run scheduler at https://localhost:8082 - TODO: set dbt config
Pipeline execution is a semi-manual procedure.
CKAN data downloading stage
./cli.sh download-datasets-info
- [luigi] Downloads CKAN dataset info as JSON files
./cli.sh create-raw-ckan-db-tables
&&./cli.sh insert-datasets-info
- [luigi] Creates
ckan_api.db
DuckDB file from downloaded JSONs - Copies newly created db file to
artifacts/read_only/{site_short_name}/ckan_api_{datetime}.db
- [luigi] Creates
- Copy manually the DuckDB readonly file to
artifacts/dbt/input/ckan_api.db
- Run
dbt run --select "models/ckan_clean"
indbt/open_data_usage_dbt
using dir's poetry venv- Creates new clean
artifacts/dbt/output/clean_ckan_api.db
- Creates new clean
- Copy that file to
artifacts/read_only/{site_short_name}/clean_ckan_api_{datetime}.db
Code search stage
./cli.sh download-sourcegraph-results
- [luigi] Downloads SourceGraph search results as JSON files based on
artifacts/dbt/output/clean_ckan_api.db
- [luigi] Downloads SourceGraph search results as JSON files based on
./cli.sh insert-sourcegraph-results
- [luigi] Creates
sourcegraph_api.db
DuckDB file from downloaded JSONs - Copies newly created db file to
artifacts/read_only/{site_short_name}/sourcegraph_api_{datetime}.db
- [luigi] Creates
- TODO: docs
- Artifacts dir should be backed up to the blob bucket (repo uses Google Storage).
- It is advised to back up the artifacts dir after you completed a chunk of work using
./sync_artifacts.sh
- You can create a bucket using
./create_gcp_bucket.sh
script. It's recommended that bucket has soft delete period set to at least 2 weeks. GCP_PROJECT
andGCP_BUCKET
env vars have to be set (either via shell or.env
file)
- Sometimes there is need to run luigi tasks which download data from the Internet multiple times cause they fail bc of network conditions
- There can be pickle read errors. It's best to delete JSONs which are the base for pickle file and run single tasks manually
- dbt files are stored as a separate project due to the deps conflict