- PREM-on-FHIR
- run this from your host
End-to-end pipeline to generate synthetic PREM data, load it into a FHIR server, land it in an analytics DB, transform with dbt, and enrich free-text with an NLP job. Dockerized where it matters for a smooth dev experience.
- add general arch diagram + c4 diagram if possible
- make sure all script using env var use the same env files
- Validate the docker run cmd and well as shell script params for synthea steps
- Make sure of the usage of virutal env for seed steps
- Build a make file for Use make seed-all (wraps all the steps above).
- Add Aibytes config
- Add Docker compose image for dbt
- validate the usage of the .hf_cache model (the docker should be standalone)
- validate usage of docker or docker compose for nlp pipeline
- create make files
- refactor all command in the readme prefering using cd .. before executing the command to avoid long path
- make a post script init to execute the airbyte view creation when oltp-db and hapi fhir are all start up and table are available
- Make the synthea script leverages the parameters.json createad after generation
# 1) Start infra (analytics DB, HAPI FHIR, pgAdmin, nginx to serve NDJSON)
docker compose --profile all up -d
# 2) Seed synthetic data (Synthea + Questionnaire + QuestionnaireResponses)
cd ..
make seed-all # if you use the provided Makefile
# ...or run the individual scripts shown below
# 3) Build stg/core models, run NLP, then build marts
make elt # dbt stg+core -> NLP -> dbt marts/
├─ 00-setup/ # infra only
│ ├─ analytics-db/ # Postgres for analytics
│ ├─ hapi-fhir/ # FHIR server (java-based)
│ ├─ oltp-db/ # OLTP Postgres
│ ├─ pgadmin/ # pgAdmin client to monitor analytics and oltp db
│ ├─ synthea-files/ # synthea generated files served by nginx
│ │ └─ nginx.conf # web server config (autoindex + NDJSON types)
│ └─ docker-compose.yml # docker compose to setup the whole infrastructure
├─ 01-data-generation/
│ ├─ questionnaires/
│ │ ├─ input/ # Manually built FHIR ressources for questionnaire, value set, copdesystem
│ │ ├─ output/ # FHIR bundles to upload
│ │ ├─ questionnaire_bundle_maker.py # Python script to transform the manually fhir ressources into a bundles
│ │ └─ upload_questionnaire.sh # Shell script to upload the bundle into the HAPI FHIR server via API calls
│ ├─ questionnaire_responses/
│ │ ├─ input/ # this is the output folder of the export_qr_header.py. It will contains a QuestionnaireHeader.csv files with all foreign keys.
│ │ ├─ output/ # This is where QR bundle are generated by the qr_bundle_maker.py script
│ │ ├─ export_qr_header.py # Script to retrieve identifiers from patient, practioners, encounters... that are mandatory in a QuestionnairResponse resssources so that those resssources have a real context associated
| | ├─ qr_bundle_maker.py # Python script that generate fake QR using the QuestionnaireHeader.csv
│ │ └─ upload_qr.sh # Shell script to upload the QR bundle into the HAPI FHIR server via API calls
│ └─ synthea/
│ ├─ Dockerfile # builds a Synthea runner
│ ├─ output/ # generated NDJSON
│ └─ upload_synthea.sh # bulk import via $import + polling
├─ 02-elt/
│ ├─ extract_load_config/ # Airbyte exports/screenshots
│ ├─ fhir_prem_to_decision/ # dbt project (stg/core/mart)
│ └─ nlp_pipeline/ # NLP job (Dockerized, runs one-shot)
└─ 04-dashboard/ # (placeholder) dashboard/app- Docker ≥ 24 with Compose v2
- make
- Python 3.11 if you run the local scripts manually
- Airbytes local local instances
abctl local install --disable-auth
cp .env.example .env
# edit values as neededdocker compose --profile all up -d
docker compose psThis starts:
- analytics-db (Postgres),
- oltp-db (Postgres)
- hapi-fhir-server (FHIR server),
- pgadmin,
- synthea-files (nginx web server serves NDJSON from 00-setup/synthea-files for convenient bulk import)
Then you need t oexecute direclty in the db ./00-setup/oltp-db/post-init/20-airbyte-views.sql to prepare the views for airbytes ETL
make up # infra up
make seed-all # Synthea + Questionnaire + QR uploads
make dbt-stg-core # build stg/core
make nlp # run NLP one-shot
make dbt-mart # build marts
make elt # stg/core -> NLP -> mart
make logs # follow docker compose logs
make down # stop containers
make nuke # stop + remove volumes (⚠ wipes data)| Target | Description |
|---|---|
make init |
Create Python venv (if missing) and install dependencies. |
make venv |
Create .venv virtualenv (Linux/Mac or Windows). |
make deps |
Install CLI requirements into the venv. |
| Target | Description |
|---|---|
make fhir-wait |
Poll until HAPI at $FHIR_BASE is ready (default: http://localhost:8080/fhir). |
make fhir-import |
Submit a bulk $import job using a single Parameters JSON (IMPORT_PARAMS). |
make fhir-import-many |
Submit multiple $import jobs in sequence (IMPORT_FILES). |
Config knobs
FHIR_BASE=http://host:port/fhir
IMPORT_PARAMS=…
IMPORT_FILES=…
IMPORT_POLL (seconds between polls, default 30)
IMPORT_TIMEOUT_MIN (total minutes, default 60)
| Target | Description |
|---|---|
make synthea-build |
Build Synthea Docker image (SYN_TAG, default syntheadocker). |
make synthea-run |
Run Synthea inside Docker and export bulk FHIR data to SYN_OUT. |
Config knobs (can be set via env or .env):
POPULATION (default 5)
AGE_RANGE (default 18-100)
KEEP_FILE (default keep_neuro.json)
IMPORT_TIMEOUT_MIN (total minutes, default 60)
EXTRA_ARGS (default enables bulk data export etc.)
Override knobs per run (or put in .env):
POPULATION=25 AGE_RANGE=18-90 KEEP_FILE=keep_neuro.json EXTRA_ARGS="--exporter.fhir.bulk_data=true ..." make synthea-runThen you need to edit P:\dev\PREM-on-FHIR\01-data-generation\synthea\import-pass1.json:
- Location
- Organization
- Practioner
- PractitionerRole
All those files are generated with a unique identifier in their files name, You need to update it
make fhir-import-many : This will first import all the files in import pass 1 (they are ressources that needs to be inserted first otherwise it cuase issue with the others in import pass 2)
you could get some error like missing files (due to not generated or wrong id replacement in the previous steps) Please correct and reexcute the script
| Target | Description |
|---|---|
make bundle-questionnaires |
Build a transaction Bundle (Q_BUNDLE) from JSON files in Q_IN_DIR. |
make post-questionnaires |
POST the Questionnaire bundle to $FHIR_BASE. |
Config knobs:
Q_IN_DIR=…
Q_BUNDLE=…
| Target | Description |
|---|---|
make qr-export-headers |
Run SQL against HAPI DB and export header CSV (HDR_CSV). |
make qr-make-bundles |
Generate QuestionnaireResponse bundles (QR_OUT) from the header CSV. Supports NREQ/PPNQ modes. |
make post-qr-bundles |
POST generated bundles to $FHIR_BASE. |
Config knobs (env or .env)
QR_MODE = nreq | ppnq (default nreq)
QR_CHUNK (default 250)
QR_SEED (default 42)
QR_LIKERT_DIST (NREQ weighting, e.g. 0.2,0.5,0.3)
PPNQ text generation:
QR_DRY_RUN=1 (placeholders, default 1)
QR_USE_LLM=1 (enable LLM; requires OPENAI_API_KEY)
LLM_MODEL=gpt-4o-mini
LLM_TEMPERATURE=0.6
LLM_MAX_RETRIES=3
Advanced (PPNQ):
NPS_DIST (e.g. 0:0.02,1:0.03,...,10:0.15) - Custom NPS score distribution (0–10).
KEYWORD_RATE (default 0.35) - Probability per item to inject up to 1–2 optional keywords (from a small domain list or themes.yml) into the prompt as gentle guidance.
STYLE_VARIANCE (default 0.7) - Scales the randomness used to pick a subtle style hint (e.g., “grateful”, “concerned”) consistent with the NPS bucket. Higher → more style variability.
QR_VERBOSE=1 (extra logs)
Output files:
$(QR_OUT)/{mode}_batch_bundle_###.json
make qr-export-headers : run the SQL against your HAPI DB and write QuestionnaireResponse-Header.csv → $(HDR_CSV)
(DB envs respected: DB_HOST/PORT/NAME/USER/PASS or OLTP_DB_*.)
make qr-make-bundles : generate QR batch bundles from the header CSV → $(QR_OUT).
Controls (set as env before the command):
QR_MODE=nreq|ppnq(default nreq)QR_SEED=42QR_CHUNK=250
NREQ weighting:
QR_LIKERT_DIST=0.2,0.5,0.3
PPNQ text:
- Dry run:
QR_DRY_RUN=1(default) - LLM mode:
QR_USE_LLM=1(needsOPENAI_API_KEY; optionalLLM_MODEL,LLM_TEMPERATURE,LLM_MAX_RETRIES) - Advanced:
NPS_DIST,KEYWORD_RATE,STYLE_VARIANCE
Examples:
# NREQ with weighted Likert distribution and fixed seed
QR_MODE=nreq QR_LIKERT_DIST=0.2,0.5,0.3 QR_SEED=7 make -e qr-make-bundles
# PPNQ dry-run placeholders
QR_MODE=ppnq QR_DRY_RUN=1 make -e qr-make-bundles
# PPNQ via LLM (verbose)
QR_MODE=ppnq QR_USE_LLM=1 LLM_MODEL=gpt-4o-mini QR_VERBOSE=1 make -e qr-make-bundles
delete every questionnaire response
curl -X DELETE "http://localhost:8080/fhir/QuestionnaireResponse?_lastUpdated=gt1900-01-01T00:00:00Z&_expunge=true" \
-H "Accept: application/fhir+json"| Target | Description |
|---|---|
make seed-all |
Run everything: setup, build/run Synthea, wait for FHIR, bulk import, bundle/post questionnaires, export headers, generate & post QR bundles. |
| Target | Description |
|---|---|
make clean |
Remove generated outputs (SYN_OUT, QR_OUT, HDR_OUT, curl logs). |
It is important to use of --disable-auth on the install to avoid having to configure api token etc for this PoC within the Kubernetes Airbytes clusters.
Useful command
abctl local install --disable-auth
abctl local credentials
abctl local statusThis steps has to be run only if the Airbytes config is finalized and the first run has been done
# Install deps
docker compose run --rm dbt-run dbt deps
# Build stg + core
docker compose run --rm dbt-run dbt build -s stg+ core+# Only compile
docker compose run --rm dbt-run dbt compile --select stg.*
# Only tests
docker compose run --rm dbt-run dbt test
# Full pipeline refresh
docker compose run --rm dbt-run dbt build --full-refresh --exclude 'tag:nlp'
# Full model refresh
docker compose run --rm dbt-run dbt run --full-refresh --exclude 'tag:nlp'
# Only changed models (after one prior run produced a manifest)
docker compose run --rm dbt-run dbt build --select state:modified+ --state target/dbt clean && dbt deps
Models only: dbt run
Just tests: dbt test
Full refresh (rebuild tables): dbt build --full-refresh
Run a folder (e.g., core only): dbt run -s models/core/ (or) dbt run -s core
Include parents/children: dbt build -s +core (with upstream) or dbt build -s core+ (with downstream)
Faster local runs: dbt build --threads 6
Only changed models (iterating):
run once normally to create a state manifest
dbt build --select state:modified+ --state target/
The NLP container pulls free-text from stg.nlp_prem_text, scores sentiment & themes, and upserts into stg.nlp_predictions_inbox.
The dbt marts assemble:
- mart.mart_prem_text_sentiment
- mart.mart_prem_theme_summary
DOCKER_BUILDKIT=1 docker build -t pof-prem-nlp:latest .first run will download the models into a volume (not the image)
docker run --rm \
--env-file .env \
-e PG_HOST=host.docker.internal \
-v pof-prem_hfcache:/app/.hf_cache \
prem-nlp:latest \
score --since 10y --limit 100 --verboseoverride defaults if you want:
docker run --rm --env-file .env -e PG_HOST=host.docker.internal pof-nlp:latest score --since 10y --limit 10000 --verbose
docker run --rm --env-file .env -e PG_HOST=host.docker.internal prem-nlp:latest score --since 30d --limit 100
python -m pipeline.cli score --since 10y --limit 200 --verbose
or a dry run to avoid writing:
python -m pipeline.cli score --since 10y --limit 50 --verbose --dry-run
docker compose run --rm dbt-run dbt build -s mart+-
NLP says “No pending rows.” Check that
stg.nlp_prem_texthas rows (non-empty text_raw) for the--sincewindow. Also checkstg.nlp_predictions_inboxif (qr_id, item_linkid) were already scored for the current model_family/model_version, they’re intentionally skipped. -
Bulk import (Synthea) fails The uploader script exits non-zero on
$importerrors and prints the HAPI $bulkdata-status response. Inspect the log lines; fix any bad file path or content type. -
dbt constraints not created The on-run-end hook runs only when core models build. Verify you used
build -s stg+ core+(or full build). -
Docker can’t reach your host DB Use
PG_HOST=host.docker.internal(macOS/Windows). On Linux, expose the DB in compose and use the container DNS name instead.
- Use the .env.example → .env workflow; never commit real secrets.
- The NLP job flags potential PII (regex-based). If you display sample verbatims, mask or filter rows with pii_flag=true (add a column if you want to persist it).
- Keep your Hugging Face cache in a named volume so images don’t bloat.
Source DB snapshot
MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dump -U analytics_admin -d analytics -Fc -C -Z 9 -f /tmp/analytics.dump
docker cp pof-analytics-db:/tmp/analytics.dump ./analytics.dump
MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dump -U analytics_admin -d metabase -Fc -C -Z 9 -f /tmp/metabase.dump
docker cp pof-analytics-db:/tmp/metabase.dump ./metabase.dump
MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dumpall -U analytics_admin --globals-only -f /tmp/globals.sql
docker cp pof-analytics-db:/tmp/globals.sql ./globals.sqlIn the directory you are the three file are generated, you can use them in the next steps
Target DB
cat ./00-setup/analytics-db/snapshots/20250920/globals.sql | docker exec -i pof-analytics-db psql -U analytics_admin -d postgres -v ON_ERROR_STOP=1
docker exec pof-analytics-db dropdb -U analytics_admin --if-exists analytics
docker exec pof-analytics-db dropdb -U analytics_admin --if-exists metabase
cat ./00-setup/analytics-db/snapshots/20250920/analytics.dump | docker exec -i pof-analytics-db pg_restore -U analytics_admin -C -d postgres
cat ./00-setup/analytics-db/snapshots/20250920/metabase.dump | docker exec -i pof-analytics-db pg_restore -U analytics_admin -C -d postgresMSYS_NO_PATHCONV=1 docker exec
-e POSTGRES_USER=analytics_admin
-e POSTGRES_DB=postgres
-e PGPASSWORD=analytics_admin
-e MB_DB_NAME=metabase
-e MB_DB_USER=metabase_app
-e MB_DB_PASS=change_me_strong
-it pof-analytics-db bash -lc "
sed -i 's/\r$//' /docker-entrypoint-initdb.d/10_metabase_app.sh &&
chmod +x /docker-entrypoint-initdb.d/10_metabase_app.sh &&
/docker-entrypoint-initdb.d/10_metabase_app.sh
"
executed in PS docker exec -it pof-metabase java -jar /app/metabase.jar reset-password admin@example.com