PREM-on-FHIR

PREM-on-FHIR
run this from your host

Introduction

End-to-end pipeline to generate synthetic PREM data, load it into a FHIR server, land it in an analytics DB, transform with dbt, and enrich free-text with an NLP job. Dockerized where it matters for a smooth dev experience.

ToDo's

Quickstart

# 1) Start infra (analytics DB, HAPI FHIR, pgAdmin, nginx to serve NDJSON)
docker compose --profile all up -d

# 2) Seed synthetic data (Synthea + Questionnaire + QuestionnaireResponses)
cd ..
make seed-all        # if you use the provided Makefile
# ...or run the individual scripts shown below

# 3) Build stg/core models, run NLP, then build marts
make elt             # dbt stg+core -> NLP -> dbt marts

Repo structure

/
├─ 00-setup/                       # infra only
│  ├─ analytics-db/                # Postgres for analytics
│  ├─ hapi-fhir/                   # FHIR server (java-based)
│  ├─ oltp-db/                     # OLTP Postgres
│  ├─ pgadmin/                     # pgAdmin client to monitor analytics and oltp db
│  ├─ synthea-files/               # synthea generated files served by nginx
│  │  └─ nginx.conf                # web server config (autoindex + NDJSON types)
│  └─ docker-compose.yml           # docker compose to setup the whole infrastructure
├─ 01-data-generation/
│  ├─ questionnaires/
│  │  ├─ input/                         # Manually built FHIR ressources for questionnaire, value set, copdesystem 
│  │  ├─ output/                        # FHIR bundles to upload
│  │  ├─ questionnaire_bundle_maker.py  # Python script to transform the manually fhir ressources into a bundles
│  │  └─ upload_questionnaire.sh        # Shell script to upload the bundle into the HAPI FHIR server via API calls
│  ├─ questionnaire_responses/
│  │  ├─ input/                    # this is the output folder of the export_qr_header.py. It will contains a QuestionnaireHeader.csv files with all foreign keys.
│  │  ├─ output/                   # This is where QR bundle are generated by the qr_bundle_maker.py script
│  │  ├─ export_qr_header.py       # Script to retrieve identifiers from patient, practioners, encounters... that are mandatory in a QuestionnairResponse resssources so that those resssources have a real context associated
|  |  ├─ qr_bundle_maker.py        # Python script that generate fake QR using the QuestionnaireHeader.csv
│  │  └─ upload_qr.sh              # Shell script to upload the QR bundle into the HAPI FHIR server via API calls
│  └─ synthea/
│     ├─ Dockerfile                # builds a Synthea runner
│     ├─ output/                   # generated NDJSON
│     └─ upload_synthea.sh         # bulk import via $import + polling
├─ 02-elt/
│  ├─ extract_load_config/         # Airbyte exports/screenshots
│  ├─ fhir_prem_to_decision/       # dbt project (stg/core/mart)
│  └─ nlp_pipeline/                # NLP job (Dockerized, runs one-shot)
└─ 04-dashboard/                   # (placeholder) dashboard/app

Docker ≥ 24 with Compose v2
make
Python 3.11 if you run the local scripts manually
Airbytes local local instances abctl local install --disable-auth

cp .env.example .env
# edit values as needed

Steps by Step Guide (except airbytes)

docker compose --profile all up -d
docker compose ps

This starts:

analytics-db (Postgres),
oltp-db (Postgres)
hapi-fhir-server (FHIR server),
pgadmin,
synthea-files (nginx web server serves NDJSON from 00-setup/synthea-files for convenient bulk import)

Then you need t oexecute direclty in the db ./00-setup/oltp-db/post-init/20-airbyte-views.sql to prepare the views for airbytes ETL

End to End (Make TargetCHeat sheet)=

make up            # infra up
make seed-all      # Synthea + Questionnaire + QR uploads
make dbt-stg-core  # build stg/core
make nlp           # run NLP one-shot
make dbt-mart      # build marts
make elt           # stg/core -> NLP -> mart
make logs          # follow docker compose logs
make down          # stop containers
make nuke          # stop + remove volumes (⚠ wipes data)

2. Generate & upload sample data

2.1 Setup

Target	Description
`make init`	Create Python venv (if missing) and install dependencies.
`make venv`	Create `.venv` virtualenv (Linux/Mac or Windows).
`make deps`	Install CLI requirements into the venv.

2.2 FHIR Server

Target	Description
`make fhir-wait`	Poll until HAPI at `$FHIR_BASE` is ready (default: `http://localhost:8080/fhir`).
`make fhir-import`	Submit a bulk `$import` job using a single Parameters JSON (`IMPORT_PARAMS`).
`make fhir-import-many`	Submit multiple `$import` jobs in sequence (`IMPORT_FILES`).

Config knobs FHIR_BASE=http://host:port/fhir IMPORT_PARAMS=… IMPORT_FILES=… IMPORT_POLL (seconds between polls, default 30) IMPORT_TIMEOUT_MIN (total minutes, default 60)

2.1 Synthea

Generate ndjson

Target	Description
`make synthea-build`	Build Synthea Docker image (`SYN_TAG`, default `syntheadocker`).
`make synthea-run`	Run Synthea inside Docker and export bulk FHIR data to `SYN_OUT`.

Config knobs (can be set via env or .env): POPULATION (default 5) AGE_RANGE (default 18-100) KEEP_FILE (default keep_neuro.json) IMPORT_TIMEOUT_MIN (total minutes, default 60) EXTRA_ARGS (default enables bulk data export etc.)

Override knobs per run (or put in .env):

POPULATION=25 AGE_RANGE=18-90 KEEP_FILE=keep_neuro.json EXTRA_ARGS="--exporter.fhir.bulk_data=true ..." make synthea-run

Modify params files

Then you need to edit P:\dev\PREM-on-FHIR\01-data-generation\synthea\import-pass1.json:

Location
Organization
Practioner
PractitionerRole

All those files are generated with a unique identifier in their files name, You need to update it

Import ressources

make fhir-import-many : This will first import all the files in import pass 1 (they are ressources that needs to be inserted first otherwise it cuase issue with the others in import pass 2)

you could get some error like missing files (due to not generated or wrong id replacement in the previous steps) Please correct and reexcute the script

2.2 Questionnaire

Target	Description
`make bundle-questionnaires`	Build a transaction Bundle (`Q_BUNDLE`) from JSON files in `Q_IN_DIR`.
`make post-questionnaires`	POST the Questionnaire bundle to `$FHIR_BASE`.

Config knobs: Q_IN_DIR=… Q_BUNDLE=…

2.3 QuestionnaireResponses (scaffolded from header CSV)

Target	Description
`make qr-export-headers`	Run SQL against HAPI DB and export header CSV (`HDR_CSV`).
`make qr-make-bundles`	Generate QuestionnaireResponse bundles (`QR_OUT`) from the header CSV. Supports NREQ/PPNQ modes.
`make post-qr-bundles`	POST generated bundles to `$FHIR_BASE`.

Config knobs (env or .env) QR_MODE = nreq | ppnq (default nreq)
QR_CHUNK (default 250)
QR_SEED (default 42)
QR_LIKERT_DIST (NREQ weighting, e.g. 0.2,0.5,0.3)

PPNQ text generation: QR_DRY_RUN=1 (placeholders, default 1)
QR_USE_LLM=1 (enable LLM; requires OPENAI_API_KEY)
LLM_MODEL=gpt-4o-mini
LLM_TEMPERATURE=0.6
LLM_MAX_RETRIES=3
Advanced (PPNQ):
NPS_DIST (e.g. 0:0.02,1:0.03,...,10:0.15) - Custom NPS score distribution (0–10). KEYWORD_RATE (default 0.35) - Probability per item to inject up to 1–2 optional keywords (from a small domain list or themes.yml) into the prompt as gentle guidance. STYLE_VARIANCE (default 0.7) - Scales the randomness used to pick a subtle style hint (e.g., “grateful”, “concerned”) consistent with the NPS bucket. Higher → more style variability. QR_VERBOSE=1 (extra logs)

Output files: $(QR_OUT)/{mode}_batch_bundle_###.json

2.3.1 Questionnaire Header

make qr-export-headers : run the SQL against your HAPI DB and write QuestionnaireResponse-Header.csv → $(HDR_CSV)
(DB envs respected: DB_HOST/PORT/NAME/USER/PASS or OLTP_DB_*.)

2.3.2 Questionnaire Response bundle maker

make qr-make-bundles : generate QR batch bundles from the header CSV → $(QR_OUT).

Controls (set as env before the command):

QR_MODE=nreq|ppnq (default nreq)
QR_SEED=42
QR_CHUNK=250

NREQ weighting:

QR_LIKERT_DIST=0.2,0.5,0.3

PPNQ text:

Dry run: QR_DRY_RUN=1 (default)
LLM mode: QR_USE_LLM=1 (needs OPENAI_API_KEY; optional LLM_MODEL, LLM_TEMPERATURE, LLM_MAX_RETRIES)
Advanced: NPS_DIST, KEYWORD_RATE, STYLE_VARIANCE

Examples:

# NREQ with weighted Likert distribution and fixed seed
QR_MODE=nreq QR_LIKERT_DIST=0.2,0.5,0.3 QR_SEED=7 make -e qr-make-bundles

# PPNQ dry-run placeholders
QR_MODE=ppnq QR_DRY_RUN=1 make -e qr-make-bundles

# PPNQ via LLM (verbose)
QR_MODE=ppnq QR_USE_LLM=1 LLM_MODEL=gpt-4o-mini QR_VERBOSE=1 make -e qr-make-bundles

delete every questionnaire response

curl -X DELETE "http://localhost:8080/fhir/QuestionnaireResponse?_lastUpdated=gt1900-01-01T00:00:00Z&_expunge=true" \
  -H "Accept: application/fhir+json"

2.4 End to End

Target	Description
`make seed-all`	Run everything: setup, build/run Synthea, wait for FHIR, bulk import, bundle/post questionnaires, export headers, generate & post QR bundles.

2.5 Clean up

Target	Description
`make clean`	Remove generated outputs (`SYN_OUT`, `QR_OUT`, `HDR_OUT`, curl logs).

3. ELT

3.1 Airbyte (Extract/Load)

It is important to use of --disable-auth on the install to avoid having to configure api token etc for this PoC within the Kubernetes Airbytes clusters.

Useful command

abctl local install --disable-auth
abctl local credentials
abctl local status

3.2 DBT (Transform)

This steps has to be run only if the Airbytes config is finalized and the first run has been done

3.2.1 Containerized DBT

# Install deps
docker compose run --rm dbt-run dbt deps


# Build stg + core
docker compose run --rm dbt-run dbt build -s stg+ core+

# Only compile
docker compose run --rm dbt-run dbt compile --select stg.*

# Only tests
docker compose run --rm dbt-run dbt test

# Full pipeline refresh
docker compose run --rm dbt-run dbt build --full-refresh --exclude 'tag:nlp'

# Full model refresh
docker compose run --rm dbt-run dbt run --full-refresh --exclude 'tag:nlp'

# Only changed models (after one prior run produced a manifest)
docker compose run --rm dbt-run dbt build --select state:modified+ --state target/

3.2.2 Common dbt manual workflows

dbt clean && dbt deps Models only: dbt run Just tests: dbt test Full refresh (rebuild tables): dbt build --full-refresh Run a folder (e.g., core only): dbt run -s models/core/ (or) dbt run -s core Include parents/children: dbt build -s +core (with upstream) or dbt build -s core+ (with downstream) Faster local runs: dbt build --threads 6 Only changed models (iterating): run once normally to create a state manifest dbt build --select state:modified+ --state target/

4. NLP scoring (one-shot job)

The NLP container pulls free-text from stg.nlp_prem_text, scores sentiment & themes, and upserts into stg.nlp_predictions_inbox. The dbt marts assemble:

mart.mart_prem_text_sentiment
mart.mart_prem_theme_summary

4.1 Image building and first run

DOCKER_BUILDKIT=1 docker build -t pof-prem-nlp:latest .

first run will download the models into a volume (not the image)

docker run --rm \
  --env-file .env \
  -e PG_HOST=host.docker.internal \
  -v pof-prem_hfcache:/app/.hf_cache \
  prem-nlp:latest \
  score --since 10y --limit 100 --verbose

4.2 Subsequent

override defaults if you want: docker run --rm --env-file .env -e PG_HOST=host.docker.internal pof-nlp:latest score --since 10y --limit 10000 --verbose

docker run --rm --env-file .env -e PG_HOST=host.docker.internal prem-nlp:latest score --since 30d --limit 100

python -m pipeline.cli score --since 10y --limit 200 --verbose

or a dry run to avoid writing: python -m pipeline.cli score --since 10y --limit 50 --verbose --dry-run

Build mart

docker compose run --rm dbt-run dbt build -s mart+

Troubleshooting

NLP says “No pending rows.” Check that stg.nlp_prem_text has rows (non-empty text_raw) for the --since window. Also check stg.nlp_predictions_inbox if (qr_id, item_linkid) were already scored for the current model_family/model_version, they’re intentionally skipped.
Bulk import (Synthea) fails The uploader script exits non-zero on $import errors and prints the HAPI $bulkdata-status response. Inspect the log lines; fix any bad file path or content type.
dbt constraints not created The on-run-end hook runs only when core models build. Verify you used build -s stg+ core+ (or full build).
Docker can’t reach your host DB Use PG_HOST=host.docker.internal (macOS/Windows). On Linux, expose the DB in compose and use the container DNS name instead.

Security & data hygiene (PoC-friendly)

Use the .env.example → .env workflow; never commit real secrets.
The NLP job flags potential PII (regex-based). If you display sample verbatims, mask or filter rows with pii_flag=true (add a column if you want to persist it).
Keep your Hugging Face cache in a named volume so images don’t bloat.

Analytics Db Snapshot

Source DB snapshot

MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dump -U analytics_admin -d analytics -Fc -C -Z 9 -f /tmp/analytics.dump
docker cp pof-analytics-db:/tmp/analytics.dump ./analytics.dump

MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dump -U analytics_admin -d metabase -Fc -C -Z 9 -f /tmp/metabase.dump
docker cp pof-analytics-db:/tmp/metabase.dump ./metabase.dump

MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dumpall -U analytics_admin --globals-only -f /tmp/globals.sql
docker cp pof-analytics-db:/tmp/globals.sql ./globals.sql

In the directory you are the three file are generated, you can use them in the next steps

Target DB

cat ./00-setup/analytics-db/snapshots/20250920/globals.sql | docker exec -i pof-analytics-db psql -U analytics_admin -d postgres -v ON_ERROR_STOP=1
docker exec pof-analytics-db dropdb -U analytics_admin --if-exists analytics
docker exec pof-analytics-db dropdb -U analytics_admin --if-exists metabase
cat ./00-setup/analytics-db/snapshots/20250920/analytics.dump | docker exec -i pof-analytics-db pg_restore -U analytics_admin -C -d postgres
cat ./00-setup/analytics-db/snapshots/20250920/metabase.dump | docker exec -i pof-analytics-db pg_restore -U analytics_admin -C -d postgres

run this from your host

MSYS_NO_PATHCONV=1 docker exec
-e POSTGRES_USER=analytics_admin
-e POSTGRES_DB=postgres
-e PGPASSWORD=analytics_admin
-e MB_DB_NAME=metabase
-e MB_DB_USER=metabase_app
-e MB_DB_PASS=change_me_strong
-it pof-analytics-db bash -lc " sed -i 's/\r$//' /docker-entrypoint-initdb.d/10_metabase_app.sh && chmod +x /docker-entrypoint-initdb.d/10_metabase_app.sh && /docker-entrypoint-initdb.d/10_metabase_app.sh "

executed in PS docker exec -it pof-metabase java -jar /app/metabase.jar reset-password admin@example.com

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
00-setup		00-setup
01-data-generation		01-data-generation
02-elt		02-elt
03-dashboards		03-dashboards
04-docs		04-docs
99-tools		99-tools
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

PREM-on-FHIR

Introduction

ToDo's

Quickstart

Repo structure

Steps by Step Guide (except airbytes)

End to End (Make TargetCHeat sheet)=

2. Generate & upload sample data

2.1 Setup

2.2 FHIR Server

2.1 Synthea

Generate ndjson

Modify params files

Import ressources

2.2 Questionnaire

2.3 QuestionnaireResponses (scaffolded from header CSV)

2.3.1 Questionnaire Header

2.3.2 Questionnaire Response bundle maker

2.4 End to End

2.5 Clean up

3. ELT

3.1 Airbyte (Extract/Load)

3.2 DBT (Transform)

3.2.1 Containerized DBT

3.2.2 Common dbt manual workflows

4. NLP scoring (one-shot job)

4.1 Image building and first run

4.2 Subsequent

Build mart

Troubleshooting

Security & data hygiene (PoC-friendly)

Analytics Db Snapshot

run this from your host

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages