Skip to content

VictorHueni/PREM-on-FHIR

Repository files navigation

PREM-on-FHIR

Introduction

End-to-end pipeline to generate synthetic PREM data, load it into a FHIR server, land it in an analytics DB, transform with dbt, and enrich free-text with an NLP job. Dockerized where it matters for a smooth dev experience.

ToDo's

  • add general arch diagram + c4 diagram if possible
  • make sure all script using env var use the same env files
  • Validate the docker run cmd and well as shell script params for synthea steps
  • Make sure of the usage of virutal env for seed steps
  • Build a make file for Use make seed-all (wraps all the steps above).
  • Add Aibytes config
  • Add Docker compose image for dbt
  • validate the usage of the .hf_cache model (the docker should be standalone)
  • validate usage of docker or docker compose for nlp pipeline
  • create make files
  • refactor all command in the readme prefering using cd .. before executing the command to avoid long path
  • make a post script init to execute the airbyte view creation when oltp-db and hapi fhir are all start up and table are available
  • Make the synthea script leverages the parameters.json createad after generation

Quickstart

# 1) Start infra (analytics DB, HAPI FHIR, pgAdmin, nginx to serve NDJSON)
docker compose --profile all up -d

# 2) Seed synthetic data (Synthea + Questionnaire + QuestionnaireResponses)
cd ..
make seed-all        # if you use the provided Makefile
# ...or run the individual scripts shown below

# 3) Build stg/core models, run NLP, then build marts
make elt             # dbt stg+core -> NLP -> dbt marts

Repo structure

/
├─ 00-setup/                       # infra only
│  ├─ analytics-db/                # Postgres for analytics
│  ├─ hapi-fhir/                   # FHIR server (java-based)
│  ├─ oltp-db/                     # OLTP Postgres
│  ├─ pgadmin/                     # pgAdmin client to monitor analytics and oltp db
│  ├─ synthea-files/               # synthea generated files served by nginx
│  │  └─ nginx.conf                # web server config (autoindex + NDJSON types)
│  └─ docker-compose.yml           # docker compose to setup the whole infrastructure
├─ 01-data-generation/
│  ├─ questionnaires/
│  │  ├─ input/                         # Manually built FHIR ressources for questionnaire, value set, copdesystem 
│  │  ├─ output/                        # FHIR bundles to upload
│  │  ├─ questionnaire_bundle_maker.py  # Python script to transform the manually fhir ressources into a bundles
│  │  └─ upload_questionnaire.sh        # Shell script to upload the bundle into the HAPI FHIR server via API calls
│  ├─ questionnaire_responses/
│  │  ├─ input/                    # this is the output folder of the export_qr_header.py. It will contains a QuestionnaireHeader.csv files with all foreign keys.
│  │  ├─ output/                   # This is where QR bundle are generated by the qr_bundle_maker.py script
│  │  ├─ export_qr_header.py       # Script to retrieve identifiers from patient, practioners, encounters... that are mandatory in a QuestionnairResponse resssources so that those resssources have a real context associated
|  |  ├─ qr_bundle_maker.py        # Python script that generate fake QR using the QuestionnaireHeader.csv
│  │  └─ upload_qr.sh              # Shell script to upload the QR bundle into the HAPI FHIR server via API calls
│  └─ synthea/
│     ├─ Dockerfile                # builds a Synthea runner
│     ├─ output/                   # generated NDJSON
│     └─ upload_synthea.sh         # bulk import via $import + polling
├─ 02-elt/
│  ├─ extract_load_config/         # Airbyte exports/screenshots
│  ├─ fhir_prem_to_decision/       # dbt project (stg/core/mart)
│  └─ nlp_pipeline/                # NLP job (Dockerized, runs one-shot)
└─ 04-dashboard/                   # (placeholder) dashboard/app
  • Docker ≥ 24 with Compose v2
  • make
  • Python 3.11 if you run the local scripts manually
  • Airbytes local local instances abctl local install --disable-auth
cp .env.example .env
# edit values as needed

Steps by Step Guide (except airbytes)

docker compose --profile all up -d
docker compose ps

This starts:

  • analytics-db (Postgres),
  • oltp-db (Postgres)
  • hapi-fhir-server (FHIR server),
  • pgadmin,
  • synthea-files (nginx web server serves NDJSON from 00-setup/synthea-files for convenient bulk import)

Then you need t oexecute direclty in the db ./00-setup/oltp-db/post-init/20-airbyte-views.sql to prepare the views for airbytes ETL

End to End (Make TargetCHeat sheet)=

make up            # infra up
make seed-all      # Synthea + Questionnaire + QR uploads
make dbt-stg-core  # build stg/core
make nlp           # run NLP one-shot
make dbt-mart      # build marts
make elt           # stg/core -> NLP -> mart
make logs          # follow docker compose logs
make down          # stop containers
make nuke          # stop + remove volumes (⚠ wipes data)

2. Generate & upload sample data

2.1 Setup

Target Description
make init Create Python venv (if missing) and install dependencies.
make venv Create .venv virtualenv (Linux/Mac or Windows).
make deps Install CLI requirements into the venv.

2.2 FHIR Server

Target Description
make fhir-wait Poll until HAPI at $FHIR_BASE is ready (default: http://localhost:8080/fhir).
make fhir-import Submit a bulk $import job using a single Parameters JSON (IMPORT_PARAMS).
make fhir-import-many Submit multiple $import jobs in sequence (IMPORT_FILES).

Config knobs FHIR_BASE=http://host:port/fhir IMPORT_PARAMS=… IMPORT_FILES=… IMPORT_POLL (seconds between polls, default 30) IMPORT_TIMEOUT_MIN (total minutes, default 60)

2.1 Synthea

Generate ndjson
Target Description
make synthea-build Build Synthea Docker image (SYN_TAG, default syntheadocker).
make synthea-run Run Synthea inside Docker and export bulk FHIR data to SYN_OUT.

Config knobs (can be set via env or .env): POPULATION (default 5) AGE_RANGE (default 18-100) KEEP_FILE (default keep_neuro.json) IMPORT_TIMEOUT_MIN (total minutes, default 60) EXTRA_ARGS (default enables bulk data export etc.)

Override knobs per run (or put in .env):

POPULATION=25 AGE_RANGE=18-90 KEEP_FILE=keep_neuro.json EXTRA_ARGS="--exporter.fhir.bulk_data=true ..." make synthea-run
Modify params files

Then you need to edit P:\dev\PREM-on-FHIR\01-data-generation\synthea\import-pass1.json:

  • Location
  • Organization
  • Practioner
  • PractitionerRole

All those files are generated with a unique identifier in their files name, You need to update it

Import ressources

make fhir-import-many : This will first import all the files in import pass 1 (they are ressources that needs to be inserted first otherwise it cuase issue with the others in import pass 2)

you could get some error like missing files (due to not generated or wrong id replacement in the previous steps) Please correct and reexcute the script

2.2 Questionnaire

Target Description
make bundle-questionnaires Build a transaction Bundle (Q_BUNDLE) from JSON files in Q_IN_DIR.
make post-questionnaires POST the Questionnaire bundle to $FHIR_BASE.

Config knobs: Q_IN_DIR=… Q_BUNDLE=…

2.3 QuestionnaireResponses (scaffolded from header CSV)

Target Description
make qr-export-headers Run SQL against HAPI DB and export header CSV (HDR_CSV).
make qr-make-bundles Generate QuestionnaireResponse bundles (QR_OUT) from the header CSV. Supports NREQ/PPNQ modes.
make post-qr-bundles POST generated bundles to $FHIR_BASE.

Config knobs (env or .env) QR_MODE = nreq | ppnq (default nreq)
QR_CHUNK (default 250)
QR_SEED (default 42)
QR_LIKERT_DIST (NREQ weighting, e.g. 0.2,0.5,0.3)

PPNQ text generation: QR_DRY_RUN=1 (placeholders, default 1)
QR_USE_LLM=1 (enable LLM; requires OPENAI_API_KEY)
LLM_MODEL=gpt-4o-mini
LLM_TEMPERATURE=0.6
LLM_MAX_RETRIES=3
Advanced (PPNQ):
NPS_DIST (e.g. 0:0.02,1:0.03,...,10:0.15) - Custom NPS score distribution (0–10). KEYWORD_RATE (default 0.35) - Probability per item to inject up to 1–2 optional keywords (from a small domain list or themes.yml) into the prompt as gentle guidance. STYLE_VARIANCE (default 0.7) - Scales the randomness used to pick a subtle style hint (e.g., “grateful”, “concerned”) consistent with the NPS bucket. Higher → more style variability. QR_VERBOSE=1 (extra logs)

Output files: $(QR_OUT)/{mode}_batch_bundle_###.json

2.3.1 Questionnaire Header

make qr-export-headers : run the SQL against your HAPI DB and write QuestionnaireResponse-Header.csv$(HDR_CSV)
(DB envs respected: DB_HOST/PORT/NAME/USER/PASS or OLTP_DB_*.)

2.3.2 Questionnaire Response bundle maker

make qr-make-bundles : generate QR batch bundles from the header CSV → $(QR_OUT).

Controls (set as env before the command):

  • QR_MODE=nreq|ppnq (default nreq)
  • QR_SEED=42
  • QR_CHUNK=250

NREQ weighting:

  • QR_LIKERT_DIST=0.2,0.5,0.3

PPNQ text:

  • Dry run: QR_DRY_RUN=1 (default)
  • LLM mode: QR_USE_LLM=1 (needs OPENAI_API_KEY; optional LLM_MODEL, LLM_TEMPERATURE, LLM_MAX_RETRIES)
  • Advanced: NPS_DIST, KEYWORD_RATE, STYLE_VARIANCE

Examples:

# NREQ with weighted Likert distribution and fixed seed
QR_MODE=nreq QR_LIKERT_DIST=0.2,0.5,0.3 QR_SEED=7 make -e qr-make-bundles

# PPNQ dry-run placeholders
QR_MODE=ppnq QR_DRY_RUN=1 make -e qr-make-bundles

# PPNQ via LLM (verbose)
QR_MODE=ppnq QR_USE_LLM=1 LLM_MODEL=gpt-4o-mini QR_VERBOSE=1 make -e qr-make-bundles

delete every questionnaire response

curl -X DELETE "http://localhost:8080/fhir/QuestionnaireResponse?_lastUpdated=gt1900-01-01T00:00:00Z&_expunge=true" \
  -H "Accept: application/fhir+json"

2.4 End to End

Target Description
make seed-all Run everything: setup, build/run Synthea, wait for FHIR, bulk import, bundle/post questionnaires, export headers, generate & post QR bundles.

2.5 Clean up

Target Description
make clean Remove generated outputs (SYN_OUT, QR_OUT, HDR_OUT, curl logs).

3. ELT

3.1 Airbyte (Extract/Load)

It is important to use of --disable-auth on the install to avoid having to configure api token etc for this PoC within the Kubernetes Airbytes clusters.

Useful command

abctl local install --disable-auth
abctl local credentials
abctl local status

3.2 DBT (Transform)

This steps has to be run only if the Airbytes config is finalized and the first run has been done

3.2.1 Containerized DBT
# Install deps
docker compose run --rm dbt-run dbt deps


# Build stg + core
docker compose run --rm dbt-run dbt build -s stg+ core+
# Only compile
docker compose run --rm dbt-run dbt compile --select stg.*

# Only tests
docker compose run --rm dbt-run dbt test

# Full pipeline refresh
docker compose run --rm dbt-run dbt build --full-refresh --exclude 'tag:nlp'

# Full model refresh
docker compose run --rm dbt-run dbt run --full-refresh --exclude 'tag:nlp'

# Only changed models (after one prior run produced a manifest)
docker compose run --rm dbt-run dbt build --select state:modified+ --state target/
3.2.2 Common dbt manual workflows

dbt clean && dbt deps Models only: dbt run Just tests: dbt test Full refresh (rebuild tables): dbt build --full-refresh Run a folder (e.g., core only): dbt run -s models/core/ (or) dbt run -s core Include parents/children: dbt build -s +core (with upstream) or dbt build -s core+ (with downstream) Faster local runs: dbt build --threads 6 Only changed models (iterating): run once normally to create a state manifest dbt build --select state:modified+ --state target/

4. NLP scoring (one-shot job)

The NLP container pulls free-text from stg.nlp_prem_text, scores sentiment & themes, and upserts into stg.nlp_predictions_inbox. The dbt marts assemble:

  • mart.mart_prem_text_sentiment
  • mart.mart_prem_theme_summary

4.1 Image building and first run

DOCKER_BUILDKIT=1 docker build -t pof-prem-nlp:latest .

first run will download the models into a volume (not the image)

docker run --rm \
  --env-file .env \
  -e PG_HOST=host.docker.internal \
  -v pof-prem_hfcache:/app/.hf_cache \
  prem-nlp:latest \
  score --since 10y --limit 100 --verbose

4.2 Subsequent

override defaults if you want: docker run --rm --env-file .env -e PG_HOST=host.docker.internal pof-nlp:latest score --since 10y --limit 10000 --verbose

docker run --rm --env-file .env -e PG_HOST=host.docker.internal prem-nlp:latest score --since 30d --limit 100

python -m pipeline.cli score --since 10y --limit 200 --verbose

or a dry run to avoid writing: python -m pipeline.cli score --since 10y --limit 50 --verbose --dry-run

Build mart

docker compose run --rm dbt-run dbt build -s mart+

Troubleshooting

  • NLP says “No pending rows.” Check that stg.nlp_prem_text has rows (non-empty text_raw) for the --since window. Also check stg.nlp_predictions_inbox if (qr_id, item_linkid) were already scored for the current model_family/model_version, they’re intentionally skipped.

  • Bulk import (Synthea) fails The uploader script exits non-zero on $import errors and prints the HAPI $bulkdata-status response. Inspect the log lines; fix any bad file path or content type.

  • dbt constraints not created The on-run-end hook runs only when core models build. Verify you used build -s stg+ core+ (or full build).

  • Docker can’t reach your host DB Use PG_HOST=host.docker.internal (macOS/Windows). On Linux, expose the DB in compose and use the container DNS name instead.

Security & data hygiene (PoC-friendly)

  • Use the .env.example → .env workflow; never commit real secrets.
  • The NLP job flags potential PII (regex-based). If you display sample verbatims, mask or filter rows with pii_flag=true (add a column if you want to persist it).
  • Keep your Hugging Face cache in a named volume so images don’t bloat.

Analytics Db Snapshot

Source DB snapshot

MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dump -U analytics_admin -d analytics -Fc -C -Z 9 -f /tmp/analytics.dump
docker cp pof-analytics-db:/tmp/analytics.dump ./analytics.dump

MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dump -U analytics_admin -d metabase -Fc -C -Z 9 -f /tmp/metabase.dump
docker cp pof-analytics-db:/tmp/metabase.dump ./metabase.dump

MSYS_NO_PATHCONV=1 docker exec pof-analytics-db pg_dumpall -U analytics_admin --globals-only -f /tmp/globals.sql
docker cp pof-analytics-db:/tmp/globals.sql ./globals.sql

In the directory you are the three file are generated, you can use them in the next steps

Target DB

cat ./00-setup/analytics-db/snapshots/20250920/globals.sql | docker exec -i pof-analytics-db psql -U analytics_admin -d postgres -v ON_ERROR_STOP=1
docker exec pof-analytics-db dropdb -U analytics_admin --if-exists analytics
docker exec pof-analytics-db dropdb -U analytics_admin --if-exists metabase
cat ./00-setup/analytics-db/snapshots/20250920/analytics.dump | docker exec -i pof-analytics-db pg_restore -U analytics_admin -C -d postgres
cat ./00-setup/analytics-db/snapshots/20250920/metabase.dump | docker exec -i pof-analytics-db pg_restore -U analytics_admin -C -d postgres

run this from your host

MSYS_NO_PATHCONV=1 docker exec
-e POSTGRES_USER=analytics_admin
-e POSTGRES_DB=postgres
-e PGPASSWORD=analytics_admin
-e MB_DB_NAME=metabase
-e MB_DB_USER=metabase_app
-e MB_DB_PASS=change_me_strong
-it pof-analytics-db bash -lc " sed -i 's/\r$//' /docker-entrypoint-initdb.d/10_metabase_app.sh && chmod +x /docker-entrypoint-initdb.d/10_metabase_app.sh && /docker-entrypoint-initdb.d/10_metabase_app.sh "

executed in PS docker exec -it pof-metabase java -jar /app/metabase.jar reset-password admin@example.com

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors