Skip to content

feat(domains): add demographic-driven domain services with json schema validation for llm tool support#119

Merged
ake2l merged 1 commit intodevelopmentfrom
feat/demographic-extension
Oct 12, 2025
Merged

feat(domains): add demographic-driven domain services with json schema validation for llm tool support#119
ake2l merged 1 commit intodevelopmentfrom
feat/demographic-extension

Conversation

@ake2l
Copy link
Member

@ake2l ake2l commented Oct 12, 2025

address/person/patient/doctor services with schema validation, locale packs, determinism utils, examples, and tests; make mypy/ruff clean

Why

  • Provide first-class, deterministic domain services (address, person, patient, doctor) that compose locale data and demographic components.
  • Enforce contracts at boundaries via JSON Schemas and structured domain errors.
  • Ensure codebase is typing- and lint-clean to keep CI green and onboarding smooth.

What Changed

  • Domains
    • Added services and packages for address, person, patient, doctor backed by domain datasets.
    • Introduced locales module to load locale packs (people/address/doctor/patient) with auto delimiter detection and strict mode support. - Added common/demographics components: profile metadata and sampler composition; resolved supported component datasets. - Added facade to centralize orchestration (thin boundary).
  • Contracts & Validation - Added schema_registry with cached schema loading and validate_payload() for request/response enforcement. - Added domain JSON schemas for v1 requests/responses (person, patient, doctor, address). - Introduced DomainError with structured details and consistent to_dict() payload.
  • Determinism
    • New determinism helpers: canonicalization, canonical JSON, stable UUIDs, seed derivation/mixing, provenance hashing, frozen clock, and explicit RNG wrapper.
  • Data & Examples - Added demographic group CSVs and per-country component/profile metadata (DE, US, VN). - Added examples for all four domains (requests + responses).
  • Tooling & Hygiene
    • mypy: fixed inferred type narrowing and added a targeted type: ignore for jsonschema stubs and asdict narrowing.
    • ruff: cleaned unused/deprecated typing imports and ensured repo style. - Updated README.md and docs/README.md to reflect new capabilities. - Minor update to datamimic_ce/domains/utils/dataset_path.py for dataset path compliance.

Tests

  • Added API-level tests covering:
    • Locale loading and dataset coverage.
    • Component resolution and strict mode behavior. - Group schema integrity, loader path compliance, and absence of JSON duplication. - Sampler composition/runtime and cross-entity context behavior. - Profile seed determinism and profile vs baseline distribution. - Service purity and property tests.
  • Deterministic by default (seeded, filesystem isolated where applicable).
  • Local runs:
    • uv run mypy datamimic_ce → Success: no issues
    • ruff check datamimic_ce --unsafe-fixes --fix → All checks passed - uv run pytest -q (please confirm locally; CI should enforce)

Risks & Roll-back

  • Risk: Missing domain_data files per-locale can raise errors in strict mode.
    • Mitigation: Clear error messages; examples provided; strict mode can be toggled via existing config.
  • Risk: Runtime validation costs for JSON Schema on hot paths.
    • Mitigation: Cached validators; can gate and/or limit to boundary checks.
  • Roll-back: Revert domain additions and schema registry changes in a single revert; typing/lint-only changes are isolated and safe.

Docs/CLI

  • README/docs updated to link new domain services and examples.
  • No public CLI changes; commands remain stable.

Assumptions & Alternatives

  • Assumption: JSON Schema at runtime is acceptable. Alternative: migrate boundary validation to Pydantic models for tighter typing, generate JSON Schema from models for docs/contract sync.
  • Assumption: jsonschema stubs not required short-term. Alternative: add types-jsonschema to dev deps to drop the type: ignore.
  • Assumption: locale data shipped as CSV remains manageable. Alternatives: consolidate using a compact serialized format or lazy-load per component to reduce memory.

Performance Notes

  • CSV access is file-backed with simple line iteration; hotspot risk is low.
  • Determinism utilities use SHA-256 and minimal allocations; no O(N²) patterns detected in added paths.

Compatibility

  • Additive feature set; no breaking changes to existing public APIs.
  • SemVer: minor version bump appropriate.

DoD Checklist

  • Types and style pass (mypy/ruff)
  • Tests added and deterministic
  • Public errors are human-readable; internal exceptions typed
  • Docs updated; examples runnable
  • No material performance regressions observed
  • No architecture violations (SOC/SPOT/DRY/KISS)

…ervices with schema validation, locale packs, determinism utils, examples, and tests; make mypy/ruff clean

  Why

  - Provide first-class, deterministic domain services (address, person, patient, doctor) that compose locale data and demographic components.
  - Enforce contracts at boundaries via JSON Schemas and structured domain errors.
  - Ensure codebase is typing- and lint-clean to keep CI green and onboarding smooth.

  What Changed

  - Domains
      - Added services and packages for address, person, patient, doctor backed by domain datasets.
      - Introduced locales module to load locale packs (people/address/doctor/patient) with auto delimiter detection and strict mode support.
      - Added common/demographics components: profile metadata and sampler composition; resolved supported component datasets.
      - Added facade to centralize orchestration (thin boundary).
  - Contracts & Validation
      - Added schema_registry with cached schema loading and validate_payload() for request/response enforcement.
      - Added domain JSON schemas for v1 requests/responses (person, patient, doctor, address).
      - Introduced DomainError with structured details and consistent to_dict() payload.
  - Determinism
      - New determinism helpers: canonicalization, canonical JSON, stable UUIDs, seed derivation/mixing, provenance hashing, frozen clock, and explicit RNG wrapper.
  - Data & Examples
      - Added demographic group CSVs and per-country component/profile metadata (DE, US, VN).
      - Added examples for all four domains (requests + responses).
  - Tooling & Hygiene
      - mypy: fixed inferred type narrowing and added a targeted type: ignore for jsonschema stubs and asdict narrowing.
      - ruff: cleaned unused/deprecated typing imports and ensured repo style.
      - Updated README.md and docs/README.md to reflect new capabilities.
      - Minor update to datamimic_ce/domains/utils/dataset_path.py for dataset path compliance.

  Tests

  - Added API-level tests covering:
      - Locale loading and dataset coverage.
      - Component resolution and strict mode behavior.
      - Group schema integrity, loader path compliance, and absence of JSON duplication.
      - Sampler composition/runtime and cross-entity context behavior.
      - Profile seed determinism and profile vs baseline distribution.
      - Service purity and property tests.
  - Deterministic by default (seeded, filesystem isolated where applicable).
  - Local runs:
      - uv run mypy datamimic_ce → Success: no issues
      - ruff check datamimic_ce --unsafe-fixes --fix → All checks passed
      - uv run pytest -q (please confirm locally; CI should enforce)

  Risks & Roll-back

  - Risk: Missing domain_data files per-locale can raise errors in strict mode.
      - Mitigation: Clear error messages; examples provided; strict mode can be toggled via existing config.
  - Risk: Runtime validation costs for JSON Schema on hot paths.
      - Mitigation: Cached validators; can gate and/or limit to boundary checks.
  - Roll-back: Revert domain additions and schema registry changes in a single revert; typing/lint-only changes are isolated and safe.

  Docs/CLI

  - README/docs updated to link new domain services and examples.
  - No public CLI changes; commands remain stable.

  Assumptions & Alternatives

  - Assumption: JSON Schema at runtime is acceptable. Alternative: migrate boundary validation to Pydantic models for tighter typing, generate JSON Schema from models for docs/contract sync.
  - Assumption: jsonschema stubs not required short-term. Alternative: add types-jsonschema to dev deps to drop the type: ignore.
  - Assumption: locale data shipped as CSV remains manageable. Alternatives: consolidate using a compact serialized format or lazy-load per component to reduce memory.

  Performance Notes

  - CSV access is file-backed with simple line iteration; hotspot risk is low.
  - Determinism utilities use SHA-256 and minimal allocations; no O(N²) patterns detected in added paths.

  Compatibility

  - Additive feature set; no breaking changes to existing public APIs.
  - SemVer: minor version bump appropriate.

  DoD Checklist

  - Types and style pass (mypy/ruff)
  - Tests added and deterministic
  - Public errors are human-readable; internal exceptions typed
  - Docs updated; examples runnable
  - No material performance regressions observed
  - No architecture violations (SOC/SPOT/DRY/KISS)
@ake2l ake2l self-assigned this Oct 12, 2025
@sonarqubecloud
Copy link

@ake2l ake2l merged commit ef1f2b5 into development Oct 12, 2025
33 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant