diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..e14d22f --- /dev/null +++ b/Makefile @@ -0,0 +1,14 @@ +.PHONY: test coverage typecheck lint + +test: + pytest -q + +coverage: + coverage run -m pytest tests/mcp + coverage report --include="datamimic_ce/mcp/*" --fail-under=90 + +typecheck: + mypy --strict datamimic_ce/mcp + +lint: + pylint datamimic_ce/mcp diff --git a/README.md b/README.md index ff730c5..770aaef 100644 --- a/README.md +++ b/README.md @@ -14,29 +14,25 @@ Faker gives you *random* data. [![Maintainability](https://sonarcloud.io/api/project_badges/measure?project=rapiddweller_datamimic&metric=sqale_rating)](https://sonarcloud.io/summary/new_code?id=rapiddweller_datamimic) [![Python](https://img.shields.io/badge/Python-3.11%2B-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) +![MCP Ready](https://img.shields.io/badge/MCP-ready-8A2BE2.svg) --- -## ๐Ÿง  What Problem DATAMIMIC Solves +## โœจ Why DATAMIMIC? -Typical data generators (like Faker) produce **isolated random values**. -Thatโ€™s fine for unit tests โ€” but meaningless for system, analytics, or compliance testing. - -**Example:** +Typical data generators produce **isolated random values**. Thatโ€™s fine for unit tests โ€” but meaningless for system, analytics, or compliance testing. ```python -# Faker โ€“ broken relationships +# Faker โ€” broken relationships patient_name = fake.name() patient_age = fake.random_int(1, 99) -conditions = [fake.word()] -# "25-year-old with Alzheimer's" โ€“ nonsense data. +conditions = [fake.word()] +# "25-year-old with Alzheimer's" โ€” nonsense data ``` -**DATAMIMIC โ€“ contextual realism** - ```python +# DATAMIMIC โ€” contextual realism from datamimic_ce.domains.healthcare.services import PatientService - patient = PatientService().generate() print(f"{patient.full_name}, {patient.age}, {patient.conditions}") # "Shirley Thompson, 72, ['Diabetes', 'Hypertension']" @@ -52,10 +48,9 @@ Install and run: pip install datamimic-ce ``` -## Deterministic Data Generation +### Deterministic Generation -DATAMIMIC lets you generate the *same* data, every time across machines, environments, or CI pipelines. -Seeds, clocks, and UUIDv5 namespaces ensure your synthetic datasets remain reproducible and traceable, no matter where or when theyโ€™re generated. +DATAMIMIC produces the *same data for the same request*, across machines and CI runs. Seeds, clocks, and UUIDv5 namespaces enforce reproducibility. ```python from datamimic_ce.domains.facade import generate_domain @@ -71,20 +66,66 @@ request = { response = generate_domain(request) print(response["items"][0]["id"]) +# Same input โ†’ same output +``` + +**Determinism Contract** + +* **Inputs:** `{seed, clock, uuidv5-namespace, request body}` +* **Guarantees:** byte-identical payloads + stable `determinism_proof.content_hash` +* **Scope:** all CE domains (see docs for domain-specific caveats) + +--- + +## โšก MCP (Model Context Protocol) + +Run DATAMIMIC as an MCP server so Claude / Cursor (and agents) can call deterministic data tools. + +**Install** + +```bash +pip install datamimic-ce[mcp] +# Development +pip install -e .[mcp] +``` + +**Run (SSE transport)** + +```bash +export DATAMIMIC_MCP_HOST=127.0.0.1 +export DATAMIMIC_MCP_PORT=8765 +# Optional auth; clients must send the same token via Authorization: Bearer or X-API-Key +export DATAMIMIC_MCP_API_KEY=changeme +datamimic-mcp ``` -**Result:** -`Same input โ†’ same output.` +**In-proc example (determinism proof)** -Behind the scenes, every deterministic request combines: +```python +import anyio, json +from fastmcp.client import Client +from datamimic_ce.mcp.models import GenerateArgs +from datamimic_ce.mcp.server import create_server + +async def main(): + args = GenerateArgs(domain="person", locale="en_US", seed=42, count=2) + payload = args.model_dump(mode="python") + async with Client(create_server()) as c: + a = await c.call_tool("generate", {"args": payload}) + b = await c.call_tool("generate", {"args": payload}) + print(json.loads(a[0].text)["determinism_proof"]["content_hash"] + == json.loads(b[0].text)["determinism_proof"]["content_hash"]) # True +anyio.run(main) +``` -* A **stable seed** (for idempotent randomness), -* A **frozen clock** (for time-dependent values), and -* A **UUIDv5 namespace** (for globally consistent identifiers). +**Config keys** -Together, they form a reproducibility contract. Ideal for CI/CD pipelines, agentic pipelines, and analytics verification. +* `DATAMIMIC_MCP_HOST` (default `127.0.0.1`) +* `DATAMIMIC_MCP_PORT` (default `8765`) +* `DATAMIMIC_MCP_API_KEY` (unset = no auth) +* Requests over cap (`count > 10_000`) are rejected with `422`. -Agents can safely re-invoke the same generation call and receive byte-for-byte identical data. +โžก๏ธ **Full guide, IDE configs (Claude/Cursor), transports, errors:** [`docs/mcp_quickstart.md`](docs/mcp_quickstart.md) --- @@ -98,10 +139,10 @@ patient = PatientService().generate() print(patient.full_name, patient.conditions) ``` -* **PatientService** โ€“ Demographically realistic patients -* **DoctorService** โ€“ Specialties match conditions -* **HospitalService** โ€“ Realistic bed capacities and types -* **MedicalRecordService** โ€“ Longitudinal health records +* Demographically realistic patients +* Doctor specialties match conditions +* Hospital capacities and types +* Longitudinal medical records ### ๐Ÿ’ฐ Finance @@ -111,31 +152,30 @@ account = BankAccountService().generate() print(account.account_number, account.balance) ``` -* Balances respect transactions +* Balances respect transaction histories * Card/IBAN formats per locale -* Distributions tuned for fraud analytics and reconciliation +* Distributions tuned for fraud/reconciliation tests -### ๐Ÿ‘ค Demographics +### ๐ŸŒ Demographics -* `PersonService` โ€“ Culturally consistent names, addresses, phone patterns -* Locale packs for DE / US / VN, versioned and auditable +* `PersonService` with locale packs (DE / US / VN), versioned and auditable --- ## ๐Ÿ”’ Deterministic by Design -* **Frozen clocks** and **canonical hashing** โ†’ reproducible IDs -* **Seeded random generators** โ†’ identical outputs across runs -* **Schema validation** (XSD, JSONSchema) โ†’ structural integrity -* **Provenance hashing** โ†’ audit-friendly lineage +* **Frozen clocks** + **canonical hashing** โ†’ reproducible IDs +* **Seeded RNG** โ†’ identical outputs across runs +* **Schema validation** (XSD/JSONSchema) โ†’ structural integrity +* **Provenance hashing** โ†’ audit-ready lineage ๐Ÿ“˜ See [Developer Guide](docs/developer_guide.md) --- -## ๐Ÿงฎ XML / Python Model Workflow +## ๐Ÿงฎ XML / Python Parity -Python-based generation: +Python: ```python from random import Random @@ -147,7 +187,7 @@ svc = PatientService(dataset="US", demographic_config=cfg, rng=Random(1337)) print(svc.generate().to_dict()) ``` -Equivalent XML model: +Equivalent XML: ```xml @@ -162,24 +202,7 @@ Equivalent XML model: --- -## โš–๏ธ CE vs EE Comparison - -| Feature | Community (CE) | Enterprise (EE) | -| --------------------------------------- | -------------- | --------------- | -| Deterministic domain generation | โœ… | โœ… | -| XML + Python pipelines | โœ… | โœ… | -| Healthcare & Finance domains | โœ… | โœ… | -| Multi-user collaboration | โŒ | โœ… | -| Governance & lineage dashboards | โŒ | โœ… | -| ML engines (Mostly AI, Synthcity, ... ) | โŒ | โœ… | -| RBAC & audit logging (HIPAA/GDPR/PCI) | โŒ | โœ… | -| Managed EDIFACT / SWIFT adapters | โŒ | โœ… | - -๐Ÿ‘‰ [Compare editions](https://datamimic.io) โ€ข [Book a strategy call](https://datamimic.io/contact) - ---- - -## ๐Ÿงฐ CLI & Automation +## ๐Ÿงฐ CLI ```bash # Run instant healthcare demo @@ -190,29 +213,38 @@ datamimic run ./healthcare-example/datamimic.xml datamimic version ``` +**Quality gates (repo):** + +```bash +make typecheck # mypy --strict +make lint # pylint (โ‰ฅ9.0 score target) +make coverage # target โ‰ฅ 90% +``` + --- ## ๐Ÿงญ Architecture Snapshot -* **Core pipeline:** Determinism kit + domain services + schema validators -* **Governance layer:** Group tables, linkage audits, provenance hashing -* **Execution layer:** CLI, API, and XML runners +* **Core pipeline:** Determinism kit โ€ข Domain services โ€ข Schema validators +* **Governance layer:** Group tables โ€ข Linkage audits โ€ข Provenance hashing +* **Execution layer:** CLI โ€ข API โ€ข XML runners โ€ข MCP server --- -## ๐ŸŒ Industry Blueprints - -### Finance +## โš–๏ธ CE vs EE -* Simulate SWIFT / ISO 20022 flows -* Replay hashed PCI transaction histories -* Validate fraud and reconciliation pipelines +| Feature | Community (CE) | Enterprise (EE) | +| ------------------------------------- | -------------- | --------------- | +| Deterministic domain generation | โœ… | โœ… | +| XML + Python pipelines | โœ… | โœ… | +| Healthcare & Finance domains | โœ… | โœ… | +| Multi-user collaboration | โŒ | โœ… | +| Governance & lineage dashboards | โŒ | โœ… | +| ML engines (Mostly AI, Synthcity, โ€ฆ) | โŒ | โœ… | +| RBAC & audit logging (HIPAA/GDPR/PCI) | โŒ | โœ… | +| EDIFACT / SWIFT adapters | โŒ | โœ… | -### Healthcare - -* Generate deterministic patient journeys -* Integrate HL7/FHIR/EDIFACT exchanges -* Reproduce QA datasets for regression testing +๐Ÿ‘‰ [Compare editions](https://datamimic.io) โ€ข [Book a strategy call](https://datamimic.io/contact) ---