This repository contains the code used for Deliverable 3 of the A2-PRIVCOMP project, entitled "Validación de nuevos modelos de generación de datos sintéticos" and authored by Vicomtech Foundation.
This repository provides tools to train and evaluate different synthetic data generation models. It allows:
- Loading and preprocessing real datasets.
- Training and evaluating multiple synthetic tabular data generative models.
- Tracking evaluation metrics and results using MLflow and MinIO.
- Reproducible evaluation workflows for synthetic tabular data research.
├── Notebooks # Contains Jupyter notebooks used to load, preprocess and split the source data for each used dataset.
├── Real Data # Stores the datasets used in the deliverable, organized into three subfolders
│ ├── Complete Data # Original real data in its raw format.
│ ├── Train Data # Data used for model training (generated via the notebooks in Notebooks).
│ └── Test Data # Data used to validate the synthetic data (also generated via the notebooks in Notebooks).
├── requirements # Includes multiple .txt files, each listing the dependencies needed to run the code and the different synthetic data generator models.
├── results # Includes a Jupyter notebook demonstrating how to retrieve registered metrics from MLflow and use them.
├── src # Contains the implementation code for synthetic data generation models and evaluation metrics and methods.
├── .env # Defines the env variables for data paths.
├── .env.minio # Defines the env variables for minio container. This must be correctly configured for models to access the necessary data.
├── .env.mlflow # Defines the env variables for mlflow container. This must be correctly configured for models to access the necessary data.
├── .gitignore
├── .pre-commit-config.yaml
├── docker-compose.yml # A docker compose file containing all the the required specifications to correctly set the containers.
├── Dockerfile_mlflow # A dockerfile used as base in order to corectly set the mlflow container.
├── README.md
└── train_{model_name}_model.py # A collection of Python scripts that act as entry points for training the different models defined within them.- Python 3.8+
- Python environment manager: Some of the models in this repository are mutually exclusive, meaning they cannot coexist in the same Python environment. Keep in mind the limitations this can imply.
- Docker & Docker Compose: All metrics are tracked using an MLflow container (mlflow.org) which relies on MinIO (min.io)
NOTE
The containers do not need to run on the same machine used for training and evaluating the synthetic data generators. However, the training machine must be able to establish a connection to the MLflow container, even if MLflow is hosted remotely.
This repository includes a docker-compose.yml file that deploys two containers: MLflow and MinIO.
-
MLflow is used to track models and log metrics for each training process.
-
MinIO serves as the backend storage for these models and metrics.
For the containers to work properly, two .env files are required (one for each service). Both are included in the repository but must be completed with the correct values before starting the containers.
To launch the containers, run (.env files must be filled):
docker compose -f docker-compose.yml up
adding "-d" will detach the process once the containers are ready.
Before generating synthetic data, it is necessary to create a bucket named mlflow-bucket in MinIO to do so:
-
Open the MinIO web interface at port 9001 on the host machine where the containers were deployed. If running locally, go to: http://localhost:9001 in case of running on other machine set the ip of the machine instead of "localhost".
-
Log in with the credentials defined in the .env.minio file.
-
Create a bucket named mlflow-bucket.
MLflow will then use this bucket to store evaluation results and artifacts.
Using your chosen Python environment manager, create a new environment and install the requirements for the model you want to train.
If training a specific model, install the dependencies listed in its corresponding requirements file.
For example, in order to train the dp-npc model install the according requirements files (under the requirements folder):
pip install -r requirements_npc.txt
Once the enviroment is ready open the appropriate .py file for the target model and perform the following steps:
-
At the bottom of the file, specify which datasets should be used for training, there should be some examples ready, it is possible to use more than one dataset per execution.
-
The datasets must already be partitioned into train and test sets, and stored correctly in the Real Data folder.
-
Ensure the code includes a tracking_uri variable. This must be set to the IP address and port of the MLflow container where metrics will be tracked. Model hyperparameters are set to default in case of changes the class containing the parameters can be changed. Refer to the .py files stored in src/sdg_models to see the available hyperparameter for each model.
Once everything is set correctly run the chosen .py file, for example:
python train_models_dp_npc.py
NOTE
Some models and datasets may require significantly more computational resources than others.
Results and metrics are tracked in MLflow and stored in MinIO. A Jupyter notebook is provided under Results/ as an example to retrieve and visualize metrics for one dataset.
Contributions, issues and feature requests are welcome. Open a pull request or report a bug via the issues page.
This work has been funded by the A2-PRIVCOMP project funded by the Spanish Government PRTR funds, part of the European Union-NextGeneration EU recovery plan.
If you use this repository in your research, please cite:
Hernandez M, Osorio-Marulanda PA, Catalina M, Loinaz L, Epelde G and Aginako N (2025). Comprehensive evaluation framework for synthetic tabular data in health: fidelity, utility and privacy analysis of generative models with and without privacy guarantees. Front. Digit. Health 7:1576290. doi: 10.3389/fdgth.2025.1576290
