This repository contains the code and tutorial for "Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions". The Twin-2K-500 dataset is designed to support the benchmarking and advancement of LLM-based persona simulation methods.
- Dataset: Twin-2K-500
@article{toubia2025twin2k500,
title = {Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions},
author = {Toubia, Olivier and Gui, George Z. and Peng, Tianyi and Merlau, Daniel J. and Li, Ang and Chen, Haozhe},
journal = {arXiv preprint arXiv:2505.17479},
year = {2025}
}
Before getting started, we highly recommend reviewing our documentation for detailed information about the dataset and tutorials for various use cases: Documentation
The digital twin simulation system creates virtual representations of individuals based on their survey responses and simulates their behavior in response to new survey questions. The system uses LLMs to generate realistic responses that maintain consistency with the original persona profiles.
.
├── text_simulation/ # Main simulation code
│ ├── configs/ # Configuration files
│ ├── text_personas/ # Persona profile data
│ ├── text_questions/ # Survey questions
│ ├── text_simulation_input/ # Combined input files
│ └── text_simulation_output/ # Simulation results
├── evaluation/ # Evaluation folder
├── notebooks/ # Demo notebooks
│ ├── demo_simple_simulation.ipynb # Quick start: simulate responses to new questions
│ └── demo_full_pipeline.ipynb # Complete pipeline with evaluation (alternative to shell scripts)
├── scripts/ # Utility scripts
├── data/ # Raw data
└── cache/ # Cached data
-
Persona Processing
convert_persona_to_text.py: Converts persona data to text formatbatch_convert_personas.py: Batch processes multiple personas
-
Question Processing
convert_question_json_to_text.py: Converts question data to text format
-
Simulation
create_text_simulation_input.py: Combines personas with questionsrun_LLM_simulations.py: Runs the actual LLM simulationsllm_helper.py: Helper functions for LLM interactionspostprocess_responses.py: Processes and analyzes simulation results
- Python 3.11.7 or higher
- Poetry for dependency management
- Clone the repository:
git clone [repository-url]
cd digital-twin-simulation- Install dependencies using Poetry:
poetry installFor a quick introduction to digital twin simulation, try our interactive demo notebook:
jupyter notebook notebooks/demo_simple_simulation.ipynbThis notebook demonstrates:
- Loading persona summaries directly from Hugging Face dataset (no setup required!)
- Creating custom survey questions
- Simulating responses using GPT-4.1 mini
- Running batch simulations for multiple personas
- Automatic package installation and API key configuration
- Works seamlessly in both local environments and Google Colab
Perfect for researchers who want to quickly test new survey questions on digital twins without complex setup.
For those who prefer Jupyter notebooks over shell scripts, we provide a complete pipeline walkthrough:
jupyter notebook notebooks/demo_full_pipeline.ipynbThis notebook covers the entire workflow from data preparation to evaluation, making it an excellent alternative to the shell script approach described below.
To run the complete digital twin simulation pipeline:
-
Prepare the Data: First, download the necessary dataset by executing the following command:
poetry run python download_dataset.py
-
Configure API Access: Set the
OPENAI_API_KEYenvironment variable to enable LLM interactions. Create a file named.envin the project's root directory and add your API key as follows:OPENAI_API_KEY=your_actual_api_key_hereReplace
your_actual_api_key_herewith your valid OpenAI API key. -
Run the Simulation Pipeline: Execute the main simulation pipeline using the provided shell scripts. You can run a small test with a limited number of personas or simulate all available personas.
- For a small test run (e.g., 5 personas):
./scripts/run_pipeline.sh --max_personas=5
- To run the simulation for all 2058 personas:
./scripts/run_pipeline.sh
- For a small test run (e.g., 5 personas):
-
Evaluate the Results: After running the simulations, evaluate the results using:
./scripts/run_evaluation_pipeline.sh