Twin-2K-500: Digital Twin Simulation with LLMs

This repository contains the code and tutorial for "Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions". The Twin-2K-500 dataset is designed to support the benchmarking and advancement of LLM-based persona simulation methods.

Dataset: Twin-2K-500

Citation

@article{toubia2025twin2k500,
  title     = {Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions},
  author    = {Toubia, Olivier and Gui, George Z. and Peng, Tianyi and Merlau, Daniel J. and Li, Ang and Chen, Haozhe},
  journal   = {arXiv preprint arXiv:2505.17479},
  year      = {2025}
}

How to Use Our Dataset

Before getting started, we highly recommend reviewing our documentation for detailed information about the dataset and tutorials for various use cases: Documentation

Overview

The digital twin simulation system creates virtual representations of individuals based on their survey responses and simulates their behavior in response to new survey questions. The system uses LLMs to generate realistic responses that maintain consistency with the original persona profiles.

Project Structure

.
├── text_simulation/           # Main simulation code
│   ├── configs/              # Configuration files
│   ├── text_personas/        # Persona profile data
│   ├── text_questions/       # Survey questions
│   ├── text_simulation_input/ # Combined input files
│   └── text_simulation_output/ # Simulation results
├── evaluation/                # Evaluation folder  
├── notebooks/                 # Demo notebooks
│   ├── demo_simple_simulation.ipynb    # Quick start: simulate responses to new questions
│   └── demo_full_pipeline.ipynb        # Complete pipeline with evaluation (alternative to shell scripts)
├── scripts/                  # Utility scripts
├── data/                     # Raw data
└── cache/                    # Cached data

Key Components

Persona Processing
- convert_persona_to_text.py: Converts persona data to text format
- batch_convert_personas.py: Batch processes multiple personas
Question Processing
- convert_question_json_to_text.py: Converts question data to text format
Simulation
- create_text_simulation_input.py: Combines personas with questions
- run_LLM_simulations.py: Runs the actual LLM simulations
- llm_helper.py: Helper functions for LLM interactions
- postprocess_responses.py: Processes and analyzes simulation results

Requirements

Python 3.11.7 or higher
Poetry for dependency management

Installation

Clone the repository:

git clone [repository-url]
cd digital-twin-simulation

Install dependencies using Poetry:

poetry install

Usage

Quick Start: Simulate Responses to New Questions

For a quick introduction to digital twin simulation, try our interactive demo notebook:

jupyter notebook notebooks/demo_simple_simulation.ipynb

This notebook demonstrates:

Loading persona summaries directly from Hugging Face dataset (no setup required!)
Creating custom survey questions
Simulating responses using GPT-4.1 mini
Running batch simulations for multiple personas
Automatic package installation and API key configuration
Works seamlessly in both local environments and Google Colab

Perfect for researchers who want to quickly test new survey questions on digital twins without complex setup.

Full Pipeline Demo (Interactive Alternative)

For those who prefer Jupyter notebooks over shell scripts, we provide a complete pipeline walkthrough:

jupyter notebook notebooks/demo_full_pipeline.ipynb

This notebook covers the entire workflow from data preparation to evaluation, making it an excellent alternative to the shell script approach described below.

Full Pipeline (Command Line)

To run the complete digital twin simulation pipeline:

Prepare the Data: First, download the necessary dataset by executing the following command:
```
poetry run python download_dataset.py
```
Configure API Access: Set the OPENAI_API_KEY environment variable to enable LLM interactions. Create a file named .env in the project's root directory and add your API key as follows:
```
OPENAI_API_KEY=your_actual_api_key_here
```
Replace your_actual_api_key_here with your valid OpenAI API key.
Run the Simulation Pipeline: Execute the main simulation pipeline using the provided shell scripts. You can run a small test with a limited number of personas or simulate all available personas.
- For a small test run (e.g., 5 personas):
```
./scripts/run_pipeline.sh --max_personas=5
```
- To run the simulation for all 2058 personas:
```
./scripts/run_pipeline.sh
```
Evaluate the Results: After running the simulations, evaluate the results using:
```
./scripts/run_evaluation_pipeline.sh
```

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
docs		docs
evaluation		evaluation
ml_prediction		ml_prediction
notebooks		notebooks
scripts		scripts
text_simulation		text_simulation
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
README.md		README.md
download_dataset.py		download_dataset.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twin-2K-500: Digital Twin Simulation with LLMs

Citation

How to Use Our Dataset

Overview

Project Structure

Key Components

Requirements

Installation

Usage

Quick Start: Simulate Responses to New Questions

Full Pipeline Demo (Interactive Alternative)

Full Pipeline (Command Line)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Twin-2K-500: Digital Twin Simulation with LLMs

Citation

How to Use Our Dataset

Overview

Project Structure

Key Components

Requirements

Installation

Usage

Quick Start: Simulate Responses to New Questions

Full Pipeline Demo (Interactive Alternative)

Full Pipeline (Command Line)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages