This repository contains the implementation for the paper "The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation", accepted to the NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling.
๐ Paper: Link to paper ๐ Workshop: NeurIPS 2025 LLM Evaluation Workshop
This framework introduces a psychometric approach to evaluating Large Language Models (LLMs) through multi-agent debates. Rather than traditional benchmarking, we assess LLMs based on social and cognitive dimensions such as:
- Belief formation and updating through structured argumentation
- Theory of Mind capabilities in multi-agent settings
- Semantic diversity and stance convergence over debate rounds
- Bias detection and cognitive complexity in arguments
- Epistemic reasoning and confidence calibration
The system simulates structured debates between AI agents with different personas and incentives, moderated by a neutral or proactive moderator, to reveal emergent social behaviors in LLMs.
multi-agent-LLM-eval/
โโโ src/
โ โโโ main.py # Main experiment orchestrator
โ โโโ agents.py # Agent and Moderator classes
โ โโโ config.py # Experiment configuration
โ โโโ prompts.py # Structured prompt templates
โ โโโ psychometric_parser.py # JSON parsing for structured responses
โ โโโ metrics.py # Advanced psychometric metrics
โ โโโ eval.py # Batch evaluation pipeline
โ โโโ transcript_saver.py # Debate transcript management
โ โโโ result_analysis_agg.py # Statistical analysis and visualization
โ โโโ result_analysis_psychometric.py # Psychometric-specific analysis
โ โโโ dataset_utils.py # Dataset loading utilities
โโโ data/
โ โโโ cmv_topics.json # ChangeMyView debate topics
โ โโโ results/ # Evaluation results
โ โโโ plots/ # Generated visualizations
โโโ experiment_transcripts/ # Saved debate transcripts
โโโ requirements.txt # Python dependencies
- Clone the repository:
git clone https://github.com/your-username/multi-agent-LLM-eval.git
cd multi-agent-LLM-eval- Install dependencies:
pip install -r requirements.txt- Configure your LLM provider in
src/config.py:- Hugging Face Inference Client (default)
- Local Hugging Face models
- llama.cpp for GGUF quantized models
Edit src/config.py to customize your experiments:
CONFIG = {
"num_agents": 2, # Number of debating agents
"num_rounds": 7, # Rounds per debate
"personas": [ # Agent personas
"evidence-driven analyst",
"values-focused ethicist",
"heuristic thinker",
"contrarian debater",
"pragmatic policy-maker"
],
"incentives": ["truth", "persuasion", "novelty"],
"moderator_role": "neutral", # neutral, informed, probing
"model": "Qwen/Qwen3-14B", # Model name or path
"provider": "inference_client", # inference_client, huggingface, cpp
"temperature": 0.3,
"max_debates": 300
}Run a quick test with topics from the config file:
cd src
python main.py --test-run --rounds 3Run experiments using ChangeMyView topics:
python main.py --experiment "cmv_300_7_moderator_neutral" \
--max-debates 300 \
--rounds 7After generating transcripts, calculate psychometric metrics:
python eval.py --experiment-folder "Qwen_Qwen3_14B/cmv_300_3_moderator_neutral" \
--output-file "qwen_evaluation_results.json"Analyze results and generate visualizations:
python result_analysis_agg.pyThe framework computes advanced metrics for each debate:
- Semantic Diversity: Measures argument variety using sentence embeddings
- Stance Convergence: Cosine similarity of final agent positions
- Total Stance Shift: Movement from initial to final beliefs
- Bias Score: Social bias detection using fine-tuned local models
- Belief Update Frequency: Rate of opinion changes per agent
- Semantic Diversity Trend: Evolution of argument diversity
- Stance Agreement: Agreement level between agents per round
- Sentiment Score: Emotional tone of arguments
- Complexity Metrics: Evidence ratio, rebuttal rate, claim count
- Theory of Mind Scores: Empathy, perspective-taking accuracy
- Epistemic Markers: Certainty language (definitely, possibly, etc.)
- Cognitive Effort: Argument structure complexity
- Confidence Calibration: Stated vs. actual belief confidence
Agents are assigned distinct cognitive styles:
- Evidence-driven analyst: Relies on data and research
- Values-focused ethicist: Prioritizes moral considerations
- Heuristic thinker: Uses rules of thumb and intuition
- Contrarian debater: Challenges conventional positions
- Pragmatic policy-maker: Focuses on practical outcomes
Each agent pursues one of three incentives:
- Truth: Maximize accuracy and evidence
- Persuasion: Win the debate
- Novelty: Introduce unique perspectives
- Neutral: Impartial observation and judgment
- Consensus Builder: Encourages common ground
- Probing: Challenges arguments with questions
The system uses ChangeMyView (CMV) topics covering:
- Social policy and justice
- Ethics and morality
- Science and technology
- Politics and governance
- Cultural and identity issues
Saved as JSON in experiment_transcripts/:
{
"metadata": {
"topic": "Do social media platforms harm democracy?",
"llm_model": "Qwen/Qwen3-14B",
"num_rounds": 7,
"timestamp": "20250831_222001"
},
"moderator_initial_opinion": "Generally neutral...",
"transcript": [
{
"speaker": "A1",
"text": "...",
"action": "argument",
"round": 0,
"structured_analysis": {...}
}
],
"agent_belief_evolution": {
"A1": ["Initial: ...", "Round 1: ...", ...]
},
"psychometric_data": {...},
"final_judgment": "..."
}Aggregated metrics saved in data/results/:
{
"source_transcript": "20250831_222001_Dosocialmediaplatforms...",
"topic": "Do social media platforms harm democracy?",
"model": "Qwen/Qwen3-14B",
"metrics": {
"overall_metrics": {
"semantic_diversity": 0.45,
"final_stance_convergence": 0.72,
"total_stance_shift": 0.28,
"bias_score": 0.15
},
"per_round_metrics": [...]
}
}This framework enables research into:
- Emergent Social Behavior: How do LLMs behave in multi-agent settings?
- Belief Dynamics: When and why do agents change their positions?
- Persuasion Mechanisms: What argumentative strategies are effective?
- Bias Amplification: Do debates magnify or mitigate biases?
- Theory of Mind: Can LLMs model other agents' mental states?
- Model Comparison: Systematic evaluation across different LLMs
Uses local GGUF models for social bias detection:
CONFIG["bias_model_path"] = "/path/to/Qwen3-4B-BiasExpert.gguf"- Hugging Face Inference API: Cloud-based execution
- Local Transformers: GPU/CPU inference with quantization
- llama.cpp: Efficient CPU inference with GGUF models
Agents generate JSON-formatted arguments with:
- Claims: Main propositions with confidence scores
- Evidence: Supporting data and sources
- Warrants: Logical connections
- Rebuttals: Counter-arguments to opponents
Explicit modeling of opponents' perspectives:
- Understanding of opponent's position
- Acknowledged common ground
- Predicted responses
- Empathy scores
The framework generates publication-ready plots:
- Stance Convergence Distributions: Histograms of final agreement
- Per-Round Trends: Line plots of metric evolution
- Category Comparisons: Contentious vs. less contentious topics
- Model Comparisons: Side-by-side performance across LLMs
Example:
# Generates plots in data/plots/
python result_analysis_agg.pyIf you use this framework in your research, please cite:
@inproceedings{social-lab-mad,
title={The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation},
author={Zarreen Reza},
booktitle={NeurIPS 2025 Workshop on Evaluating the Evolving
LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling},
year={2025},
eprint={2510.01295},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.01295}
}Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- ChangeMyView (CMV): For providing rich debate topics
- Hugging Face: For model hosting and inference infrastructure
- NeurIPS 2025 Workshop: For supporting this research direction
For questions or collaboration inquiries:
- Email: [email protected]
- GitHub Issues: Open an issue
- Workshop: NeurIPS 2025 LLM Evaluation Workshop
- Integration with additional debate datasets (Debatepedia, Kialo)
- Multi-modal argumentation (text + images)
- Real-time human-AI debate interfaces
- Cross-lingual multi-agent evaluation
- Reinforcement learning from debate outcomes
- Adversarial robustness testing
Built with โค๏ธ for advancing LLM evaluation science