Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
-
Updated
Mar 11, 2025 - Python
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
An agent benchmark with tasks in a simulated software company.
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.
MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek), custom tasks in YAML, and HTML/CSV reports.
Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."