ai-benchmark

Star

Here are 6 public repositories matching this topic...

microsoft / WindowsAgentArena

Star

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

windows ai computer ai-research ai-agent agentic ai-benchmark desktop-agent computer-use

Updated Mar 11, 2025
Python

TheAgentCompany / TheAgentCompany

Star

An agent benchmark with tasks in a simulated software company.

agent benchmark ai ai-research llm ai-benchmark

Updated Apr 7, 2025
Python

kaykycampos / gta-benchmark

Star

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

python docker flask benchmark machine-learning puzzle reverse-engineering educational pattern-recognition ctf binary-analysis algorithm-analysis computational-thinking algorithmic-reasoning ai-benchmark

Updated Mar 23, 2025

Habitante / gta-benchmark

Star

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

Updated Jan 12, 2025
Python

PlayBench is a platform that evaluates AI models by having them compete in various games and creative tasks. Unlike traditional benchmarks that focus on text generation quality or factual knowledge, PlayBench tests models on skills like strategic thinking, pattern recognition, and creative problem-solving.

svg chess ai rock-paper-scissors ai-benchmark

Updated Apr 25, 2025
PHP

petmal / MindTrial

Star

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek), custom tasks in YAML, and HTML/CSV reports.

Updated Apr 26, 2025
Go

Improve this page

Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmark

Here are 6 public repositories matching this topic...

microsoft / WindowsAgentArena

TheAgentCompany / TheAgentCompany

kaykycampos / gta-benchmark

Habitante / gta-benchmark

playsaurus-inc / play-bench

petmal / MindTrial

Improve this page

Add this topic to your repo