Skip to content

Latest commit

 

History

History
17 lines (9 loc) · 699 Bytes

File metadata and controls

17 lines (9 loc) · 699 Bytes

Changelog

2025-04-14

We added GPT-4.1, GPT-4.5, Grok 3 beta, Grok 3 mini beta, Gemini 2.5 Pro, and Llama 4 Maverick. Following our data cleaning process, we reviewed the labels of examples on which either of these models failed.

2025-02-26

We added Claude 3.7 Sonnet and Claude 3.7 Sonnet (Thinking Mode) to our leaderboard. Following our data cleaning process, we reviewed the labels of examples on which either of these models failed.

2025-02-13

We add Gemini 2.0 Pro and Qwen2.5-Max to our leaderboard, and review the labels of any examples on which either of these models fail that have yet not been reviewed.

2025-02-06

This is the original release of the benchmark.