We added GPT-4.1, GPT-4.5, Grok 3 beta, Grok 3 mini beta, Gemini 2.5 Pro, and Llama 4 Maverick. Following our data cleaning process, we reviewed the labels of examples on which either of these models failed.
We added Claude 3.7 Sonnet and Claude 3.7 Sonnet (Thinking Mode) to our leaderboard. Following our data cleaning process, we reviewed the labels of examples on which either of these models failed.
We add Gemini 2.0 Pro and Qwen2.5-Max to our leaderboard, and review the labels of any examples on which either of these models fail that have yet not been reviewed.
This is the original release of the benchmark.