local_platinum_bench/CHANGELOG.md at main · IST-DASLab/local_platinum_bench

Changelog

2025-04-14

We added GPT-4.1, GPT-4.5, Grok 3 beta, Grok 3 mini beta, Gemini 2.5 Pro, and Llama 4 Maverick. Following our data cleaning process, we reviewed the labels of examples on which either of these models failed.

2025-02-26

We added Claude 3.7 Sonnet and Claude 3.7 Sonnet (Thinking Mode) to our leaderboard. Following our data cleaning process, we reviewed the labels of examples on which either of these models failed.

2025-02-13

We add Gemini 2.0 Pro and Qwen2.5-Max to our leaderboard, and review the labels of any examples on which either of these models fail that have yet not been reviewed.

2025-02-06

This is the original release of the benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

2025-04-14

2025-02-26

2025-02-13

2025-02-06

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

2025-04-14

2025-02-26

2025-02-13

2025-02-06