🥳 MuBench: Assessment of Multilingual Capabilities of Large Language Models

📊 Dataset: https://huggingface.co/datasets/aialt/MuBench

MuBench is a meta-dataset for evaluating the multilingual capabilities of large language models (LLMs) across 61 languages and 3.9M aligned samples.
It provides a unified framework to assess understanding, reasoning, factual knowledge, and truthfulness in both single-language and code-switched settings.

🌍 Key Features

61 languages covering over 60% of the world’s native speakers
12 core benchmarks across 6 ability dimensions
Cross-lingual alignment ensuring one-to-one comparability across languages
Code-switched variants for mixed-language evaluation
Rigorous data pipeline including translation, back-translation, semantic and cultural validation
Human evaluation of 34k samples across 17 languages
New metric — Multilingual Consistency (MLC) for analyzing cross-lingual performance stability

📚 Task Coverage

Category	Representative Datasets
Natural Language Understanding	SNLI, MultiNLI, WinoGrande
Commonsense Reasoning	HellaSwag, StoryCloze
Knowledge-based QA	MMLU, MMLU-Pro
Academic & Technical Reasoning	ARC-Easy, ARC-Challenge, GPQA
Factual Recall	BMLAMA
Truthfulness	TruthfulQA

🧾 Dataset Naming Convention

Each dataset file in MuBench follows the naming format:

{dataset}_{mode}_{lang}

where:

dataset ∈ {SNLIDataset, MNLIDataset, StoryClozeDataset, WinoGrandeDataset, MMLUDataset, MMLUProDataset, BMLAMADataset, HellaswagDataset, ARCEasyDataset, ARCChallengeDataset, GPQADataset}
mode specifies the evaluation variant:
- en_template — English instruction prompt with localized content (improves model instruction-following consistency)
- local_template — Fully localized prompt and content in the target language
- lighteval — Reformatted for cloze-style evaluation harnesses
- mix — Code-switched version mixing components from other languages
- mix_lighteval — Code-switched version in cloze format

For mix and mix_lighteval, the suffix _[int] denotes the maximum number of non-English languages introduced in each sample:

Typically _2 for all datasets
_8 for bmlama, reflecting its multi-fact and high-entropy composition

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🥳 MuBench: Assessment of Multilingual Capabilities of Large Language Models

🌍 Key Features

📚 Task Coverage

🧾 Dataset Naming Convention

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🥳 MuBench: Assessment of Multilingual Capabilities of Large Language Models

🌍 Key Features

📚 Task Coverage

🧾 Dataset Naming Convention

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages