Skip to content

aialt/MuBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 

Repository files navigation

πŸ₯³ MuBench: Assessment of Multilingual Capabilities of Large Language Models

πŸ“Š Dataset: https://huggingface.co/datasets/aialt/MuBench

MuBench is a meta-dataset for evaluating the multilingual capabilities of large language models (LLMs) across 61 languages and 3.9M aligned samples.
It provides a unified framework to assess understanding, reasoning, factual knowledge, and truthfulness in both single-language and code-switched settings.


🌍 Key Features

  • 61 languages covering over 60% of the world’s native speakers
  • 12 core benchmarks across 6 ability dimensions
  • Cross-lingual alignment ensuring one-to-one comparability across languages
  • Code-switched variants for mixed-language evaluation
  • Rigorous data pipeline including translation, back-translation, semantic and cultural validation
  • Human evaluation of 34k samples across 17 languages
  • New metric β€” Multilingual Consistency (MLC) for analyzing cross-lingual performance stability

πŸ“š Task Coverage

Category Representative Datasets
Natural Language Understanding SNLI, MultiNLI, WinoGrande
Commonsense Reasoning HellaSwag, StoryCloze
Knowledge-based QA MMLU, MMLU-Pro
Academic & Technical Reasoning ARC-Easy, ARC-Challenge, GPQA
Factual Recall BMLAMA
Truthfulness TruthfulQA

🧾 Dataset Naming Convention

Each dataset file in MuBench follows the naming format:

{dataset}_{mode}_{lang}

where:

  • dataset ∈ {SNLIDataset, MNLIDataset, StoryClozeDataset, WinoGrandeDataset, MMLUDataset, MMLUProDataset, BMLAMADataset, HellaswagDataset, ARCEasyDataset, ARCChallengeDataset, GPQADataset}
  • mode specifies the evaluation variant:
    • en_template β€” English instruction prompt with localized content (improves model instruction-following consistency)
    • local_template β€” Fully localized prompt and content in the target language
    • lighteval β€” Reformatted for cloze-style evaluation harnesses
    • mix β€” Code-switched version mixing components from other languages
    • mix_lighteval β€” Code-switched version in cloze format

For mix and mix_lighteval, the suffix _[int] denotes the maximum number of non-English languages introduced in each sample:

  • Typically _2 for all datasets
  • _8 for bmlama, reflecting its multi-fact and high-entropy composition

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors