Simil-Eval is a comprehensive toolkit for evaluating Large Language Models (LLMs) using interpretability-based metrics. Unlike traditional evaluation suites that rely solely on exact matching or accuracy, Simil-Eval incorporates embedding-based similarity and surprisal (perplexity) to provide deeper insights into model behavior—particularly for low-resource languages.
This repository accompanies the paper Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study (Rodríguez et al., Findings 2025)
- Methodology & Metrics
- Setup
- Quick Start (Local Usage)
- Cluster Deployment (SLURM)
- Available Tasks, Datasets & Supported Metrics
- Exporting Results
- Citation
Simil-Eval evaluates LLMs from two complementary perspectives:
- Surprisal → intrinsic knowledge and linguistic competence (no generation required)
- Similarity → semantic quality of generated text
Surprisal-based metrics evaluate how expected a given text is according to a language model.
Surprisal S(x) measures the negative log-probability of a token sequence. Lower values mean that the model recognises the text as natural or expected. Higher values indicate that the model is unfamiliar with the language, structure, or facts. m This metric is particularly useful for linguistic acceptability and commonsense reasoning tasks.
A novel metric introduced in Simil-Eval to explicitly compare correct vs. incorrect alternatives of the same fact
where:
-
$S(x_a)$ : surprisal of the acceptable / correct text -
$S(x_{na})$ : surprisal of the non-acceptable / incorrect text
Higher values in this metric indicate a stronger model preference for the correct option, whereas values near zero suggest weak discrimination.
Similarity metrics evaluate how close a model’s generated output is to a reference answer in semantic space.
-
Cosine Similarity: Computes cosine similarity between sentence embeddings of the model generation and the reference answer. This metric is fast and model-agnostic.
-
MoverScore (Code,Paper): Uses embeddings from a BERT model to calculate the effort required to transform one text into another. Lower effort means higher similarity.
-
BERTScore (Code,Paper): Uses embeddings from a BERT model to compute a refined version of Cosine Similarity.
Note
Simil-Eval is extensively tested on SLURM clusters, but also runs efficiently on a single local machine.
- Python 3.9+
- A working
pipinstallation - CUDA-enabled GPU recommended (optional but strongly advised)
Clone the repository and install dependencies in a python environment:
git clone https://github.com/proxectonos/simil-eval
sh install.shCreate a file at ./configs/.env with the following structure:
HF_TOKEN=your_huggingface_token_here
CACHE_DIR=./cacheHF_TOKEN: Hugging Face access token (required for gated models)CACHE_DIR: directory for model and dataset caching
Note
This tool is extensively tested in a cluster managed by SLURM. It can be used in other environments, but some modifications may be necessary.
Example: evaluating a Llama2-7b on the Galician OpenBookQA using five few-shot examples.
source eval_env/bin/activate
python3 eval_similarity.py \
--model "meta-llama/Llama-2-7b-hf" \
--dataset openbookqa_gl \
--language gl \
--metrics cosine bertscore \
--fewshot_num 5 \
--create_examples \
--generate_answers \
--evaluate_similarity \
--results_file ./results/llama_gl_eval.json| Argument | Description |
|---|---|
--model |
Hugging Face model ID or local path |
--dataset |
Dataset identifier |
--language |
Language code (gl, es, en, pt, cat) |
--metrics |
cosine, moverscore, bertscore |
--fewshot_num |
Number of in-context examples (0 = zero-shot) |
--create_examples |
Construct few-shot prompts |
--generate_answers |
Generate responses for the selected datasets |
--evaluate_similarity |
Computes similarity metrics |
--results_file |
Output JSON file |
Example: evaluating a Llama2-7b on the Catalan Cola version.
source eval_env/bin/activate
python3 eval_surprisal.py \
--model "meta-llama/Llama-2-7b-hf" \
--dataset catcola \
--lang cat \
--cache ./cacheSimil-Eval supports large-scale evaluations on SLURM-managed clusters.
- Navigate to
./launchers/ - Edit
execute_eval_similarity.sh:MODELS: Hugging Face IDs or local checkpointsDATASETS: dataset identifiersLANGUAGES:gl,cat,es,en,pt
- Adjust
#SBATCHdirectives (GPUs, memory, time) in thelaunch_sim_eval.shfile. - Launch:
sh execute_eval_similarity.sh- Navigate to
./launchers/ - Edit
execute_eval_surprisal.shto specifyMODELS - Adjust
#SBATCHdirectives (GPUs, memory, time) in thelaunch_sur_eval.shfile. - Launch:
sh execute_eval_surprisal.sh- Multiple Choice QA:
- OpenBookQA: Four possible answers. A question is asked, and the correct answer must be selected.
- VeritasQA: Between 4 and 10 possible options. A question is asked and the correct option must be chosen, but there are multiple correct options possible. The task is similar to the original mc1 showed in its paper.
- TruthfulQA: Similar to VeritasQA but focused in USA facts. The task is similar to the original mc1 showed in its paper.
- Reading Understanding
- Belebele: A context is provided with diverse information, followed by a question with 4 options where the correct answer can be deduced from the context.
- XStoryCloze: A context is provided with diverse information, followed for two options that can complete the context. The model has to choose the logical option to continue the text.
- Linguistic Acceptability:
- CoLA: Contains sentences labeled as linguistically acceptable (1) or unacceptable (0). This allows for studying when a model is more likely to generate acceptable or unacceptable texts by comparing the probabilities it assigns to each type of sentence.
- Generative Capabilities:
- Calame: A text fragment is provided, and the task is to complete the last word. The dataset is designed so that the last word should be unique, and the goal is to check whether the word generated by the model matches the reference word in the dataset for each fragment.
- Physical commonsense reasoning:
- Global PIQA: Each example consists of two candidate solutions, one correct and one incorrect. Determining the correct solution is designed to require physical commonsense reasoning, although we allow for fairly flexible definitions of physical commonsense (e.g. knowledge of physical properties of objects, affordances, physical and temporal relations, and everyday activities).
| Cosine Similarity | MoverScore | BertScore | Surprisal | |
|---|---|---|---|---|
| OpenBookQA | ✔️ | ✔️ | ✔️ | |
| Belebele | ✔️ | ✔️ | ✔️ | |
| CoLA | ✔️ | |||
| Calame | ✔️ | |||
| VeritasQA | ✔️ | ✔️ | ✔️ | |
| XStoryCloze | ✔️ | ✔️ | ✔️ | |
| TruthfulQA | ✔️ | ✔️ | ✔️ | |
| GlobalPIQA | ✔️ |
| Galician | English | Catalan | Spanish | Portuguese | |
|---|---|---|---|---|---|
| OpenBookQA | openbookqa_gl | openbookqa | openbookqa_ca | openbookqa_es | Private |
| Belebele | belebele_gl | belebele_eng_Latn | belebele_cat_Latn | belebele_spa_Latn | belebele_por_Latn |
| CoLA | galcola | glue_cola | CatCoLA | EsCoLA | |
| Calame | calame-gl | calame-pt | |||
| VeritasQA | veritasqa_gl | veritasqa_en | veritasqa_ca | veritasqa_es | |
| XStoryCloze | xtorycloze_gl | xstory_cloze_en | xstorycloze_ca | xstory_cloze_es | XStoryCloze_pt |
| TruthfulQA | truthfulqa_gl_gen | truthful_qa_gen | |||
| Global PIQA | global-piqa_gl | global-piqa_eng | global-piqa_cat | global-piqa_spa-spai | global-piqa_por-port |
Convert raw JSON outputs into Excel summaries.
- Load the evaluation environment.
- Execute the following commands, changing the
RESULTS_DIRandOUTPUT_DIRvariables in each file:
cd export/
python export_similarity_to_excel.py
python export_surprisal_to_excel.pyOutputs (located in $OUTPUT_DIR/):
similarity_summary.xlsxsurprisal_summary.xlsx
If you use Simil-Eval or the Galician datasets, please cite:
@inproceedings{rodriguez-etal-2025-continued,
title = "Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A {G}alician Case Study",
author = "Rodr{\'i}guez, Pablo and
Su{\'a}rez, Silvia Paniagua and
Gamallo, Pablo and
Docio, Susana Sotelo",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.240/",
doi = "10.18653/v1/2025.findings-acl.240",
pages = "4622--4637",
ISBN = "979-8-89176-256-5",
abstract = "Recent advances in Large Language Models (LLMs) have led to remarkable improvements in language understanding and text generation. However, challenges remain in enhancing their performance for underrepresented languages, ensuring continual learning without catastrophic forgetting, and developing robust evaluation methodologies. This work addresses these issues by investigating the impact of Continued Pretraining (CPT) on multilingual models and proposing a comprehensive evaluation framework for LLMs, focusing on the case of Galician language. Our first contribution explores CPT strategies for languages with limited representation in multilingual models. We analyze how CPT with Galician corpora improves text generation while assessing the trade-offs between linguistic enrichment and task-solving capabilities. Our findings show that CPT with small, high-quality corpora and diverse instructions enhances both task performance and linguistic quality. Our second contribution is a structured evaluation framework based on distinguishing task-based and language-based assessments, leveraging existing and newly developed benchmarks for Galician. Additionally, we contribute new Galician LLMs, datasets for evaluation and instructions, and an evaluation framework."
}