Simil-Eval: A Multilingual Toolkit for Evaluating LLMs

Simil-Eval is a comprehensive toolkit for evaluating Large Language Models (LLMs) using interpretability-based metrics. Unlike traditional evaluation suites that rely solely on exact matching or accuracy, Simil-Eval incorporates embedding-based similarity and surprisal (perplexity) to provide deeper insights into model behavior—particularly for low-resource languages.

This repository accompanies the paper Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study (Rodríguez et al., Findings 2025)

1. Methodology & Metrics

Simil-Eval evaluates LLMs from two complementary perspectives:

Surprisal → intrinsic knowledge and linguistic competence (no generation required)
Similarity → semantic quality of generated text

A. Surprisal Metrics

Surprisal-based metrics evaluate how expected a given text is according to a language model.

Surprisal (Code,Paper)

Surprisal S(x) measures the negative log-probability of a token sequence. Lower values mean that the model recognises the text as natural or expected. Higher values indicate that the model is unfamiliar with the language, structure, or facts. m This metric is particularly useful for linguistic acceptability and commonsense reasoning tasks.

Difsur (Differential Surprisal)

A novel metric introduced in Simil-Eval to explicitly compare correct vs. incorrect alternatives of the same fact

$$\text{difsur} = \frac{S(x_{na}) - S(x_a)}{\max{S(x_a), S(x_{na})}} \times 100$$

where:

$S(x_a)$: surprisal of the acceptable / correct text
$S(x_{na})$: surprisal of the non-acceptable / incorrect text

Higher values in this metric indicate a stronger model preference for the correct option, whereas values near zero suggest weak discrimination.

B. Similarity Metrics

Similarity metrics evaluate how close a model’s generated output is to a reference answer in semantic space.

Cosine Similarity: Computes cosine similarity between sentence embeddings of the model generation and the reference answer. This metric is fast and model-agnostic.
MoverScore (Code,Paper): Uses embeddings from a BERT model to calculate the effort required to transform one text into another. Lower effort means higher similarity.
BERTScore (Code,Paper): Uses embeddings from a BERT model to compute a refined version of Cosine Similarity.

2. Setup

Note
Simil-Eval is extensively tested on SLURM clusters, but also runs efficiently on a single local machine.

Prerequisites

Python 3.9+
A working pip installation
CUDA-enabled GPU recommended (optional but strongly advised)

Installation

Clone the repository and install dependencies in a python environment:

git clone https://github.com/proxectonos/simil-eval
sh install.sh

Environment variables configuration

Create a file at ./configs/.env with the following structure:

HF_TOKEN=your_huggingface_token_here
CACHE_DIR=./cache

HF_TOKEN: Hugging Face access token (required for gated models)
CACHE_DIR: directory for model and dataset caching

3. Quick Start (Local Usage)

Note

This tool is extensively tested in a cluster managed by SLURM. It can be used in other environments, but some modifications may be necessary.

A. Similarity Evaluation (Generation)

Example: evaluating a Llama2-7b on the Galician OpenBookQA using five few-shot examples.

source eval_env/bin/activate
python3 eval_similarity.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --dataset openbookqa_gl \
  --language gl \
  --metrics cosine bertscore \
  --fewshot_num 5 \
  --create_examples \
  --generate_answers \
  --evaluate_similarity \
  --results_file ./results/llama_gl_eval.json

Common Arguments

Argument	Description
`--model`	Hugging Face model ID or local path
`--dataset`	Dataset identifier
`--language`	Language code (`gl`, `es`, `en`, `pt`, `cat`)
`--metrics`	`cosine`, `moverscore`, `bertscore`
`--fewshot_num`	Number of in-context examples (0 = zero-shot)
`--create_examples`	Construct few-shot prompts
`--generate_answers`	Generate responses for the selected datasets
`--evaluate_similarity`	Computes similarity metrics
`--results_file`	Output JSON file

B. Surprisal Evaluation

Example: evaluating a Llama2-7b on the Catalan Cola version.

source eval_env/bin/activate
python3 eval_surprisal.py \
  --model "meta-llama/Llama-2-7b-hf" \
  --dataset catcola \
  --lang cat \
  --cache ./cache

4. Cluster Deployment (SLURM)

Simil-Eval supports large-scale evaluations on SLURM-managed clusters.

Similarity Jobs

Navigate to ./launchers/
Edit execute_eval_similarity.sh:
- MODELS: Hugging Face IDs or local checkpoints
- DATASETS: dataset identifiers
- LANGUAGES: gl, cat, es, en, pt
Adjust #SBATCH directives (GPUs, memory, time) in the launch_sim_eval.sh file.
Launch:

sh execute_eval_similarity.sh

Surprisal Jobs

Navigate to ./launchers/
Edit execute_eval_surprisal.sh to specify MODELS
Adjust #SBATCH directives (GPUs, memory, time) in the launch_sur_eval.sh file.
Launch:

sh execute_eval_surprisal.sh

5. Available Tasks, Datasets & Supported Metrics

Tasks and Available Datasets

Multiple Choice QA:
- OpenBookQA: Four possible answers. A question is asked, and the correct answer must be selected.
- VeritasQA: Between 4 and 10 possible options. A question is asked and the correct option must be chosen, but there are multiple correct options possible. The task is similar to the original mc1 showed in its paper.
- TruthfulQA: Similar to VeritasQA but focused in USA facts. The task is similar to the original mc1 showed in its paper.
Reading Understanding
- Belebele: A context is provided with diverse information, followed by a question with 4 options where the correct answer can be deduced from the context.
- XStoryCloze: A context is provided with diverse information, followed for two options that can complete the context. The model has to choose the logical option to continue the text.
Linguistic Acceptability:
- CoLA: Contains sentences labeled as linguistically acceptable (1) or unacceptable (0). This allows for studying when a model is more likely to generate acceptable or unacceptable texts by comparing the probabilities it assigns to each type of sentence.
Generative Capabilities:
- Calame: A text fragment is provided, and the task is to complete the last word. The dataset is designed so that the last word should be unique, and the goal is to check whether the word generated by the model matches the reference word in the dataset for each fragment.
Physical commonsense reasoning:
- Global PIQA: Each example consists of two candidate solutions, one correct and one incorrect. Determining the correct solution is designed to require physical commonsense reasoning, although we allow for fairly flexible definitions of physical commonsense (e.g. knowledge of physical properties of objects, affordances, physical and temporal relations, and everyday activities).

Supported Metrics by Dataset

	Cosine Similarity	MoverScore	BertScore	Surprisal
OpenBookQA	✔️	✔️	✔️
Belebele	✔️	✔️	✔️
CoLA				✔️
Calame				✔️
VeritasQA	✔️	✔️	✔️
XStoryCloze	✔️	✔️	✔️
TruthfulQA	✔️	✔️	✔️
GlobalPIQA				✔️

Datasets by Language

	Galician	English	Catalan	Spanish	Portuguese
OpenBookQA	openbookqa_gl	openbookqa	openbookqa_ca	openbookqa_es	Private
Belebele	belebele_gl	belebele_eng_Latn	belebele_cat_Latn	belebele_spa_Latn	belebele_por_Latn
CoLA	galcola	glue_cola	CatCoLA	EsCoLA
Calame	calame-gl				calame-pt
VeritasQA	veritasqa_gl	veritasqa_en	veritasqa_ca	veritasqa_es
XStoryCloze	xtorycloze_gl	xstory_cloze_en	xstorycloze_ca	xstory_cloze_es	XStoryCloze_pt
TruthfulQA	truthfulqa_gl_gen	truthful_qa_gen
Global PIQA	global-piqa_gl	global-piqa_eng	global-piqa_cat	global-piqa_spa-spai	global-piqa_por-port

6. Exporting Results

Convert raw JSON outputs into Excel summaries.

Load the evaluation environment.
Execute the following commands, changing the RESULTS_DIR and OUTPUT_DIR variables in each file:

cd export/
python export_similarity_to_excel.py
python export_surprisal_to_excel.py

Outputs (located in $OUTPUT_DIR/):

similarity_summary.xlsx
surprisal_summary.xlsx

7. Citation

If you use Simil-Eval or the Galician datasets, please cite:

@inproceedings{rodriguez-etal-2025-continued,
    title = "Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A {G}alician Case Study",
    author = "Rodr{\'i}guez, Pablo  and
      Su{\'a}rez, Silvia Paniagua  and
      Gamallo, Pablo  and
      Docio, Susana Sotelo",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.240/",
    doi = "10.18653/v1/2025.findings-acl.240",
    pages = "4622--4637",
    ISBN = "979-8-89176-256-5",
    abstract = "Recent advances in Large Language Models (LLMs) have led to remarkable improvements in language understanding and text generation. However, challenges remain in enhancing their performance for underrepresented languages, ensuring continual learning without catastrophic forgetting, and developing robust evaluation methodologies. This work addresses these issues by investigating the impact of Continued Pretraining (CPT) on multilingual models and proposing a comprehensive evaluation framework for LLMs, focusing on the case of Galician language. Our first contribution explores CPT strategies for languages with limited representation in multilingual models. We analyze how CPT with Galician corpora improves text generation while assessing the trade-offs between linguistic enrichment and task-solving capabilities. Our findings show that CPT with small, high-quality corpora and diverse instructions enhances both task performance and linguistic quality. Our second contribution is a structured evaluation framework based on distinguishing task-based and language-based assessments, leveraging existing and newly developed benchmarks for Galician. Additionally, we contribute new Galician LLMs, datasets for evaluation and instructions, and an evaluation framework."
}

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
configs		configs
core		core
export		export
launchers		launchers
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_similarity.py		eval_similarity.py
eval_surprisal.py		eval_surprisal.py
install.sh		install.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simil-Eval: A Multilingual Toolkit for Evaluating LLMs

Table of Contents

1. Methodology & Metrics

A. Surprisal Metrics

Surprisal (Code,Paper)

Difsur (Differential Surprisal)

B. Similarity Metrics

2. Setup

Prerequisites

Installation

Environment variables configuration

3. Quick Start (Local Usage)

A. Similarity Evaluation (Generation)

Common Arguments

B. Surprisal Evaluation

4. Cluster Deployment (SLURM)

Similarity Jobs

Surprisal Jobs

5. Available Tasks, Datasets & Supported Metrics

Tasks and Available Datasets

Supported Metrics by Dataset

Datasets by Language

6. Exporting Results

7. Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

proxectonos/simil-eval

Folders and files

Latest commit

History

Repository files navigation

Simil-Eval: A Multilingual Toolkit for Evaluating LLMs

Table of Contents

1. Methodology & Metrics

A. Surprisal Metrics

Surprisal (Code,Paper)

Difsur (Differential Surprisal)

B. Similarity Metrics

2. Setup

Prerequisites

Installation

Environment variables configuration

3. Quick Start (Local Usage)

A. Similarity Evaluation (Generation)

Common Arguments

B. Surprisal Evaluation

4. Cluster Deployment (SLURM)

Similarity Jobs

Surprisal Jobs

5. Available Tasks, Datasets & Supported Metrics

Tasks and Available Datasets

Supported Metrics by Dataset

Datasets by Language

6. Exporting Results

7. Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages