EncodEval: Evaluating Pretrained Encoders Across NLP Tasks with Confidence-Aware Rankings

Overview

EncodEval is a lightweight evaluation framework designed to benchmark general-purpose pre-trained encoders on a diverse set of downstream NLP tasks:

Sequence Classification (SC)
Sequence Regression (SR)
Token Classification (TC)
Information Retrieval (IR)

This repository was used for evaluation in the paper
EuroBERT: Scaling Multilingual Encoders for European Languages. If you're interested in training the EuroBERT model, please refer to the EuroBERT repository.

Installation

To install EncodEval directly via pip:

pip install git+https://github.com/hgissbkh/EncodEval.git

For development, clone the repository and install in editable mode:

git clone https://github.com/hgissbkh/EncodEval.git
cd EncodEval
pip install -e .

Running Evaluations

To run a task evaluation from the command line:

python main.py \ 
    --config_file <config_file_path> \ 
    --model_path <model_path>

This will generate a results.json file with instance-level scores. For analyzing and comparing system-level results, see the System Evaluation section below.

Task Evaluation Modules

Task-specific evaluation logic is implemented in encodeval/eval_tasks/. These modules handle both fine-tuning and evaluation.

Example usage in Python:

from encodeval.eval_tasks import EvalConfig, SequenceClassificationEval

config_file = "./configs/sequence_classification_tasks/example.yaml"
eval_config: EvalConfig = configue.load(
    config_file,  
    sub_path="eval_config",
)
evaluator = SequenceClassificationEval(eval_config)
evaluator.train() # Fine-tune on the target task
evaluator.validate() # Evaluate on the validation set
evaluator.test() # Evaluate on the test set

Datasets

Dataset loading and preprocessing are managed in encodeval/datasets.py. To add a new dataset, implement the loading logic in this file.

Example (loading the XNLI dataset):

from encodeval.datasets import xnli
dataset = xnli()

Configuration Files

Examples of configuration files are available in the configs/ folder.

System evaluation

To compare and rank models on a given task, use the get_results function. This will:

Run hyperparameter search on the validation set (if available and if multiple configurations are provided — see results/toy/ for an example). Otherwise, it simply loads the existing results.
Compute average scores across languages (average_scores)
Perform statistical testing and calculate Borda counts for rankings (system_ranking)

Example usage:

from encodeval.system_ranking import get_results

average_scores, system_ranking = get_results(
    base_path="./results/toy",
    models=["model1", "model2", "model3"], 
    task_type="SC",
    dataset="dataset_sc", 
    valid_langs=["en", "fr"],
)

print(average_scores)
print(system_ranking)

Note: Rankings are based on statistical significance at the 95% confidence level.

Citation

If you use this framework in your research, please consider citing:

@misc{boizard2025eurobertscalingmultilingualencoders,
  title={EuroBERT: Scaling Multilingual Encoders for European Languages}, 
  author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
  year={2025},
  eprint={2503.05500},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2503.05500}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
encodeval		encodeval
results/toy		results/toy
.gitignore		.gitignore
README.md		README.md
get_results.py		get_results.py
launch_xp_eval.py		launch_xp_eval.py
main.py		main.py
pyproject.toml		pyproject.toml
squeue		squeue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EncodEval: Evaluating Pretrained Encoders Across NLP Tasks with Confidence-Aware Rankings

Overview

Installation

Running Evaluations

Task Evaluation Modules

Datasets

Configuration Files

System evaluation

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EncodEval: Evaluating Pretrained Encoders Across NLP Tasks with Confidence-Aware Rankings

Overview

Installation

Running Evaluations

Task Evaluation Modules

Datasets

Configuration Files

System evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages