TRIBBLE - TRanslating IBerian languages Based on Limited E-resources

This repository contains the code and resources for the TRIBBLE project, developed as part of the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. Our work focuses on machine translation for three low-resource languages: Aragonese (spa-arg), Aranese (spa-arn), and Asturian (spa-ast).

Overview

TRIBBLE is designed to address the challenge of limited e-resources for these endangered Iberian languages by adapting a multilingual model in a constrained translation setting. We leverage the distilled NLLB-200 model with 600M parameters, extending its vocabulary with two new language-specific tokens: arn_Latn for Aranese and arg_Latn for Aragonese. These tokens are initialized with weights from oci_Latn (Occitan) for Aranese and spa_Latn (Spanish) for Aragonese, reflecting linguistic similarities.

The system uses carefully processed datasets to train the model on language pairs: Spanish to Aragonese, Spanish to Aranese, and Spanish to Asturian. Our results are benchmarked against the Apertium baseline, showing comparable performance for Asturian and identifying challenges in Aragonese and Aranese translation quality.

Key Results

Language Direction	BLEU (Apertium)	chrF (Apertium)	BLEU (TRIBBLE)	chrF (TRIBBLE)
Spanish → Aragonese	61.1	79.3	49.2	73.6
Spanish → Aranese	28.8	49.4	23.9	46.1
Spanish → Asturian	17.0	50.8	17.9	50.5

Repository Structure

data: Contains processed datasets used in training. final.csv holds the final processed language pairs.
filtering: Scripts for data acquisition, formatting, and filtering. Includes scripts for downloading FLORES, PILAR, OPUS, and Wikipedia data, filtering, and combining data, as well as score_blaser.py for quality evaluation.
graphics: Scripts to generate plots for language pair distributions.
harvesting: Data alignment and sentence processing tools, including scripts for sentence alignment, translation.
install: Installation script for the required components.
requirements.txt: Lists the dependencies needed to run the project.
runs: Contains scripts for training runs, including fine-tuning and parameter adjustments.
scripts: Main script for running sequence-to-sequence training.
utils: Utility scripts for deduplication, language detection, length ratio calculation, normalization, and translation.

Installation

Clone the repository and install the required dependencies:

git clone https://github.com/piotrmp/tribble.git
cd tribble
pip install -r requirements.txt

Usage

Data Preparation:
- Pre-process data using scripts in filtering and harvesting. Final pre-processed datasets should be stored in data/processed/final/final.csv.
Model Training:
- Run training using the main script in scripts/run_peft_seq2seq.py.
Evaluation:
- Use score_blaser.py in filtering to evaluate the translation quality with BLEU and chrF scores.

Citation

If you use this code or refer to the TRIBBLE project, please cite the paper as follows:

@InProceedings{kuzmin-EtAl:2024:WMT,
  author    = {Kuzmin, Igor  and  Przybyła, Piotr  and  McGill, Euan  and  Saggion, Horacio},
  title     = {TRIBBLE - TRanslating IBerian languages Based on Limited E-resources},
  booktitle = {Proceedings of the Ninth Conference on Machine Translation},
  month     = {November},
  year      = {2024},
  address   = {Miami, Florida, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {955--959},
  abstract  = {In this short overview paper, we describe our system submission for the language pairs Spanish to Aragonese (spa-arg), Spanish to Aranese (spa-arn), and Spanish to Asturian (spa-ast). We train a unified model for all language pairs in the constrained scenario. In addition, we add two language control tokens for Aragonese and Aranese Occitan, as there is already one present for Asturian. We take the distilled NLLB-200 model with 600M parameters and extend special tokens with 2 tokens that denote target languages (arn_Latn, arg_Latn) because Asturian was already presented in NLLB-200 model. We adapt the model by training on a special regime of data augmentation with both monolingual and bilingual training data for the language pairs in this challenge.},
  url       = {https://aclanthology.org/2024.wmt-1.94}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRIBBLE - TRanslating IBerian languages Based on Limited E-resources

Overview

Key Results

Repository Structure

Installation

Usage

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data/processed/final		data/processed/final
filtering		filtering
graphics		graphics
harvesting		harvesting
install		install
runs		runs
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

piotrmp/tribble

Folders and files

Latest commit

History

Repository files navigation

TRIBBLE - TRanslating IBerian languages Based on Limited E-resources

Overview

Key Results

Repository Structure

Installation

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages