This repository hosts an experimental codebase for the CLEF 2025 JOKER Track on computational wordplay. It focuses primarily on Task 2 (Pun Translation EN→FR) while remaining extensible to Task 1 (Humour-aware IR) and Task 3 (Onomastic Wordplay Translation). It provides:
- Supervised fine-tuning (SFT) pipeline with LoRA
- Alternative alignment / preference optimization (ARPO-style CPO/SimPO) training
- Structured JSON config system for reproducibility
- Batched inference + submission packaging
- Optional Unsloth acceleration & 4/8-bit loading
- Integrated on-the-fly COMET (translation quality) evaluation callback (optional)
| Task | Description | Example Challenge |
|---|---|---|
| Task 1 | Humour-aware Information Retrieval | Retrieve jokes relevant to a semantic query ("physics", "dating", etc.) preserving humorous intent. |
| Task 2 | Pun Translation (EN→FR) | Preserve dual meanings + humor: I used to be a banker but I lost interest → J'ai été banquier mais j'en ai perdu tout l'intérêt. |
| Task 3 | Onomastic Wordplay Translation | Maintain name-based wordplay (proper nouns, famous figures) while retaining pun plausibility. |
- Unified training interfaces:
src/sft.py(supervised) andsrc/arpo.py(preference / constrained policy optimization style) - LoRA integration (PEFT) + optional Unsloth fast adapters
- Configurable generation defaults saved alongside model artifacts
- Completion-only loss mode with response template masking
- Mid-training NMT quality probing via COMET (optional callback)
- Reproducible, declarative experiment configs (JSON)
- Submission inference helper that auto-zips predictions for Task 2
src/
sft.py # Supervised fine-tuning entry
arpo.py # Alignment / preference optimization training
run_submission_inference.py# Batch generation + packaging for submissions
utils/ # Collators, callbacks, metrics, seeding, IO
scripts/ # Prompt templates, helpers
configs/ # Experiment JSON configs (SFT + ARPO)
data/ # Place your local datasets (not tracked)
runs/ # Shell scripts & run outputs
Experiments.ipynb # Exploratory notebook
Create an environment (example with uv or conda). Dependencies are standard: transformers, trl, datasets, accelerate, peft, wandb, unsloth (optional), tqdm, comet-ml / unbabel-comet (if using COMET callback).
python -m venv .venv
source .venv/bin/activate
pip install -U pip
# (Optional) create requirements.txt later; for now install minimal stack:
pip install transformers accelerate trl peft datasets wandb tqdm unsloth
# Optional metrics (only if using NMT callback)
pip install unbabel-cometLogin to Hugging Face & Weights & Biases if pushing to hub / logging:
huggingface-cli login
wandb loginExpected raw JSON lists for training / evaluation:
- SFT format (chat-style) example item:
{
"messages": [
{"role": "user", "content": "Translate this English pun into French: I used to be a banker but I lost interest"},
{"role": "assistant", "content": "J'ai été banquier mais j'en ai perdu tout l'intérêt"}
]
}- (If using NMT callback) script will derive
instruction+targetfields during evaluation formatting when absent.
Place curated files under data/ and reference them in a config (see below).
Each JSON in configs/ fully describes a run. Core fields:
| Key | Purpose |
|---|---|
train_file / eval_file |
Paths to JSON lists of examples |
model_name |
Base HF model (chat / instruct style) |
lora |
PEFT LoRA block (omit or set null to disable) |
generation_config |
Saved inference defaults (temperature, beams, etc.) |
max_tokens_count / max_length |
Sequence length control |
completion_only |
If true, masks loss to assistant response only |
response_template |
String token prefix marking assistant region |
use_nmt_callback |
Enable COMET evaluation mid-training |
trainer |
Training hyperparameters passed to TRL / custom trainer |
seed |
Reproducibility |
output_dir |
Where checkpoints & tokenizer get written |
Minimal SFT config skeleton:
{
"train_file": "data/task2/train.json",
"eval_file": "data/task2/dev.json",
"model_name": "croissantllm/CroissantLLMChat-v0.1",
"max_tokens_count": 512,
"completion_only": true,
"response_template": "<|im_start|>assistant",
"lora": {"r": 32, "lora_alpha": 32, "lora_dropout": 0.05, "bias": "none", "target_modules": ["q_proj","v_proj"]},
"trainer": {"num_train_epochs": 1, "per_device_train_batch_size": 8, "gradient_accumulation_steps": 4, "learning_rate": 5e-5, "eval_strategy": "steps", "eval_steps": 50, "save_steps": 200, "report_to": "wandb", "push_to_hub": true, "hub_model_id": "user/project-sft-v1"},
"seed": 3407,
"output_dir": "models/project-sft-v1"
}python src/sft.py train \
--config_file configs/skommarkhos_croissantllmchat_v0.1_1b_sft_v1.json \
--output_dir models/sft_run_1Notes:
- Set
use_unsloth=Truefor faster adapter training (8-bit/4-bit) completion_onlyrewrites internal collator to focus on assistant spans- Generation config is saved for downstream evaluation
ARPO training mimics constrained policy optimization with a custom trainer (CPOTrainer). Similar invocation:
python src/arpo.py train \
--config_file configs/skommarkhos_croissantllmchat_v0.1_1b_arpo_v1.json \
--output_dir models/arpo_run_1Key deltas vs SFT:
loss_type(e.g.simpo) insidetrainer- Separate prompt/completion length caps (
max_prompt_length,max_completion_length) - Lower learning rate typical (
5e-7in example)
Enable by setting "use_nmt_callback": true. The callback:
- Derives instruction/target pairs if absent
- Generates translations using saved
generation_config - Scores with COMET22 (if
unbabel-cometinstalled) - Logs metrics (W&B if enabled)
Runs ~2 times per training by dynamically spacing evaluation steps.
python src/run_submission_inference.py main \
--model_path models/sft_run_1 \
--test_data data/task2/joker_pun_translation_2025_test.json \
--output_dir submissions/task2 \
--batch_size 32Outputs:
- JSON with fields:
run_id,manual,id_en,en,fr - Auto-generated ZIP containing
prediction.json(ready for upload)
Temperature / sampling settings currently defined inline (tune as desired inside run_submission_inference.py).
- Fixed seed in config (
seed) - Explicit tokenizer + special tokens saved to
output_dir - Generation parameters versioned
- LoRA adapter weights merged only if you export them explicitly (default: PEFT format)
| Goal | Where to Modify |
|---|---|
| New metric | src/utils/metrics.py |
| Alternate reward / loss | utils/cpo_trainer.py (custom trainer) |
| Prompt template logic | src/scripts/prompt.py |
| Custom collator | utils/collators.py |
- Add retrieval baseline for Task 1 (BM25 + reranker)
- Add name-entity augmentation patterns for Task 3
- Publish structured requirements file & lightweight Dockerfile
- Add evaluation harness for BLEU / chrF / pun-preservation score
- Merge LoRA weights export utility script
This project is licensed under the Apache License 2.0.
Copyright 2025 Igor Kuzmin
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
If you use this codebase or derivatives in academic work, please cite:
@inproceedings{kuzmin2025joker,
author = {Igor Kuzmin},
title = {{CLEF} 2025 {JOKER} Track: No Pun Left Behind},
booktitle = {CLEF 2025 Labs and Workshops, Notebook Papers},
series = {CEUR Workshop Proceedings},
volume = {4038},
publisher = {CEUR-WS.org},
year = {2025},
url = {https://ceur-ws.org/Vol-4038/paper_225.pdf},
issn = {1613-0073},
note = {Paper 225}
}