This repository contains the official implementation of ssToken, accepted at ICLR 2026.
Supervised fine-tuning (SFT) quality is highly sensitive to token-level noise. Existing token selection methods typically rely on an extra reference model and loss-only criteria, which can miss semantically important tokens.
ssToken addresses these limitations with two complementary signals:
- Self-modulated selection (REL): uses retrospective excess loss between the current model and its history model, avoiding a separately trained reference model.
- Semantic-aware selection (attention score): estimates token importance from attention patterns, complementing loss-based filtering.
The final token score is computed as:
Score(x_i) = gamma * Normalize(REL(x_i)) + (1 - gamma) * AttnScore(x_i)
where gamma controls the trade-off between optimization signal and semantic signal.
We recommend Python 3.10+ and a CUDA-compatible PyTorch installation.
pip install -r requirements.txtConsistent with TokenCleaning, we adopt the same data preparation setup.
The data pool (50k samples) is constructed based on a new powerful data curation pipeline DS2, which involves selecting data samples using quality rating scores generated by LLMs. For convenience, the 50k used samples can be accessed from Huggingface via the link.
Before ssToken training, compute token losses for the base/history model:
bash bash_src/calculate_loss.sh <model_name_or_path> <train_data_json> <per_device_batch_size> <num_gpus>In the current code release, scripts/finetune.py loads history losses from a placeholder path:
ref_token_losses = torch.load("your_path_to_base_model_losses.pt")Please set it to your generated loss file (e.g., under results/loss/).
bash run.sh <ratio> <data_prop> <run_name>ratio:gammain the paper (balance between REL and attention).data_prop: token selection ratiorho.run_name: output run identifier.
Merged model output is saved to:
./model/<base_model_name>/lora_merged_<run_name>/
bash eval.sh <model_path> <task_name_or_all>bash prepare_eval_data.sh
bash eval_tydiqa.sh <model_path>eval.shsetsHF_DATASETS_OFFLINE=1by default; adjust if online dataset access is needed.- For large models, tune batch size and GPU count according to available memory.
- The codebase includes implementations for random/default/token-selection regimes via
token_select_pattern.
If you find this repository useful, please cite:
@article{qin2025sstoken,
title={ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning},
author={Qin, Xiaohan and Wang, Xiaoxing and Liao, Ning and Zhang, Cancheng and Zhang, Xiangdong and Feng, Mingquan and Wang, Jingzhi and Yan, Junchi},
journal={arXiv preprint arXiv:2510.18250},
year={2025}
}