[ICLR 2026] ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

This repository contains the official implementation of ssToken, accepted at ICLR 2026.

Overview

Supervised fine-tuning (SFT) quality is highly sensitive to token-level noise. Existing token selection methods typically rely on an extra reference model and loss-only criteria, which can miss semantically important tokens.

ssToken addresses these limitations with two complementary signals:

Self-modulated selection (REL): uses retrospective excess loss between the current model and its history model, avoiding a separately trained reference model.
Semantic-aware selection (attention score): estimates token importance from attention patterns, complementing loss-based filtering.

The final token score is computed as:

Score(x_i) = gamma * Normalize(REL(x_i)) + (1 - gamma) * AttnScore(x_i)

where gamma controls the trade-off between optimization signal and semantic signal.

Installation

We recommend Python 3.10+ and a CUDA-compatible PyTorch installation.

pip install -r requirements.txt

Dataset Preparation

Consistent with TokenCleaning, we adopt the same data preparation setup.

The data pool (50k samples) is constructed based on a new powerful data curation pipeline DS2, which involves selecting data samples using quality rating scores generated by LLMs. For convenience, the 50k used samples can be accessed from Huggingface via the link.

Quick Start

1) Compute token-level base losses

Before ssToken training, compute token losses for the base/history model:

bash bash_src/calculate_loss.sh <model_name_or_path> <train_data_json> <per_device_batch_size> <num_gpus>

2) Configure the history loss path

In the current code release, scripts/finetune.py loads history losses from a placeholder path:

ref_token_losses = torch.load("your_path_to_base_model_losses.pt")

Please set it to your generated loss file (e.g., under results/loss/).

3) Run ssToken fine-tuning

bash run.sh <ratio> <data_prop> <run_name>

ratio: gamma in the paper (balance between REL and attention).
data_prop: token selection ratio rho.
run_name: output run identifier.

Merged model output is saved to:

./model/<base_model_name>/lora_merged_<run_name>/

Evaluation

lm-eval-harness tasks

bash eval.sh <model_path> <task_name_or_all>

TyDiQA

bash prepare_eval_data.sh
bash eval_tydiqa.sh <model_path>

Notes

eval.sh sets HF_DATASETS_OFFLINE=1 by default; adjust if online dataset access is needed.
For large models, tune batch size and GPU count according to available memory.
The codebase includes implementations for random/default/token-selection regimes via token_select_pattern.

Citation

If you find this repository useful, please cite:

@article{qin2025sstoken,
  title={ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning},
  author={Qin, Xiaohan and Wang, Xiaoxing and Liao, Ning and Zhang, Cancheng and Zhang, Xiangdong and Feng, Mingquan and Wang, Jingzhi and Yan, Junchi},
  journal={arXiv preprint arXiv:2510.18250},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR 2026] ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

Overview

Installation

Dataset Preparation

Quick Start

1) Compute token-level base losses

2) Configure the history loss path

3) Run ssToken fine-tuning

Evaluation

lm-eval-harness tasks

TyDiQA

Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bash_src		bash_src
eval		eval
figures		figures
fsdp_configs		fsdp_configs
scripts		scripts
squad		squad
.gitignore		.gitignore
README.md		README.md
eval.sh		eval.sh
eval_tydiqa.sh		eval_tydiqa.sh
prepare_eval_data.sh		prepare_eval_data.sh
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

[ICLR 2026] ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning

Overview

Installation

Dataset Preparation

Quick Start

1) Compute token-level base losses

2) Configure the history loss path

3) Run ssToken fine-tuning

Evaluation

lm-eval-harness tasks

TyDiQA

Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages