PASE: Phonologically Anchored Speech Enhancer

🎉 This is the official implementation of our AAAI 2026 paper:

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement.

🔥 The pre-trained checkpoints will be released soon.

Inference

To run inference on audio files, use:

python -m inference.inference --input_dir <input_dir> --output_dir <output_dir> [options]

Argument	Requirement / Default	Description
`--input_dir`	required	Path to the input directory containing audio files.
`--output_dir`	required	Path to the output directory where enhanced files will be saved.
`--device`	default: `cuda:0`	Torch device to run inference on, e.g., `cuda:0`, `cuda:1`, or `cpu`.
`--extension`	default: `.wav`	Audio file extension to process.

Training

Step 1: Training a single-stream vocoder

training script: train/train_vocoder.py

training configuration: configs/cfg_train_vocoder.yaml

python -m train.train_vocoder -C configs/cfg_train_vocoder.yaml -D 0,1,2,3

inference script: inference/infer_vocoder.py

python -m inference.infer_vocoder -C configs/cfg_infer.yaml -D 0

This step aims to pre-train a vocoder using the 24th-layer WavLM representations. The pre-trained single-stream vocoder is then used in Step 2 to reconstruct waveforms, enabling the evaluation of DeWavLM’s performance.

Step 2: Fine-tuning WavLM

training script: train/train_wavlm.py
training configuration: configs/cfg_train_wavlm.yaml
inference script: inference/infer_wavlm.py

(The usage is the same as in Step 1.)

This step aims to obtain a denoised WavLM (DeWavLM) via knowledge distillation, referred to in the paper as denoising representation distillation (DRD).

Step 3: Training a dual-stream vocoder

training script: train/train_vocoder_dual.py
training configuration: configs/cfg_train_vocoder_dual.yaml
inference script: inference/infer_vocoder_dual.py

(The usage is the same as in Step 1.)

This step trains the final dual-stream vocoder, which takes the acoustic (1st-layer) and phonetic (24th-layer) DeWavLM representations as inputs and produces the final enhanced waveform.

Citation

If you find this work useful, please cite our paper:

@misc{rong2025pase,
      title={PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement}, 
      author={Xiaobin Rong and Qinwen Hu and Mansur Yesilbursa and Kamil Wojcicki and Jing Lu},
      year={2025},
      eprint={2511.13300},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2511.13300}, 
}

Contact

Xiaobin Rong: [email protected]

Mansur Yesilbursa: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
inference		inference
licenses		licenses
models		models
train		train
utils		utils
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

PASE: Phonologically Anchored Speech Enhancer

Inference

Training

Step 1: Training a single-stream vocoder

Step 2: Fine-tuning WavLM

Step 3: Training a dual-stream vocoder

Citation

Contact

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

Licenses found

cisco-open/pase

Folders and files

Latest commit

History

Repository files navigation

PASE: Phonologically Anchored Speech Enhancer

Inference

Training

Step 1: Training a single-stream vocoder

Step 2: Fine-tuning WavLM

Step 3: Training a dual-stream vocoder

Citation

Contact

About

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages