Ming-UniAudio

📝Technical Report ｜🌐Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope

Introduction

Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon Ming-Lite-Omni. Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.

🔥 First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any timestamp condition: Ming-UniAudio-Edit
🔥 First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark

Updates

Key Features

Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:

Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks

Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-quality speech synthesis.
Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.

Evaluation

In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.

Speech Tokenizer

Comparison of reconstruction performance across different acoustic tokenizers. The best results are in bold.

System	FrameRate	SEED-ZH			SEED-EN
System	FrameRate	PESQ↑	SIM↑	STOI↑	PESQ↑	SIM↑	STOI↑
MiMo-Audio-Tokenizer	25	2.71	0.89	0.93	2.43	0.85	0.92
GLM4-Voice-Tokenizer	12.5	1.06	0.33	0.61	1.05	0.12	0.60
Baichuan-Audio-Tokenizer	12.5	1.84	0.78	0.86	1.62	0.69	0.85
XY-Tokenizer	12.5	2.27	0.77	0.90	2.14	0.82	0.90
Mimi	75	2.05	0.73	0.89	2.01	0.77	0.89
XCodec2.0	50	2.19	0.80	0.92	2.37	0.82	0.93
BigCodec	80	2.26	0.81	0.92	2.22	0.80	0.91
MingTok-Audio(ours)	50	4.21	0.96	0.98	4.04	0.96	0.98

Speech Understanding

ASR performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets	Model	Performance
Datasets	Model	aishell2-ios	LS-clean	Hunan	Minnan	Guangyue	Chuanyu	Shanghai
Understanding ASR	Kimi-Audio	2.56	1.28	31.93	80.28	41.49	6.69	60.64
	Qwen2.5 Omni	2.75	1.80	29.31	53.43	10.39	7.61	32.05
	Qwen2 Audio	2.92	1.60	25.88	123.78	7.59	7.77	31.73
	Ming-UniAudio-16B-A3B(ours)	2.84	1.62	9.80	16.50	5.51	5.46	14.65

Context ASR performance comparison on various audio benchmark datasets.

Datasets	Model	Performance
Datasets	Model	Speech-English WER \| NE-WER \| NE-FNR	Dialogue-English WER \| NE-WER \| NE-FNR	Speech-Mandarin WER \| NE-WER \| NE-FNR	Dialogue-Mandarin WER \| NE-WER \| NE-FNR
Understanding Context ASR	Qwen2-Audio	11.49 \| 27.27 \| 35.08	13.99 \| 33.02 \| 32.92	9.92 \| 24.10 \| 30.02	7.00 \| 22.76 \| 26.17
	Baichuan-Audio	7.52 \| 5.87 \| 4.55	5.66 \| 10.01 \| 3.64	2.16 \| 6.65 \| 2.35	2.96 \| 11.48 \| 3.94
	Kimi-Audio	2.90 \| 6.68 \| 8.01	4.67 \| 13.50 \| 11.31	1.95 \| 11.13 \| 15.28	2.90 \| 15.91 \| 16.68
	Baichuan-Omni-1.5	8.16 \| 7.69 \| 6.53	9.91 \| 14.40 \| 5.54	2.98 \| 8.39 \| 4.71	5.00 \| 16.83 \| 7.84
	Qwen2.5-Omni-3B	3.99 \| 7.80 \| 9.69	4.83 \| 14.36 \| 12.85	2.13 \| 10.55 \| 14.11	3.12 \| 15.07 \| 15.17
	Qwen2.5-Omni-7B	3.96 \| 7.38 \| 8.72	5.32 \| 11.83 \| 9.24	1.84 \| 9.80 \| 12.19	2.40 \| 14.06 \| 13.17
	Ming-UniAudio-16B-A3B-Edit(ours)	4.00 \| 3.56 \| 3.69	5.34 \| 8.73 \| 2.53	1.58 \| 5.98 \| 2.40	3.04 \| 9.50 \| 1.48

Speech Generation

Performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets	Model	Performance
		Seed-zh WER(%)	Seed-zh SIM	Seed-en WER(%)	Seed-en SIM
Generation	Seed-TTS	1.12	0.80	2.25	0.76
	MiMo-Audio	1.96	-	5.37	-
	Qwen3-Omni-30B-A3B-Instruct	1.07	-	1.39	-
	Ming-Omni-Lite	1.69	0.68	4.31	0.51
	Ming-UniAudio-16B-A3B(ours)	0.95	0.70	1.85	0.58

Speech Editing

Performance on various audio benchmark datasets.

Datasets	Model	Performance
Datasets	Model	Deletion-basic Deletion	Ming-UniAudio-16B-A3B-Edit	WER(%) zh \| en 11.89 \| 14.85 22.92 \| 27.60	ACC zh \| en 100 \| 82.22 82.92 \| 85	SIM zh \| en 0.78 \| 0.76 0.81 \| 0.74	no-edit WER(%) zh \| en 11.49 \| 24.26 17.50 \| 35.21
Insertion-basic Insertion	Ming-UniAudio-16B-A3B-Edit	WER(%) zh \| en 3.42 \| 6.63 3.89 \| 7.592	ACC zh \| en 80 \| 71.43 79.31 \| 62.31	SIM zh \| en 0.83 \| 0.79 0.83 \| 0.79	no-edit WER(%) zh \| en 3.52 \| 17.70 4.10 \| 18.84
Substitution-basic Substitution	Ming-UniAudio-16B-A3B-Edit	WER(%) zh \| en 4.52 \| 8.99 4.56 \| 7.64	ACC zh \| en 78.62 \| 59.78 76.62 \| 65.62	SIM zh \| en 0.82 \| 0.78 0.83 \| 0.77	no-edit WER(%) zh \| en 4.63 \| 19.28 4.75 \| 18.39
Dialect Conversion	Ming-UniAudio-16B-A3B-Edit	WER(%) 8.93	ACC 0.50	SIM 0.66	-
Speed changing	Ming-UniAudio-16B-A3B-Edit	WER(%) zh \| en 5.88 \| 17.53	SIM zh \| en 0.66 \| 0.57	RDE(%) zh \| en 6.36 \| 5.92	-
Pitch changing	Ming-UniAudio-16B-A3B-Edit	WER(%) zh \| en 7.45 \| 13.37	SIM zh \| en 0.36 \| 0.24	-	-
Volume changing	Ming-UniAudio-16B-A3B-Edit	WER(%) zh \| en 1.71 \| 1.35	SIM zh \| en 0.86 \| 0.80	RAE(%) zh \| en 14.9 \| 11.7	-

Denoise

Performance comparison on various audio benchmark datasets. The best results are in bold.

Datasets	Model	Model Type	DNSMOS OVRL	DNSMOS SIG	DNSMOS BAK
Denoise	FullSubNet	specialized	2.93	3.05	3.51
	Inter-Subnet		2.98	3.17	3.15
	CDiffuSE		2.84	3.37	3.52
	SGMSE		3.11	3.47	3.41
	StoRM		3.15	3.54	3.69
	GenSE		3.43	3.65	4.18
	MiMo-Audio	general	3.30	3.56	4.10
	Ming-UniAudio-16B-A3B-Edit(ours)	general	3.26	3.59	3.97

Model & Benchmark Downloads

You can download our latest model and Benchmark from both Huggingface and ModelScope.

Type	Model	Input modality	Oput modality	Download
Tokenizer	MingTok-Audio	audio	audio	🤗 HuggingFace 🤖 ModelScope
SpeechLLM	Ming-UniAudio-16B-A3B	audio	audio	🤗 HuggingFace 🤖 ModelScope
SpeechLLM	Ming-UniAudio-16B-A3B-Edit	text, audio	text, audio	🤗 HuggingFace 🤖 ModelScope
Benchmark	Ming-Freeform-Audio-Edit	-	-	🤗 HuggingFace 🤖 ModelScope Eval tools

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B  --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Environment Preparation

Installation with pip

pip install -r requirements.txt

Installation with docker

You can set up the environment using Docker in two ways.

Option 1: Pull from Docker Hub (Recommended)

# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.1

# 2. Run the container
docker run -it --gpus all yongjielv/ming_uniaudio:v1.1 /bin/bash

Option 2: Build from Source

# 1. Build the image
docker build -t ming-uniaudio:v1.1 -f ./docker/ming_uniaudio.dockerfile .

# 2. Run the container
docker run -it --gpus all ming-uniaudio:v1.1 /bin/bash

Example Usage

We provide a step-by-step running example:

Step 1 - Download the source code

git clone	https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio

Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory

Download our model following Model & Benchmark Downloads

mkdir inclusionAI 
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3B

Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.

python3 cookbooks/test.py

For detailed usage, please refer to demo.ipynb.

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.

SFT

We have open-sourced the Supervised Fine-Tuning (SFT) part for speech generation, which supports both full-parameter and LoRA training. Please follow the recipes to start training.

Citation

If you find our work helpful, feel free to give us a cite.

@misc{yan2025minguniaudiospeechllmjoint,
      title={Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation}, 
      author={Canxiang Yan and Chunxiang Jin and Dawei Huang and Haibing Yu and Han Peng and Hui Zhan and Jie Gao and Jing Peng and Jingdong Chen and Jun Zhou and Kaimeng Ren and Ming Yang and Mingxue Yang and Qiang Xu and Qin Zhao and Ruijie Xiong and Shaoxiong Lin and Xuezhi Wang and Yi Yuan and Yifei Wu and Yongjie Lyu and Zhengyu He and Zhihao Qiu and Zhiqiang Fang and Ziyuan Huang},
      year={2025},
      eprint={2511.05516},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.05516}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
assets		assets
audio_tokenizer		audio_tokenizer
cookbooks		cookbooks
data		data
docker		docker
figures		figures
fm		fm
sentence_manager		sentence_manager
sft		sft
LEGAL.md		LEGAL.md
LICENSE		LICENSE
README.md		README.md
audio_processing_bailingmm.py		audio_processing_bailingmm.py
bailingmm_utils.py		bailingmm_utils.py
chat_format.py		chat_format.py
configuration_bailing_moe.py		configuration_bailing_moe.py
configuration_bailingmm.py		configuration_bailingmm.py
configuration_glm.py		configuration_glm.py
image_processing_bailingmm.py		image_processing_bailingmm.py
modeling_bailing_moe.py		modeling_bailing_moe.py
modeling_bailingmm.py		modeling_bailingmm.py
modeling_utils.py		modeling_utils.py
preprocessor_config.json		preprocessor_config.json
processing_bailingmm.py		processing_bailingmm.py
requirements.txt		requirements.txt
special_tokens_map.json		special_tokens_map.json
tokenization_bailing.py		tokenization_bailing.py
tokenizer.json		tokenizer.json
tokenizer_config.json		tokenizer_config.json

Datasets	Model	Performance
Datasets	Model	Speech-English WER \| NE-WER \| NE-FNR	Dialogue-English WER \| NE-WER \| NE-FNR	Speech-Mandarin WER \| NE-WER \| NE-FNR	Dialogue-Mandarin WER \| NE-WER \| NE-FNR
Understanding Context ASR	Qwen2-Audio	11.49 \| 27.27 \| 35.08	13.99 \| 33.02 \| 32.92	9.92 \| 24.10 \| 30.02	7.00 \| 22.76 \| 26.17
	Baichuan-Audio	7.52 \| 5.87 \| 4.55	5.66 \| 10.01 \| 3.64	2.16 \| 6.65 \| 2.35	2.96 \| 11.48 \| 3.94
	Kimi-Audio	2.90 \| 6.68 \| 8.01	4.67 \| 13.50 \| 11.31	1.95 \| 11.13 \| 15.28	2.90 \| 15.91 \| 16.68
	Baichuan-Omni-1.5	8.16 \| 7.69 \| 6.53	9.91 \| 14.40 \| 5.54	2.98 \| 8.39 \| 4.71	5.00 \| 16.83 \| 7.84
	Qwen2.5-Omni-3B	3.99 \| 7.80 \| 9.69	4.83 \| 14.36 \| 12.85	2.13 \| 10.55 \| 14.11	3.12 \| 15.07 \| 15.17
	Qwen2.5-Omni-7B	3.96 \| 7.38 \| 8.72	5.32 \| 11.83 \| 9.24	1.84 \| 9.80 \| 12.19	2.40 \| 14.06 \| 13.17
	Ming-UniAudio-16B-A3B-Edit(ours)	4.00 \| 3.56 \| 3.69	5.34 \| 8.73 \| 2.53	1.58 \| 5.98 \| 2.40	3.04 \| 9.50 \| 1.48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ming-UniAudio

Table of Contents

Introduction

Updates

Key Features

Evaluation

Speech Tokenizer

Speech Understanding

Speech Generation

Speech Editing

Denoise

Model & Benchmark Downloads

Environment Preparation

Installation with pip

Installation with docker

Example Usage

SFT

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

inclusionAI/Ming-UniAudio

Folders and files

Latest commit

History

Repository files navigation

Ming-UniAudio

Table of Contents

Introduction

Updates

Key Features

Evaluation

Speech Tokenizer

Speech Understanding

Speech Generation

Speech Editing

Denoise

Model & Benchmark Downloads

Environment Preparation

Installation with pip

Installation with docker

Example Usage

SFT

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages