📝Technical Report |🌐Project Page |🤗 Hugging Face| 🤖 ModelScope
- Introduction
- Updates
- Key Features
- Evaluation
- Model & Benchmark Downloads
- Environment Preparation
- Example Usage
- SFT
- Citation
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon Ming-Lite-Omni. Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.
- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
- 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any timestamp condition: Ming-UniAudio-Edit
- 🔥 First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark
- Support VLLM Inference
- Technical Report
- ASR & TTS SFT recipes
- Streaming TTS
- Ming-UniAudio Blog
Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:
- Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks
- Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-quality speech synthesis.
- Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.
In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.
| System | FrameRate | SEED-ZH | SEED-EN | ||||
|---|---|---|---|---|---|---|---|
| PESQ↑ | SIM↑ | STOI↑ | PESQ↑ | SIM↑ | STOI↑ | ||
| MiMo-Audio-Tokenizer | 25 | 2.71 | 0.89 | 0.93 | 2.43 | 0.85 | 0.92 |
| GLM4-Voice-Tokenizer | 12.5 | 1.06 | 0.33 | 0.61 | 1.05 | 0.12 | 0.60 |
| Baichuan-Audio-Tokenizer | 12.5 | 1.84 | 0.78 | 0.86 | 1.62 | 0.69 | 0.85 |
| XY-Tokenizer | 12.5 | 2.27 | 0.77 | 0.90 | 2.14 | 0.82 | 0.90 |
| Mimi | 75 | 2.05 | 0.73 | 0.89 | 2.01 | 0.77 | 0.89 |
| XCodec2.0 | 50 | 2.19 | 0.80 | 0.92 | 2.37 | 0.82 | 0.93 |
| BigCodec | 80 | 2.26 | 0.81 | 0.92 | 2.22 | 0.80 | 0.91 |
| MingTok-Audio(ours) | 50 | 4.21 | 0.96 | 0.98 | 4.04 | 0.96 | 0.98 |
| Datasets | Model | Performance | ||||||
|---|---|---|---|---|---|---|---|---|
| aishell2-ios | LS-clean | Hunan | Minnan | Guangyue | Chuanyu | Shanghai | ||
| Understanding ASR | Kimi-Audio | 2.56 | 1.28 | 31.93 | 80.28 | 41.49 | 6.69 | 60.64 |
| Qwen2.5 Omni | 2.75 | 1.80 | 29.31 | 53.43 | 10.39 | 7.61 | 32.05 | |
| Qwen2 Audio | 2.92 | 1.60 | 25.88 | 123.78 | 7.59 | 7.77 | 31.73 | |
| Ming-UniAudio-16B-A3B(ours) | 2.84 | 1.62 | 9.80 | 16.50 | 5.51 | 5.46 | 14.65 | |
| Datasets | Model | Performance | |||
|---|---|---|---|---|---|
|
Speech-English WER | NE-WER | NE-FNR |
Dialogue-English WER | NE-WER | NE-FNR |
Speech-Mandarin WER | NE-WER | NE-FNR |
Dialogue-Mandarin WER | NE-WER | NE-FNR |
||
|
Understanding Context ASR |
Qwen2-Audio | 11.49 | 27.27 | 35.08 | 13.99 | 33.02 | 32.92 | 9.92 | 24.10 | 30.02 | 7.00 | 22.76 | 26.17 |
| Baichuan-Audio | 7.52 | 5.87 | 4.55 | 5.66 | 10.01 | 3.64 | 2.16 | 6.65 | 2.35 | 2.96 | 11.48 | 3.94 | |
| Kimi-Audio | 2.90 | 6.68 | 8.01 | 4.67 | 13.50 | 11.31 | 1.95 | 11.13 | 15.28 | 2.90 | 15.91 | 16.68 | |
| Baichuan-Omni-1.5 | 8.16 | 7.69 | 6.53 | 9.91 | 14.40 | 5.54 | 2.98 | 8.39 | 4.71 | 5.00 | 16.83 | 7.84 | |
| Qwen2.5-Omni-3B | 3.99 | 7.80 | 9.69 | 4.83 | 14.36 | 12.85 | 2.13 | 10.55 | 14.11 | 3.12 | 15.07 | 15.17 | |
| Qwen2.5-Omni-7B | 3.96 | 7.38 | 8.72 | 5.32 | 11.83 | 9.24 | 1.84 | 9.80 | 12.19 | 2.40 | 14.06 | 13.17 | |
| Ming-UniAudio-16B-A3B-Edit(ours) | 4.00 | 3.56 | 3.69 | 5.34 | 8.73 | 2.53 | 1.58 | 5.98 | 2.40 | 3.04 | 9.50 | 1.48 | |
| Datasets | Model | Performance | |||
|---|---|---|---|---|---|
| Seed-zh WER(%) | Seed-zh SIM | Seed-en WER(%) | Seed-en SIM | ||
| Generation | Seed-TTS | 1.12 | 0.80 | 2.25 | 0.76 |
| MiMo-Audio | 1.96 | - | 5.37 | - | |
| Qwen3-Omni-30B-A3B-Instruct | 1.07 | - | 1.39 | - | |
| Ming-Omni-Lite | 1.69 | 0.68 | 4.31 | 0.51 | |
| Ming-UniAudio-16B-A3B(ours) | 0.95 | 0.70 | 1.85 | 0.58 | |
| Datasets | Model | Performance | |||
|---|---|---|---|---|---|
|
Deletion-basic Deletion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 11.89 | 14.85 22.92 | 27.60 |
ACC zh | en 100 | 82.22 82.92 | 85 |
SIM zh | en 0.78 | 0.76 0.81 | 0.74 |
no-edit WER(%) zh | en 11.49 | 24.26 17.50 | 35.21 |
|
Insertion-basic Insertion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 3.42 | 6.63 3.89 | 7.592 |
ACC zh | en 80 | 71.43 79.31 | 62.31 |
SIM zh | en 0.83 | 0.79 0.83 | 0.79 |
no-edit WER(%) zh | en 3.52 | 17.70 4.10 | 18.84 |
|
Substitution-basic Substitution |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 4.52 | 8.99 4.56 | 7.64 |
ACC zh | en 78.62 | 59.78 76.62 | 65.62 |
SIM zh | en 0.82 | 0.78 0.83 | 0.77 |
no-edit WER(%) zh | en 4.63 | 19.28 4.75 | 18.39 |
|
Dialect Conversion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) 8.93 |
ACC 0.50 |
SIM 0.66 |
- |
|
Speed changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 5.88 | 17.53 |
SIM zh | en 0.66 | 0.57 |
RDE(%) zh | en 6.36 | 5.92 |
- |
|
Pitch changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 7.45 | 13.37 |
SIM zh | en 0.36 | 0.24 |
- | - |
|
Volume changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 1.71 | 1.35 |
SIM zh | en 0.86 | 0.80 |
RAE(%) zh | en 14.9 | 11.7 |
- |
| Datasets | Model | Model Type | DNSMOS OVRL | DNSMOS SIG | DNSMOS BAK |
|---|---|---|---|---|---|
| Denoise | FullSubNet | specialized | 2.93 | 3.05 | 3.51 |
| Inter-Subnet | 2.98 | 3.17 | 3.15 | ||
| CDiffuSE | 2.84 | 3.37 | 3.52 | ||
| SGMSE | 3.11 | 3.47 | 3.41 | ||
| StoRM | 3.15 | 3.54 | 3.69 | ||
| GenSE | 3.43 | 3.65 | 4.18 | ||
| MiMo-Audio | general | 3.30 | 3.56 | 4.10 | |
| Ming-UniAudio-16B-A3B-Edit(ours) | 3.26 | 3.59 | 3.97 |
You can download our latest model and Benchmark from both Huggingface and ModelScope.
| Type | Model | Input modality | Oput modality | Download |
|---|---|---|---|---|
| Tokenizer | MingTok-Audio | audio | audio | 🤗 HuggingFace 🤖 ModelScope |
| SpeechLLM | Ming-UniAudio-16B-A3B | audio | audio | 🤗 HuggingFace 🤖 ModelScope |
| SpeechLLM | Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | 🤗 HuggingFace 🤖 ModelScope |
| Benchmark | Ming-Freeform-Audio-Edit | - | - | 🤗 HuggingFace 🤖 ModelScope Eval tools |
pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master
Note: This download process will take several minutes to several hours, depending on your network conditions.
pip install -r requirements.txtYou can set up the environment using Docker in two ways.
- Option 1: Pull from Docker Hub (Recommended)
# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.1
# 2. Run the container
docker run -it --gpus all yongjielv/ming_uniaudio:v1.1 /bin/bash- Option 2: Build from Source
# 1. Build the image
docker build -t ming-uniaudio:v1.1 -f ./docker/ming_uniaudio.dockerfile .
# 2. Run the container
docker run -it --gpus all ming-uniaudio:v1.1 /bin/bashWe provide a step-by-step running example:
Step 1 - Download the source code
git clone https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory
Download our model following Model & Benchmark Downloads
mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3BStep 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.
python3 cookbooks/test.pyFor detailed usage, please refer to demo.ipynb.
Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.
We have open-sourced the Supervised Fine-Tuning (SFT) part for speech generation, which supports both full-parameter and LoRA training. Please follow the recipes to start training.
If you find our work helpful, feel free to give us a cite.
@misc{yan2025minguniaudiospeechllmjoint,
title={Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation},
author={Canxiang Yan and Chunxiang Jin and Dawei Huang and Haibing Yu and Han Peng and Hui Zhan and Jie Gao and Jing Peng and Jingdong Chen and Jun Zhou and Kaimeng Ren and Ming Yang and Mingxue Yang and Qiang Xu and Qin Zhao and Ruijie Xiong and Shaoxiong Lin and Xuezhi Wang and Yi Yuan and Yifei Wu and Yongjie Lyu and Zhengyu He and Zhihao Qiu and Zhiqiang Fang and Ziyuan Huang},
year={2025},
eprint={2511.05516},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.05516},
}


