🚀 FastSparkTTS – Based on the SparkTTS model, this platform provides high-quality Chinese speech synthesis and voice cloning services. With an easy-to-use web interface, you can effortlessly create natural and realistic human voices to suit various scenarios.
- 🚀 Multiple Backend Acceleration Options: Supports acceleration strategies such as
vllm
,sglang
, andllama cpp
- 🎯 High Concurrency: Utilizes dynamic batching to significantly boost concurrent processing
- 🎛️ Full Parameter Control: Offers comprehensive adjustments for pitch, speech rate, voice timbre temperature, and more
- 📱 Lightweight Deployment: Minimal dependencies, with rapid startup based on Flask and fastapi
- 🎨 Clean Interface: Features a modern, standardized UI
- 🔊 Long Text Speech Synthesis: Capable of synthesizing extended texts while maintaining consistent voice timbre
- Python 3.10+
- Flask 2.0+
- fastapi
- vllm or sglang or llama-cpp
pip install -r requirements.txt
(Install one as needed; if using torch for inference, you can skip this step)
-
vLLM
The vllm version should be greater than
0.7.2
pip install vllm
For more details, please refer to: https://github.com/vllm-project/vllm
-
llama-cpp
pip install llama-cpp-python
Convert the LLM weights to gguf format, save the file as
model.gguf
, and place it in theLLM
directory. You can refer to the following method for weight conversion. If quantization is needed, you can configure the parameters accordingly.git clone https://github.com/ggml-org/llama.cpp.git cd llama.cpp python convert_hf_to_gguf.py Spark-TTS-0.5B/LLM --outfile Spark-TTS-0.5B/LLM/model.gguf
-
sglang
pip install sglang
For more details, please refer to: https://github.com/sgl-project/sglang
Weight download links: huggingface, modelscope
-
Clone the project repository
git clone https://github.com/HuiResearch/Fast-Spark-TTS.git cd Fast-Spark-TTS
-
Start the SparkTTS API Service
The engine can be chosen according to your environment; currently supported options include
torch
,vllm
,sglang
, andllama-cpp
.python server.py \ --model_path Spark-TTS-0.5B \ --engine vllm \ --llm_device cuda \ --tokenizer_device cuda \ --detokenizer_device cuda \ --wav2vec_attn_implementation sdpa \ --max_length 32768 \ --llm_gpu_memory_utilization 0.6 \ --host 0.0.0.0 \ --port 8000
-
Start the Web Interface
python frontend.py
-
Access via your browser
http://localhost:8001
- Switch to the Speech Synthesis tab.
- Enter the text you wish to convert to speech.
- Adjust parameters such as gender, pitch, and speech rate.
- Click the Generate Speech button.
- Once generation is complete, play or download the audio.
- Switch to the Voice Cloning tab.
- Enter the target text.
- Upload the reference audio.
- Enter the corresponding text for the reference audio.
- Adjust the parameters.
- Click the Clone Voice button.
- Once cloning is complete, play or download the audio.
- Switch to the Character Cloning tab.
- Enter the target text.
- Choose your desired character.
- Adjust the parameters.
- Click the Character Cloning button.
- Once cloning is complete, play or download the audio.
Graphics Card: A800
Using prompt_audio.wav to test cloning speed, the inference is looped five times to calculate the average inference time (in seconds).
Test code reference: speed_test.py
After using vllm, most of the processing time is spent on the audio tokenizer and vocoder rather than the LLM. Optimization using ONNX might further improve performance.
engine | device | Avg Time | Avg Time (warm up) |
---|---|---|---|
Official | CPU | 27.20 | 27.30 |
Official | GPU | 5.95 | 4.97 |
llama-cpp | CPU | 11.32 | 11.09 |
vllm | GPU | 1.95 | 1.22 |
sglang | GPU | 3.41 | 0.76 |
Usage instructions can be found in [inference.py].
For API deployment and repeated inference calls, it is recommended to use asynchronous (async) methods.
Note: For backends like vllm and sglang, the first inference call might take longer, but subsequent calls will perform normally. For benchmarking, it is advised to warm up using the first data entry.
This project provides a zero-shot voice cloning TTS model intended for academic research, educational purposes, and lawful applications such as personalized speech synthesis, assistive technologies, and linguistic studies.
Please note:
- Do not use this model for unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or any illegal activities.
- Ensure compliance with local laws, regulations, and ethical standards when using this model.
- The developers assume no responsibility for any misuse of this model.
This project advocates the responsible development and use of artificial intelligence and encourages the community to adhere to safety and ethical principles in AI research and applications.
This project is built upon Spark-TTS and is distributed under the same open-source license as SparkTTS. For details, please refer to the original SparkTTS License.