Skip to content

Commit 6f4bd54

Browse files
committed
feat(pitd, f0, docs): improve PITD and add hybrid F0 backend
BREAKING CHANGE: PITD scaler default changed from 2.0 to 1.0; PITD results prior to v0.9.0 are unreliable feat(f0): add hybrid F0 backend with fallback; improve stability and reduce discontinuities fix(pitd): fix overly flat PITD curves (issue #21); special thanks to @ma0shu for helping identify and diagnose this critical bug ❤ docs(readme): add v0.9.0+ warning; document scaler change; update troubleshooting
1 parent ab0c2e4 commit 6f4bd54

16 files changed

Lines changed: 811 additions & 295 deletions

File tree

README.en.md

Lines changed: 55 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,15 @@
77
<a href="README.en.md"><img src="https://img.shields.io/badge/lang-English-blue.svg"></a>
88
</p>
99

10+
> [!WARNING]
11+
> 🚨 **Please Read Before Downloading** 🚨
12+
>
13+
> **It is strongly recommended to use v0.9.0 or later**. Earlier versions of the **PITD** expression parameter processing algorithm contain a [critical flaw](https://github.com/NewComer00/expressive/releases/tag/v0.9.0) that may result in **incorrect pitch curve generation**. To download the latest version, please visit the [Releases page](https://github.com/NewComer00/expressive/releases).
14+
>
15+
> For users migrating from an older version to `v0.9.0` or later, note that the default value of the **PITD Scaler** is now `1.0` (previously `2.0`). If you have an old configuration file, please set the **PITD Scaler** to `1.0`.
16+
>
17+
> **🎵 Thank you for using Expressive🎵**
18+
1019
# Expressive
1120

1221
**Expressive** is a [DiffSinger](https://github.com/openvpi/diffsinger) expression parameter importer developed for [OpenUtau](https://github.com/stakira/OpenUtau). It aims to extract expression parameters from real human vocals and import them into the appropriate tracks of your project.
@@ -21,18 +30,17 @@ The current version supports importing the following expression parameters:
2130

2231
| **Working with OpenUtau** | **Data Viewer** |
2332
|:---:|:---:|
24-
| <img src="https://github.com/user-attachments/assets/268b44d4-528d-481e-acfb-3f7da7261c80" width="100%" /> | <img src="https://github.com/user-attachments/assets/91ddadee-62cd-4420-abf0-dd9177e8f935" width="100%" /> |
33+
| <img src="https://github.com/user-attachments/assets/d4e37337-50df-4d7d-8552-c4505dc73f20" width="100%" /> | <img src="https://github.com/user-attachments/assets/91ddadee-62cd-4420-abf0-dd9177e8f935" width="100%" /> |
2534

2635
</div>
2736

28-
> - *OpenUtau version from [keirokeer/OpenUtau-DiffSinger-Lunai](https://github.com/keirokeer/OpenUtau-DiffSinger-Lunai)*
29-
> - *Singer model from [yousa-ling-official-production/yousa-ling-diffsinger-v1](https://github.com/yousa-ling-official-production/yousa-ling-diffsinger-v1)*
37+
> - *Example from [`examples/明天会更好`](examples/明天会更好). Click to view details.*
3038
3139
> [!TIP]
3240
> <details>
3341
> <summary><b>👉 Click to expand the full voiced demo video 👈</b></summary>
3442
>
35-
> <p align="center"><video src="https://github.com/user-attachments/assets/4b5b7c15-947a-4f54-b80e-a14a9eefc86b"></video></p>
43+
> <p align="center"><video src="https://github.com/user-attachments/assets/89706eec-63f6-44f6-8ed7-1f1c73cb341e"></video></p>
3644
> <p align="center"><video src="https://github.com/user-attachments/assets/4076eb8b-07eb-48e6-bdec-4abeac6258c7"></video></p>
3745
>
3846
> </details>
@@ -47,6 +55,8 @@ By default, this application uses [rmvpe-onnx](https://github.com/newcomer00/rmv
4755

4856
The [swift-f0](https://github.com/lars76/swift-f0) and [CREPE](https://github.com/marl/crepe) pitch extraction backends are also available. The former runs on CPU only and is the fastest option, though its accuracy is modest. The latter is a classic algorithm in the field and runs more slowly. In a CUDA environment, the CREPE backend will automatically enable GPU acceleration.
4957

58+
There is also a newly added experimental **hybrid** backend available. The hybrid backend combines the prediction results of rmvpe-onnx and swift-f0, primarily using the pitch extraction results from rmvpe-onnx. In voiced segments of the audio, if the confidence of rmvpe-onnx is low and the confidence of swift-f0 is high, the result from swift-f0 is used for correction, improving the overall accuracy of pitch extraction.
59+
5060
> \* On Windows, TensorFlow 2.10 is the last version that supports GPU acceleration, and Python 3.10 is the highest Python version supported by its `.whl` files.
5161
5262
## 📌 Use Case
@@ -297,24 +307,53 @@ Relaunching the application should restore normal functionality, and this issue
297307
#### Future Plan
298308
The NiceGUI framework has begun improving its drag-and-drop support and should resolve this in a future release.
299309

300-
### PITD expression curve is overall too flat
310+
---
311+
312+
### PITD expression curve is overly flat
301313

302314
#### Symptom
303-
The extracted PITD expression curve is too flat, with almost no significant variation overall. Pitch changes in the reference vocal are not reflected in the expression curve.
304315

305-
#### Possible Cause
306-
The two confidence thresholds in the PITD extractor are set **too high**, causing many pitch changes to be discarded.
316+
The extracted PITD expression curve is too flat, with almost no significant variation. Pitch changes in the reference vocal are not properly reflected in the curve.
307317

308-
#### Solution
309-
First try using the best-performing rmvpe-onnx backend (with default confidence thresholds). If the issue persists, try lowering both confidence thresholds. In general, the **Utau vocal** is relatively clean, so it is advisable to first adjust the confidence threshold for the **Reference vocal**.
318+
#### Possible Causes
319+
320+
1. In versions earlier than **v0.9.0**, there is an issue in the conversion between pitch and PITD values, which can cause the curve to appear overly flat.
321+
2. The two confidence thresholds in the PITD extractor are set **too high**, causing many pitch changes to be discarded.
322+
You can observe missing segments in the original pitch curve in [`expressive-viewer`](#iewer).
323+
324+
#### Solutions
310325

311-
### PITD expression curve has sudden jumps or spikes at certain positions
326+
1. Please upgrade to **v0.9.0 or later**.
327+
2. First try using the best-performing **rmvpe-onnx** or **hybrid** backend (with default confidence thresholds).
328+
If the issue persists, try lowering both confidence thresholds. You can use the pitch confidence curve in [`expressive-viewer`](#viewer) as a reference when tuning.
329+
In general, the **Utau vocal** is relatively clean, so it is recommended to adjust the confidence threshold for the **reference vocal** first.
330+
331+
#### Future Plans
332+
Incorporate semantic information into the PITD expression extraction algorithm.
333+
334+
---
335+
336+
### PITD expression curve has sudden jumps or spikes
312337

313338
#### Symptom
314-
The PITD expression curve changes too rapidly at certain positions, with very large jumps or spikes that clearly do not match natural vocal behavior.
315339

316-
#### Possible Cause
317-
The two confidence thresholds in the PITD extractor are set **too low**, causing erroneous detection results to be accepted.
340+
The PITD expression curve changes too abruptly at certain positions, with large jumps or spikes that do not match natural vocal behavior.
318341

319-
#### Solution
320-
First try using the best-performing rmvpe-onnx backend (with default confidence thresholds). If the issue persists, try increasing both confidence thresholds. In general, the **Utau vocal** is relatively clean, so it is advisable to first adjust the confidence threshold for the **Reference vocal**.
342+
#### Possible Causes
343+
344+
1. In versions earlier than **v0.9.0**, there is an issue in the conversion between pitch and PITD values.
345+
2. There is noise in the reference audio around the corresponding timestamps.
346+
You can observe abnormal spikes in the original pitch curve in [`expressive-viewer`](#viewer)
347+
3. The two confidence thresholds in the PITD extractor are set **too low**, causing incorrect detections to be accepted.
348+
This may also appear as spikes in the pitch curve in [`expressive-viewer`](#viewer).
349+
350+
#### Solutions
351+
352+
1. Please upgrade to **v0.9.0 or later**.
353+
2. Try denoising the reference audio using tools such as [UVR](https://github.com/Anjok07/ultimatevocalremovergui) or [MSST](https://github.com/SUC-DriverOld/MSST-WebUI).
354+
3. First try using the best-performing **rmvpe-onnx** or **hybrid** backend (with default confidence thresholds).
355+
If the issue persists, try increasing both confidence thresholds. You can use the pitch confidence curve in [`expressive-viewer`](#viewer) to guide your adjustments.
356+
In general, the **Utau vocal** is relatively clean, so it is recommended to adjust the confidence threshold for the **reference vocal** first.
357+
358+
#### Future Plans
359+
Incorporate semantic information into the PITD expression extraction algorithm.

README.md

Lines changed: 34 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,15 @@
77
<a href="README.en.md"><img src="https://img.shields.io/badge/lang-English-blue.svg"></a>
88
</p>
99

10+
> [!WARNING]
11+
> 🚨 **下载前请注意** 🚨
12+
>
13+
> **强烈建议您使用 `v0.9.0` 及以上版本**。早先版本的 **PITD** 表情参数处理算法存在[严重缺陷](https://github.com/NewComer00/expressive/releases/tag/v0.9.0),会导致**音高曲线绘制错误**。下载最新版本请前往 [Releases 页面](https://github.com/NewComer00/expressive/releases)
14+
>
15+
> 对于从旧版本迁移到 `v0.9.0` 及以上版本的用户,新版本中 **PITD 缩放因子(Scaler)的默认值为 `1.0`**,不再是原来的 `2.0`。若您有旧版本的配置文件,请将 **PITD 缩放因子(Scaler)设置为 `1.0`**
16+
>
17+
> **🎵 感谢您使用 Expressive🎵**
18+
1019
# Expressive
1120

1221
**Expressive** 是一个为 [OpenUtau](https://github.com/stakira/OpenUtau) 开发的 [DiffSinger](https://github.com/openvpi/diffsinger) 表情参数导入工具,旨在从真实人声中提取表情参数,并导入至工程的相应轨道。
@@ -21,18 +30,17 @@
2130

2231
| **工作流程** | **数据可视化** |
2332
|:---:|:---:|
24-
| <img src="https://github.com/user-attachments/assets/268b44d4-528d-481e-acfb-3f7da7261c80" width="100%" /> | <img src="https://github.com/user-attachments/assets/91ddadee-62cd-4420-abf0-dd9177e8f935" width="100%" /> |
33+
| <img src="https://github.com/user-attachments/assets/d4e37337-50df-4d7d-8552-c4505dc73f20" width="100%" /> | <img src="https://github.com/user-attachments/assets/91ddadee-62cd-4420-abf0-dd9177e8f935" width="100%" /> |
2534

2635
</div>
2736

28-
> - *OpenUtau 版本来自 [keirokeer/OpenUtau-DiffSinger-Lunai](https://github.com/keirokeer/OpenUtau-DiffSinger-Lunai)*
29-
> - *歌手模型来自 [yousa-ling-official-production/yousa-ling-diffsinger-v1](https://github.com/yousa-ling-official-production/yousa-ling-diffsinger-v1)*
37+
> - *示例来自 [`examples/明天会更好`](examples/明天会更好),点击查看详情信息*
3038
3139
> [!TIP]
3240
> <details>
3341
> <summary><b>👉 点击展开完整有声演示视频 👈</b></summary>
3442
>
35-
> <p align="center"><video src="https://github.com/user-attachments/assets/4b5b7c15-947a-4f54-b80e-a14a9eefc86b"></video></p>
43+
> <p align="center"><video src="https://github.com/user-attachments/assets/89706eec-63f6-44f6-8ed7-1f1c73cb341e"></video></p>
3644
> <p align="center"><video src="https://github.com/user-attachments/assets/4076eb8b-07eb-48e6-bdec-4abeac6258c7"></video></p>
3745
>
3846
> </details>
@@ -47,6 +55,8 @@
4755

4856
应用也提供了 [swift-f0](https://github.com/lars76/swift-f0)[CREPE](https://github.com/marl/crepe) 音高提取后端。前者仅依赖 CPU,效果一般,但速度最快。后者是业内的经典算法,速度较慢。在 CUDA 环境下,CREPE 后端会自动启用 GPU 加速。
4957

58+
应用还新增了一个实验性的 **hybrid** 后端。该后端融合了 rmvpe-onnx 与 swift-f0 的预测结果,以 rmvpe-onnx 的音高提取结果为主,在音频有声段中,如果 rmvpe-onnx 的置信度较低且 swift-f0 的置信度较高,则采用 swift-f0 的结果进行修正,从而提升整体音高提取的准确性。
59+
5060
> \* 在 Windows 平台下,TensorFlow 2.10 是最后一个支持 GPU 加速的版本,Python 3.10 是它的 `.whl` 文件支持的最高 Python 版本。
5161
5262
## 📌 使用场景
@@ -302,24 +312,40 @@ graph TB;
302312
#### 未来计划
303313
NiceGUI 框架已经开始着手改进文件拖拽支持,应该在未来的版本中能够解决此问题。
304314

315+
---
316+
305317
### PITD 表情曲线整体变化过于平缓
306318

307319
#### 问题现象
308320
提取出的 PITD 表情曲线过于平缓,整体上几乎没有大的起伏,参考人声中的音高变化并没有反映到表情曲线上。
309321

310322
#### 可能原因
311-
PITD 表情提取器中,两个置信度阈值设置**过高**,许多音高变化没有被采信。
323+
1. 在早于 `v0.9.0` 的版本中,PITD 表情曲线取值与音高之间的换算有问题,会导致 PITD 表情曲线整体非常平缓。
324+
2. PITD 表情提取器中,两个置信度阈值设置**过高**,许多音高变化没有被采信。您可以在 [`expressive-viewer`](#可视化工具viewer) 中观察到,原始的音高曲线中有很多不该出现的缺失部分。
312325

313326
#### 解决方案
314-
请先尝试使用效果最好的 rmvpe-onnx 后端(默认置信度阈值)。若问题仍在,尝试降低两个置信度阈值。一般来说,**歌姬音声**比较纯净,可以先调整**参考人声**的置信度阈值。
327+
1. 请下载安装 `v0.9.0` 及之后的版本。
328+
2. 请先尝试使用效果最好的 rmvpe-onnx 或 hybrid 后端(默认置信度阈值)。若问题仍在,尝试降低两个置信度阈值。您可以参考 [`expressive-viewer`](#可视化工具viewer) 的音高置信度曲线来辅助调整。一般来说,**歌姬音声**比较纯净,可以先调整**参考人声**的置信度阈值。
329+
330+
#### 未来计划
331+
为 PITD 表情提取算法引入语义信息。
332+
333+
---
315334

316335
### PITD 表情曲线在某些位置变化过快,出现跳跃或毛刺
317336

318337
#### 问题现象
319338
PITD 表情曲线在某些位置变化过快,出现非常大的跳跃或毛刺,明显不符合人声的变化规律。
320339

321340
#### 可能原因
322-
PITD 表情提取器中,两个置信度阈值设置**过低**,错误的识别结果被采信。
341+
1. 在早于 `v0.9.0` 的版本中,PITD 表情曲线取值与音高之间的换算有问题。
342+
2. 参考音频的对应时间戳附近有噪声。您可以在 [`expressive-viewer`](#可视化工具viewer) 中观察到,原始的音高曲线中有很多不该出现的尖刺。
343+
3. PITD 表情提取器中,两个置信度阈值设置**过低**,错误的识别结果被采信。您可以在 [`expressive-viewer`](#可视化工具viewer) 中观察到,原始的音高曲线中有很多不该出现的尖刺。
323344

324345
#### 解决方案
325-
请先尝试使用效果最好的 rmvpe-onnx 后端(默认置信度阈值)。若问题仍在,尝试增加两个置信度阈值。一般来说,**歌姬音声**比较纯净,可以先调整**参考人声**的置信度阈值。
346+
1. 请下载安装 `v0.9.0` 及之后的版本。
347+
2. 可使用 [UVR](https://github.com/Anjok07/ultimatevocalremovergui)[MSST](https://github.com/SUC-DriverOld/MSST-WebUI) 等工具对参考音频去噪声(denoise)。
348+
3. 请先尝试使用效果最好的 rmvpe-onnx 或 hybrid 后端(默认置信度阈值)。若问题仍在,尝试增加两个置信度阈值。您可以参考 [`expressive-viewer`](#可视化工具viewer) 的音高置信度曲线来辅助调整。一般来说,**歌姬音声**比较纯净,可以先调整**参考人声**的置信度阈值。
349+
350+
#### 未来计划
351+
为 PITD 表情提取算法引入语义信息。

examples/Прекрасное Далеко/expressive_config.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,21 +18,21 @@
1818
},
1919
"pitd": {
2020
"selected": true,
21-
"backend": "rmvpe-onnx",
21+
"backend": "hybrid",
2222
"confidence_utau": null,
2323
"confidence_ref": null,
2424
"align_radius": 1,
2525
"semitone_shift": 0,
26-
"smoothness": 4,
27-
"scaler": 2.2
26+
"smoothness": 2,
27+
"scaler": 1.0
2828
},
2929
"tenc": {
3030
"selected": true,
3131
"trim_silence": true,
3232
"align_radius": 1,
3333
"smoothness": 6,
3434
"scaler": 1.0,
35-
"bias": 10
35+
"bias": 15
3636
}
3737
}
3838
}

examples/テトリス/expressive_config.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,13 @@
1818
},
1919
"pitd": {
2020
"selected": true,
21-
"backend": "rmvpe-onnx",
21+
"backend": "hybrid",
2222
"confidence_utau": null,
2323
"confidence_ref": null,
2424
"align_radius": 1,
2525
"semitone_shift": 0,
2626
"smoothness": 2,
27-
"scaler": 2.0
27+
"scaler": 1.0
2828
},
2929
"tenc": {
3030
"selected": true,

examples/明天会更好/expressive_config.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,21 +18,21 @@
1818
},
1919
"pitd": {
2020
"selected": true,
21-
"backend": "rmvpe-onnx",
21+
"backend": "hybrid",
2222
"confidence_utau": null,
2323
"confidence_ref": null,
2424
"align_radius": 1,
2525
"semitone_shift": 0,
2626
"smoothness": 2,
27-
"scaler": 2.0
27+
"scaler": 1.0
2828
},
2929
"tenc": {
3030
"selected": true,
3131
"trim_silence": true,
3232
"align_radius": 1,
3333
"smoothness": 6,
3434
"scaler": 1.2,
35-
"bias": 10
35+
"bias": 15
3636
}
3737
}
3838
}

0 commit comments

Comments
 (0)