Skip to content

Commit 596d386

Browse files
committed
Merge remote-tracking branch 'origin/main' into earnings_pc
2 parents d5e4d27 + 828a750 commit 596d386

27 files changed

+706
-76
lines changed

.github/workflows/tests.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,14 +75,20 @@ jobs:
7575
pip install nemo-toolkit[asr,nlp]==1.23.0
7676
pip install nemo_text_processing
7777
pip install -r requirements/huggingface.txt
78+
pip install certifi #this needed to avoid problems with certificates [COORAL]
79+
export SSL_CERT_FILE=$(python -m certifi)
7880
python -m pip cache purge
81+
7982
8083
- name: Run all tests
8184
env:
8285
AWS_SECRET_KEY: ${{ secrets.AWS_SECRET_KEY }}
8386
AWS_ACCESS_KEY: ${{ secrets.AWS_ACCESS_KEY }}
8487
CLEAN_UP_TMP_PATH: 1
8588
run: |
89+
wget https://uit.stanford.edu/sites/default/files/2023/10/11/incommon-rsa-ca2.pem #downloading cert manually [for CORAL]
90+
sudo cp incommon-rsa-ca2.pem /usr/local/share/ca-certificates/incommon-rsa-server-ca-2.crt # [cert for CORAL]
91+
sudo update-ca-certificates # [cert for CORAL]
8692
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
8793
python -m pytest tests/ --junitxml=pytest.xml --ignore=tests/test_tts_sdp_end_to_end.py --cov-report=term-missing:skip-covered --cov=sdp --durations=30 -rs | tee pytest-coverage.txt
8894

dataset_configs/english/coraal/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ documentation: |
1818
This config performs the following data processing.
1919
2020
1. Downloads CORAAL data based on the
21-
`official file list <http://lingtools.uoregon.edu/coraal/coraal_download_list.txt>`_.
21+
`official file list <https://lingtools.uoregon.edu/coraal/coraal_download_list.txt>`_. #Official mirror link
2222
There are a couple of errors in the links there, which are fixed in our code.
2323
2. Drops all utterances which contain only pauses. Set ``drop_pauses=False`` to undo.
2424
3. Groups all consecutive segments from the same speaker until 20 seconds duration
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
documentation: |
2+
HiFiTTS-2 22kHz
3+
###############
4+
5+
This config can be used to download the audio data for
6+
`HiFiTTS-2 22kHz <https://huggingface.co/datasets/nvidia/hifitts-2>`_
7+
8+
1. Downloads HiFiTTS-2 audio from LibriVox.
9+
2. Outputs a new manifest in which LibriVox audiobook chapters which could not be downloaded (e.g. because they
10+
were removed from the website) are removed.
11+
12+
**Required arguments**.
13+
14+
* **workspace_dir**: specify the workspace folder where all audio files and manifests will be stored.
15+
16+
Note that you can customize any part of this config either directly or from command-line.
17+
18+
**Output format**.
19+
20+
This config outputs 2 manifest files:
21+
22+
* ``${workspace_dir}/errors.json`` - entries from the input chapters file which failed to download from LibriVox.
23+
* ``${workspace_dir}/manifest_filtered_22khz`` - input manifest file without utterances from failed chapters.
24+
25+
processors_to_run: all
26+
workspace_dir: ???
27+
manifest_filename: manifest_22khz.json
28+
output_filename: manifest_filtered_22khz.json
29+
chapter_filename: chapters_22khz.json
30+
error_filename: errors_22khz.json
31+
audio_dir_name: audio_22khz
32+
chapter_audio_dir_name: chapters
33+
sample_rate: 22050
34+
delete_chapter_files: true
35+
exit_on_error: false
36+
use_dask: false
37+
max_workers: 8
38+
chunksize: 50
39+
40+
input_manifest_file: ${workspace_dir}/${manifest_filename}
41+
chapter_file: ${workspace_dir}/${chapter_filename}
42+
error_file: ${workspace_dir}/${error_filename}
43+
audio_dir: ${workspace_dir}/${audio_dir_name}
44+
chapter_dir: ${workspace_dir}/${chapter_audio_dir_name}
45+
final_manifest: ${workspace_dir}/${output_filename}
46+
47+
processors:
48+
- _target_: sdp.processors.DownloadHiFiTTS2
49+
audio_dir: ${audio_dir}
50+
chapter_dir: ${chapter_dir}
51+
sample_rate: ${sample_rate}
52+
delete_chapter_files: ${delete_chapter_files}
53+
exit_on_error: ${exit_on_error}
54+
input_manifest_file: ${chapter_file}
55+
output_manifest_file: ${error_file}
56+
use_dask: ${use_dask}
57+
max_workers: ${max_workers}
58+
chunksize: ${chunksize}
59+
60+
- _target_: sdp.processors.RemovedFailedChapters
61+
input_manifest_file: ${input_manifest_file}
62+
output_manifest_file: ${final_manifest}
63+
error_file: ${error_file}
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
documentation: |
2+
HiFiTTS-2 44kHz
3+
##################
4+
5+
This config can be used to download the audio data for
6+
`HiFiTTS-2 44kHz <https://huggingface.co/datasets/nvidia/hifitts-2>`_
7+
8+
9+
1. Downloads HiFiTTS-2 audio from LibriVox.
10+
2. Outputs a new manifest in which LibriVox audiobook chapters which could not be downloaded (e.g. because they
11+
were removed from the website) are removed.
12+
13+
**Required arguments**.
14+
15+
* **workspace_dir**: specify the workspace folder where all audio files and manifests will be stored.
16+
17+
Note that you can customize any part of this config either directly or from command-line.
18+
19+
**Output format**.
20+
21+
This config outputs 2 manifest files:
22+
23+
* ``${workspace_dir}/errors.json`` - entries from the input chapters file which failed to download from LibriVox.
24+
* ``${workspace_dir}/manifest_filtered_44khz`` - input manifest file without utterances from failed chapters.
25+
26+
processors_to_run: all
27+
workspace_dir: ???
28+
manifest_filename: manifest_44khz.json
29+
output_filename: manifest_filtered_44khz.json
30+
chapter_filename: chapters_44khz.json
31+
error_filename: errors_44khz.json
32+
audio_dir_name: audio_44khz
33+
chapter_audio_dir_name: chapters
34+
sample_rate: 44100
35+
delete_chapter_files: true
36+
exit_on_error: false
37+
use_dask: false
38+
max_workers: 8
39+
chunksize: 50
40+
41+
input_manifest_file: ${workspace_dir}/${manifest_filename}
42+
chapter_file: ${workspace_dir}/${chapter_filename}
43+
error_file: ${workspace_dir}/${error_filename}
44+
audio_dir: ${workspace_dir}/${audio_dir_name}
45+
chapter_dir: ${workspace_dir}/${chapter_audio_dir_name}
46+
final_manifest: ${workspace_dir}/${output_filename}
47+
48+
processors:
49+
- _target_: sdp.processors.DownloadHiFiTTS2
50+
audio_dir: ${audio_dir}
51+
chapter_dir: ${chapter_dir}
52+
sample_rate: ${sample_rate}
53+
delete_chapter_files: ${delete_chapter_files}
54+
exit_on_error: ${exit_on_error}
55+
input_manifest_file: ${chapter_file}
56+
output_manifest_file: ${error_file}
57+
use_dask: ${use_dask}
58+
max_workers: ${max_workers}
59+
chunksize: ${chunksize}
60+
61+
- _target_: sdp.processors.RemovedFailedChapters
62+
input_manifest_file: ${input_manifest_file}
63+
output_manifest_file: ${final_manifest}
64+
error_file: ${error_file}
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
documentation: |
2+
HiFiTTS-2 Bandwidth Estimation
3+
##############################
4+
5+
This config contains the bandwidth estimation code used for HiFiTTS and HiFiTTS-2.
6+
This config can be used to estimate bandwidth for any dataset. For HiFiTTS-2 bandwidth
7+
was estimated using the first 30 seconds of every audiobook chapter, but the estimate is still
8+
reasonably accurate if run over a shorter duration or with individual utterances.
9+
10+
**Required arguments**.
11+
12+
* **workspace_dir**: The workspace folder where all audio files and manifests are stored.
13+
* **audio_dir**: Folder in workspace containing audio files to estimate bandwidth of.
14+
* **input_manifest_filename**: Manifest file in workspace containing relative paths to audio.
15+
16+
**Output format**.
17+
18+
This config outputs a single manifest with the following field(s):
19+
20+
* **bandwidth (int)**: Estimated bandwidth of the audio file.
21+
22+
processors_to_run: all
23+
workspace_dir: ???
24+
audio_dir_name: ???
25+
input_manifest_filename: ???
26+
output_manifest_filename: manifest_bandwidth.json
27+
audio_key: audio_filepath
28+
use_dask: false
29+
max_workers: 1
30+
chunksize: 1
31+
32+
input_manifest_file: ${workspace_dir}/${input_manifest_filename}
33+
final_manifest: ${workspace_dir}/${output_manifest_filename}
34+
audio_dir: ${workspace_dir}/${audio_dir_name}
35+
36+
processors:
37+
- _target_: sdp.processors.EstimateBandwidth
38+
input_manifest_file: ${input_manifest_file}
39+
output_manifest_file: ${final_manifest}
40+
audio_dir: ${audio_dir}
41+
input_audio_key: ${audio_key}
42+
use_dask: ${use_dask}
43+
max_workers: ${max_workers}
44+
chunksize: ${chunksize}

docker/Dockerfile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ RUN apt-get update \
2121
# Update pip
2222
RUN pip install --upgrade pip
2323

24+
#install typing-ext manually
25+
RUN pip install typing-extensions
26+
2427
# Clone the NeMo SDP repository
2528
COPY . /src/NeMo-speech-data-processor
2629
RUN rm -rf /src/NeMo-speech-data-processor/.git
@@ -34,4 +37,4 @@ RUN find requirements/ -name "*.txt" -exec pip install -r {} \;
3437
WORKDIR /src/NeMo-speech-data-processor
3538

3639
# Set up entrypoint
37-
CMD ["bash"]
40+
CMD ["bash"]

docs/src/conf.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,3 +189,8 @@ def setup(app):
189189
]
190190
# nitpick_ignore_regex = [('py:class', '*')]
191191

192+
#adding this especially for coraal, temporary
193+
linkcheck_ignore = [
194+
r'https://lingtools\.uoregon\.edu/coraal/coraal_download_list\.txt',
195+
]
196+
# https://lingtools.uoregon.edu/coraal/coraal_download_list.txt

docs/src/sdp/api.rst

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,12 +137,24 @@ HuggingFace Datasets
137137
.. autodata:: sdp.processors.CreateInitialManifestHuggingFace
138138
:annotation:
139139

140+
140141
YTC Datasets
141142
''''''''''''
142143

143144
.. autodata:: sdp.processors.datasets.ytc.create_initial_manifest.CreateInitialManifestYTC
144145
:annotation:
145146

147+
148+
HiFiTTS-2
149+
''''''''''''''''''''
150+
151+
.. autodata:: sdp.processors.DownloadHiFiTTS2
152+
:annotation:
153+
154+
.. autodata:: sdp.processors.RemovedFailedChapters
155+
:annotation:
156+
157+
146158
Lhotse processors
147159
#################
148160

@@ -172,6 +184,9 @@ used in the downstream processing for additional enhancement or filtering.
172184
.. autodata:: sdp.processors.ASRTransformers
173185
:annotation:
174186

187+
.. autodata:: sdp.processors.EstimateBandwidth
188+
:annotation:
189+
175190
.. autodata:: sdp.processors.tts.pyannote.PyAnnoteDiarizationAndOverlapDetection
176191
:annotation:
177192

@@ -187,7 +202,6 @@ used in the downstream processing for additional enhancement or filtering.
187202
.. autodata:: sdp.processors.tts.metrics.BandwidthEstimationProcessor
188203
:annotation:
189204

190-
191205
Text-only processors
192206
####################
193207

docs/src/sdp/existing_configs.rst

Lines changed: 29 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -366,6 +366,13 @@ Armenian Toloka
366366
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/armenian/toloka/pipeline_get_final_res.yaml>`__ |
367367
:doc:`documentation <config-docs/armenian/toloka/pipeline_get_final_res>`
368368

369+
.. toctree::
370+
:hidden:
371+
372+
config-docs/armenian/toloka/pipeline_start
373+
config-docs/armenian/toloka/pipeline_validate_answers
374+
config-docs/armenian/toloka/pipeline_get_final_res
375+
369376
YouTube Commons (YTC)
370377
~~~~~~~~~~~~~~~~~~~~~~
371378

@@ -377,8 +384,26 @@ YouTube Commons (YTC)
377384
.. toctree::
378385
:hidden:
379386

380-
config-docs/armenian/toloka/pipeline_start
381-
config-docs/armenian/toloka/pipeline_validate_answers
382-
config-docs/armenian/toloka/pipeline_get_final_res
383-
384387
config-docs/tts/ytc/config
388+
389+
HiFiTTS-2
390+
~~~~~~~~~~~~~~~~~~~~~~~
391+
392+
**Dataset link:** https://huggingface.co/datasets/nvidia/hifitts-2
393+
394+
* **22kHz**:
395+
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/english/hifitts2/config_22khz.yaml>`__ |
396+
:doc:`documentation <config-docs/english/hifitts2/config_22khz>`
397+
* **44kHz**:
398+
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/english/hifitts2/config_44khz.yaml>`__ |
399+
:doc:`documentation <config-docs/english/hifitts2/config_44khz>`
400+
* **Bandwidth Estimation**:
401+
`config <https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/dataset_configs/english/hifitts2/config_bandwidth.yaml>`__ |
402+
:doc:`documentation <config-docs/english/hifitts2/config_bandwidth>`
403+
404+
.. toctree::
405+
:hidden:
406+
407+
config-docs/english/hifitts2/config_22khz
408+
config-docs/english/hifitts2/config_44khz
409+
config-docs/english/hifitts2/config_bandwidth

requirements/main.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ python-docx
1818
pydub
1919
dask
2020
distributed
21-
21+
jiwer>=3.1.0,<4.0.0
2222
# toloka-kit # Temporarily disabled due to Toloka's technical pause; keep as reference for past and future API support
2323
# for some processers, additionally https://github.com/NVIDIA/NeMo is required
2424
# for some processers, additionally nemo_text_processing is required

0 commit comments

Comments
 (0)