Uses ffmpeg
to cut the audio track from mp4 file, performs speech recognition via Vosk API and Vosk model and returns text as result. A utility for calculating metrics based on the reference text is included.
- Python 3.6+
- ffmpeg
This application utilizes ffmpeg
to convert .mp4
to .wav
. You should have ffmpeg
package installed in your system to make mpeg-4 video to text conversion work.
pip install -r requirements.txt
Download a model for your language from https://alphacephei.com/vosk/models and put in into ./model
directory.
Example:
wget https://alphacephei.com/vosk/models/vosk-model-ru-0.22.zip
unzip vosk-model-ru-0.22.zip
mv vosk-model-ru-0.22 model
python transcript_mp4.py some_video.mp4
python wer.py <hypotesis_text_file> <reference_text_file>
We have an ideal transcript for this video in russian (./samples/ideal.txt
): https://vod-video.rbc.ru/archive/2021/12/02/den1118.folder/telecast_576p.mp4
We have also made a transcript with Vosk model for the same video (./samples/test.txt
).
So, we can run the calculation:
python wer.py samples/test.txt samples/ideal.txt
Result:
WER (Words Error Rate): 0.14775815217391305
MER (Match Error Rate): 0.14164767176815368
WIL (Word Information Lost): 0.22181904843819444
WIP (Word Information Preserved): 0.7781809515618056
Hits: 2636
Substitutions: 270
Deletions: 38
Insertions: 127