A robust, production-grade library designed to convert WebVTT to SRT, turning messy AI transcripts into clean, usable subtitles.
A post-processing tool designed to clean the output from OpenAI Whisper, YouTube Auto-Captions, and other AI transcription services.
Perfect for TTS pipelines, video dubbing, and dataset preparation.
AI tools like Whisper are incredible at speech recognition, but their raw VTT output is often chaotic. They frequently produce:
- The "Karaoke Effect": Words accumulating screen-by-screen (e.g., "Hello", "Hello world", "Hello world!").
- Micro-Glitches: Subtitle frames lasting milliseconds that are invisible to humans but break TTS/dubbing scripts.
- Metadata Clutter: Tags like
align:start,<c>,<b>or<i>that mess up text processing.
whisper-vtt2srt is the bridge between raw AI output and your production pipeline. It stabilizes and normalizes the text, making it safe for Text-to-Speech (TTS) generation, video players, and NLP tasks.
Test the conversion instantly in your browser (Client-Side / Secure). No installation required.
- 🎮 Online Playground
- 👀 See the Difference
- 🚀 Key Features
- 📦 Installation
- 📘 How to Use
- 🧠 How It Works
- 📆 Changelog
- 🤝 Contributing
- 📜 License
- 📚 Reference
🚧 Raw Input — (Typical output from YouTube/Whisper - with "Karaoke Effect")
Notice the accumulated text, repetitive lines, and internal tagging.
WEBVTT
Kind: captions
Language: en
00:00:00.640 --> 00:00:03.110 align:start position:0%
APIs<00:00:01.280><c> are</c><00:00:01.520><c> everywhere.</c><00:00:02.399><c> They</c><00:00:02.639><c> power</c><00:00:02.960><c> your</c>
00:00:03.110 --> 00:00:03.120 align:start position:0%
APIs are everywhere. They power your
00:00:03.120 --> 00:00:05.430 align:start position:0%
APIs are everywhere. They power your
apps,<00:00:03.600><c> your</c><00:00:03.840><c> payment</c><00:00:04.160><c> systems,</c><00:00:04.880><c> your</c><00:00:05.120><c> cloud</c>
00:00:05.430 --> 00:00:05.440 align:start position:0%
apps, your payment systems, your cloud
00:00:05.440 --> 00:00:07.829 align:start position:0%
apps, your payment systems, your cloud
services,<00:00:06.560><c> pretty</c><00:00:06.879><c> much</c><00:00:07.120><c> every</c><00:00:07.440><c> piece</c><00:00:07.680><c> of</c>
00:00:07.829 --> 00:00:07.839 align:start position:0%
services, pretty much every piece of
00:00:07.839 --> 00:00:10.470 align:start position:0%
services, pretty much every piece of✨ Cleaned Output — (Processed by whisper-vtt2srt)
Clean, stable, and ready for TTS input, YouTube, Netflix or standard players.
1
00:00:00,640 --> 00:00:03,110
APIs are everywhere. They power your
2
00:00:03,120 --> 00:00:05,430
apps, your payment systems, your cloud
3
00:00:05,440 --> 00:00:07,829
services, pretty much every piece of-
🛡️ Stabilization Strategy Intelligently detects and merges accumulating text blocks ("Karaoke Effect"), preventing the rapid flashing of partial sentences. Essential for generating smooth audio in TTS pipelines, video dubbing, and subtitles.
-
🎵 Sound Description Removal Automatically filters out non-speech elements like
[Music],[Applause], or[Laughter], ensuring your TTS voice doesn't try to read stage directions. -
🧹 Glitch Filtering Automatically removes subtitle blocks with insignificant duration (< 50ms) that can cause audio generation errors or player flickering.
-
✨ Smart Normalization Strips VTT-specific metadata (
align:start,position:0%), removes internal tags (<c>,<00:00:00>), and cleans up inconsistent whitespace ensuring pure text output. -
⚡ Zero Dependencies Built with pure Python standard library. Lightweight and easy to install in any environment (Linux, Windows, Docker).
-
🔧 Configurable Strictness Every cleaning step is optional. You enable exactly what your pipeline needs.
pip install whisper-vtt2srtProcess files directly from the command line:
# Convert a Single File
whisper-vtt2srt input.vtt
# Batch Convert a Folder
whisper-vtt2srt ./my_dataset
# Recursive Conversion (subfolders included)
whisper-vtt2srt ./my_dataset --recursive
# Handle Legacy Encodings (e.g., Latin-1)
whisper-vtt2srt input_latin.vtt --encoding ISO-8859-1
# Keep the "karaoke" effect (disable deduplication)
whisper-vtt2srt input.vtt --no-karaokeusage: whisper-vtt2srt [-h] [-r] [-e ENCODING] [--no-karaoke] [--keep-glitches] [--keep-formatting]
[--keep-metadata] [--merge-short-lines]
input [output]
Convert WebVTT to SRT with professional cleaning.
positional arguments:
input Input VTT file or directory
output Output SRT file or directory (optional)
options:
-h, --help show this help message and exit
-r, --recursive Recursively process directories
-e ENCODING, --encoding ENCODING
Input file encoding (default: utf-8)
--no-karaoke Disable anti-karaoke filter (keep accumulating text)
--keep-sound-descriptions
Keep sound descriptions like [Music] or [Applause]
--keep-glitches Keep short <50ms blocks
--keep-formatting Keep VTT tags (bold, italic, colors)
--keep-metadata Keep metadata tags (align:start, position:0%)
--merge-short-lines Aggressively merge short lines into single lines
--max-line-length MAX_LINE_LENGTH
Maximum line length allowed when merging short lines (default: 42, like YouTube/Netflix)
Easily integrate whisper-vtt2srt into your own Python pipelines.
The library exports a high-level Pipeline class for full control.
from whisper_vtt2srt import Pipeline
# 1. Initialize
pipeline = Pipeline()
# 2. Read input
with open("subs.vtt", "r", encoding="utf-8") as f:
raw_vtt_content = f.read()
# 3. Convert raw VTT content
srt_content = pipeline.convert(raw_vtt_content)
# 4. Use the clean SRT (e.g., send to TTS engine, save to file, render in player, etc.)
print(srt_content)You can customize the cleaning options if needed:
Just pass a
CleaningOptionsobject to thePipelineconstructor to toggle specific cleaning rules.
from whisper_vtt2srt import CleaningOptions, Pipeline
# Configure strictness
options = CleaningOptions(
remove_pixelation=True, # Fix Karaoke effect
remove_sound_descriptions=True, # Remove [Music], [Applause]
remove_glitches=True, # Remove <50ms blocks
simplify_formatting=True, # Strip tags like <c> or <b> and fix whitespace
remove_metadata=True, # Clean VTT positioning
merge_short_lines=False, # Aggressively merge short lines
max_line_length=42 # Max length constraint for merging
)
pipeline = Pipeline(options)- Parser (State Machine): Robustly reads messy VTT files, handling multi-line strings and irregular spacing.
- Deduplication Engine: Uses a sliding window to identify comparison patterns between blocks. If a block is just a prefix of the next one (common in AI streams), it is merged or removed to stabilize the text.
- Filter Layer: Applies duration checks and regex cleaning to ensure the final output is compliant with the SubRip (SRT) standard.
Project history and updates are tracked in CHANGELOG.md.
Contributions are welcome! We follow a strict SOLID architecture. See CONTRIBUTING.md for details.
MIT License - see LICENSE.
If you use this library in your research or project, please cite it as:
@software{whisper_vtt2srt,
author = {Jorcelino Junior},
title = {whisper-vtt2srt: A robust WebVTT to SRT converter for AI subtitles},
year = {2026},
url = {https://github.com/jorcelinojunior/whisper-vtt2srt}
}