This toolset allows you to download transcripts from the TWiT.tv network and process them into clean, structured Markdown files optimized for Large Language Model (LLM) context windows, specifically Google NotebookLM.
It handles fetching, cleaning, and chunking transcripts to stay within token and file size limits.
- Multi-Show Support: Download transcripts for any major TWiT network show (e.g., Intelligent Machines, This Week in Google, Windows Weekly, Security Now).
- Smart Caching: Caches list pages to minimize server load while refreshing recent pages to catch new episodes.
- Resume Capability: Skips already downloaded transcript files to save bandwidth.
- NotebookLM Ready: Converts HTML to Markdown and chunks output files to stay under 500,000 words / 200MB limits.
- Year-Based Splitting: Option to split transcript chunks by calendar year for better organization.
- Robust Logging: Includes a
--debugflag to trace scraping logic. - Security: Enforces strict relative URL validation and sanitizes Markdown links.
- Dual Implementation: Available in both Python (dependency-free) and Go (performant).
data/: Stores downloaded HTML files and generated Markdown output.fetch_transcripts.py: Python scraper script.process_transcripts.py: Python processing/chunking script.go/: Go implementation source tree.tests/: Python unit tests.
The Python implementation relies on standard libraries only (urllib, re, argparse, glob). No pip install required.
Download raw HTML transcripts from TWiT.tv.
# Download specific shows (e.g., Intelligent Machines, This Week in Google)
python3 fetch_transcripts.py --shows "Intelligent Machines" "This Week in Google"
# Use show codes for brevity
python3 fetch_transcripts.py --shows IM TWIG
# Download ALL supported shows
python3 fetch_transcripts.py --all
# Refresh list pages (force re-download index)
python3 fetch_transcripts.py --refresh-listConvert HTML files to Markdown and chunk them.
# Process specific shows
python3 process_transcripts.py --prefixes IM TWIG
# Process ALL downloaded shows
python3 process_transcripts.py --all
# Split by year (useful for large archives) as well as size limits
python3 process_transcripts.py --all --by-yearOutput files will be saved in data/, e.g., IM_Transcripts_2024_100_150.md.
The Go implementation offers the same functionality with improved performance. Requires Go 1.19+.
cd go
go build -o fetch-transcripts ./cmd/fetch-transcripts
go build -o process-transcripts ./cmd/process-transcripts# Fetch
./fetch-transcripts IM TWIG
./fetch-transcripts --all
# Process
./process-transcripts --by-year IM
./process-transcripts --allRun the standard unittest suite:
python3 -m unittest discover testsRun the Go test suite:
cd go
go test ./...Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository.
- Create your feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
Please ensure all tests pass before submitting.