TWiT Transcript Archiver

This toolset allows you to download transcripts from the TWiT.tv network and process them into clean, structured Markdown files optimized for Large Language Model (LLM) context windows, specifically Google NotebookLM.

It handles fetching, cleaning, and chunking transcripts to stay within token and file size limits.

Features

Multi-Show Support: Download transcripts for any major TWiT network show (e.g., Intelligent Machines, This Week in Google, Windows Weekly, Security Now).
Smart Caching: Caches list pages to minimize server load while refreshing recent pages to catch new episodes.
Resume Capability: Skips already downloaded transcript files to save bandwidth.
NotebookLM Ready: Converts HTML to Markdown and chunks output files to stay under 500,000 words / 200MB limits.
Year-Based Splitting: Option to split transcript chunks by calendar year for better organization.
Robust Logging: Includes a --debug flag to trace scraping logic.
Security: Enforces strict relative URL validation and sanitizes Markdown links.
Dual Implementation: Available in both Python (dependency-free) and Go (performant).

Directory Structure

data/: Stores downloaded HTML files and generated Markdown output.
fetch_transcripts.py: Python scraper script.
process_transcripts.py: Python processing/chunking script.
go/: Go implementation source tree.
tests/: Python unit tests.

Usage (Python)

The Python implementation relies on standard libraries only (urllib, re, argparse, glob). No pip install required.

1. Fetch Transcripts

Download raw HTML transcripts from TWiT.tv.

# Download specific shows (e.g., Intelligent Machines, This Week in Google)
python3 fetch_transcripts.py --shows "Intelligent Machines" "This Week in Google"

# Use show codes for brevity
python3 fetch_transcripts.py --shows IM TWIG

# Download ALL supported shows
python3 fetch_transcripts.py --all

# Refresh list pages (force re-download index)
python3 fetch_transcripts.py --refresh-list

2. Process Transcripts

Convert HTML files to Markdown and chunk them.

# Process specific shows
python3 process_transcripts.py --prefixes IM TWIG

# Process ALL downloaded shows
python3 process_transcripts.py --all

# Split by year (useful for large archives) as well as size limits
python3 process_transcripts.py --all --by-year

Output files will be saved in data/, e.g., IM_Transcripts_2024_100_150.md.

Usage (Go)

The Go implementation offers the same functionality with improved performance. Requires Go 1.19+.

1. Build

cd go
go build -o fetch-transcripts ./cmd/fetch-transcripts
go build -o process-transcripts ./cmd/process-transcripts

2. Run

# Fetch
./fetch-transcripts IM TWIG
./fetch-transcripts --all

# Process
./process-transcripts --by-year IM
./process-transcripts --all

Testing

Python

Run the standard unittest suite:

python3 -m unittest discover tests

Go

Run the Go test suite:

cd go
go test ./...

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository.
Create your feature branch (git checkout -b feature/AmazingFeature).
Commit your changes (git commit -m 'Add some AmazingFeature').
Push to the branch (git push origin feature/AmazingFeature).
Open a Pull Request.

Please ensure all tests pass before submitting.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
data		data
go		go
tests		tests
.gitignore		.gitignore
DESIGN.md		DESIGN.md
LESSONS_LEARNED.md		LESSONS_LEARNED.md
README.md		README.md
fetch_transcripts.py		fetch_transcripts.py
process_transcripts.py		process_transcripts.py
repair_grc.py		repair_grc.py
reproduce_issue.py		reproduce_issue.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TWiT Transcript Archiver

Features

Directory Structure

Usage (Python)

1. Fetch Transcripts

2. Process Transcripts

Usage (Go)

1. Build

2. Run

Testing

Python

Go

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TWiT Transcript Archiver

Features

Directory Structure

Usage (Python)

1. Fetch Transcripts

2. Process Transcripts

Usage (Go)

1. Build

2. Run

Testing

Python

Go

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages