podcast-transcriber

A set-it-and-forget-it GCP Cloud Function for transcribing a podcast, built for the Arms Control Wonk Podcast Slack community with transcriptions by Deepgram.

Overview

main.py defines a Cloud Function that

Waits on an invocation from a Pub/Sub topic;
Fetches a podcast's RSS or Atom feed of episodes;
Selects up to three most recent episodes for which it hasn't already produced transcripts;
Submits those podcast episodes to Deepgram's automated speech recognition API for transcription;
Writes the Deepgram response and a processed transcript to Google Cloud Storage.

That Cloud Function is designed to be invoked on a regular schedule; the setup instructions below and Makefile provide cron-like invocations by using Cloud Scheduler to publish to the Pub/Sub topic.

It avoids retranscribing episodes by checking whether a transcription artifact matching the episode's feed URI exists in GCS.

Usage

Requirements

gsutil and gcloud
Python 3.7
Deepgram account

Setup

Recommended: start with an empty GCP project for the resources created here (a Cloud Storage bucket, Pub/Sub topic, Cloud Scheduler job, and Cloud Function). Run gcloud config set project your-project-name to point the Google Cloud SDK tools at that project.

Create a Google Cloud Storage bucket.

Run make bucket.
Create .env.yaml file with required environment variables.

TARGET_FEED_URL : The URL of a podcast's Atom/RSS feed, e.g. "https://armscontrolwonk.libsyn.com/rss".

TRANSCRIPTIONS_BUCKET_NAME : The name of the Google Cloud Storage bucket created in step 1.

DEEPGRAM_API_KEY : Your personal Deepgram API key. This is a secret!
Example .env.yaml file
```
# Configuration
TARGET_FEED_URL: "https://armscontrolwonk.libsyn.com/rss"
TRANSCRIPTIONS_BUCKET_NAME: "transcriptions"

# Secrets
DEEPGRAM_API_KEY: "your_deepgram_secret_here"
```
Initialize the scheduling infrastructure (Pub/Sub topic and Cloud Scheduler job; documentation).

Run make cron-job.
Deploy the Cloud Function.

Run make deploy.

To work through the backlog of episodes in the feed, repeatedly run the created job: gcloud scheduler jobs run WeeklyJob.

Customization

Scheduling

Set up a transcription schedule appropriate to your podcast; these default settings are well-configured for a podcast with up to three episodes per week.

To check for updates more or less frequently, change the Cloud Scheduler job frequency. Default: once per week.

To transcribe more or fewer episodes per function execution, change the threshold in main.py#_main. Default: up to 3 episodes.

Deepgram and transcripts

The Deepgram request in main.py#_transcribe is tailored to the Arms Control Wonk podcast; if the speech in your podcast is faster or slower, you may want to decrease or increase the utt_split utterance threshold, respectively.

See Deepgram's API documentation and Python SDK for a documentation of the available options.

Want transcripts in a different format? Change how main.py#_process formats utterances.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.vscode		.vscode
.gcloudignore		.gcloudignore
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

podcast-transcriber

Overview

Usage

Requirements

Setup

Customization

Scheduling

Deepgram and transcripts

About

Releases

Packages

Languages

lukasschwab/podcast-transcriber

Folders and files

Latest commit

History

Repository files navigation

podcast-transcriber

Overview

Usage

Requirements

Setup

Customization

Scheduling

Deepgram and transcripts

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages