A set-it-and-forget-it GCP Cloud Function for transcribing a podcast, built for the Arms Control Wonk Podcast Slack community with transcriptions by Deepgram.
main.py defines a Cloud Function that
- Waits on an invocation from a Pub/Sub topic;
- Fetches a podcast's RSS or Atom feed of episodes;
- Selects up to three most recent episodes for which it hasn't already produced transcripts;
- Submits those podcast episodes to Deepgram's automated speech recognition API for transcription;
- Writes the Deepgram response and a processed transcript to Google Cloud Storage.
That Cloud Function is designed to be invoked on a regular schedule; the setup instructions below and Makefile provide cron
-like invocations by using Cloud Scheduler to publish to the Pub/Sub topic.
It avoids retranscribing episodes by checking whether a transcription artifact matching the episode's feed URI exists in GCS.
gsutil
andgcloud
- Python 3.7
- Deepgram account
Recommended: start with an empty GCP project for the resources created here (a Cloud Storage bucket, Pub/Sub topic, Cloud Scheduler job, and Cloud Function). Run gcloud config set project your-project-name
to point the Google Cloud SDK tools at that project.
-
Create a Google Cloud Storage bucket.
Run
make bucket
. -
Create
.env.yaml
file with required environment variables.TARGET_FEED_URL
: The URL of a podcast's Atom/RSS feed, e.g."https://armscontrolwonk.libsyn.com/rss"
.TRANSCRIPTIONS_BUCKET_NAME
: The name of the Google Cloud Storage bucket created in step 1.DEEPGRAM_API_KEY
: Your personal Deepgram API key. This is a secret!Example .env.yaml file
# Configuration TARGET_FEED_URL: "https://armscontrolwonk.libsyn.com/rss" TRANSCRIPTIONS_BUCKET_NAME: "transcriptions" # Secrets DEEPGRAM_API_KEY: "your_deepgram_secret_here"
-
Initialize the scheduling infrastructure (Pub/Sub topic and Cloud Scheduler job; documentation).
Run
make cron-job
. -
Deploy the Cloud Function.
Run
make deploy
.
To work through the backlog of episodes in the feed, repeatedly run the created job: gcloud scheduler jobs run WeeklyJob
.
Set up a transcription schedule appropriate to your podcast; these default settings are well-configured for a podcast with up to three episodes per week.
To check for updates more or less frequently, change the Cloud Scheduler job frequency. Default: once per week.
To transcribe more or fewer episodes per function execution, change the threshold in main.py#_main. Default: up to 3 episodes.
The Deepgram request in main.py#_transcribe is tailored to the Arms Control Wonk podcast; if the speech in your podcast is faster or slower, you may want to decrease or increase the utt_split
utterance threshold, respectively.
See Deepgram's API documentation and Python SDK for a documentation of the available options.
Want transcripts in a different format? Change how main.py#_process formats utterances.