GitHub - Mornor/ml-auto-subtitles: Automatic Subtitles Generation and Synchronization with AWS Transcribe.

Automatic Subtitles Generation and Synchronization with AWS Transcribe

Intro

AWS Transcribe allows to automaticaly convert speech to text using their own Machine Learning trained model.
Using it, I created a project to generate and synchronize subtitles from a given video as an input file.
This repo contains the Terraform templates in order to deploy the solution in AWS, as well as the code used for the Lambdas and the code used by the ECS (on Fargate) from a home-made Docker image uploaded to ECR.

Basic idea

Put a video file as input in a S3 folder.
Get the result as a .srt file.

Architecture

Schema

Workflow

The user puts the input video file into the app_bucket under inputs/.
This triggers the input_to_sqs Lambda which will send the key path of the input file into the sqs_input queue.
A message received in this queue triggers the trigger_ecs_task Lambda. The function will
1. read and parse the message from the SQS Queue.
2. trigger an ECS task and passing the values (key pahth and Bucket name) fetched from the SQS to it.
The ECS task will download the input file into its local FS, extract the sound from it and upload the .mp3 result under /tmp of the app_bucket.
Once a message is put under tmp/ the trigger_transcribe_job starts the Transcribe job and send to it the key path of the extracted sound as well as the Bucket name.
The Transcribe job starts with the arguments given to it (key path of the .mp3 file and the Bucket name).
Once the Transcribe job is done, its result is uploaded into the transcribe_result_bucket.
This result needs to be parsed into a .srt format. This is the job of the parse_transcribe_result which is triggered by a Bucket notification when a file is uploaded into the root of the transcribe_result_bucket.
Finally, the parsed and synchronized .srt file from the uploaded input video is uploaded into the transcribe_result_bucket under results/.

Repository explanations

./code
The code directory is composed if 3 sub-directories: docker, lambdas and local.
- /lambdas
  This directory contains the Python code used by the AWS Lambdas.
- /local
  This folder was my starting point, and was used to validate my initial idea.
  It contains the Python code to locally test the Transcribe job. It takes a video path as an input and calls the AWS API to receive the .srt final result.
  To use it, export your AWS profile into the shell, create a S3 Bucket, fill-up config.json and execute the Transcribe job - python3 transcribe.py.
- /docker
  This part contains the Python code which is used by the ECS task to extract the sound from the video. The Dockerfile is used to built the Docker container which needs to be pused to the ECR repo.
./infrastructure
This directory contains all the necessary templates and resources to deploy the infrastructure on AWS.
- /compostions
  Logical units of Terraform code. Each parts define some Terraform modules which call a group of Terraform resources defined in ./infrastructure/resources. For example, in buckets we can find the code used to deploy each S3 Buckets used by the solution. Since I want all the Buckets to be encrypted, I can re-use the same module structure I defined for all of them.
- /ecs_definition
  JSON templates defining the ECS task definition. This template is populated by the ecs_defintion.tf Terraform template file.
- /policies
  All the policies used by the different components. These policies are templated using the same technique as the one used in ecs_definition module.
- /resources
  Terraform resources logically grouped together and called from the ./composition part. For exanple, a S3 Bucket is being defined as a group of aws_s3_bucket, aws_s3_bucket_policy and a aws_s3_bucket_public_access_block Terraform resource.

How to deploy

Terraform backend

cd infrastructure/compositions/terraform_backend
Comment the S3 part of the providers.tf file from terraform_backend:

terraform {
  required_version = ">= 0.12"
//  backend "s3" {
//  }
}

terraform init --backend-config=backend.config
terraform plan
terraform apply. Optionally terraform apply --auto-approve
Uncomment the S3 part

terraform {
  required_version = ">= 0.12"
  backend "s3" {
  }
}

terraform init --backend-config=backend.config and type yes to copy the local state into the deployed remote state Bucket.
Remove any .tfstate or .tfstate.backup file from the current dir.

Transcribe architecture

cd infrastructure/compositions/networking
terraform init --backend-config=backend.config
terraform plan
terraform apply. Optionally terraform apply --auto-approve
Apply same command for infrastructure/compositions/buckets and infrastructure/compositions/media_processing
Build and upload to ECR the Docker image used by the ECS task:
With fish shell:

eval (aws ecr get-login --no-include-email --region <region>)
docker build -t ecr_media_processing .
docker tag ecr_media_processing:latest <account_id>.dkr.ecr.<region>.amazonaws.com/ecr_media_processing:latest
docker push <account_id>.dkr.ecr.<region>.amazonaws.com/ecr_media_processing:latest

Machine Learning Model model accuracy

Amazon Transcribe uses a deep learning process called automatic speech recognition (ASR) to convert speech to text quickly and accurately.

However, for some very close pronunciation cases, the model could be not accurate enough (although constantly improving).
In the F.R.I.E.N.D.S extract I used as a test, Phoebe says:

We went to a self-defense class

Which is translated by

Way went to a self-defense class

However annoying, this can be easily fixed by editing the resulting .srt file with a simple text editor.

Problem encoutered and solution found

All-in with Lambda
My plan was to used only Lambdas function to do everything. I have been quickly limited because of the following reasons:
- I needed to locally download the inout video to extract the sound from it. The /tmp storage is limited to 512MB.
- There was a risk of a too-long processing-time, which means the Lambda could have timed-out. Because of these limitations, I decided to go for an ECS task running on Fargate.
AWS Transcribe tmp file
Transcribe creates a .write_access_check_file.temp at the root of the Bucket in which its end-result will be uploaded. This means that the parse_transcribe_result Lambda will be triggered by the creation of this file and will try to parse it, resulting in an error (since the Lambda expects a .json file, resulting from the Transcribe Job).
The solution was to trigger this Lambda when a file was uploaded to the root of the Bucket AND that this file ends with .json (using the suffix feature).
Transcribe and key path
My initial plan was to only use one Bucket for everything.
However Transcribe does not allow to specify a key path to use to upload its end-result (otherwise I would have used the already deployed app_bucket, and upload the final result under something like /results). Only a Bucket can be specified in The Transcribe job. I could have used the app_bucket and uploads the results at its roots, but I think this breaks the logic of having dir-like structure in this Bucket.
The solution I choose was to create another Bucket (transcribe_result_bucket) to hold the end-result of the Transcribe job.

Notes

ECS on Fargate is suitable for this use-case because:
- I do not need to manage the under-lying instances
- I have 10GB for Docker layer, and additional 4GB for volume mounts, which is enough to download most of the input vide file locally.
The Lambda functions works inside a Private subnets and uses VPC endpoints to reach the different services.
Same for the ECS Cluster, which uses a NAT Gateway instead to pull the Docker container from ECR.
If you want to test the solution by yourself, I added the video.mp4 which you can use as an input.
The result from the Transcribe job can be found under tmp_transcribe_result.json.
The parsed final result can be found under result.srt.

Possible improvements

Sharding with Kinesis.
Have a frontend.

These solutions might be implemented in the future in a private repo.

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
assets		assets
code		code
infrastructure		infrastructure
readme_assets		readme_assets
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
notes.md		notes.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Subtitles Generation and Synchronization with AWS Transcribe

Intro

Basic idea

Architecture

Schema

Workflow

Repository explanations

How to deploy

Terraform backend

Transcribe architecture

Machine Learning Model model accuracy

Problem encoutered and solution found

Notes

Possible improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Automatic Subtitles Generation and Synchronization with AWS Transcribe

Intro

Basic idea

Architecture

Schema

Workflow

Repository explanations

How to deploy

Terraform backend

Transcribe architecture

Machine Learning Model model accuracy

Problem encoutered and solution found

Notes

Possible improvements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages