AWS Transcribe allows to automaticaly convert speech to text using their own Machine Learning trained model.
Using it, I created a project to generate and synchronize subtitles from a given video as an input file.
This repo contains the Terraform templates in order to deploy the solution in AWS, as well as the code used for the Lambdas and the code used by the ECS (on Fargate) from a home-made Docker image uploaded to ECR.
- Put a video file as input in a S3 folder.
- Get the result as a
.srtfile.
- The user puts the input video file into the
app_bucketunderinputs/. - This triggers the
input_to_sqsLambda which will send the key path of the input file into thesqs_inputqueue. - A message received in this queue triggers the
trigger_ecs_taskLambda. The function will- read and parse the message from the SQS Queue.
- trigger an ECS task and passing the values (key pahth and Bucket name) fetched from the SQS to it.
- The ECS task will download the input file into its local FS, extract the sound from it and upload the
.mp3result under/tmpof theapp_bucket. - Once a message is put under
tmp/thetrigger_transcribe_jobstarts the Transcribe job and send to it the key path of the extracted sound as well as the Bucket name. - The Transcribe job starts with the arguments given to it (key path of the
.mp3file and the Bucket name). - Once the Transcribe job is done, its result is uploaded into the
transcribe_result_bucket. - This result needs to be parsed into a
.srtformat. This is the job of theparse_transcribe_resultwhich is triggered by a Bucket notification when a file is uploaded into the root of thetranscribe_result_bucket. - Finally, the parsed and synchronized
.srtfile from the uploaded input video is uploaded into thetranscribe_result_bucketunderresults/.
-
./code
The code directory is composed if 3 sub-directories: docker, lambdas and local.-
/lambdas
This directory contains the Python code used by the AWS Lambdas. -
/local
This folder was my starting point, and was used to validate my initial idea.
It contains the Python code to locally test the Transcribe job. It takes a video path as an input and calls the AWS API to receive the .srt final result.
To use it, export your AWS profile into the shell, create a S3 Bucket, fill-up config.json and execute the Transcribe job -python3 transcribe.py. -
/docker
This part contains the Python code which is used by the ECS task to extract the sound from the video. The Dockerfile is used to built the Docker container which needs to be pused to the ECR repo.
-
-
./infrastructure
This directory contains all the necessary templates and resources to deploy the infrastructure on AWS.-
/compostions
Logical units of Terraform code. Each parts define some Terraformmoduleswhich call a group of Terraformresourcesdefined in ./infrastructure/resources. For example, inbucketswe can find the code used to deploy each S3 Buckets used by the solution. Since I want all the Buckets to be encrypted, I can re-use the samemodulestructure I defined for all of them. -
/ecs_definition
JSON templates defining the ECS task definition. This template is populated by the ecs_defintion.tf Terraform template file. -
/policies
All the policies used by the different components. These policies are templated using the same technique as the one used inecs_definitionmodule. -
/resources
Terraformresourceslogically grouped together and called from the ./composition part. For exanple, a S3 Bucket is being defined as a group ofaws_s3_bucket,aws_s3_bucket_policyand aaws_s3_bucket_public_access_blockTerraform resource.
-
cd infrastructure/compositions/terraform_backend- Comment the S3 part of the
providers.tffile fromterraform_backend:
terraform {
required_version = ">= 0.12"
// backend "s3" {
// }
}
terraform init --backend-config=backend.configterraform planterraform apply. Optionallyterraform apply --auto-approve- Uncomment the S3 part
terraform {
required_version = ">= 0.12"
backend "s3" {
}
}
terraform init --backend-config=backend.configand type yes to copy the local state into the deployed remote state Bucket.- Remove any
.tfstateor.tfstate.backupfile from the current dir.
cd infrastructure/compositions/networkingterraform init --backend-config=backend.configterraform planterraform apply. Optionallyterraform apply --auto-approve- Apply same command for
infrastructure/compositions/bucketsandinfrastructure/compositions/media_processing - Build and upload to ECR the Docker image used by the ECS task:
With fish shell:
eval (aws ecr get-login --no-include-email --region <region>)
docker build -t ecr_media_processing .
docker tag ecr_media_processing:latest <account_id>.dkr.ecr.<region>.amazonaws.com/ecr_media_processing:latest
docker push <account_id>.dkr.ecr.<region>.amazonaws.com/ecr_media_processing:latestHowever, for some very close pronunciation cases, the model could be not accurate enough (although constantly improving).
In the F.R.I.E.N.D.S extract I used as a test, Phoebe says:
We went to a self-defense class
Which is translated by
Way went to a self-defense class
However annoying, this can be easily fixed by editing the resulting .srt file with a simple text editor.
-
All-in with Lambda
My plan was to used only Lambdas function to do everything. I have been quickly limited because of the following reasons:- I needed to locally download the inout video to extract the sound from it. The
/tmpstorage is limited to 512MB. - There was a risk of a too-long processing-time, which means the Lambda could have timed-out. Because of these limitations, I decided to go for an ECS task running on Fargate.
- I needed to locally download the inout video to extract the sound from it. The
-
AWS Transcribe tmp file
Transcribe creates a.write_access_check_file.tempat the root of the Bucket in which its end-result will be uploaded. This means that theparse_transcribe_resultLambda will be triggered by the creation of this file and will try to parse it, resulting in an error (since the Lambda expects a.jsonfile, resulting from the Transcribe Job).
The solution was to trigger this Lambda when a file was uploaded to the root of the Bucket AND that this file ends with.json(using thesuffixfeature). -
Transcribe and key path
My initial plan was to only use one Bucket for everything.
However Transcribe does not allow to specify a key path to use to upload its end-result (otherwise I would have used the already deployedapp_bucket, and upload the final result under something like/results). Only a Bucket can be specified in The Transcribe job. I could have used theapp_bucketand uploads the results at its roots, but I think this breaks the logic of having dir-like structure in this Bucket.
The solution I choose was to create another Bucket (transcribe_result_bucket) to hold the end-result of the Transcribe job.
- ECS on Fargate is suitable for this use-case because:
- I do not need to manage the under-lying instances
- I have 10GB for Docker layer, and additional 4GB for volume mounts, which is enough to download most of the input vide file locally.
- The Lambda functions works inside a Private subnets and uses VPC endpoints to reach the different services.
- Same for the ECS Cluster, which uses a NAT Gateway instead to pull the Docker container from ECR.
- If you want to test the solution by yourself, I added the
video.mp4which you can use as an input. - The result from the Transcribe job can be found under
tmp_transcribe_result.json. - The parsed final result can be found under
result.srt.
- Sharding with Kinesis.
- Have a frontend.
These solutions might be implemented in the future in a private repo.
