CS550: Machine Learning - Final Project

Indian Institute of Technology, Bhilai

Advance Lipreading (Sentence-Level)

Project GitHub Link: https://github.com/Antoniocolapso/Lip-Reader

AUTHOR: Omm Prakash Sahoo (12141190)

Introduction

Lip Reading is a fascinating project aiming to predict sentences from input video data without using audio information. In this Final Submission report, we provide an overview of the project, the dataset used, data preprocessing steps, and all the steps in designing and training the deep neural network. Additionally, we present a brief overview of our developed app.

Dataset

The project uses the Grid corpus, which includes 1000 short videos and corresponding alignment files for 34 individuals, totaling 34000 videos and alignments, comprising 18 men and 16 women.

Links:

Research Paper: https://arxiv.org/abs/1611.01599
Dataset: http://spandh.dcs.shef.ac.uk/gridcorpus

Data Augmentation

Video Preprocessing

Extract 75 uniform frames from every video.
Convert each frame to grayscale.
Crop only the mouth region from each frame manually.
Standardize the data.

Unique Character to Integer Encoding

Map each character to an integer.
Encode vocabularies to their respective indices.
Replace characters not in the vocabulary with an empty string.

Decoding

Decoding is done by returning the corresponding character for each integer value.

Alignment Preprocessing

Alignments contain words corresponding to time stamps.
Words are encoded in the same way as unique character to integer encoding, considering silence as a space in between.

Data Pipeline

Select 500 random videos from each of the 34 folders.
Use 450 videos for training and 50 videos for validation.
Perform the specified preprocessing.
Add a prefetching step to the dataset for optimized performance.

Deep Neural Network Architecture

After researching and training several models, we have selected the current best model for phase one of submission and for the final submission.

Model Training

Utilizing the CTC loss function.
Implementing a learning rate scheduler.
Periodically saving the model.

Model Performance

On epoch 1:

Original: bin white at t two now
Prediction: le e e e e eo

On epoch 50:

Original: bin blue by s six please
Prediction: bin blue by six please

On epoch 96:

Original: place green in d five soon
Prediction: place green in d five soon

Accuracy Analysis

We used the standard levenshtein-distance algorithm to evaluate word and sentence accuracy of both of our models.

Accuracy of Model submitted for phase 1:

Mean Word Error Rate: 14.40%
Mean Sentence Error Rate: 6.06%
Word-level Accuracy: 85.6%
Sentence-level Accuracy: 93.94%

Accuracy of Model submitted for Final Submission:

Mean Word Error Rate: 1.77%
Mean Sentence Error Rate: 0.67%
Word-level Accuracy: 98.23%
Sentence-level Accuracy: 99.33%

Cherry on Top: Our model is capable enough to predict sentences from given Video (i.e., without any audio) almost in real-time while combined with our app.

How Did We Achieve Such Accuracy

We trained our model on 34 individual people with different skin tones, different accents, different genders, and with sentences comprising most random words possible, making our model robust enough to predict on any video. Our optimization at different levels not only decreased the training time from 12 hours for 500 videos to 9 hours for 1000 videos, which is almost 2.7x faster training speed, but also our unique encoding and decoding technique made predictions fast enough to predict over a given video almost in real time.

App Development

NeuroSync Lipscape: An app that synchronizes sentence predictions with your lip movement by harnessing the power of neural networks.

App Contents

Video Selection Bar

By this bar, you can select to predict a sentence on an existing video or upload* your own video on which you want to predict a sentence.

Left Vertical Drawer

For now, it is just showcasing the app and its description, but later we will make it more functional, giving it more functionality like choosing your model to predict or it will be open for integration with other apps and also act as a nav-bar for our app. You can find this at the top left part of the app.

Video Preview

In this tab, there will be a preview of the video that you have selected/uploaded*. In this section, to make sure we can process videos with any extensions, we are converting that to MP4 format. You can find this at the middle left part of the app.

What Your Model Actually Takes as Input

In this section, we have made users visualize the model input by animating the 75 preprocessed images to a GIF with 10 frames per second. You can find this at the middle right part of the app.

Output / Prediction

In this section, we have shown raw model output as a tensor and along with decoded sentence corresponding to that tensor. You can find this at the bottom right part of the app.

App UI

This is the complete UI of our current developed app:

Tasks Done In Final Submission

Training model after fine-tuning it on 34000 videos (before it was only trained on 1000).
Did accuracy analysis to know actual sreangth of model.
Developed an full stack app to utilize the model in real life for hearing impared or deaf people.
We did exlore possibility of predidcting on any length (possibly very big i.e. in Hours) and came with solution to just classify each neighbour frame to one of 44 unique phonemes but unavalibility of proper dataset was major issue for us and we did try to make our own dataset after which we did end up making a 10 minutes long video with manual transcripting but due to our accent and low quality of datasets ( like touching or licking lips in between and many more reasons) we could not achieve the accuracy near this model (47%).
And also many phonemes in english are similar when pronouncing so that was also measure issue also we couldnot completely decode the predicted sequence also.
Due to above reasons and many more we decided to fine-tune and make it perfect previous model only which we successfully did and planning to deploy app after giving it some more finishing touches.

Conclusion

Our lip reading project successfully achieved advanced accuracy levels, reducing mean word and sentence error rates to 1.77% and 0.67%, respectively. The model, trained on a diverse dataset, demonstrated robustness and adaptability. Real-time sentence predictions were realized through the innovative "NeuroSync Lipscape" app, offering accessibility for the hearing-impaired. The unique character-to-integer encoding and data preprocessing techniques significantly contributed to model performance. The proposed Bi-GRU LSTM feature extractor, manual crop for mouth region extraction, and learning rate scheduler enhanced the model's capabilities. The app's user-friendly interface integrates video preview, model input visualization, and detailed predictions. Our collaborative efforts, including individual contributions and constant model monitoring, ensured project success. Overall, our work not only advances lip reading accuracy but also translates research into a practical, real-time solution for improved accessibility.

Project GitHub Link

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Kaggle Output		Kaggle Output
Notebooks		Notebooks
app		app
models		models
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS550: Machine Learning - Final Project

Advance Lipreading (Sentence-Level)

AUTHOR: Omm Prakash Sahoo (12141190)

Table of Contents

Introduction

Dataset

Data Augmentation

Video Preprocessing

Unique Character to Integer Encoding

Decoding

Alignment Preprocessing

Data Pipeline

Deep Neural Network Architecture

Model Training

Model Performance

Accuracy Analysis

How Did We Achieve Such Accuracy

App Development

App Contents

Video Selection Bar

Left Vertical Drawer

Video Preview

What Your Model Actually Takes as Input

Output / Prediction

App UI

Tasks Done In Final Submission

Conclusion

About

Releases

Packages

Languages

License

Antoniocolapso/Lip-Reader

Folders and files

Latest commit

History

Repository files navigation

CS550: Machine Learning - Final Project

Advance Lipreading (Sentence-Level)

AUTHOR: Omm Prakash Sahoo (12141190)

Table of Contents

Introduction

Dataset

Data Augmentation

Video Preprocessing

Unique Character to Integer Encoding

Decoding

Alignment Preprocessing

Data Pipeline

Deep Neural Network Architecture

Model Training

Model Performance

Accuracy Analysis

How Did We Achieve Such Accuracy

App Development

App Contents

Video Selection Bar

Left Vertical Drawer

Video Preview

What Your Model Actually Takes as Input

Output / Prediction

App UI

Tasks Done In Final Submission

Conclusion

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages