Indian Institute of Technology, Bhilai
- Project GitHub Link: https://github.com/Antoniocolapso/Lip-Reader
- Introduction
- Dataset
- Data Augmentation
- Data Pipeline
- Deep Neural Network Architecture
- Model Training
- Model Performance
- Accuracy Analysis
- Cherry on Top
- App Development
- Tasks Done In Final Submission
- Conclusion
- Individual Contribution to the Project
Lip Reading is a fascinating project aiming to predict sentences from input video data without using audio information. In this Final Submission report, we provide an overview of the project, the dataset used, data preprocessing steps, and all the steps in designing and training the deep neural network. Additionally, we present a brief overview of our developed app.
The project uses the Grid corpus, which includes 1000 short videos and corresponding alignment files for 34 individuals, totaling 34000 videos and alignments, comprising 18 men and 16 women.
Links:
- Research Paper: https://arxiv.org/abs/1611.01599
- Dataset: http://spandh.dcs.shef.ac.uk/gridcorpus
- Extract 75 uniform frames from every video.
- Convert each frame to grayscale.
- Crop only the mouth region from each frame manually.
- Standardize the data.
- Map each character to an integer.
- Encode vocabularies to their respective indices.
- Replace characters not in the vocabulary with an empty string.
- Decoding is done by returning the corresponding character for each integer value.
- Alignments contain words corresponding to time stamps.
- Words are encoded in the same way as unique character to integer encoding, considering silence as a space in between.
- Select 500 random videos from each of the 34 folders.
- Use 450 videos for training and 50 videos for validation.
- Perform the specified preprocessing.
- Add a prefetching step to the dataset for optimized performance.
After researching and training several models, we have selected the current best model for phase one of submission and for the final submission.
- Utilizing the CTC loss function.
- Implementing a learning rate scheduler.
- Periodically saving the model.
On epoch 1:
- Original: bin white at t two now
- Prediction: le e e e e eo
On epoch 50:
- Original: bin blue by s six please
- Prediction: bin blue by six please
On epoch 96:
- Original: place green in d five soon
- Prediction: place green in d five soon
We used the standard levenshtein-distance algorithm to evaluate word and sentence accuracy of both of our models.
Accuracy of Model submitted for phase 1:
- Mean Word Error Rate: 14.40%
- Mean Sentence Error Rate: 6.06%
- Word-level Accuracy: 85.6%
- Sentence-level Accuracy: 93.94%
Accuracy of Model submitted for Final Submission:
- Mean Word Error Rate: 1.77%
- Mean Sentence Error Rate: 0.67%
- Word-level Accuracy: 98.23%
- Sentence-level Accuracy: 99.33%
Cherry on Top: Our model is capable enough to predict sentences from given Video (i.e., without any audio) almost in real-time while combined with our app.
We trained our model on 34 individual people with different skin tones, different accents, different genders, and with sentences comprising most random words possible, making our model robust enough to predict on any video. Our optimization at different levels not only decreased the training time from 12 hours for 500 videos to 9 hours for 1000 videos, which is almost 2.7x faster training speed, but also our unique encoding and decoding technique made predictions fast enough to predict over a given video almost in real time.
NeuroSync Lipscape: An app that synchronizes sentence predictions with your lip movement by harnessing the power of neural networks.
By this bar, you can select to predict a sentence on an existing video or upload* your own video on which you want to predict a sentence.
For now, it is just showcasing the app and its description, but later we will make it more functional, giving it more functionality like choosing your model to predict or it will be open for integration with other apps and also act as a nav-bar for our app. You can find this at the top left part of the app.
In this tab, there will be a preview of the video that you have selected/uploaded*. In this section, to make sure we can process videos with any extensions, we are converting that to MP4 format. You can find this at the middle left part of the app.
In this section, we have made users visualize the model input by animating the 75 preprocessed images to a GIF with 10 frames per second. You can find this at the middle right part of the app.
In this section, we have shown raw model output as a tensor and along with decoded sentence corresponding to that tensor. You can find this at the bottom right part of the app.
This is the complete UI of our current developed app:
- Training model after fine-tuning it on 34000 videos (before it was only trained on 1000).
- Did accuracy analysis to know actual sreangth of model.
- Developed an full stack app to utilize the model in real life for hearing impared or deaf people.
- We did exlore possibility of predidcting on any length (possibly very big i.e. in Hours) and came with solution to just classify each neighbour frame to one of 44 unique phonemes but unavalibility of proper dataset was major issue for us and we did try to make our own dataset after which we did end up making a 10 minutes long video with manual transcripting but due to our accent and low quality of datasets ( like touching or licking lips in between and many more reasons) we could not achieve the accuracy near this model (47%).
- And also many phonemes in english are similar when pronouncing so that was also measure issue also we couldnot completely decode the predicted sequence also.
- Due to above reasons and many more we decided to fine-tune and make it perfect previous model only which we successfully did and planning to deploy app after giving it some more finishing touches.
Our lip reading project successfully achieved advanced accuracy levels, reducing mean word and sentence error rates to 1.77% and 0.67%, respectively. The model, trained on a diverse dataset, demonstrated robustness and adaptability. Real-time sentence predictions were realized through the innovative "NeuroSync Lipscape" app, offering accessibility for the hearing-impaired. The unique character-to-integer encoding and data preprocessing techniques significantly contributed to model performance. The proposed Bi-GRU LSTM feature extractor, manual crop for mouth region extraction, and learning rate scheduler enhanced the model's capabilities. The app's user-friendly interface integrates video preview, model input visualization, and detailed predictions. Our collaborative efforts, including individual contributions and constant model monitoring, ensured project success. Overall, our work not only advances lip reading accuracy but also translates research into a practical, real-time solution for improved accessibility.