Speech Emotion Recognition

Speech Emotion Recognition (SER) can be defined as extraction of the emotional state of the speaker from his or her speech signal

The project aims to develop a Speech Emotion Recognition system starting from EMOVO Corpus audio samples (italian language).
The model used is an MLP implemented in Keras.

The goal of this repository is to create a starting base that can be implemented with the use of other datasets and other deep architectures.

Structure

├── data/                                    
│   └── EMOVO/                                // EMOVO dataset
├── docs/                                     // documentation
├── models/                                  
│   ├── mlp.py                                // Multilayer Perceptron 
│   └── trained_model.m                       // trained model
├── modules/                                 
│   ├── load_dataset.py                       // loading EMOVO
│   ├── data_augmengtation.py                 // data augmentation functions
│   └── data_preparation.py                   // feature extraction and data preparation for ML
├── out/
│   ├── classification_report.csv             // statistics of predictions vs. true values
│   ├── confusion_matrix.png                  // confusion matrix 
│   └── predictions.csv                       // raw predictions
├── preprocessing.py                          // loading and preprocessing the dataset 
├── train.py                                  // training the model
└── predict.py                                // making predictions

Requirements

Install dependencies: pip install -r requirements.txt

The dataset

EMOVO is the first emotional corpus applicable to the Italian language. It is a database built from the voices of 6 actors (3 males and 3 females) who played 14 sentences simulating 6 emotional states (disgust, fear, anger, joy, surprise, sadness) plus the neutral state.

Project pipeline

preprocessing.py:
Load the dataset and extract the audio features through the librosa library. Synthetic data were created in order to increase the number of audio samples through data augmentation techniques.
train.py:
Train the MLP classifier defined in modules/mlp.py. But first the data must be prepared so that it can be input to the neural network. To do this, a special function called data_preparation() has been created.
predict.py:
Make predictions on data never seen by the model. Summary data of the predictions made can be found in the folder out/.

Notes

Despite the promising results, the work can be improved by increasing the number of audio samples to train SER models. Having more data could allow you to train DNN able to perform feature extraction automatically (i.e., CNN and LSTM).
A possible business application of SER system was proposed in my master thesis "A Speech Emotion Recognition system to perform Sentiment Analysis in a business context".

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.vscode		.vscode
checkpoints		checkpoints
data/EMOVO		data/EMOVO
docs		docs
models		models
modules		modules
out		out
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict.py		predict.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Emotion Recognition

Structure

Requirements

The dataset

Project pipeline

Notes

About

Releases

Packages

Contributors 2

Languages

fp1acm8/SER

Folders and files

Latest commit

History

Repository files navigation

Speech Emotion Recognition

Structure

Requirements

The dataset

Project pipeline

Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages