Project Master

This repository is a compilation of scripts that I used during my Master's project.

The goal of those scripts are to perform personality detection on fictional characters. The personality detection algorithms are made following Y. Mehta's procedure (See Y. Mehta's Github).

In the Archives section, you can find my attempts to conduct personality detection. Those include the works I've done using Google Colabs, but also some tests I made with sentiment analysis (See SenticNet web site). Finally, you can also find the attempts that I've done to build my own Keras models.

The files placed in the root directory and those in the Modified Code directory are the one that I used for the final form of my Master's Project.

How to Use

SetUp

Clone this repository with:

git clone git@github.com:JocelinPitt/ProjetMaster.git

If you want to create your own models, you should also clone Y. Mehta GitHub repository

git clone git@gitlab.com:ml-automated-personality-detection/personality.git

You should also install the required python packages with :

pip -r requirements.txt

Note that not all the packages are needed to perform the last goal of this work. But those will be required if you try to run the archived files.

Prepare the data

The first thing you must do is to download the data file on Zenodo on the links below. In the "Movies speeches by narrative role Dataset", you can find three type of files. If you download the DATA_full.csv, you will get my raw data. Those data need to be linked with the UCI's cast.html (See below). If you choose to work with those files, you can download the cast.csv file on Zenodo's "Multiple personality data set". It's the same file that I have edited to be in a CSV format. With those two file, you can run the Imdb.py file. This operation will try to match both data frame into a single one.

You can skip this operation as it is a very long process by downloading the Df.pkl on Zenodo's "Movie speeches by narrative role". This pickled file is a data frame that is the result of the previous operation. With this file, you can directly call the To_Csv.py script. Or you can also skip this process, and directly download the narrative role csv available on Zenodo (i.e. hero.csv, villain.csv, agent.csv).

Once this is done, you can clone Y. Mehta's GitHub repository and substitute some of his python script with the slightly modified one that you'll find in the Modified Code directory. Then you can follow Y. Metha's process to embed your csv. The modified codes allow you to add a "MyData" argument to the LM_extractor.py calls.

python LM_extractor.py -dataset_type 'MyData' -token_length 512 -batch_size 32 -embed 'bert-base' -op_dir 'pkl_data'

You can also skip this step by downloading my embedded data on Zenodo's "Bert embedded speeches by narrative role".

Prepare the models

You can either use the pretrained models available on this repository and directly use your embedded data with the Predict.py file. You can also train new ones by following Y. Mehta's process. If you choose to do so, the MLP_LM_saved.py file allow you to save the models you are training. To do so, you should create a directory called "checkpoint" before following the Y. Mehta's procedure.

Train your models on the essays files, which are the Pennebaker et al. / Mairesse et al. golden training files for personality detection, with:

python LM_extractor.py -dataset_type 'essays' -token_length 512 -batch_size 32 -embed 'bert-base' -op_dir 'pkl_data'

Then run MLP_LM_saved.py file to get the saved models in the "checkpoint" directory.

Python Scripts and data

Imdb.py

This file goal is to link two dataframe. UCI's cast.html from their Movie Data Set (See UCI's Machine learning repository web page) and a Dataframe made by myself in another class work. Those dataframes are to be linked based on their movie title and the fictional character's names, which isn't contained in the UCI dataframe. To do so, this script has been written using the IMDBpy package.

Predict.py

This file is a simple script that call trained models to perform personality prediction on the fictional characters both present in the two dataframes previously mentioned. This script is used to call the Keras functions 'evaluate' and 'predict' and produce a confusion matrix. The script include a variable call "path" that should be changed in order to work.

Replacement.py

This file is a python dictionary that was made to transform my personal dataframe. Those transformations are made to help the work of the Imdb.py script. It changes the movies names accordingly to IMDB's catalogue.

To_Csv.py

This file is a simple script that will transform a pandas data frame into a collection of CSV containing only the Speeches and the OCEAN targets in a y/n format. This step is required for passing those CSV to Y. Metha's BERT embedding function. This script will produce a CSV for each role present in the main data frame.

DATAs

Due to file size restriction on Github, no data files were included on this repository. Those can be downloaded on the Zenodo sandbox with the following links:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Master

How to Use

SetUp

Prepare the data

Prepare the models

Python Scripts and data

Imdb.py

Predict.py

Replacement.py

To_Csv.py

DATAs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Archives		Archives
Models		Models
Modifed Codes		Modifed Codes
Imdb.py		Imdb.py
Predict.py		Predict.py
README.md		README.md
Replacement.py		Replacement.py
To_Csv.py		To_Csv.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Project Master

How to Use

SetUp

Prepare the data

Prepare the models

Python Scripts and data

Imdb.py

Predict.py

Replacement.py

To_Csv.py

DATAs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages