Skip to content
This repository has been archived by the owner on Jan 27, 2022. It is now read-only.
/ LangTech_Spam Public archive

Spam email classifier using scikit-learn on the public SpamAssassin corpus

Notifications You must be signed in to change notification settings

griffin/LangTech_Spam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Technology Spam Classifier

This project is a binary classifier of spam based on the SpamAssassin dataset. It uses tools such as nltk and scikit learn to construct this classifier.

Usage

This projects dependencies are managed through Pipenv. You must first install this program before usage.

Install dependencies using:

pipenv install

Next you must download and preprocess the training data, to do this you must run the data sub-command and feed it a directory to download the data to and a place to output data vectors in csv format. In the following command we output the data into the directory data and output the data vectors to the current directory, '.'. This command must be run in the spam_classifier subdirectory.

pipenv run python cli.py data data .

After the data has been downloaded you can train the classifier with the choosen classifier. Where nb is Naive Bayes, svm is Support Vector Machines, knn is K Nearest Neighbors and rf is Random Forest.

pipenv run python cli.py train train.csv class.pkl [nb|svm|knn|rf]

Lastly the classifier can be tested with the following command where it will output the statistics and output an error.csv of locations of misclassifications.

pipenv run python cli.py test test.csv class.pkl error.csv

About

Spam email classifier using scikit-learn on the public SpamAssassin corpus

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published