Language Technology Spam Classifier

This project is a binary classifier of spam based on the SpamAssassin dataset. It uses tools such as nltk and scikit learn to construct this classifier.

Usage

This projects dependencies are managed through Pipenv. You must first install this program before usage.

Install dependencies using:

pipenv install

Next you must download and preprocess the training data, to do this you must run the data sub-command and feed it a directory to download the data to and a place to output data vectors in csv format. In the following command we output the data into the directory data and output the data vectors to the current directory, '.'. This command must be run in the spam_classifier subdirectory.

pipenv run python cli.py data data .

After the data has been downloaded you can train the classifier with the choosen classifier. Where nb is Naive Bayes, svm is Support Vector Machines, knn is K Nearest Neighbors and rf is Random Forest.

pipenv run python cli.py train train.csv class.pkl [nb|svm|knn|rf]

Lastly the classifier can be tested with the following command where it will output the statistics and output an error.csv of locations of misclassifications.

pipenv run python cli.py test test.csv class.pkl error.csv

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
paper		paper
spam_classifier		spam_classifier
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Technology Spam Classifier

Usage

About

Releases

Packages

Contributors 2

Languages

griffin/LangTech_Spam

Folders and files

Latest commit

History

Repository files navigation

Language Technology Spam Classifier

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages