This project implements a movie recommendation system using various NLP techniques to match user preferences with movie descriptions. TF-IDF, lemmatized TF-IDF, SVD-reduced TF-IDF, and SBERT embeddings to provide content recommendations based on user inputs.
movie-recommendation-system/
├── cleaned/ # Preprocessed data files (ignored in git)
│ └── filtered_df.pkl # Filtered dataset
├── data/ # Raw data
│ └── tmdb_5000_movies.csv
├── models/ # Model implementations
│ ├── sbert.py # Sentence-BERT model
│ ├── tfidf.py # Basic TF-IDF model
│ ├── tfidf_lemmatized.py # Lemmatized TF-IDF
│ └── tfidf_svd.py # TF-IDF with SVD
├── outputs/ # Recommendation outputs (ignored in git)
├── client.py # CLI interface
├── preprocessing.py # Data preprocessing scripts
├── README.md # this file
├── demo.md # link to video demo
└── requirements.txt # Project dependencies
We used a publically available dataset from Kaggle called the TMDb 5000 Movie Dataset. This is a list of around 5k popular movies with plot overviews and other related data collected around 7 years ago. The database was generated using the TMDb API.
git clone https://github.com/lous-e/movie-recommendation-system
cd movie-recommendation-system
Create a new virtual environment using any tool you prefer. We use venv for this example
python -m venv venv
venv/Scripts/activate
python3 -m venv venv
source ./venv/bin/activate
pip install -r requirements.txt
python load_models.py
python client.py --desc "I like action movies set in space" --topn 5 --model tfidf --out recommendations
| Argument | Description | Type | Default |
|---|---|---|---|
| --desc | User input describing preference | str | Required |
| --topn | Number of top recommendations to return | int | 5 (Max 10) |
| --model | Model type for recommendations | str | tfidf |
| --out | Output file name (saved in outputs/) | str | output |
Currently, the following models are supported:
tfidf: Returns the top-n movies sorted by descending order of tf-idf.tfidf-lemmatized: Lemmatizes the words before tf-idf.tfidf-svd: Performs SVD on tfidf matrices to reduce dimensionality.sbert: Uses SBERT embeddings for semantic similarity matching
The recommendations are saved in outputs/{out}.txt with details including movie title, similarity score, and overview.
Top-5 movie recommendations for sample query
I like space adventure films
-
tfidf
- The Kentucky Fried Movie (Similarity: 0.4654)
- Space Pirate Captain Harlock (Similarity: 0.2361)
- A Haunted House (Similarity: 0.2192)
- Metallica: Through the Never (Similarity: 0.1830)
- Lifeforce (Similarity: 0.1696)
-
tfidf-lemmatized
- The Kentucky Fried Movie (Similarity: 0.4654)
- Space Pirate Captain Harlock (Similarity: 0.2361)
- A Haunted House (Similarity: 0.2192)
- Metallica: Through the Never (Similarity: 0.1830)
- Lifeforce (Similarity: 0.1696)
-
tfidf-svd
- Lost in Space (Similarity: 0.4337)
- Space Pirate Captain Harlock (Similarity: 0.4144)
- Moonraker (Similarity: 0.3778)
- Deck the Halls (Similarity: 0.3716)
- The Kentucky Fried Movie (Similarity: 0.3600)
-
sbert
- Interstellar (Similarity: 0.4547)
- You Only Live Twice (Similarity: 0.4534)
- Sea Rex 3D: Journey to a Prehistoric World (Similarity: 0.4388)
- My Big Fat Independent Movie (Similarity: 0.4096)
- Galaxy Quest (Similarity: 0.4080)
- Model Improvements
- Add collaborative filtering based on user ratings
- Incorporate more advanced transformer models
- Real-time movie data updates
- Deployment
- Dockerization
- Endpoints using FastAPI
- Streamlit frontend
- Evaluation
Commented in PR!