Skip to content

Kwanjk/tmdb-supervised-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 What Makes a Movie Timeless?

A Data + Machine Learning Analysis Using TMDB Metadata (200 Movies)

This project explores a surprisingly difficult question:

What makes a movie timeless β€” a film that stays culturally relevant long after its release?

Using a dataset of 200 high-engagement films scraped directly from the TMDB API, I collected metadata, engineered features, and trained a supervised logistic regression model to analyze which factors most strongly predict whether a movie endures or fades.


πŸ“¦ Repo Contents

βœ” Jupyter Notebook β€” scraping TMDB, cleaning data, engineering features, training the ML model
βœ” All visualizations referenced in the Medium article
βœ” Final Medium write-up (linked below)
βœ” Reproducible ML workflow for cultural longevity analysis


πŸ”— Medium Article β€” Final Submission

"What Makes a Movie Timeless? Using Data and Supervised Learning to Find Out"
πŸ‘‰ https://medium.com/@kwanjosh25/what-makes-a-movie-timeless-a-data-driven-look-using-tmdb-and-supervised-learning-9eff432fac6e

This README is designed to accompany and reinforce that article.


πŸ“‚ Project Structure

Path / File Description
tmdb_timeless_movies.csv 200-row cleaned dataset
movie_timeless_analysis.ipynb Full scraping + ML notebook
visuals/ Folder containing saved plots used in the article
README.md This file

πŸ”‘ Getting a TMDB API Key

This project uses the TMDB API to collect movie metadata.
To run the scraping portion of the notebook, you'll need your own API key.

1. Create a TMDB Account

Sign up here:
πŸ‘‰ https://www.themoviedb.org/signup

2. Generate an API Key

After creating an account:

  • Go to your profile
  • Click Settings β†’ API
  • Under API Key, click Request an API Key
  • Choose Developer
  • Fill out the required information
  • Your key will appear under API Key

3. Store Your Key

Create a .env file in the project root:

TMDB_API_KEY=your_key_here

4. Load It in Python

The notebook will automatically detect it using python-dotenv:

from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv("TMDB_API_KEY")

If the key is found, all TMDB requests will run properly.


🎯 Project Goal

πŸ”‘ Getting a TMDB API Key This project uses the TMDB API to collect movie metadata. To run the scraping portion of the notebook, you’ll need your own API key.

  1. Create a TMDB Account Sign up here: πŸ‘‰ https://www.themoviedb.org/signup

  2. Generate an API Key After creating an account: Go to your profile Click Settings β†’ API Under API Key (v3 Auth), click Request an API Key Choose Developer Fill out the required information Your key will appear under API Key

  3. Store Your Key Create a .env file in the project root:

TMDB_API_KEY=your_key_here
  1. Load It in Python The notebook will automatically detect it using python-dotenv:
from dotenv import load_dotenv
import os

load_dotenv()
API_KEY = os.getenv("TMDB_API_KEY")

If the key is found, all TMDB requests will run properly.

🎯 Project Goal

Identify non-obvious patterns in popular films that help explain why some movies become timeless while others fade β€” and explore how machine learning can support decisions such as:

  • film studio investment
  • streaming platform catalog curation
  • franchise strategy
  • marketing prioritization
  • long-term content licensing

πŸ“Š Dataset Overview

The dataset was created programmatically using the TMDB API, collecting metadata for 200 films, including:

Audience Metrics

  • vote_average
  • vote_count
  • popularity

Economic Metrics

  • budget
  • revenue

Content Features

  • one-hot encoded genre_ids
  • cast_size
  • production_companies
  • runtime

Temporal

  • release_year
  • movie_age
  • Target Label
  • timeless β€” manually assigned (0 or 1)

Class Breakdown Total films: 200 Timeless = 26 Not timeless = 174 This strong imbalance influences model performance and evaluation.

🧹 Data Collection & Cleaning

βœ” Scraping (Python + TMDB API)

Libraries used:

import requests
import pandas as pd

Scraping process:

  1. Fetch popular movie IDs page-by-page
  2. Query /movie/{id} to pull metadata
  3. Query /movie/{id}/credits for cast size
  4. Extract + normalize nested JSON fields
  5. Convert genre_ids to one-hot vectors
  6. Engineer features (movie_age, cast_size)
  7. Append results into a structured DataFrame

Example record creation:

record = {
    "id": movie_id,
    "title": details.get("title"),
    "release_date": details.get("release_date"),
    "release_year": int(details.get("release_date", "0000")[:4]) 
                    if details.get("release_date") else None,
    "budget": details.get("budget"),
    "revenue": details.get("revenue"),
    "runtime": details.get("runtime"),
    "popularity": details.get("popularity"),
    "vote_average": details.get("vote_average"),
    "vote_count": details.get("vote_count"),
    "genre_ids": [g["id"] for g in details.get("genres", [])],
    "production_companies": len(details.get("production_companies", [])),
    "cast_size": len(credits.get("cast", [])),
}

βœ” Cleaning Steps

  • Filled missing numeric fields with 0
  • Normalized budget/revenue values
  • Handled SettingWithCopyWarning correctly
  • Removed malformed genre metadata
  • Verified no duplicates and validated row count = 200

🧠 Modeling Approach

βœ” Supervised Learning Model

Logistic Regression

βœ” Why Logistic Regression?

  • The target (timeless) is binary
  • Coefficients provide interpretable insights
  • Handles imbalance with class weights
  • Works well with mixed numerical + one-hot features

βœ” Feature Groups

Audience Metrics

vote_average, vote_count, popularity

Economic

budget, revenue

Content

Genres (one-hot), runtime, cast_size, production_companies

Temporal

movie_age

βœ” Evaluation Methods

  • Confusion matrix
  • Classification report
  • Manual inspection of misclassified samples
  • Re-running the notebook to ensure reproducibility

πŸ” Key Findings

  1. Highly Rated Movies Age Better β€” vote_average and metrics of audience engagement were top predictors.

  2. Certain Genres Contribute to Longevity β€” Drama, Mystery, Fantasy, and Sci-Fi appeared frequently in timeless films.

  3. Age Matters β€” Older films outperform new releases in timelessness likelihood.

  4. Budget Does Not Predict Longevity β€” High-budget films often fade; cultural impact matters more.

  5. Timeless Films Follow Consistent Narrative Archetypes β€” Genre combinations and story structures appear to influence longevity.

πŸ§ͺ Validation

To ensure correctness:

  • Verified random samples against TMDB manually
  • Checked outliers (invalid budgets, missing release years)
  • Hand-reviewed five misclassified films
  • Compared ChatGPT code guidance with notebook output
  • Ran the notebook from scratch to validate reproducibility

πŸ“‰ Limitations

  • Only 200 films β€” larger datasets may yield more robust patterns
  • Timelessness is subjective (but defined consistently)
  • Class imbalance affects precision for timeless films
  • TMDB popularity may skew toward newer releases
  • Lacks rich textual features like scripts or reviews

🧰 Technologies Used

  • Python
  • Pandas, NumPy
  • Matplotlib, Seaborn
  • Scikit-learn
  • Requests / TMDB API
  • Jupyter Notebook

This project demonstrates:

  • Real API scraping
  • Feature engineering
  • Binary classification modeling
  • Visual storytelling
  • Reproducible data science workflow

πŸ™Œ Credits

Dataset collected & curated by Joshua Kwan Analysis, visualizations, and writing by Joshua Kwan TMDB API used under fair-use academic guidelines.

About

Supervised learning project using TMDB API to predict movie ratings/revenue for INST414.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors