This project explores a surprisingly difficult question:
What makes a movie timeless β a film that stays culturally relevant long after its release?
Using a dataset of 200 high-engagement films scraped directly from the TMDB API, I collected metadata, engineered features, and trained a supervised logistic regression model to analyze which factors most strongly predict whether a movie endures or fades.
β Jupyter Notebook β scraping TMDB, cleaning data, engineering features, training the ML model
β All visualizations referenced in the Medium article
β Final Medium write-up (linked below)
β Reproducible ML workflow for cultural longevity analysis
"What Makes a Movie Timeless? Using Data and Supervised Learning to Find Out"
π https://medium.com/@kwanjosh25/what-makes-a-movie-timeless-a-data-driven-look-using-tmdb-and-supervised-learning-9eff432fac6e
This README is designed to accompany and reinforce that article.
| Path / File | Description |
|---|---|
tmdb_timeless_movies.csv |
200-row cleaned dataset |
movie_timeless_analysis.ipynb |
Full scraping + ML notebook |
visuals/ |
Folder containing saved plots used in the article |
README.md |
This file |
This project uses the TMDB API to collect movie metadata.
To run the scraping portion of the notebook, you'll need your own API key.
Sign up here:
π https://www.themoviedb.org/signup
After creating an account:
- Go to your profile
- Click Settings β API
- Under API Key, click Request an API Key
- Choose Developer
- Fill out the required information
- Your key will appear under API Key
Create a .env file in the project root:
TMDB_API_KEY=your_key_here
The notebook will automatically detect it using python-dotenv:
from dotenv import load_dotenv
import os
load_dotenv()
API_KEY = os.getenv("TMDB_API_KEY")If the key is found, all TMDB requests will run properly.
π Getting a TMDB API Key This project uses the TMDB API to collect movie metadata. To run the scraping portion of the notebook, youβll need your own API key.
-
Create a TMDB Account Sign up here: π https://www.themoviedb.org/signup
-
Generate an API Key After creating an account: Go to your profile Click Settings β API Under API Key (v3 Auth), click Request an API Key Choose Developer Fill out the required information Your key will appear under API Key
-
Store Your Key Create a .env file in the project root:
TMDB_API_KEY=your_key_here
- Load It in Python The notebook will automatically detect it using python-dotenv:
from dotenv import load_dotenv
import os
load_dotenv()
API_KEY = os.getenv("TMDB_API_KEY")
If the key is found, all TMDB requests will run properly.
Identify non-obvious patterns in popular films that help explain why some movies become timeless while others fade β and explore how machine learning can support decisions such as:
- film studio investment
- streaming platform catalog curation
- franchise strategy
- marketing prioritization
- long-term content licensing
The dataset was created programmatically using the TMDB API, collecting metadata for 200 films, including:
- vote_average
- vote_count
- popularity
- budget
- revenue
- one-hot encoded genre_ids
- cast_size
- production_companies
- runtime
- release_year
- movie_age
- Target Label
- timeless β manually assigned (0 or 1)
Class Breakdown Total films: 200 Timeless = 26 Not timeless = 174 This strong imbalance influences model performance and evaluation.
Libraries used:
import requests
import pandas as pdScraping process:
- Fetch popular movie IDs page-by-page
- Query /movie/{id} to pull metadata
- Query /movie/{id}/credits for cast size
- Extract + normalize nested JSON fields
- Convert genre_ids to one-hot vectors
- Engineer features (movie_age, cast_size)
- Append results into a structured DataFrame
Example record creation:
record = {
"id": movie_id,
"title": details.get("title"),
"release_date": details.get("release_date"),
"release_year": int(details.get("release_date", "0000")[:4])
if details.get("release_date") else None,
"budget": details.get("budget"),
"revenue": details.get("revenue"),
"runtime": details.get("runtime"),
"popularity": details.get("popularity"),
"vote_average": details.get("vote_average"),
"vote_count": details.get("vote_count"),
"genre_ids": [g["id"] for g in details.get("genres", [])],
"production_companies": len(details.get("production_companies", [])),
"cast_size": len(credits.get("cast", [])),
}- Filled missing numeric fields with 0
- Normalized budget/revenue values
- Handled SettingWithCopyWarning correctly
- Removed malformed genre metadata
- Verified no duplicates and validated row count = 200
Logistic Regression
- The target (timeless) is binary
- Coefficients provide interpretable insights
- Handles imbalance with class weights
- Works well with mixed numerical + one-hot features
Audience Metrics
vote_average, vote_count, popularity
Economic
budget, revenue
Content
Genres (one-hot), runtime, cast_size, production_companies
Temporal
movie_age
- Confusion matrix
- Classification report
- Manual inspection of misclassified samples
- Re-running the notebook to ensure reproducibility
-
Highly Rated Movies Age Better β vote_average and metrics of audience engagement were top predictors.
-
Certain Genres Contribute to Longevity β Drama, Mystery, Fantasy, and Sci-Fi appeared frequently in timeless films.
-
Age Matters β Older films outperform new releases in timelessness likelihood.
-
Budget Does Not Predict Longevity β High-budget films often fade; cultural impact matters more.
-
Timeless Films Follow Consistent Narrative Archetypes β Genre combinations and story structures appear to influence longevity.
To ensure correctness:
- Verified random samples against TMDB manually
- Checked outliers (invalid budgets, missing release years)
- Hand-reviewed five misclassified films
- Compared ChatGPT code guidance with notebook output
- Ran the notebook from scratch to validate reproducibility
- Only 200 films β larger datasets may yield more robust patterns
- Timelessness is subjective (but defined consistently)
- Class imbalance affects precision for timeless films
- TMDB popularity may skew toward newer releases
- Lacks rich textual features like scripts or reviews
- Python
- Pandas, NumPy
- Matplotlib, Seaborn
- Scikit-learn
- Requests / TMDB API
- Jupyter Notebook
This project demonstrates:
- Real API scraping
- Feature engineering
- Binary classification modeling
- Visual storytelling
- Reproducible data science workflow
Dataset collected & curated by Joshua Kwan Analysis, visualizations, and writing by Joshua Kwan TMDB API used under fair-use academic guidelines.