Ironhack Data Science & Machine Learning Bootcamp — Final Project Isis Hassan | March 2026
Can a machine learning model tell Aries from Pisces just by reading their horoscope? This project investigates that question — and the answer turns out to say more about humans than the stars.
Daily horoscopes are a media staple deliberately written to be generic, so anyone can see themselves in them. This project uses that paradox as its foundation: across three objectives, it attempts to classify, analyse, and generate horoscope text using a full NLP and ML pipeline.
Objective 1 — Classification: Can a model predict the zodiac sign from horoscope text alone? (Baseline: 8.3% — random chance across 12 signs)
Objective 2 — Trends & Themes: Do patterns emerge across signs, elements, or time? Using sentiment analysis and semantic theme detection.
Objective 3 — Generation: Use a local LLM (LLaMA 3.2 via Ollama) to generate new daily horoscopes inspired by the tone and style of real ones.
| Task | Approach | Result |
|---|---|---|
| Classification | Soft Voting Ensemble (Logistic Regression + Linear SVM) | 14% test accuracy — beats random chance (8.3%) |
| Sentiment Analysis | twitter-roberta-base-sentiment-latest transformer |
Horoscopes are more positive during your birthday month; positivity trends decline across the year |
| Theme Detection | Sentence Transformers (all-MiniLM-L6-v2) |
Theme distributions are near-identical across signs, confirming deliberate genericism |
| Horoscope Generation | LLaMA 3.2 (local, via Ollama) with few-shot prompting | Full year of synthetic horoscopes generated for all 12 signs |
Takeaway: Daily horoscopes are engineered to be universal. Any signal we extract tells us more about how humans write than about astrology.
| Category | Tools |
|---|---|
| Data processing | pandas, numpy |
| NLP | scikit-learn (TF-IDF), NLTK, wordcloud |
| ML models | Logistic Regression, Linear SVM, Naive Bayes, Random Forest, Soft Voting Ensemble |
| Hyperparameter tuning | GridSearchCV |
| Transformers | transformers (HuggingFace), sentence-transformers |
| Dimensionality reduction | T-SNE |
| LLM / generation | Ollama (LLaMA 3.2), local inference |
| Visualisation | matplotlib, seaborn |
| Frontend | HTML/CSS (interactive horoscope picker) |
FinalProject/
│
├── data/ # Not tracked in git (see Data Sources below)
│ ├── data sources/
│ │ └── kaggle_source_1/ # Raw CSVs from Kaggle
│ ├── hindustan_times.csv
│ └── horoscope_com.csv
│
├── data_collection.ipynb # Data ingestion & standardisation
├── eda.ipynb # Exploratory data analysis (TF-IDF, T-SNE, word clouds)
├── classification.ipynb # ML pipeline, model comparison, ensemble
├── sentiment_analysis.ipynb # Sentiment and Theme exploration
├── text_generation.ipynb # Horoscope generation
├── index.html # Interactive horoscope viewer (load your CSV)
│
├── requirements.txt
└── README.md
data_collection.ipynb — Ingests and standardises three Kaggle datasets (16,701 horoscopes total). Handles mistyped sign names, date format alignment, column restructuring, and sign anonymisation (replacing sign mentions with [Sign Name]).
eda.ipynb — Exploratory analysis using TF-IDF with bigrams, word clouds, T-SNE clustering, distinctive word heatmaps, transformer-based sentiment analysis, and semantic theme detection across signs and time.
classification.ipynb — Full ML pipeline: TF-IDF vectorisation → model comparison (Naive Bayes, Logistic Regression, Linear SVM, Random Forest) → GridSearchCV tuning → Soft Voting Ensemble. Includes confusion matrices and iterative debugging of overfitting.
sentiment_analysis.ipynb — Trends and theme analysis across signs and time. Uses the twitter-roberta-base-sentiment-latest transformer to score sentiment per sign and month, revealing that horoscopes are more positive during a sign's birthday season. Theme detection is explored via two approaches: zero-shot classification (deberta-v3-xsmall) and the faster Sentence Transformers (all-MiniLM-L6-v2) with cosine similarity — demonstrating how theme definitions can be manipulated to produce very different outcomes.
text_generation.ipynb — Few-shot horoscope generation using LLaMA 3.2 running locally via Ollama. Builds a prompt per sign and date using 2 randomly sampled real horoscopes as style examples, then generates a full year of daily horoscopes (4,380 entries across 12 signs). Requests are parallelised with ThreadPoolExecutor and checkpointed to CSV every 50 rows to handle interruptions on slow hardware.
Data is excluded from this repository (see .gitignore). All source datasets are publicly available on Kaggle:
adxie12/horoscopes— Globe horoscopes (scraped + 2025)prasad22/daily-horoscope-dataset— Hindustan Timesshahp7575/horoscopes— horoscope.com
Prerequisites: Python 3.9+, virtual environment, Ollama installed locally (for generation only)
# Clone the repo
git clone https://github.com/your-username/FinalProject.git
cd FinalProject
# Create and activate virtual environment
python -m venv .env
source .env/bin/activate # Mac/Linux
.env\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# For horoscope generation only — pull the model via Ollama
ollama pull llama3.2Then download the Kaggle datasets and place them in data/data sources/kaggle_source_1/.
Run notebooks in order: data_collection → eda → classification → sentiment→ text_generation.
- Transformer-based classification — swap TF-IDF for BERT embeddings to capture richer semantic signal
- Temporal theme analysis — explore weekend vs. weekday horoscope patterns
- Sign-aware generation — prompt the LLM with sign personality traits to produce more differentiated output
The project slide deck (Decoding_The_Zodiac.pdf) is included in this repository.
Project completed as part of the Ironhack Data Science & Machine Learning Bootcamp, March 2026.