Product teams at banking apps face a critical challenge: prioritizing which features to build and bugs to fix from thousands of user reviews. With limited engineering resources, teams need to identify high-impact opportunities that will maximize return on investment (ROI) in terms of user satisfaction, retention, and app ratings.
Current Pain Points:
- Volume overload: Banking apps receive 25,000+ reviews annually across multiple platforms
- Subjective prioritization: Feature requests are often prioritized based on gut feeling or loudest voices, not data
- Unclear ROI: No systematic way to measure which fixes will have the biggest impact on user satisfaction
- Delayed insights: Manual review analysis is slow, causing teams to miss time-sensitive issues
Business Impact:
- A 0.5 star rating increase can improve conversion rates by 20-30% (industry benchmark)
- Fixing high-frequency negative issues can reduce churn by 10-15%
- Addressing the "right" complaints can increase App Store visibility and organic downloads
Goal: Build a data-driven system to analyze app reviews and predict which product roadmap decisions will deliver the highest ROI based on:
- Issue frequency - How many users are affected?
- Sentiment intensity - How unhappy/happy are users?
- Impact correlation - Which issues correlate with low ratings and churn?
This project extracts 21+ features from banking app reviews (crashes, login issues, transfer problems, etc.) to identify and prioritize high-impact product improvements.
Phase 1 Complete: Feature extraction, EDA, and baseline modeling finished
Current Phase: Building ROI prediction model to prioritize product improvements
1. Feature Extraction (21 features)
- Pattern matching: 10 issue categories (crash, login, mobile deposit, performance, frustration, etc.)
- Sentiment analysis: 4 VADER scores (compound, positive, negative, neutral)
- Text statistics: 7 metrics (length, word count, caps ratio, punctuation)
2. Exploratory Data Analysis
- Analyzed 25,000 reviews from 5 banking apps (Chase, Citi, BofA, Capital One, Wells Fargo)
- Key findings:
- Top issues: Crashes (24%), Login (22%), Frustration (24%)
- Highest negative ROI: Frustration (impact score: 997)
- Sentiment vs star rating correlation: 0.85+
- Identified data quality issues (VADER sarcasm detection, pattern overlap)
3. Baseline Classification Models
- Goal: Predict star ratings (1-5) from extracted features
- Models tested: Logistic Regression vs Random Forest
- Best model: Logistic Regression with class_weight='balanced'
- Accuracy: 50%
- Macro F1: 0.40
- Beats naive baseline (35%) by 43%
What Worked:
- Sentiment features most predictive (34-35% total importance)
- Feature engineering captured meaningful signals
- Logistic Regression outperformed Random Forest despite lower accuracy
- Model learned ordinal relationships (predicts adjacent ratings when wrong)
What Didn't Work:
- Only 50% accuracy - not production-ready for exact rating prediction
- Cannot reliably predict middle ratings (2-4 stars) due to class imbalance
- Random Forest achieved higher accuracy (61%) but only by predicting 1 and 5 stars
- VADER sentiment analysis fails on sarcasm, mixed reviews, context
Error Analysis (363 big errors analyzed):
- 60% of errors: Model predicts too high (actual 1-star → predicted 4-5 stars)
- Root cause: VADER scores misleadingly positive
- Example: "After 2 years of fighting this bank" → sentiment +0.97
- 30% of errors: Model predicts too low (actual 5-star → predicted 1-2 stars)
- Root cause: Mentions problems in passing even when overall positive
- 10% of errors: Middle ratings confused
1. Class Imbalance (Primary Bottleneck)
- Classes 2-4 only represent 9-11% of data each
- Model defaults to predicting majority classes (1 and 5)
- F1-scores for middle ratings: 0.18-0.24 (very poor)
2. Sentiment Analysis Quality
- VADER cannot detect sarcasm: "Not the best app!" → +0.89 sentiment
- Mixed reviews averaged: "works great BUT crashes" → positive score
- Context blindness: "I changed banks" in review about NEW bank → neutral
3. Feature Quality
- has_satisfaction captures both "satisfied" and "NOT satisfied"
- Pattern features don't distinguish "had crash" vs "has crash"
- No n-grams or contextual features
4. Evaluation Limitations
- Single train/test split (no cross-validation)
- No hyperparameter tuning
- Default model parameters used
Immediate Improvements (Low Effort, High Impact):
-
Fix pattern matching
- Split has_satisfaction into positive vs negative mentions
- Use negative lookahead regex:
(?<!not |un|dis)satisf
-
Address class imbalance
- SMOTE (Synthetic Minority Over-sampling)
- More aggressive class weights: {1: 1, 2: 5, 3: 5, 4: 5, 5: 1}
-
Cross-validation
- 5-fold stratified CV to validate results
Future Improvements (Higher Effort, Higher Impact): 4. Better sentiment analysis
- Replace VADER with BERT or RoBERTa fine-tuned on app reviews
- Aspect-based sentiment (crash sentiment vs overall sentiment)
- Would fix 60% of big errors
-
Ordinal regression
- Treat ratings as ordered (1 < 2 < 3 < 4 < 5) instead of independent classes
- Penalize adjacent errors less
-
Advanced features
- TF-IDF (top 100-500 words)
- N-grams for phrases ("face id", "customer service")
- Sentence-level sentiment
- App version, device type metadata
-
Better models
- XGBoost/LightGBM for better class imbalance handling
- Ensemble methods
ROI Prediction Model (Current Focus):
- Pivot from exact rating prediction to ROI prediction
- Use feature importance + frequency to identify high-impact improvements
- Answer: "If we fix X, how much will ratings improve?"
Sections below to be completed...