Skip to content

Latest commit

 

History

History
156 lines (118 loc) · 6.14 KB

File metadata and controls

156 lines (118 loc) · 6.14 KB

Product Roadmap ROI Predictor

Business Problem

Product teams at banking apps face a critical challenge: prioritizing which features to build and bugs to fix from thousands of user reviews. With limited engineering resources, teams need to identify high-impact opportunities that will maximize return on investment (ROI) in terms of user satisfaction, retention, and app ratings.

Current Pain Points:

  • Volume overload: Banking apps receive 25,000+ reviews annually across multiple platforms
  • Subjective prioritization: Feature requests are often prioritized based on gut feeling or loudest voices, not data
  • Unclear ROI: No systematic way to measure which fixes will have the biggest impact on user satisfaction
  • Delayed insights: Manual review analysis is slow, causing teams to miss time-sensitive issues

Business Impact:

  • A 0.5 star rating increase can improve conversion rates by 20-30% (industry benchmark)
  • Fixing high-frequency negative issues can reduce churn by 10-15%
  • Addressing the "right" complaints can increase App Store visibility and organic downloads

Goal: Build a data-driven system to analyze app reviews and predict which product roadmap decisions will deliver the highest ROI based on:

  1. Issue frequency - How many users are affected?
  2. Sentiment intensity - How unhappy/happy are users?
  3. Impact correlation - Which issues correlate with low ratings and churn?

This project extracts 21+ features from banking app reviews (crashes, login issues, transfer problems, etc.) to identify and prioritize high-impact product improvements.


Project Status

Phase 1 Complete: Feature extraction, EDA, and baseline modeling finished

Current Phase: Building ROI prediction model to prioritize product improvements


Current Progress

Completed Work

1. Feature Extraction (21 features)

  • Pattern matching: 10 issue categories (crash, login, mobile deposit, performance, frustration, etc.)
  • Sentiment analysis: 4 VADER scores (compound, positive, negative, neutral)
  • Text statistics: 7 metrics (length, word count, caps ratio, punctuation)

2. Exploratory Data Analysis

  • Analyzed 25,000 reviews from 5 banking apps (Chase, Citi, BofA, Capital One, Wells Fargo)
  • Key findings:
    • Top issues: Crashes (24%), Login (22%), Frustration (24%)
    • Highest negative ROI: Frustration (impact score: 997)
    • Sentiment vs star rating correlation: 0.85+
    • Identified data quality issues (VADER sarcasm detection, pattern overlap)

3. Baseline Classification Models

  • Goal: Predict star ratings (1-5) from extracted features
  • Models tested: Logistic Regression vs Random Forest
  • Best model: Logistic Regression with class_weight='balanced'
    • Accuracy: 50%
    • Macro F1: 0.40
    • Beats naive baseline (35%) by 43%

Key Findings

What Worked:

  • Sentiment features most predictive (34-35% total importance)
  • Feature engineering captured meaningful signals
  • Logistic Regression outperformed Random Forest despite lower accuracy
  • Model learned ordinal relationships (predicts adjacent ratings when wrong)

What Didn't Work:

  • Only 50% accuracy - not production-ready for exact rating prediction
  • Cannot reliably predict middle ratings (2-4 stars) due to class imbalance
  • Random Forest achieved higher accuracy (61%) but only by predicting 1 and 5 stars
  • VADER sentiment analysis fails on sarcasm, mixed reviews, context

Error Analysis (363 big errors analyzed):

  • 60% of errors: Model predicts too high (actual 1-star → predicted 4-5 stars)
    • Root cause: VADER scores misleadingly positive
    • Example: "After 2 years of fighting this bank" → sentiment +0.97
  • 30% of errors: Model predicts too low (actual 5-star → predicted 1-2 stars)
    • Root cause: Mentions problems in passing even when overall positive
  • 10% of errors: Middle ratings confused

Where Models Need Improvement

1. Class Imbalance (Primary Bottleneck)

  • Classes 2-4 only represent 9-11% of data each
  • Model defaults to predicting majority classes (1 and 5)
  • F1-scores for middle ratings: 0.18-0.24 (very poor)

2. Sentiment Analysis Quality

  • VADER cannot detect sarcasm: "Not the best app!" → +0.89 sentiment
  • Mixed reviews averaged: "works great BUT crashes" → positive score
  • Context blindness: "I changed banks" in review about NEW bank → neutral

3. Feature Quality

  • has_satisfaction captures both "satisfied" and "NOT satisfied"
  • Pattern features don't distinguish "had crash" vs "has crash"
  • No n-grams or contextual features

4. Evaluation Limitations

  • Single train/test split (no cross-validation)
  • No hyperparameter tuning
  • Default model parameters used

Possible Next Attempts

Immediate Improvements (Low Effort, High Impact):

  1. Fix pattern matching

    • Split has_satisfaction into positive vs negative mentions
    • Use negative lookahead regex: (?<!not |un|dis)satisf
  2. Address class imbalance

    • SMOTE (Synthetic Minority Over-sampling)
    • More aggressive class weights: {1: 1, 2: 5, 3: 5, 4: 5, 5: 1}
  3. Cross-validation

    • 5-fold stratified CV to validate results

Future Improvements (Higher Effort, Higher Impact): 4. Better sentiment analysis

  • Replace VADER with BERT or RoBERTa fine-tuned on app reviews
  • Aspect-based sentiment (crash sentiment vs overall sentiment)
  • Would fix 60% of big errors
  1. Ordinal regression

    • Treat ratings as ordered (1 < 2 < 3 < 4 < 5) instead of independent classes
    • Penalize adjacent errors less
  2. Advanced features

    • TF-IDF (top 100-500 words)
    • N-grams for phrases ("face id", "customer service")
    • Sentence-level sentiment
    • App version, device type metadata
  3. Better models

    • XGBoost/LightGBM for better class imbalance handling
    • Ensemble methods

ROI Prediction Model (Current Focus):

  • Pivot from exact rating prediction to ROI prediction
  • Use feature importance + frequency to identify high-impact improvements
  • Answer: "If we fix X, how much will ratings improve?"

Table of Contents


Sections below to be completed...