Skip to content

karthik1841/amazon-ml-challenge

Repository files navigation

Amazon ML Challenge 2025 - Smart Product Pricing Solution

A machine learning solution for predicting e-commerce product prices using text and image features with ensemble methods, optimized for SMAPE metric.

🎯 Challenge Overview

Objective: Predict product prices based on catalog content and product images

Dataset:

  • Training: 75K products with prices
  • Test: 75K products for prediction

Evaluation: SMAPE (Symmetric Mean Absolute Percentage Error) - lower is better

πŸš€ Quick Start

1. Setup

All dependencies are automatically installed. The project includes:

  • Python 3.11
  • scikit-learn, XGBoost (always available)
  • LightGBM, CatBoost (optional - may require system libraries)
  • pandas, numpy, nltk, pillow

Note: If LightGBM or CatBoost fail to load due to missing system libraries (libgomp.so.1), the system will automatically fall back to XGBoost only. The ensemble will adapt to use the available boosted models.

2. Run Demo

python demo.py

3. Train Model (with your data)

# Text features only (faster)
python main.py --mode train --train-path dataset/train.csv

# With image features (better accuracy, slower)
python main.py --mode train --train-path dataset/train.csv --use-images

4. Generate Predictions

python main.py --mode predict --test-path dataset/test.csv

5. Complete Pipeline

# Train and predict in one command
python main.py --mode both

πŸ“ Project Structure

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ utils.py              # Helper functions (I/O, SMAPE, image download)
β”‚   β”œβ”€β”€ feature_extraction.py # Text (TF-IDF, IPQ) & image features
β”‚   β”œβ”€β”€ models.py             # ML models and ensemble
β”‚   β”œβ”€β”€ train.py              # Training with cross-validation
β”‚   └── predict.py            # Prediction generation
β”œβ”€β”€ dataset/                  # Place train.csv and test.csv here
β”œβ”€β”€ images/                   # Downloaded product images
β”œβ”€β”€ models/                   # Saved trained models
β”œβ”€β”€ output/                   # test_out.csv predictions
β”œβ”€β”€ main.py                   # Main entry point
β”œβ”€β”€ demo.py                   # Demo with sample data
└── Documentation.md          # Detailed methodology

πŸ”¬ Methodology

Feature Engineering

  1. Text Features

    • TF-IDF vectorization (3000-5000 features)
    • Item Pack Quantity (IPQ) extraction (value, unit)
    • Text statistics (length, word count, keywords)
  2. Image Features (optional)

    • Color statistics (RGB mean/std)
    • Image dimensions
    • Basic visual patterns

Models

  • XGBoost: Gradient boosting (50% weight)
  • LightGBM: Fast gradient boosting (30% weight)
  • CatBoost: Categorical boosting (20% weight)

Training Strategy

  • 5-fold cross-validation
  • 80/20 train-validation split
  • SMAPE optimization
  • Hyperparameter tuning

πŸ“Š Usage Examples

Basic Training

python main.py --mode train

Advanced Training

python main.py \
  --mode train \
  --train-path dataset/train.csv \
  --model-type ensemble \
  --use-images

Prediction Only

python main.py \
  --mode predict \
  --test-path dataset/test.csv \
  --output-path output/test_out.csv

Model Type Options

  • ensemble (default): Combines available boosted models
  • xgboost: XGBoost only
  • lightgbm: LightGBM only
  • catboost: CatBoost only

πŸ“ˆ Expected Performance

  • Cross-validation SMAPE: Varies by data
  • Validation SMAPE: Reported after training
  • Prediction Range: $0.01 - $XXX.XX
  • Output Format: CSV with columns: sample_id, price

πŸ”§ How It Works

  1. Data Loading: Reads CSV files with product information
  2. Feature Extraction:
    • Extracts TF-IDF features from catalog text
    • Parses Item Pack Quantity (value and unit)
    • Optionally downloads and processes images
  3. Model Training:
    • Trains multiple models with cross-validation
    • Creates weighted ensemble
    • Saves models to disk
  4. Prediction:
    • Loads trained models
    • Extracts features from test data
    • Generates price predictions
    • Saves to test_out.csv

πŸ“ Output Format

The prediction output (test_out.csv) contains:

sample_id,price
217392,45.67
209156,23.45
...

πŸ› οΈ Customization

Adjust Model Hyperparameters

Edit src/models.py to modify:

  • Number of estimators
  • Learning rate
  • Max depth
  • Ensemble weights

Change Feature Count

Edit src/feature_extraction.py:

TextFeatureExtractor(max_features=5000)  # Increase/decrease

Add New Features

Extend the TextFeatureExtractor or ImageFeatureExtractor classes

πŸ“‹ Requirements Checklist

  • βœ… MIT/Apache 2.0 licensed models (< 8B parameters)
  • βœ… No external LLM APIs used
  • βœ… No external price lookup
  • βœ… SMAPE evaluation metric
  • βœ… Proper output format
  • βœ… Well-commented source code
  • βœ… Documentation included

πŸ› Troubleshooting

Issue: Image download fails

Solution: Use --use-images flag with caution. Text-only model works well.

Issue: Out of memory

Solution: Reduce max_features in TextFeatureExtractor or use a single model instead of ensemble.

Issue: Missing train.csv or test.csv

Solution: Place your data files in the dataset/ directory with exact names train.csv and test.csv.

πŸ“š Documentation

See Documentation.md for:

  • Detailed methodology
  • Experiments conducted
  • Model architecture
  • Results and conclusions
  • Future improvements

πŸ† Submission

  1. Train model: python main.py --mode train --train-path dataset/train.csv
  2. Generate predictions: python main.py --mode predict --test-path dataset/test.csv
  3. Submit output/test_out.csv to challenge portal
  4. Include this code and Documentation.md with submission

πŸ“„ License

This project uses only MIT/Apache 2.0 licensed libraries and complies with all Amazon ML Challenge 2025 rules.


Good luck with the challenge! πŸš€

About

The Amazon ML Challenge 2025 is a national-level machine learning competition by Amazon.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages