A machine learning solution for predicting e-commerce product prices using text and image features with ensemble methods, optimized for SMAPE metric.
Objective: Predict product prices based on catalog content and product images
Dataset:
- Training: 75K products with prices
- Test: 75K products for prediction
Evaluation: SMAPE (Symmetric Mean Absolute Percentage Error) - lower is better
All dependencies are automatically installed. The project includes:
- Python 3.11
- scikit-learn, XGBoost (always available)
- LightGBM, CatBoost (optional - may require system libraries)
- pandas, numpy, nltk, pillow
Note: If LightGBM or CatBoost fail to load due to missing system libraries (libgomp.so.1), the system will automatically fall back to XGBoost only. The ensemble will adapt to use the available boosted models.
python demo.py# Text features only (faster)
python main.py --mode train --train-path dataset/train.csv
# With image features (better accuracy, slower)
python main.py --mode train --train-path dataset/train.csv --use-imagespython main.py --mode predict --test-path dataset/test.csv# Train and predict in one command
python main.py --mode bothβββ src/
β βββ utils.py # Helper functions (I/O, SMAPE, image download)
β βββ feature_extraction.py # Text (TF-IDF, IPQ) & image features
β βββ models.py # ML models and ensemble
β βββ train.py # Training with cross-validation
β βββ predict.py # Prediction generation
βββ dataset/ # Place train.csv and test.csv here
βββ images/ # Downloaded product images
βββ models/ # Saved trained models
βββ output/ # test_out.csv predictions
βββ main.py # Main entry point
βββ demo.py # Demo with sample data
βββ Documentation.md # Detailed methodology
-
Text Features
- TF-IDF vectorization (3000-5000 features)
- Item Pack Quantity (IPQ) extraction (value, unit)
- Text statistics (length, word count, keywords)
-
Image Features (optional)
- Color statistics (RGB mean/std)
- Image dimensions
- Basic visual patterns
- XGBoost: Gradient boosting (50% weight)
- LightGBM: Fast gradient boosting (30% weight)
- CatBoost: Categorical boosting (20% weight)
- 5-fold cross-validation
- 80/20 train-validation split
- SMAPE optimization
- Hyperparameter tuning
python main.py --mode trainpython main.py \
--mode train \
--train-path dataset/train.csv \
--model-type ensemble \
--use-imagespython main.py \
--mode predict \
--test-path dataset/test.csv \
--output-path output/test_out.csvensemble(default): Combines available boosted modelsxgboost: XGBoost onlylightgbm: LightGBM onlycatboost: CatBoost only
- Cross-validation SMAPE: Varies by data
- Validation SMAPE: Reported after training
- Prediction Range: $0.01 - $XXX.XX
- Output Format: CSV with columns: sample_id, price
- Data Loading: Reads CSV files with product information
- Feature Extraction:
- Extracts TF-IDF features from catalog text
- Parses Item Pack Quantity (value and unit)
- Optionally downloads and processes images
- Model Training:
- Trains multiple models with cross-validation
- Creates weighted ensemble
- Saves models to disk
- Prediction:
- Loads trained models
- Extracts features from test data
- Generates price predictions
- Saves to test_out.csv
The prediction output (test_out.csv) contains:
sample_id,price
217392,45.67
209156,23.45
...
Edit src/models.py to modify:
- Number of estimators
- Learning rate
- Max depth
- Ensemble weights
Edit src/feature_extraction.py:
TextFeatureExtractor(max_features=5000) # Increase/decreaseExtend the TextFeatureExtractor or ImageFeatureExtractor classes
- β MIT/Apache 2.0 licensed models (< 8B parameters)
- β No external LLM APIs used
- β No external price lookup
- β SMAPE evaluation metric
- β Proper output format
- β Well-commented source code
- β Documentation included
Solution: Use --use-images flag with caution. Text-only model works well.
Solution: Reduce max_features in TextFeatureExtractor or use a single model instead of ensemble.
Solution: Place your data files in the dataset/ directory with exact names train.csv and test.csv.
See Documentation.md for:
- Detailed methodology
- Experiments conducted
- Model architecture
- Results and conclusions
- Future improvements
- Train model:
python main.py --mode train --train-path dataset/train.csv - Generate predictions:
python main.py --mode predict --test-path dataset/test.csv - Submit
output/test_out.csvto challenge portal - Include this code and
Documentation.mdwith submission
This project uses only MIT/Apache 2.0 licensed libraries and complies with all Amazon ML Challenge 2025 rules.
Good luck with the challenge! π