I've successfully created a complete Docker Selenium demonstration setup for your team. Here's everything that's ready to use:
small_dataset.csv- 20 URLs with 1:1 ratio (10 legitimate, 10 phishing)create_small_dataset.py- Script to create the balanced small datasetdocker_selenium_extractor.py- Main feature extraction scriptrun_demo.py- Automated complete demo runnertest_docker_selenium.py- Quick connection testerselenium_requirements.txt- Python dependenciesDOCKER_SELENIUM_DEMO_README.md- Complete documentationDOCKER_SELENIUM_SETUP.md- Technical setup guide
python run_demo.pyThis handles everything automatically and is perfect for showing your team.
# 1. Start Docker Selenium
docker run -d -p 4444:4444 -p 7900:7900 --shm-size="2g" --name selenium-chrome selenium/standalone-chrome:latest
# 2. Test connection
python test_docker_selenium.py
# 3. Run extraction
python docker_selenium_extractor.py- Real-time browser activity at http://localhost:7900 (password: secret)
- Progress updates for each URL being processed
- Feature extraction status (success/failure)
- Performance metrics (response times)
extracted_features.csv- ~25 features per URL ready for MLextraction_report.json- Detailed extraction log with statistics
- URL length, domain analysis, suspicious keywords
- Page load success, response times
- HTML element counts (forms, links, images)
- Security indicators (HTTPS, password fields)
This setup is ideal for showing your team because:
- Small Scale - Only 20 URLs so demo runs quickly (~5 minutes)
- Visual - Team can watch browser automation in real-time
- Comprehensive - Shows both URL and web content features
- Educational - Clear logs showing what happens at each step
- Scalable - Easy to understand how it scales to thousands of URLs
Total URLs processed: 20
Successful extractions: ~16-18 (some phishing sites may be down)
Failed extractions: ~2-4 (normal for phishing URLs)
Success rate: ~80-90%
Features per URL: ~25
Demo duration: ~5 minutes
- Feature Engineering: See which features distinguish phishing vs legitimate sites
- Scalability: How this approach handles larger datasets
- Reliability: Graceful handling of failed page loads
- Performance: Response times and throughput considerations
- Ethics: Responsible web scraping practices
- Review extracted features in
extracted_features.csv - Analyze patterns between legitimate and phishing URLs
- Scale to full dataset using the same approach
- Integrate with ML pipeline for model training
- Customize features based on your specific needs
Everything is set up and tested. Your team can now:
- See exactly how web scraping works for phishing detection
- Understand the feature extraction process
- Watch browser automation in real-time
- Get hands-on experience with the tools
Just run python run_demo.py when you're ready to show your team! 🎉