π― Official Project for DSC Winter of Code 2026
A Python-based web scraper built using Playwright that automatically extracts all company names and logo URLs from the 5Paisa Stocks page. This project handles dynamic content loading via infinite scroll, validates logo URLs, removes duplicates, and exports clean data to Excel.
β οΈ IMPORTANT: The scraper opens a visible browser window (non-headless mode) to bypass anti-bot protection. Do not close the browser window manually! Let the script complete and it will close automatically. The scraping process may take 2-5 minutes depending on the number of stocks.
- β Automated Infinite Scrolling β Dynamically loads all stock data by scrolling until no more content appears
- β Company Name & Logo Extraction β Parses HTML to extract company names and logo image URLs
- β Logo URL Validation β Validates each logo URL using HTTP HEAD/GET requests
- β Duplicate Removal β Normalizes company names and removes duplicate entries
- β Resume from Checkpoint β Automatically resumes scraping from the last saved checkpoint if interrupted
- β Excel Export β Saves final data to a well-formatted Excel file with serial numbers
- β Comprehensive Logging β Tracks progress, errors, and statistics in a detailed log file
- β Polite Scraping β Implements random delays and custom user-agent to avoid overwhelming the server
5paisa_stocks_scraper/
β
βββ scraper/
β βββ run_scraper.py # Main Playwright scraper script
β
βββ outputs/
β βββ all_stock_script_Nov15_2025.xlsx # Final Excel output file
β
βββ logs/
β βββ scraper.log # Log file for progress and errors
β
βββ checkpoints/
β βββ progress.csv # For resumable scraping (created during run)
β
βββ README.md # This documentation file
βββ requirements.txt # Python dependencies
- Python 3.10+
- Playwright β Browser automation for dynamic content
- BeautifulSoup4 β HTML parsing
- httpx β Async HTTP client for logo validation
- pandas β Data manipulation
- openpyxl β Excel file generation
- Python 3.10 or higher installed on your system
- Internet connection for scraping
Download the project folder or clone it to your local machine.
cd c:\Users\Mashr\Desktop\5paisa_stocks_scraperpip install -r requirements.txtPlaywright requires browser binaries to be installed:
playwright install chromiumTo start scraping all stock data from 5Paisa:
python scraper/run_scraper.py- Browser Launch β Opens Chromium browser in visible (non-headless) mode
- Navigate to Page β Loads https://www.5paisa.com/stocks/all
- Infinite Scroll β Scrolls down repeatedly until all stocks are loaded
- Data Extraction β Parses HTML and extracts company names and logo URLs
- Data Cleaning β Removes duplicates based on normalized company names
- Logo Validation β Checks each logo URL for validity (HTTP status and content-type)
- Save to Excel β Exports final data to
outputs/all_stock_script_Nov15_2025.xlsx - Logging β Records all activity to
logs/scraper.log
If the scraper is interrupted (e.g., network issue, manual stop), it automatically saves progress to:
checkpoints/progress.csv
When you run the scraper again, it will:
- Detect the checkpoint file
- Resume from where it left off
- Skip re-scraping already collected data
To force a fresh start, simply delete checkpoints/progress.csv before running.
For each logo URL, the scraper:
- Sends an HTTP HEAD request (faster, no content download)
- Falls back to GET request if HEAD fails
- Checks:
- Status code is 200 (OK)
- Content-Type header contains "image"
- Marks logo as:
- β "Valid" β Logo accessible and is an image
- β "Broken or Missing" β Logo inaccessible or not an image
β οΈ "Invalid (Status: XXX)" β Other HTTP errors
The validation status is saved in the "notes" column in the Excel output.
The Excel file (outputs/all_stock_script_Nov15_2025.xlsx) contains:
| serial_no | company_name | logo_url | notes |
|---|---|---|---|
| 1 | Reliance Industries | https://example.com/logo1.png | Valid |
| 2 | TCS Limited | https://example.com/logo2.png | Valid |
| 3 | HDFC Bank | https://example.com/logo3.png | Broken or Missing |
All scraping activity is logged to:
logs/scraper.log
The log file includes:
- Start and end timestamps
- Total companies scraped
- Number of duplicates removed
- Number of invalid logos
- Total execution time
- Any errors or warnings
-
Limited Stock Listings on Page β The 5Paisa "All Stocks" page (https://www.5paisa.com/stocks/all) only displays a limited sample of companies (~30-40) on initial load, not all listed companies as the title suggests. The website may require using search functionality, filters, or accessing different pages to view all stocks. The scraper extracts all visible companies from the page.
-
Anti-Bot Protection β The 5Paisa website uses anti-bot protection that may block automated access. The scraper runs in non-headless mode (visible browser) to help bypass this. Do not close the browser window manually during scraping.
-
Manual Intervention May Be Required β If the website shows a CAPTCHA or "Access Denied" message, you may need to:
- Complete the CAPTCHA manually in the browser window that opens
- Wait a few minutes and try again
- Use a VPN if your IP has been temporarily blocked
-
Website Structure Changes β If 5Paisa updates their HTML structure, the CSS selectors may need adjustment
-
Rate Limiting β Excessive requests may trigger rate limiting; scraper includes polite delays to minimize risk
-
Dynamic Content β Some stocks may load asynchronously; the scraper waits for network idle but rare edge cases may occur
-
Logo Validation Speed β Validating hundreds/thousands of URLs takes time; expect 1-3 seconds per logo
Solution: Run playwright install chromium
Solution: Run pip install -r requirements.txt
Solution: The website structure may have changed. Check logs/scraper.log for details. You may need to update CSS selectors in extract_stock_data() function.
Solution: Check your internet connection. The scraper will timeout after 60 seconds on page load.
We welcome contributions from the community, especially participants of DSC Winter of Code 2026.
Please read theContribution.md file for:
- Setup instructions
- Beginner-friendly issues
- Pull request guidelines
- Code of conduct
This project is intended for educational purposes only.
Users are responsible for ensuring compliance with the websiteβs terms of service before scraping any data.
This project is created for educational and internship evaluation purposes.
- 5Paisa for providing publicly accessible stock data
- Playwright team for excellent browser automation tools
- Python community for amazing open-source libraries
For issues, questions, or suggestions, please contact via LinkedIn or GitHub.