This project was developed as part of my Bachelor’s thesis at the University of Passau, under the supervision of the Chair of Data Science. It implements a distributed web crawler for extracting structured product data from amazon.com. The system uses Playwright Go in combination with Camoufox as the headless browser.
The crawler begins with a predefined set of SEED_URLS and continuously discovers new product pages through search and category pages. URLs are queued in a PostgreSQL-backed job system to support scalable, distributed processing. Parsed product data is passed to a loosely coupled consumer module, with support for stdout and opensearch currently implemented.
The crawler depends on Camoufox, xvfb and other system libraries to be installed on the system. The easiest way to get started is to use the included Docker setup.
docker compose up -dAfterwards visit the OpenSearch Dashboard at http://localhost:5601.
This project was developed for academic research at the University of Passau – Chair of Data Science. It is provided for educational and research purposes only.
Please note: This crawler targets Amazon, which prohibits automated access in its Terms of Service. Use responsibly and at your own risk. The author and affiliated institution assume no liability for misuse.