Docker-based service that scrapes northdata.de via a REST API.
- Docker setup with memory limits
- Puppeteer with stealth plugin
- Express REST API
- In-memory queue for sequential processing
- Request blocking for api.rupt.dev
- Human-like behavior with slow typing
- Debug mode with visible browser window
- Clone the repository and navigate to the directory
- Create a
.env
file from.env.example
- Add your northdata.de credentials to
.env
With Docker:
docker-compose up --build
For development:
npm install
npm run dev
POST /search
Content-Type: application/json
Body: {"query": "Company Name"}
Returns HTML content of search results.
GET /suggest?query=Company
Returns JSON suggestions from northdata.de's suggestion API.
GET /page?url=https://www.northdata.de/...
Returns cleaned HTML content of a specific page, with:
- Only the main content section
- No JavaScript, CSS, links, images, or non-informational elements
- Minimal whitespace to reduce token count for downstream AI processing
GET /health
Returns status of the service and queue information.
Key environment variables:
PORT
: Server port (default: 3000)NORTHDATA_USERNAME
: northdata.de usernameNORTHDATA_PASSWORD
: northdata.de passwordBROWSER_HEADLESS
: Set to 'false' for debug modeTYPING_DELAY_MIN/MAX
: Keystroke delay in msWAIT_FOR_NETWORK_IDLE
: Wait for network idle after navigation
Set BROWSER_HEADLESS=false
in .env
to see browser interactions.
For Docker debugging (Linux only):
docker-compose -f docker-compose.yml -f docker-compose.debug.yml up