Status: Prototype
Environment: macOS Silicon (Dev) -> Linux (Prod)
Stack: Docker, Tor, Redis Stack (RediSearch), Python
READ THIS BEFORE RUNNING.
You are about to build a tool that indiscriminately crawls the Tor hidden services (.onion sites).
- Liability: This crawler will eventually encounter illegal content (CSAM, illicit markets, malware). Downloading this content to your disk, even unintentionally via a crawler, may constitute a felony in your jurisdiction.
- OpSec: Running this on a personal laptop connects your identity (via your ISP) to the entry node of the Tor network. If your Tor configuration fails, your real IP will be exposed to the sites you are crawling.
- Malware: Hidden services are rife with browser exploits. While this crawler uses
requests(not a browser engine), parsing hostile HTML still carries risks.
Proceed at your own risk.
This project simulates a production environment using Docker containers. It does not use Homebrew services to ensure isolation and portability.
torService: A minimal Alpine Linux container running Tor, exposing SOCKS5 on port 9050.redisService: Runsredis-stack-server. This is not standard Redis; it includes theRediSearchmodule, allowing us to perform full-text search queries (FT.SEARCH) on the data we scrape.crawlerService: A Python worker that:- Fetches URLs from a Redis List (
frontier). - Proxies traffic through the
torcontainer. - Parses HTML and extracts text/links.
- Stores data in Redis Hashes.
- Updates the inverted search index automatically.
- Fetches URLs from a Redis List (
- Docker Desktop for Mac (Apple Silicon)
- Git
-
Clone and Enter:
git init # (Add files provided in the setup instructions) -
Build and Run:
docker-compose up --build -d
-
Monitor the Crawler: The crawler will start immediately. You can watch the logs to see what it is hitting.
docker-compose logs -f crawler
-
How to Search (The UI): We are using RedisInsight as the GUI instead of building a custom frontend.
- Open your browser to
http://localhost:8001. - Accept the EULA.
- It should auto-detect the local Redis instance. If not, connect to
host: localhost,port: 6379. - Click on the Workbench (CLI icon) or the Browser tool.
Run a Search Query:
FT.SEARCH idx:onion "bitcoin"Check Index Info:
FT.INFO idx:onion - Open your browser to
- Persistence: The
redis-datavolume maps to your host. In production, this needs to be a managed volume with backups. - Network: A single Tor instance cannot handle high throughput. Production requires a load balancer (HAProxy) rotating requests across multiple Tor instances.
- Sanitization: The current HTML parser is naive. It does not strictly sanitize inputs before storage, leaving the database vulnerable to XSS if the data is ever displayed in a web browser.
- Rate Limiting: There is no politeness policy implementation (
robots.txtis rarely respected in Tor, but hardcoded delays are required to avoid DOSing small hidden services).
- "Connection Refused" on Crawler: The Tor container takes a few seconds to bootstrap 100%. Restart the crawler container:
docker-compose restart crawler. - Disk Space: Crawling the web consumes space rapidly. Monitor the
./redis-datafolder size.