Skip to content

m-lally/dark-web-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Onion Indexer (POC)

Status: Prototype
Environment: macOS Silicon (Dev) -> Linux (Prod)
Stack: Docker, Tor, Redis Stack (RediSearch), Python

⚠️ LEGAL & ETHICAL WARNING

READ THIS BEFORE RUNNING.

You are about to build a tool that indiscriminately crawls the Tor hidden services (.onion sites).

  1. Liability: This crawler will eventually encounter illegal content (CSAM, illicit markets, malware). Downloading this content to your disk, even unintentionally via a crawler, may constitute a felony in your jurisdiction.
  2. OpSec: Running this on a personal laptop connects your identity (via your ISP) to the entry node of the Tor network. If your Tor configuration fails, your real IP will be exposed to the sites you are crawling.
  3. Malware: Hidden services are rife with browser exploits. While this crawler uses requests (not a browser engine), parsing hostile HTML still carries risks.

Proceed at your own risk.

Architecture

This project simulates a production environment using Docker containers. It does not use Homebrew services to ensure isolation and portability.

  1. tor Service: A minimal Alpine Linux container running Tor, exposing SOCKS5 on port 9050.
  2. redis Service: Runs redis-stack-server. This is not standard Redis; it includes the RediSearch module, allowing us to perform full-text search queries (FT.SEARCH) on the data we scrape.
  3. crawler Service: A Python worker that:
    • Fetches URLs from a Redis List (frontier).
    • Proxies traffic through the tor container.
    • Parses HTML and extracts text/links.
    • Stores data in Redis Hashes.
    • Updates the inverted search index automatically.

Prerequisites

  • Docker Desktop for Mac (Apple Silicon)
  • Git

Installation & Usage

  1. Clone and Enter:

    git init
    # (Add files provided in the setup instructions)
  2. Build and Run:

    docker-compose up --build -d
  3. Monitor the Crawler: The crawler will start immediately. You can watch the logs to see what it is hitting.

    docker-compose logs -f crawler
  4. How to Search (The UI): We are using RedisInsight as the GUI instead of building a custom frontend.

    1. Open your browser to http://localhost:8001.
    2. Accept the EULA.
    3. It should auto-detect the local Redis instance. If not, connect to host: localhost, port: 6379.
    4. Click on the Workbench (CLI icon) or the Browser tool.

    Run a Search Query:

    FT.SEARCH idx:onion "bitcoin"
    

    Check Index Info:

    FT.INFO idx:onion
    

Production Considerations (Why this isn't Prod-Ready)

  • Persistence: The redis-data volume maps to your host. In production, this needs to be a managed volume with backups.
  • Network: A single Tor instance cannot handle high throughput. Production requires a load balancer (HAProxy) rotating requests across multiple Tor instances.
  • Sanitization: The current HTML parser is naive. It does not strictly sanitize inputs before storage, leaving the database vulnerable to XSS if the data is ever displayed in a web browser.
  • Rate Limiting: There is no politeness policy implementation (robots.txt is rarely respected in Tor, but hardcoded delays are required to avoid DOSing small hidden services).

Troubleshooting

  • "Connection Refused" on Crawler: The Tor container takes a few seconds to bootstrap 100%. Restart the crawler container: docker-compose restart crawler.
  • Disk Space: Crawling the web consumes space rapidly. Monitor the ./redis-data folder size.

About

A Dark Web Search Engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published