OSINTai is a cutting-edge, AI-enhanced web crawler engineered for Open Source Intelligence (OSINT) professionals, cybersecurity researchers, and digital investigators. Leveraging advanced asynchronous processing, intelligent proxy rotation, and state-of-the-art LLM analysis, OSINTai automates comprehensive intelligence gathering from web sources with unparalleled efficiency and accuracy.
Key Capabilities:
- High-Performance Async Crawling with intelligent concurrency controls
- AI-Powered Content Analysis using Ollama LLMs for structured intelligence extraction
- Advanced Indicator Mining with automated entity recognition and risk assessment
- Graph-Based Intelligence Mapping with ACE-T compatible export formats
- Operational Security featuring proxy rotation, user-agent randomization, and stealth techniques
- Hunt Mode for targeted intelligence discovery with configurable search terms
- Near-Duplicate Detection using Simhash algorithms to eliminate redundant content
- Intelligent Scoring and prioritization based on indicator density and risk factors
- Asynchronous Architecture: Concurrent processing with configurable global (18) and per-host (4) concurrency limits
- Intelligent Proxy Management: Health-scored proxy rotation with automatic failover and performance tracking
- Adaptive Throttling: Per-host rate limiting prevents site overwhelming while maximizing throughput
- Resume Capability: Automatic checkpointing and state persistence for interrupted operations
- Memory Efficient: Optimized for large-scale crawls with minimal resource overhead
- LLM Content Analysis: Structured intelligence extraction using Ollama models (
granite3.2:latest) - Vector Embeddings: Semantic content embeddings for clustering and similarity analysis (
nomic-embed-text:latest) - Risk Intelligence: Automated identification of suspicious patterns, high-risk indicators, and threat signals
- Contextual Analysis: Deep content understanding with entity relationships and temporal analysis
- Comprehensive Indicator Mining: Extracts emails, phone numbers, cryptocurrency addresses, social media handles, domains, IP addresses, and custom patterns
- Hunt Mode: Targeted crawling for specific intelligence terms with configurable lead discovery limits
- Content Deduplication: Simhash-based near-duplicate detection to eliminate redundant information
- Intelligence Scoring: Automated prioritization based on indicator density, risk flags, and content relevance
OSINTai employs a sophisticated multi-factor scoring system that combines traditional indicator mining with AI-powered risk assessment to prioritize pages by intelligence value:
- Emails: 2.0 points each (max 10 emails = 20 points)
- Phone Numbers: 1.5 points each (max 5 phones = 7.5 points)
- BTC Addresses: 3.0 points each (max 3 addresses = 9 points)
- ETH Addresses: 3.0 points each (max 3 addresses = 9 points)
- Social Media Handles: 1.0 point each (max 5 handles = 5 points)
- Risk Flags: 5.0 points each (unlimited - highest priority intelligence)
- Examples: "Hacking Campaigns", "Data Breaches", "Regulatory Changes"
- Actionable Leads: 3.0 points each (unlimited - valuable insights)
- Examples: Investigation recommendations, strategic intelligence
- Key Entities: 1.0 point each (max 10 entities = 10 points)
- Named entities like organizations, people, technologies
- Key Locations: 1.5 points each (max 5 locations = 7.5 points)
- Geographic intelligence and operational locations
- Keywords: 0.5 points each (max 20 keywords = 10 points)
- Topic relevance and content categorization
- High-Value Page: Risk flags (10 pts) + actionable leads (9 pts) + entities (8 pts) = 27+ points
- Medium-Value Page: Entities (6 pts) + locations (4.5 pts) + keywords (8 pts) = 18.5 points
- Low-Value Page: Basic indicators only (emails, phones) = 5-15 points
Pages are automatically ranked by total score, with AI-enhanced intelligence receiving the highest prioritization for OSINT analysis.
- Graph Export: ACE-T compatible JSONL format with nodes, edges, and relationship mapping
- Multi-Format Reports: Structured JSON, JSONL, and human-readable intelligence summaries
- Visualization Ready: Compatible with GraphXR, Neo4j, NetworkX, and D3.js for advanced analysis
- Comprehensive Metadata: Full crawl state, analysis results, timestamps, and provenance tracking
- User-Agent Rotation: Extensive randomization pool to avoid detection and fingerprinting
- Proxy Anonymization: Built-in proxy health management with automatic rotation
- Configurable Delays: Adaptive timing controls to respect site policies and avoid rate limiting
- Domain Filtering: Optional same-domain restriction for focused, controlled analysis
- Ethical Design: Built for authorized OSINT research with responsible data handling
- Python 3.10+
- Conda (recommended for environment management)
- Ollama (optional, for AI-powered features)
# 1. Clone the repository
git clone https://github.com/yourusername/OSINTai.git
cd OSINTai
# 2. Create conda environment
conda create -n osintai python=3.10 beautifulsoup4 requests httpx anyio lxml -y
conda activate osintai
# 3. Install Ollama (for AI features)
brew install ollama
ollama pull granite3.2:latest
ollama pull nomic-embed-text:latest
# 4. Verify installation
python run_osintai.py --helppip install beautifulsoup4 requests httpx anyio lxmlpython run_osintai.py \
--seed "https://example.com" \
--depth 2 \
--max 150 \
--same-domainpython run_osintai.py \
--seed "https://target-site.com" \
--model "granite3.2:latest" \
--embed-model "nomic-embed-text:latest" \
--concurrency 12python run_osintai.py \
--seed "https://investigation-target.com" \
--hunt "invoice,wire transfer,bank,crypto,telegram,darkweb" \
--hunt-max 60 \
--depth 3python run_osintai.py \
--seed "https://large-site.com" \
--no-ollama \
--max 1000 \
--concurrency 32 \
--per-host 6python run_osintai.py \
--seed "https://target.com" \
--proxies "proxies.txt" \
--concurrency 8 \
--per-host 2Create a seed_urls.txt file in the project root:
https://site1.com
https://site2.com
https://site3.com
Then run:
python run_osintai.py --seed "https://example.com" # Will also check for seed_urls.txtusage: run_osintai.py [-h] [--seed SEED] [--depth DEPTH] [--max MAX]
[--same-domain] [--concurrency CONCURRENCY]
[--per-host PER_HOST] [--ua UA] [--proxies PROXIES]
[--model MODEL] [--embed-model EMBED_MODEL]
[--no-ollama] [--hunt HUNT] [--hunt-max HUNT_MAX]
[--run-id RUN_ID]
OSINTai v3.3 FULL (async + proxy + dedupe + embeddings + hunt + graph export)
required arguments:
--seed SEED Seed URL (or use seed_urls.txt file)
optional arguments:
--depth DEPTH Max crawl depth (default: 2)
--max MAX Max URLs to crawl (default: 150)
--same-domain Only crawl same domain as seed
--concurrency CONCURRENCY Global concurrency limit (default: 18)
--per-host PER_HOST Per-host concurrency limit (default: 4)
--ua UA User agents file (default: user_agents.txt)
--proxies PROXIES Optional proxy list file
--model MODEL Ollama analysis model (default: granite3.2:latest)
--embed-model EMBED_MODEL Ollama embeddings model (default: nomic-embed-text:latest)
--no-ollama Disable LLM analysis and embeddings
--hunt HUNT Comma-separated hunt terms (optional)
--hunt-max HUNT_MAX Max lead URLs per page from hunt mode (default: 50)
--run-id RUN_ID Optional run ID override (default: auto-generated)
Each crawl generates a timestamped directory under data/runs/ with comprehensive intelligence data:
urls_crawled.jsonl- Complete crawl log with HTTP status, timestamps, and metadataindicators.jsonl- Extracted intelligence indicators with context and confidence scorespage_scores.jsonl- Intelligence-scored pages with analysis results and risk assessmentscrawl_state.json- Resume state for interrupted operations
analysis/- Individual page intelligence analysis in JSON formatembeddings/- Vector embeddings for semantic clustering and similarity analysishunt.jsonl- Targeted hunt mode discoveries (when hunt terms specified)
report.txt- Human-readable executive summary with key findingsranked_pages.json- Structured intelligence prioritization and scoringgraph_nodes.jsonl- Graph nodes for network visualization and analysisgraph_edges.jsonl- Graph relationships and connections
pages_raw/- Original HTML content for forensic analysispages_text/- Extracted text content for processing and review
data/runs/2026-01-16_143022_investigation_001/
├── urls_crawled.jsonl # Complete crawl audit trail
├── indicators.jsonl # Intelligence indicators
├── page_scores.jsonl # Intelligence scoring
├── crawl_state.json # Resume capability
├── analysis/ # AI analysis results
│ ├── abc123.analysis.json
│ └── def456.analysis.json
├── embeddings/ # Vector embeddings
│ ├── abc123.embed.json
│ └── def456.embed.json
├── pages_raw/ # Raw HTML archive
├── pages_text/ # Text extraction
├── hunt.jsonl # Hunt mode results
├── report.txt # Executive summary
├── ranked_pages.json # Structured intelligence
├── graph_nodes.jsonl # Graph visualization
└── graph_edges.jsonl # Graph relationships
Purpose: Request header randomization to avoid detection Format: One user agent string per line Location: Project root
Purpose: IP rotation and anonymity
Format: One proxy URL per line (http://ip:port or ip:port)
Location: Any accessible file path
Purpose: Batch processing of multiple starting points Format: One URL per line Location: Project root (auto-detected) Example:
https://target1.com/investigation
https://target2.com/research
https://target3.com/analysis
# Phase 1: Broad Discovery (High Speed)
python run_osintai.py \
--seed "https://target.com" \
--max 500 \
--no-ollama \
--concurrency 32
# Phase 2: Deep AI Analysis (Quality over Speed)
python run_osintai.py \
--seed "https://target.com" \
--max 100 \
--model "granite3.2:latest" \
--concurrency 8
# Phase 3: Targeted Intelligence Hunt
python run_osintai.py \
--seed "https://target.com" \
--hunt "malware,ransomware,exploit,credential" \
--hunt-max 100 \
--depth 4python run_osintai.py \
--seed "https://enterprise-site.com" \
--max 2000 \
--concurrency 24 \
--per-host 4 \
--depth 5 \
--run-id "enterprise_audit_2026"# Using different Ollama models
python run_osintai.py \
--seed "https://target.com" \
--model "llama2:13b" \
--embed-model "all-minilm:33m"# Maximum data retention for investigations
python run_osintai.py \
--seed "https://evidence-site.com" \
--max 50 \
--depth 3 \
--run-id "forensic_case_123"OSINTai generates ACE-T compatible graph data for advanced network analysis and visualization:
- Pages: Web pages with intelligence scores and metadata
- Indicators: Extracted entities (emails, domains, IPs, etc.)
- Relationships: Connections between entities and content
{"id": "page:https://example.com/intel", "type": "page", "label": "Intelligence Page", "props": {"title": "Secret Intel", "score": 25.7, "risk_flags": ["suspicious"]}, "ts": 1705411200.0}
{"id": "email:investigator@agency.gov", "type": "email", "label": "investigator@agency.gov", "props": {"confidence": 0.95}, "ts": 1705411200.0}{"src": "page:https://example.com/intel", "dst": "email:investigator@agency.gov", "type": "mentions_email", "props": {"context": "contact information"}, "ts": 1705411200.0}- GraphXR: Direct JSONL import for real-time graph exploration
- Neo4j: Enterprise graph database with Cypher queries
- NetworkX: Python graph analysis and algorithmic processing
- D3.js: Custom web-based visualizations and dashboards
- Global Concurrency: Total simultaneous requests (recommended: 12-24)
- Per-Host Concurrency: Domain-specific limits (recommended: 3-6)
- Memory Scaling: Reduce concurrency for large crawls (>1000 URLs)
- Proxy Distribution: Spread load across multiple IP addresses
- Delay Configuration: Adjust timing based on target site sensitivity
- Timeout Management: Increase for slow networks or international targets
- Model Selection: Balance accuracy vs speed (
granite3.2:latestrecommended) - Batch Processing: AI analysis scales with available Ollama resources
- Embedding Optimization: Vector storage requires ~700KB per analyzed page
Ollama Connection Failed
# Verify Ollama service
ollama serve
ollama list
# Test model availability
ollama run granite3.2:latest "test"
# Fallback to non-AI mode
python run_osintai.py --seed "https://example.com" --no-ollamaProxy Configuration Issues
# Validate proxy file format
head -5 proxies.txt
# Test proxy connectivity
curl -x http://proxy-ip:port https://httpbin.org/ip
# Run without proxies
python run_osintai.py --seed "https://example.com"Memory/Resource Constraints
# Reduce concurrency for large crawls
python run_osintai.py \
--seed "https://example.com" \
--concurrency 8 \
--per-host 2 \
--max 500Rate Limiting Detection
# Increase delays between requests
# Modify fetcher.py: min_delay_s, max_delay_s defaults
# Or use proxy rotation for distribution# Enable verbose logging (modify cli.py)
import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')src/osintai/
├── cli.py # Command-line interface and orchestration
├── crawler.py # Async crawling engine with deduplication
├── fetcher.py # HTTP client with proxy rotation and retry logic
├── extractor.py # Content parsing and indicator extraction
├── analyzer.py # Intelligence scoring and risk assessment
├── ollama_api.py # LLM integration for analysis and embeddings
├── proxy_pool.py # Proxy health management and rotation
├── graph_export.py # Graph data serialization and export
├── hunt.py # Targeted term discovery and lead generation
├── normalize.py # URL processing and normalization utilities
├── dedupe.py # Simhash-based content deduplication
├── scoring.py # Page ranking and intelligence prioritization
├── report.py # Human-readable report generation
├── storage.py # File I/O and data persistence utilities
└── __init__.py # Package initialization
- Initialization: Parse arguments, load configurations, initialize components
- Crawling: Async HTTP fetching with concurrency controls and proxy rotation
- Processing: Content extraction, indicator mining, deduplication
- Analysis: LLM-powered intelligence extraction and embedding generation
- Scoring: Risk assessment, prioritization, and intelligence value calculation
- Persistence: Structured data export and graph generation
- Reporting: Human-readable summaries and visualization data
- Anonymization: Proxy rotation and user-agent randomization
- Stealth Techniques: Adaptive delays and request patterns
- Data Sanitization: No sensitive information logging
- Provenance Tracking: Complete audit trail for intelligence chain of custody
- Legal Compliance: Authorized access to publicly available information only
- Terms Respect: Honor site policies, robots.txt, and service agreements
- Data Handling: Secure storage and responsible intelligence dissemination
- Attribution: Maintain source credibility and investigation integrity
- Authorization: Obtain proper permissions for sensitive investigations
- Transparency: Document methodologies and data sources
- Impact Assessment: Consider potential consequences of findings
- Community Standards: Adhere to OSINT professional ethics and best practices
# Fork and clone
git clone https://github.com/yourusername/OSINTai.git
cd OSINTai
# Create development environment
conda create -n osintai-dev python=3.10 -y
conda activate osintai-dev
pip install -r requirements.txt
# Install development tools
pip install black flake8 pytest mypy- Formatting: Black code formatter with 120 character line length
- Linting: flake8 for code quality and style consistency
- Type Hints: Full type annotation coverage
- Testing: Comprehensive unit and integration tests
- Documentation: Detailed docstrings and inline comments
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-enhancement) - Implement changes with tests
- Ensure all tests pass (
pytest) - Format code (
black .) - Lint code (
flake8) - Commit with clear messages
- Push and create pull request
MIT License - See LICENSE file for complete terms.
OSINTai is developed exclusively for ethical open source intelligence research and authorized security investigations. This tool must not be used for unauthorized surveillance, data collection, or any illegal activities.
Users bear full responsibility for compliance with applicable laws, regulations, and terms of service. The authors assume no liability for misuse or unauthorized application of this software.
Obtain proper authorization before conducting any intelligence operations.
- Complete Async Rewrite: httpx + asyncio for 10x+ performance gains
- Advanced Proxy System: Health-scored rotation with intelligent failover
- Simhash Deduplication: Near-duplicate detection for content efficiency
- Ollama API Integration: Native LLM analysis and vector embeddings
- Hunt Mode: Targeted intelligence discovery with configurable parameters
- ACE-T Graph Export: Professional visualization and network analysis
- Performance Optimization: Concurrent processing with memory efficiency
- Resume Capability: Automatic state persistence and crash recovery
- Modular Architecture: Clean separation in
src/osintai/package - Code Cleanup: Removed legacy components and unused files
- Enhanced Documentation: Comprehensive README with usage examples
- Synchronous crawling architecture
- Basic indicator extraction
- Subprocess Ollama integration
- Limited scalability and performance
- Bug Reports: GitHub Issues
- Feature Requests: GitHub Discussions
- Documentation: Comprehensive in-code docstrings and this README
- Community: OSINT professional forums and security research communities
Built for the OSINT community with contributions from security researchers, digital investigators, and open source intelligence professionals worldwide.
Use responsibly. Research ethically. Impact positively.
OSINTai v3.3 - Illuminating the shadows of open source intelligence.
