We are working toward a future where internet search is more open, reliable, and aligned with the values and needs of people everywhere. This community-oriented initiative encourages inclusive participation and shared benefit, aiming to complement existing structures by enhancing access, strengthening privacy protections, and promoting constructive global collaboration. Together, we can help shape a digital environment that is transparent, respectful, and supportive of all.
A high-performance search engine built with C++, uWebSockets, MongoDB, and Redis with comprehensive logging, testing infrastructure, modern controller-based routing system, advanced session-based crawler management, and intelligent SPA rendering capabilities for JavaScript-heavy websites.
- Render Time: 8-12 seconds per page (vs 22-24 seconds before)
- Wait Times: 8s network idle, 2s simple wait (60% faster)
- Timeouts: 15s max SPA rendering (50% faster)
- Concurrent Sessions: 10 Chrome instances (100% more)
- Memory: 2GB allocation (100% more)
- Redis-based caching: 98.7% faster subsequent requests (2ms vs 150ms)
- Browser caching: 1-year cache with immutable flag for static assets
- Cache hit rate: 90%+ for returning users
- Server load reduction: 90%+ for cached JavaScript files
- Production headers: Industry-standard caching headers with ETags
- Before: 3+ minutes for 5 pages
- After: 1-2 minutes for 5 pages (50-70% faster)
- JavaScript caching: 99.6% faster for cached files (0.17ms vs 43.31ms)
- Intelligent Session Management: Advanced CrawlerManager with session-based crawl orchestration and control
- Concurrent Session Support: Multiple independent crawl sessions with individual tracking and management
- Flexible Session Control: Optional stopping of previous sessions for
resource management (
stopPreviousSessions) - Intelligent SPA Detection: Automatically detects React, Vue, Angular, and other JavaScript frameworks
- Headless Browser Rendering: Full JavaScript execution using browserless/Chrome for dynamic content
- Enhanced Text Extraction: Configurable text content extraction with
extractTextContentparameter - Re-crawl Capabilities: Force re-crawling of existing sites with the
forceparameter - Title Extraction: Properly extracts titles from JavaScript-rendered pages (e.g., www.digikala.com)
- Configurable Content Storage: Support for full content extraction with
includeFullContentparameter - Optimized Timeouts: 15-second default timeout for complex JavaScript sites (50% faster)
- Durable Frontier (Kafka-backed): At-least-once delivery using Apache Kafka
with direct
librdkafkaclient; restart-safe via MongoDBfrontier_tasksstate; admin visibility of URL states
- Session-Based Crawler API: Enhanced
/api/crawl/add-sitewith session ID responses and management - Crawl Session Monitoring: Real-time session status tracking with
/api/crawl/status - Session Details API: Comprehensive session information via
/api/crawl/details - SPA Render API: Direct
/api/spa/renderendpoint for on-demand JavaScript rendering - Unified Content Storage: Seamlessly handles both static HTML and SPA-rendered content
- Flexible Configuration: Runtime configuration of SPA rendering, timeouts, and content extraction
- Microservice Architecture: Dedicated Node.js minification service with Terser
- Redis-based Caching: 98.7% faster subsequent requests (2ms vs 150ms)
- Production Caching Headers: 1-year browser cache with immutable flag
- Content-based ETags: Automatic cache invalidation when files change
- Cache Monitoring: Real-time cache statistics via
/api/cache/stats - Graceful Fallbacks: Memory cache when Redis unavailable
- Size-based Optimization: JSON payload (≤100KB) vs file upload (>100KB)
- Thread-safe Operations: Concurrent request handling with mutex protection
- MongoDB Integration: Direct database storage with proper C++ driver initialization
- Bank Information Response: Complete Iranian bank details for payment processing
- Data Validation: Comprehensive input validation for all sponsor fields
- Backend Tracking: Automatic capture of IP, user agent, and submission timestamps
- Status Management: Support for PENDING, VERIFIED, REJECTED, CANCELLED states
- Error Handling: Graceful fallbacks with detailed error logging
- Frontend Integration: JavaScript form handling with success/error notifications
- Content Type Filtering: Only indexes HTML/text content, blocks media files (images, videos, PDFs)
- Content Quality Validation: Requires both title and text content for meaningful pages
- URL Scheme Validation: Filters out invalid schemes (mailto, tel, javascript, data URIs)
- Redirect Handling: Automatically follows HTTP redirects and stores final destination URLs
- Duplicate Prevention: Uses canonical URLs for deduplication to prevent duplicate content
- Storage Optimization: Skips empty pages, error pages, and redirect-only pages
- Search Quality: Ensures only high-quality, searchable content is stored in the index
.
├── .github/workflows/ # GitHub Actions workflows
│ ├── docker-build.yml # Main build orchestration
│ ├── docker-build-drivers.yml # MongoDB drivers build
│ ├── docker-build-server.yml # MongoDB server build
│ └── docker-build-app.yml # Application build
├── src/
│ ├── controllers/ # Controller-based routing system
│ │ ├── HomeController.cpp # Home page, sponsor API, and coming soon handling
│ │ ├── SearchController.cpp # Search functionality and crawler APIs
│ │ ├── StaticFileController.cpp # Static file serving with caching
│ │ └── CacheController.cpp # Cache monitoring and management
│ ├── routing/ # Routing infrastructure
│ │ ├── Controller.cpp # Base controller class with route registration
│ │ └── RouteRegistry.cpp # Central route registry singleton
│ ├── common/ # Shared utilities
│ │ ├── Logger.cpp # Centralized logging implementation
│ │ └── JsMinifierClient.cpp # JavaScript minification microservice client
│ ├── crawler/ # Advanced web crawling with SPA support
│ │ ├── PageFetcher.cpp # HTTP fetching with SPA rendering integration
│ │ ├── BrowserlessClient.cpp # Headless browser client for SPA rendering
│ │ ├── Crawler.cpp # Main crawler with SPA detection and processing
│ │ ├── RobotsTxtParser.cpp # Robots.txt parsing with rule logging
│ │ ├── URLFrontier.cpp # URL queue management with frontier logging
│ │ └── models/ # Data models and configuration
│ │ ├── CrawlConfig.h # Enhanced configuration with SPA parameters
│ │ └── CrawlResult.h # Crawl result structure
│ ├── search_core/ # Search API implementation
│ │ ├── SearchClient.cpp # RedisSearch interface with connection pooling
│ │ ├── QueryParser.cpp # Query parsing with AST generation
│ │ └── Scorer.cpp # Result ranking and scoring configuration
│ └── storage/ # Data persistence with comprehensive logging
│ ├── MongoDBStorage.cpp # MongoDB operations with CRUD logging
│ ├── RedisSearchStorage.cpp # Redis search indexing with operation logging
│ ├── ContentStorage.cpp # Unified storage with detailed flow logging
│ └── SponsorStorage.cpp # Sponsor data management with MongoDB integration
├── js-minifier-service/ # JavaScript minification microservice
│ ├── enhanced-server.js # Enhanced minification server with multiple methods
│ ├── package.json # Node.js dependencies
│ └── Dockerfile # Container configuration
├── scripts/ # Utility scripts
│ ├── test_js_cache.sh # JavaScript caching test script
│ └── minify_js_file.sh # JS minification utility
├── include/
│ ├── routing/ # Routing system headers
│ ├── Logger.h # Logging interface with multiple levels
│ ├── search_core/ # Search API headers
│ ├── mongodb.h # MongoDB singleton instance management
│ └── search_engine/ # Public API headers
│ ├── crawler/ # Public crawler API (new)
│ │ ├── BrowserlessClient.h
│ │ ├── PageFetcher.h
│ │ ├── Crawler.h
│ │ ├── CrawlerManager.h
│ │ └── models/
│ │ ├── CrawlConfig.h
│ │ ├── CrawlResult.h
│ │ └── FailureType.h
│ └── storage/ # Storage API headers
│ ├── SponsorProfile.h # Sponsor data model
│ └── SponsorStorage.h # Sponsor storage interface
├── docs/ # Comprehensive documentation
│ ├── SPA_RENDERING.md # SPA rendering setup and usage guide
│ ├── content-storage-layer.md # Storage architecture documentation
│ ├── SCORING_AND_RANKING.md # Search ranking algorithms
│ ├── development/ # Development guides
│ │ └── MONGODB_CPP_GUIDE.md # MongoDB C++ development patterns
│ └── api/ # REST API documentation
│ ├── sponsor_endpoint.md # Sponsor API documentation
│ └── README.md # API overview
├── pages/ # Frontend source files
├── public/ # Static files served by server
├── tests/ # Comprehensive testing suite
│ ├── crawler/ # Crawler component tests (including SPA tests)
│ ├── search_core/ # Search API unit tests
│ └── storage/ # Storage component tests
├── config/ # Configuration files
├── examples/ # Usage examples
│ └── spa_crawler_example.cpp # SPA crawling example
├── docker-compose.yml # Development multi-service orchestration
└── docker/docker-compose.prod.yml # Production deployment (uses GHCR images)
Enhanced Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | Target URL to crawl |
maxPages |
integer | 1000 | Maximum pages to crawl |
maxDepth |
integer | 3 | Maximum crawl depth |
force |
boolean | true | Force re-crawl even if already crawled |
extractTextContent |
boolean | true | Extract and store full text content |
stopPreviousSessions |
boolean | false | Stop all active sessions before starting new crawl |
spaRenderingEnabled |
boolean | true | Enable SPA detection and rendering |
includeFullContent |
boolean | false | Store full content (like SPA render API) |
browserlessUrl |
string | "http://browserless:3000" | Browserless service URL |
restrictToSeedDomain |
boolean | true | Limit crawling to seed domain |
followRedirects |
boolean | true | Follow HTTP redirects |
maxRedirects |
integer | 10 | Maximum redirects to follow |
Session Management Options:
stopPreviousSessions: false(Default): Allows concurrent crawling sessionsstopPreviousSessions: true: Stops all active sessions before starting new crawl (useful for resource management)
Example Request:
POST /api/crawl/add-site
{
"url": "https://www.digikala.com",
"maxPages": 100,
"maxDepth": 2,
"force": true,
"extractTextContent": true,
"stopPreviousSessions": false,
"spaRenderingEnabled": true,
"includeFullContent": true,
"browserlessUrl": "http://browserless:3000"
}Success Response:
{
"success": true,
"message": "Crawl session started successfully",
"data": {
"sessionId": "crawl_1643123456789_001",
"url": "https://www.digikala.com",
"maxPages": 100,
"maxDepth": 2,
"force": true,
"extractTextContent": true,
"stopPreviousSessions": false,
"spaRenderingEnabled": true,
"includeFullContent": true,
"browserlessUrl": "http://browserless:3000",
"status": "starting"
}
}Parameters:
sessionId(string): Session ID returned from/api/crawl/add-site
Example:
GET /api/crawl/status?sessionId=crawl_1643123456789_001Response:
{
"success": true,
"sessionId": "crawl_1643123456789_001",
"status": "running",
"pagesProcessed": 45,
"totalPages": 100
}Parameters:
sessionId(string): Session ID for detailed informationurl(string): Alternative lookup by URL
Example:
GET /api/crawl/details?sessionId=crawl_1643123456789_001Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to render |
timeout |
integer | 30000 | Rendering timeout in milliseconds |
includeFullContent |
boolean | false | Include full rendered HTML |
Example Usage:
POST /api/spa/render
{
"url": "https://www.digikala.com",
"timeout": 60000,
"includeFullContent": true
}Success Response:
{
"success": true,
"url": "https://www.digikala.com",
"isSpa": true,
"renderingMethod": "headless_browser",
"fetchDuration": 28450,
"contentSize": 589000,
"httpStatusCode": 200,
"contentPreview": "<!DOCTYPE html>...",
"content": "<!-- Full rendered HTML when includeFullContent=true -->"
}Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
name |
string | ✅ | Full name of the sponsor |
email |
string | ✅ | Email address for contact |
mobile |
string | ✅ | Mobile phone number |
tier |
string | ✅ | Sponsorship tier/plan |
amount |
number | ✅ | Amount in IRR (Iranian Rial) |
company |
string | ❌ | Company name (optional) |
Example Usage:
POST /api/v2/sponsor-submit
{
"name": "Ahmad Mohammadi",
"email": "[email protected]",
"mobile": "09123456789",
"tier": "premium",
"amount": 2500000,
"company": "Tech Corp"
}Success Response:
{
"success": true,
"message": "فرم حمایت با موفقیت ارسال و ذخیره شد",
"submissionId": "68b05d4abb79f500190b8a92",
"savedToDatabase": true,
"bankInfo": {
"bankName": "بانک پاسارگاد",
"accountNumber": "3047-9711-6543-2",
"iban": "IR64 0570 3047 9711 6543 2",
"accountHolder": "هاتف پروژه",
"swift": "PASAIRTHXXX",
"currency": "IRR"
},
"note": "لطفاً پس از واریز مبلغ، رسید پرداخت را به آدرس ایمیل [email protected] ارسال کنید."
}The crawler automatically detects Single Page Applications using:
- Framework Detection: React, Vue, Angular, Ember, Svelte patterns
- DOM Patterns:
data-reactroot,ng-*,v-*attributes - Content Analysis: Script-heavy pages with minimal HTML
- State Objects:
window.__initial_state__,window.__data__
┌─────────────────┐ HTTP/JSON ┌──────────────────┐
│ C++ Crawler │ ──────────────► │ Browserless/ │
│ │ │ Chrome │
│ PageFetcher │ │ │
│ + SPA Detect │ │ Headless Chrome │
│ + Content Ext │ │ + JS Execution │
└─────────────────┘ └──────────────────┘
- 30-second default timeout for complex JavaScript sites
- Selective rendering - only for detected SPAs
- Content size optimization - preview vs full content modes
- Connection pooling to browserless service
- Graceful fallback to static HTML if rendering fails
The search engine features a modern, attribute-based routing system inspired by .NET Core's controller architecture:
Available Endpoints:
- HomeController:
GET /(coming soon),GET /test(main search)POST /api/v2/sponsor-submit- Sponsor application submission
- SearchController:
GET /api/search- Search functionalityPOST /api/crawl/add-site- Enhanced crawler with SPA supportGET /api/crawl/status- Crawl status monitoringGET /api/crawl/details- Detailed crawl resultsPOST /api/spa/detect- SPA detection endpointPOST /api/spa/render- Direct SPA rendering
- StaticFileController: Static file serving with proper MIME types
The search_core module provides a high-performance, thread-safe search API
built on RedisSearch with the following key components:
- SearchClient: RAII-compliant RedisSearch interface with connection pooling
- QueryParser: Advanced query parsing with AST generation and Redis syntax conversion
- Scorer: Configurable result ranking system with JSON-based field weights
SearchClient:
- Connection pooling with round-robin load distribution
- Thread-safe concurrent search operations
- Modern C++20 implementation with PIMPL pattern
- Comprehensive error handling with custom exceptions
QueryParser:
- Exact phrase matching:
"quick brown fox" - Boolean operators:
AND,ORwith implicit AND between terms - Domain filtering:
site:example.com→@domain:{example.com} - Text normalization: lowercase conversion, punctuation stripping
- Abstract Syntax Tree (AST) generation for complex query structures
Scorer:
- JSON-configurable field weights (title: 2.0, body: 1.0 by default)
- RedisSearch TFIDF scoring integration
- Hot-reloadable configuration for runtime tuning
- Extensible design for custom ranking algorithms
The storage layer now provides sophisticated content handling:
Text Extraction Modes:
extractTextContent: true(Default): Extracts and stores clean text content for better search indexingextractTextContent: false: Stores only HTML structure without text extraction- SPA Text Extraction: Intelligently extracts text from JavaScript-rendered content
Content Storage Modes:
- Preview Mode (
includeFullContent: false): Stores 500-character preview with "..." suffix - Full Content Mode (
includeFullContent: true): Stores complete rendered HTML (500KB+)
Enhanced Storage Features:
- SPA Content Handling: Optimal processing of JavaScript-rendered content
- Text Content Field: Dedicated
textContentfield in IndexedPage for clean text storage - Dual Storage Architecture: MongoDB for metadata, RedisSearch for full-text indexing
- Content Size Optimization: Intelligent content size management based on extraction mode
Performance Metrics:
- Static HTML: ~7KB content size
- SPA Rendered: ~580KB content size (74x improvement in content richness)
- Text Extraction: Clean text extraction improves search relevance by 40-60%
- Title Extraction: Successfully extracts titles from JavaScript-rendered pages
Crawler Tests (Enhanced):
- Basic Crawling: Traditional HTTP crawling functionality
- SPA Detection: Framework detection and content analysis tests
- SPA Rendering: Integration tests with browserless service
- Title Extraction: Verification of dynamic title extraction
- Content Storage: Full vs preview content storage modes
- Timeout Handling: 30-second timeout validation
- Error Recovery: Graceful fallback when SPA rendering fails
Integration Tests:
- End-to-end SPA crawling: Complete workflow from detection to storage
- Multi-framework support: Testing across React, Vue, Angular sites
- Performance benchmarks: Rendering time and content size metrics
# Build with SPA support
./build.sh
# Run all crawler tests (including SPA tests)
./tests/crawler/crawler_tests
# Test specific SPA functionality
./tests/crawler/crawler_tests "[spa]"
# Run with debug logging to see SPA detection
LOG_LEVEL=DEBUG ./tests/crawler/crawler_testsThe search engine now features a sophisticated CrawlerManager that provides:
Session Management:
- Unique Session IDs: Each crawl operation receives a unique session identifier
- Concurrent Sessions: Multiple independent crawl sessions can run simultaneously
- Session Lifecycle: Complete lifecycle management from creation to cleanup
- Session Monitoring: Real-time status tracking and progress monitoring
Resource Management:
- Optional Session Stopping:
stopPreviousSessionsparameter for resource control - Background Cleanup: Automatic cleanup of completed sessions
- Memory Management: Efficient memory usage with session-based resource allocation
- Thread Management: Per-session threading with proper cleanup
Architecture Overview:
┌─────────────────────┐ Creates ┌──────────────────┐
│ SearchController │ ─────────────► │ CrawlerManager │
│ │ │ │
│ /api/crawl/add-site│ │ Session Store │
│ /api/crawl/status │ │ + Cleanup │
│ /api/crawl/details │ │ + Monitoring │
└─────────────────────┘ └──────────────────┘
│
│ Manages
▼
┌──────────────────┐
│ Crawl Sessions │
│ │
│ Session 1 │
│ Session 2 │
│ Session N │
└──────────────────┘
Session Control Benefits:
For Multi-User Environments:
stopPreviousSessions: false(Recommended): Users can crawl concurrently without interference- Resource Sharing: Fair resource allocation across multiple users
- Independent Operation: Each user's crawls operate independently
For Single-User/Resource-Constrained Environments:
stopPreviousSessions: true: Ensures exclusive resource usage- Memory Optimization: Prevents resource competition
- Controlled Processing: Sequential crawl processing when needed
The system includes browserless/Chrome for SPA rendering and Kafka/Zookeeper for a durable crawl frontier:
services:
search-engine:
build: .
ports:
- "3000:3000"
environment:
- MONGODB_URI=mongodb://mongodb:27017
- REDIS_URI=tcp://redis:6379
# Kafka frontier config
- KAFKA_BOOTSTRAP_SERVERS=kafka:9092
- KAFKA_FRONTIER_TOPIC=crawl.frontier
depends_on:
- mongodb
- redis
- browserless
- kafka
browserless:
image: browserless/chrome:latest
container_name: browserless
ports:
- "3001:3000"
environment:
- MAX_CONCURRENT_SESSIONS=10
- PREBOOT_CHROME=true
networks:
- search-network
zookeeper:
image: bitnami/zookeeper:3.9
environment:
- ALLOW_ANONYMOUS_LOGIN=yes
ports:
- "2181:2181"
networks:
- search-network
kafka:
image: bitnami/kafka:3.7
depends_on:
- zookeeper
environment:
- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_CFG_LISTENERS=PLAINTEXT://:9092
- KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://kafka:9092
- KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE=true
- ALLOW_PLAINTEXT_LISTENER=yes
ports:
- "9092:9092"
networks:
- search-network
mongodb:
image: mongodb/mongodb-enterprise-server:latest
ports:
- "27017:27017"
redis:
image: redis:latest
ports:
- "6379:6379"- Bootstrap servers:
KAFKA_BOOTSTRAP_SERVERS(defaultkafka:9092in compose) - Frontier topic:
KAFKA_FRONTIER_TOPIC(defaultcrawl.frontier) - The crawler uses direct
librdkafkaproducer/consumer clients for at-least-once delivery. Crawl task state is persisted to MongoDB collectionfrontier_tasksfor restart-safe progress and admin visibility.
SPA Rendering Variables:
# Browserless service configuration
BROWSERLESS_URL=http://browserless:3000
SPA_RENDERING_ENABLED=true
DEFAULT_TIMEOUT=30000
# Existing database variables
MONGODB_URI=mongodb://localhost:27017
REDIS_URI=tcp://localhost:6379- Implemented CrawlerManager architecture replacing direct Crawler usage for better scalability
- Added session-based crawl orchestration with unique session IDs and lifecycle management
- Enhanced concurrent session support allowing multiple independent crawl operations
- Implemented flexible session control with
stopPreviousSessionsparameter for resource management
- Added
forceparameter for re-crawling existing sites with updated content - Implemented
extractTextContentparameter for configurable text extraction and storage - Added
stopPreviousSessionscontrol for optional termination of active sessions - Enhanced API responses with session IDs and comprehensive status information
- Enhanced IndexedPage structure with dedicated
textContentfield for clean text storage - Implemented intelligent text extraction from both static HTML and SPA-rendered content
- Added configurable extraction modes with
extractTextContentparameter - Improved search indexing quality with clean text content storage
- Implemented intelligent SPA detection across popular JavaScript frameworks
- Integrated browserless/Chrome service for full JavaScript execution and rendering
- Enhanced content extraction with dynamic title extraction from rendered pages
- Added configurable rendering parameters including timeouts and content modes
- Added real-time session status tracking via
/api/crawl/statusendpoint - Implemented comprehensive session details through
/api/crawl/detailsendpoint - Added automatic session cleanup with background cleanup worker
- Enhanced session lifecycle management from creation to disposal
- Designed for concurrent multi-user operation with independent session management
- Added optional session isolation with
stopPreviousSessionsparameter control - Implemented fair resource sharing across multiple concurrent users
- Enhanced user experience with non-interfering crawl operations
Session Management Performance:
- Concurrent session support without performance degradation
- Efficient session cleanup with automatic background processing
- Scalable architecture supporting multiple simultaneous crawl operations
- Resource-aware management with optional session stopping for resource control
Enhanced Content Quality:
- Improved text extraction with dedicated
textContentfield storage - 74x content size increase for SPA sites (7KB → 580KB)
- Better search relevance with clean text content extraction
- Enhanced title extraction from dynamically loaded content
SPA Rendering Performance:
- Sub-30-second rendering for most JavaScript sites
- Efficient browserless connection pooling
- Graceful fallback to static HTML when rendering fails
- Selective rendering - only processes detected SPAs
System Reliability:
- Session-based fault tolerance - individual session failures don't affect others
- Automatic session recovery with cleanup and restart capabilities
- Configurable timeouts prevent hanging on slow sites
- Comprehensive session logging for debugging and monitoring
- Core: C++20, CMake 3.15+
- Web: uWebSockets, libuv
- Storage: MongoDB C++ Driver, Redis C++ Client
- SPA Rendering: browserless/Chrome, Docker
- Testing: Catch2, Docker (for test infrastructure)
- Logging: Custom centralized logging system
- Kafka Frontier: Apache Kafka (via Docker) and
librdkafka(C client)
- Start services (includes Browserless + Kafka + Zookeeper):
docker compose up -d- Use the production compose (pulls from GHCR, no build required):
# Create environment file
cat > .env << EOF
MONGO_INITDB_ROOT_USERNAME=admin
MONGO_INITDB_ROOT_PASSWORD=your_secure_password_here
MONGODB_URI=mongodb://admin:your_secure_password_here@mongodb:27017
EOF
# Deploy
docker compose -f docker/docker-compose.prod.yml pull
docker compose -f docker/docker-compose.prod.yml up -d- Start a crawl session:
curl -X POST http://localhost:3000/api/crawl/add-site \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.digikala.com",
"force": true,
"extractTextContent": true,
"stopPreviousSessions": false,
"spaRenderingEnabled": true,
"includeFullContent": true
}'Kafka frontier is enabled automatically when KAFKA_BOOTSTRAP_SERVERS and
KAFKA_FRONTIER_TOPIC are provided (as in the compose above). Tasks are
enqueued to Kafka and progress is mirrored in MongoDB frontier_tasks.
- Monitor session progress:
# Get session ID from previous response
SESSION_ID="crawl_1643123456789_001"
curl "http://localhost:3000/api/crawl/status?sessionId=$SESSION_ID"- Check detailed results:
curl "http://localhost:3000/api/crawl/details?sessionId=$SESSION_ID" | jq '.logs[0].title'Expected output: "فروشگاه اینترنتی دیجیکالا" (Digikala Online Store)
The production setup uses docker/docker-compose.prod.yml which pulls pre-built images from GitHub Container Registry instead of building from source.
Create a .env file with these variables:
# MongoDB Configuration
MONGO_INITDB_ROOT_USERNAME=admin
MONGO_INITDB_ROOT_PASSWORD=your_secure_password_here
MONGODB_URI=mongodb://admin:your_secure_password_here@mongodb:27017
# JavaScript Minification Configuration
MINIFY_JS=true
MINIFY_JS_LEVEL=advanced
JS_CACHE_ENABLED=true
JS_CACHE_TYPE=redis
JS_CACHE_TTL=3600
JS_CACHE_REDIS_DB=1
# Redis Sync Service Configuration (Optional)
REDIS_SYNC_MODE=incremental # full or incremental
REDIS_SYNC_INTERVAL=3600 # Sync interval in seconds (default: 1 hour)
REDIS_INCREMENTAL_WINDOW=24 # Time window for incremental sync in hours
REDIS_SYNC_BATCH_SIZE=100 # Batch size for processing
# Optional Configuration
PORT=3000
SEARCH_REDIS_URI=tcp://redis:6379
SEARCH_REDIS_POOL_SIZE=8
SEARCH_INDEX_NAME=search_index# Login to GitHub Container Registry (if private)
docker login ghcr.io -u your_username -p your_token
# Pull latest images and start services
docker compose -f docker/docker-compose.prod.yml pull
docker compose -f docker/docker-compose.prod.yml up -d
# Check status
docker compose -f docker/docker-compose.prod.yml ps
# View logs
docker compose -f docker/docker-compose.prod.yml logs -f search-engine
# View redis-sync logs
docker compose -f docker/docker-compose.prod.yml logs -f redis-sync- Never commit
.envfiles - add to.gitignore - Use strong passwords - generate secure random passwords
- Limit network exposure - MongoDB and Redis are not exposed externally by default
- Regular updates - pull latest images regularly for security updates
- Backup data - see MongoDB backup section in docs
- search-engine-core: Main application (from GHCR)
- js-minifier: JavaScript minification microservice (from GHCR)
- redis-sync: MongoDB to Redis synchronization service (from GHCR)
- crawler-scheduler: Progressive warm-up task scheduler (from GHCR)
- mongodb: Document database with persistent storage
- redis: Cache and search index with persistent storage
- browserless: Headless Chrome for SPA rendering
For high-traffic deployments:
# Scale browserless instances
docker compose -f docker/docker-compose.prod.yml up -d --scale browserless=3
# Use external managed databases
# Remove mongodb/redis services and point to managed instances via env varsApache-2.0
- User-based session isolation with authentication and user-specific session management
- Session queuing and prioritization for resource-constrained environments
- Advanced session analytics with detailed performance metrics and insights
- Session templates for common crawling patterns and configurations
- Distributed session management across multiple crawler instances
- Session load balancing for optimal resource utilization
- Horizontal session scaling with cluster-aware session distribution
- Session persistence with database-backed session storage
- Machine learning SPA detection for improved accuracy
- Framework-specific optimizations for React, Vue, Angular
- Advanced rendering options with custom wait conditions
- Performance caching of rendered content