Enhance the session management system to support
- Session queuing and prioritization for resource-constrained environments.
Crawl sessions should be processed based on priority, and retries should follow defined policies.
Motivation:
- Optimize resource usage in environments with limited crawling capacity.
- Ensure high-priority sessions are processed first.
- Support retry policies for failed URLs.
Acceptance Criteria:
- Sessions are queued in
URLFrontier or equivalent structure.
- Each session can have a priority (
LOW, NORMAL, HIGH).
- Retry queue with exponential backoff is implemented.
- System handles resource limits gracefully.
Suggested Tasks:
- Extend
CrawlSession and CrawlerManager to support priorities.
- Implement queue data structure and session scheduling.
- Add retry and backoff logic for failed sessions.
- Update APIs and metrics to reflect session priorities.
- Write unit and integration tests for prioritization and queuing.