This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.
Feature: Configurable content source for markdown generation
Changes Made:
- Added
content_source: str = "cleaned_html"parameter toMarkdownGenerationStrategyclass - Updated
DefaultMarkdownGeneratorto accept and pass the content source parameter - Renamed the
cleaned_htmlparameter toinput_htmlin thegenerate_markdownmethod - Modified
AsyncWebCrawler.aprocess_htmlto select the appropriate HTML source based on the generator's config - Added
preprocess_html_for_schemaimport inasync_webcrawler.py
Implementation Details:
- Added a new
content_sourceparameter to specify which HTML input to use for markdown generation - Options include: "cleaned_html" (default), "raw_html", and "fit_html"
- Used a dictionary dispatch pattern in
aprocess_htmlto select the appropriate HTML source - Added proper error handling with fallback to cleaned_html if content source selection fails
- Ensured backward compatibility by defaulting to "cleaned_html" option
Files Modified:
crawl4ai/markdown_generation_strategy.py: Added content_source parameter and updated the method signaturecrawl4ai/async_webcrawler.py: Added HTML source selection logic and updated imports
Examples:
- Created
docs/examples/content_source_example.pydemonstrating how to use the new parameter
Challenges:
- Maintaining backward compatibility while reorganizing the parameter flow
- Ensuring proper error handling for all content source options
- Making the change with minimal code modifications
Why This Feature: The content source selection feature allows users to choose which HTML content to use as input for markdown generation:
- "cleaned_html" - Uses the post-processed HTML after scraping strategy (original behavior)
- "raw_html" - Uses the original raw HTML directly from the web page
- "fit_html" - Uses the preprocessed HTML optimized for schema extraction
This feature provides greater flexibility in how users generate markdown, enabling them to:
- Capture more detailed content from the original HTML when needed
- Use schema-optimized HTML when working with structured data
- Choose the approach that best suits their specific use case
Feature: Comprehensive stress testing framework using arun_many and the dispatcher system to evaluate performance, concurrency handling, and identify potential issues under high-volume crawling scenarios.
Changes Made:
- Created a dedicated stress testing framework in the
benchmarking/(or similar) directory. - Implemented local test site generation (
SiteGenerator) with configurable heavy HTML pages. - Added basic memory usage tracking (
SimpleMemoryTracker) using platform-specific commands (avoidingpsutildependency for this specific test). - Utilized
CrawlerMonitorfromcrawl4aifor rich terminal UI and real-time monitoring of test progress and dispatcher activity. - Implemented detailed result summary saving (JSON) and memory sample logging (CSV).
- Developed
run_benchmark.pyto orchestrate tests with predefined configurations. - Created
run_all.shas a simple wrapper forrun_benchmark.py.
Implementation Details:
- Generates a local test site with configurable pages containing heavy text and image content.
- Uses Python's built-in
http.serverfor local serving, minimizing network variance. - Leverages
crawl4ai'sarun_manymethod for processing URLs. - Utilizes
MemoryAdaptiveDispatcherto manage concurrency via themax_sessionsparameter (note: memory adaptation features requirepsutil, not used bySimpleMemoryTracker). - Tracks memory usage via
SimpleMemoryTracker, recording samples throughout test execution to a CSV file. - Uses
CrawlerMonitor(which uses therichlibrary) for clear terminal visualization and progress reporting directly from the dispatcher. - Stores detailed final metrics in a JSON summary file.
Files Created/Updated:
stress_test_sdk.py: Main stress testing implementation usingarun_many.benchmark_report.py: (Assumed) Report generator for comparing test results.run_benchmark.py: Test runner script with predefined configurations.run_all.sh: Simple bash script wrapper forrun_benchmark.py.USAGE.md: Comprehensive documentation on usage and interpretation (updated).
Testing Approach:
- Creates a controlled, reproducible test environment with a local HTTP server.
- Processes URLs using
arun_many, allowing the dispatcher to manage concurrency up tomax_sessions. - Optionally logs per-batch summaries (when not in streaming mode) after processing chunks.
- Supports different test sizes via
run_benchmark.pyconfigurations. - Records memory samples via platform commands for basic trend analysis.
- Includes cleanup functionality for the test environment.
Challenges:
- Ensuring proper cleanup of HTTP server processes.
- Getting reliable memory tracking across platforms without adding heavy dependencies (
psutil) to this specific test script. - Designing
run_benchmark.pyto correctly pass arguments tostress_test_sdk.py.
Why This Feature:
The high volume stress testing solution addresses critical needs for ensuring Crawl4AI's arun_many reliability:
- Provides a reproducible way to evaluate performance under concurrent load.
- Allows testing the dispatcher's concurrency control (
max_session_permit) and queue management. - Enables performance tuning by observing throughput (
URLs/sec) under differentmax_sessionssettings. - Creates a controlled environment for testing
arun_manybehavior. - Supports continuous integration by providing deterministic test conditions for
arun_many.
Design Decisions:
- Chose local site generation for reproducibility and isolation from network issues.
- Utilized the built-in
CrawlerMonitorfor real-time feedback, leveraging itsrichintegration. - Implemented optional per-batch logging in
stress_test_sdk.py(when not streaming) to provide chunk-level summaries alongside the continuous monitor. - Adopted
arun_manywith aMemoryAdaptiveDispatcheras the core mechanism for parallel execution, reflecting the intended SDK usage. - Created
run_benchmark.pyto simplify running standard test configurations. - Used
SimpleMemoryTrackerto provide basic memory insights without requiringpsutilfor this particular test runner.
Future Enhancements to Consider:
- Create a separate test variant that does use
psutilto specifically stress the memory-adaptive features of the dispatcher. - Add support for generated JavaScript content.
- Add support for Docker-based testing with explicit memory limits.
- Enhance
benchmark_report.pyto provide more sophisticated analysis of performance and memory trends from the generated JSON/CSV files.
Changes Made:
- Corrected
run_benchmark.pyandstress_test_sdk.pyto use--max-sessionsinstead of the incorrect--workersparameter, accurately reflecting dispatcher configuration. - Updated
run_benchmark.pyargument handling to correctly pass all relevant custom parameters (including--stream,--monitor-mode, etc.) tostress_test_sdk.py. - (Assuming changes in
benchmark_report.py) Applied dark theme to benchmark reports for better readability. - (Assuming changes in
benchmark_report.py) Improved visualization code to eliminate matplotlib warnings. - Updated
run_benchmark.pyto provide clickablefile://links to generated reports in the terminal output. - Updated
USAGE.mdwith comprehensive parameter descriptions reflecting the final script arguments. - Updated
run_all.shwrapper to correctly invokerun_benchmark.pywith flexible arguments.
Details of Changes:
-
Parameter Correction (
--max-sessions):- Identified the fundamental misunderstanding where
--workerswas used incorrectly. - Refactored
stress_test_sdk.pyto accept--max-sessionsand configure theMemoryAdaptiveDispatcher'smax_session_permitaccordingly. - Updated
run_benchmark.pyargument parsing and command construction to use--max-sessions. - Updated
TEST_CONFIGSinrun_benchmark.pyto usemax_sessions.
- Identified the fundamental misunderstanding where
-
Argument Handling (
run_benchmark.py):- Improved logic to collect all command-line arguments provided to
run_benchmark.py. - Ensured all relevant arguments (like
--stream,--monitor-mode,--port,--use-rate-limiter, etc.) are correctly forwarded when callingstress_test_sdk.pyas a subprocess.
- Improved logic to collect all command-line arguments provided to
-
Dark Theme & Visualization Fixes (Assumed in
benchmark_report.py):- (Describes changes assumed to be made in the separate reporting script).
-
Clickable Links (
run_benchmark.py):- Added logic to find the latest HTML report and PNG chart in the
benchmark_reportsdirectory afterbenchmark_report.pyruns. - Used
pathlibto generate correctfile://URLs for terminal output.
- Added logic to find the latest HTML report and PNG chart in the
-
Documentation Improvements (
USAGE.md):- Rewrote sections to explain
arun_many, dispatchers, and--max-sessions. - Updated parameter tables for all scripts (
stress_test_sdk.py,run_benchmark.py). - Clarified the difference between batch and streaming modes and their effect on logging.
- Updated examples to use correct arguments.
- Rewrote sections to explain
Files Modified:
stress_test_sdk.py: Changed--workersto--max-sessions, added new arguments, usedarun_many.run_benchmark.py: Changed argument handling, updated configs, callsstress_test_sdk.py.run_all.sh: Updated to callrun_benchmark.pycorrectly.USAGE.md: Updated documentation extensively.benchmark_report.py: (Assumed modifications for dark theme and viz fixes).
Testing:
- Verified that
--max-sessionscorrectly limits concurrency via theCrawlerMonitoroutput. - Confirmed that custom arguments passed to
run_benchmark.pyare forwarded tostress_test_sdk.py. - Validated clickable links work in supporting terminals.
- Ensured documentation matches the final script parameters and behavior.
Why These Changes:
These refinements correct the fundamental approach of the stress test to align with crawl4ai's actual architecture and intended usage:
- Ensures the test evaluates the correct components (
arun_many,MemoryAdaptiveDispatcher). - Makes test configurations more accurate and flexible.
- Improves the usability of the testing framework through better argument handling and documentation.
Future Enhancements to Consider:
- Add support for generated JavaScript content to test JS rendering performance
- Implement more sophisticated memory analysis like generational garbage collection tracking
- Add support for Docker-based testing with memory limits to force OOM conditions
- Create visualization tools for analyzing memory usage patterns across test runs
- Add benchmark comparisons between different crawler versions or configurations
Changes Made:
- Fixed custom parameter handling in run_benchmark.py
- Applied dark theme to benchmark reports for better readability
- Improved visualization code to eliminate matplotlib warnings
- Added clickable links to generated reports in terminal output
- Enhanced documentation with comprehensive parameter descriptions
Details of Changes:
-
Custom Parameter Handling Fix
- Identified bug where custom URL count was being ignored in run_benchmark.py
- Rewrote argument handling to use a custom args dictionary
- Properly passed parameters to the test_simple_stress.py command
- Added better UI indication of custom parameters in use
-
Dark Theme Implementation
- Added complete dark theme to HTML benchmark reports
- Applied dark styling to all visualization components
- Used Nord-inspired color palette for charts and graphs
- Improved contrast and readability for data visualization
- Updated text colors and backgrounds for better eye comfort
-
Matplotlib Warning Fixes
- Resolved warnings related to improper use of set_xticklabels()
- Implemented correct x-axis positioning for bar charts
- Ensured proper alignment of bar labels and data points
- Updated plotting code to use modern matplotlib practices
-
Documentation Improvements
- Created comprehensive USAGE.md with detailed instructions
- Added parameter documentation for all scripts
- Included examples for all common use cases
- Provided detailed explanations for interpreting results
- Added troubleshooting guide for common issues
Files Modified:
tests/memory/run_benchmark.py: Fixed custom parameter handlingtests/memory/benchmark_report.py: Added dark theme and fixed visualization warningstests/memory/run_all.sh: Added clickable links to reportstests/memory/USAGE.md: Created comprehensive documentation
Testing:
- Verified that custom URL counts are now correctly used
- Confirmed dark theme is properly applied to all report elements
- Checked that matplotlib warnings are no longer appearing
- Validated clickable links to reports work in terminals that support them
Why These Changes: These improvements address several usability issues with the stress testing system:
- Better parameter handling ensures test configurations work as expected
- Dark theme reduces eye strain during extended test review sessions
- Fixing visualization warnings improves code quality and output clarity
- Enhanced documentation makes the system more accessible for future use
Future Enhancements:
- Add additional visualization options for different types of analysis
- Implement theme toggle to support both light and dark preferences
- Add export options for embedding reports in other documentation
- Create dedicated CI/CD integration templates for automated testing
Feature: MHTML snapshot capture of crawled pages
Changes Made:
- Added
capture_mhtml: bool = Falseparameter toCrawlerRunConfigclass - Added
mhtml: Optional[str] = Nonefield toCrawlResultmodel - Added
mhtml_data: Optional[str] = Nonefield toAsyncCrawlResponseclass - Implemented
capture_mhtml()method inAsyncPlaywrightCrawlerStrategyclass to capture MHTML via CDP - Modified the crawler to capture MHTML when enabled and pass it to the result
Implementation Details:
- MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
- The implementation waits for page to fully load before capturing MHTML content
- Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
- We ensure all browser resources are properly cleaned up after capture
Files Modified:
crawl4ai/models.py: Added the mhtml field to CrawlResultcrawl4ai/async_configs.py: Added capture_mhtml parameter to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py: Implemented MHTML capture logiccrawl4ai/async_webcrawler.py: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml
Testing:
- Created comprehensive tests in
tests/20241401/test_mhtml.pycovering:- Capturing MHTML when enabled
- Ensuring mhtml is None when disabled explicitly
- Ensuring mhtml is None by default
- Capturing MHTML on JavaScript-enabled pages
Challenges:
- Had to improve page loading detection to ensure JavaScript content was fully rendered
- Tests needed to be run independently due to Playwright browser instance management
- Modified test expected content to match actual MHTML output
Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:
- Offline viewing of captured pages
- Creating permanent snapshots of web content for archival
- Ensuring consistent content for later analysis, even if the original site changes
Future Enhancements to Consider:
- Add option to save MHTML to file
- Support for filtering what resources get included in MHTML
- Add support for specifying MHTML capture options
Feature: Comprehensive capturing of network requests/responses and browser console messages during crawling
Changes Made:
- Added
capture_network_requests: bool = Falseandcapture_console_messages: bool = Falseparameters toCrawlerRunConfigclass - Added
network_requests: Optional[List[Dict[str, Any]]] = Noneandconsole_messages: Optional[List[Dict[str, Any]]] = Nonefields to bothAsyncCrawlResponseandCrawlResultmodels - Implemented event listeners in
AsyncPlaywrightCrawlerStrategy._crawl_web()to capture browser network events and console messages - Added proper event listener cleanup in the finally block to prevent resource leaks
- Modified the crawler flow to pass captured data from AsyncCrawlResponse to CrawlResult
Implementation Details:
- Network capture uses Playwright event listeners (
request,response, andrequestfailed) to record all network activity - Console capture uses Playwright event listeners (
consoleandpageerror) to record console messages and errors - Each network event includes metadata like URL, headers, status, and timing information
- Each console message includes type, text content, and source location when available
- All captured events include timestamps for chronological analysis
- Error handling ensures even failed capture attempts won't crash the main crawling process
Files Modified:
crawl4ai/models.py: Added new fields to AsyncCrawlResponse and CrawlResultcrawl4ai/async_configs.py: Added new configuration parameters to CrawlerRunConfigcrawl4ai/async_crawler_strategy.py: Implemented capture logic using event listenerscrawl4ai/async_webcrawler.py: Added data transfer from AsyncCrawlResponse to CrawlResult
Documentation:
- Created detailed documentation in
docs/md_v2/advanced/network-console-capture.md - Added feature to site navigation in
mkdocs.yml - Updated CrawlResult documentation in
docs/md_v2/api/crawl-result.md - Created comprehensive example in
docs/examples/network_console_capture_example.py
Testing:
- Created
tests/general/test_network_console_capture.pywith tests for:- Verifying capture is disabled by default
- Testing network request capturing
- Testing console message capturing
- Ensuring both capture types can be enabled simultaneously
- Checking correct content is captured in expected formats
Challenges:
- Initial implementation had synchronous/asynchronous mismatches in event handlers
- Needed to fix type of property access vs. method calls in handlers
- Required careful cleanup of event listeners to prevent memory leaks
Why This Feature: The network and console capture feature provides deep visibility into web page activity, enabling:
- Debugging complex web applications by seeing all network requests and errors
- Security analysis to detect unexpected third-party requests and data flows
- Performance profiling to identify slow-loading resources
- API discovery in single-page applications
- Comprehensive analysis of web application behavior
Future Enhancements to Consider:
- Option to filter captured events by type, domain, or content
- Support for capturing response bodies (with size limits)
- Aggregate statistics calculation for performance metrics
- Integration with visualization tools for network waterfall analysis
- Exporting captures in HAR format for use with external tools