This directory contains tools for benchmarking GhostRoll performance and identifying bottlenecks.
Runs comprehensive performance benchmarks on key GhostRoll operations:
- File Hashing (sequential and parallel) - SHA256 computation
- Database Queries - SQLite query performance with indexes
- Image Processing (sequential and parallel) - JPEG resizing and processing
- File Copying - File I/O operations
- File Scanning - Directory traversal performance
Analyzes benchmark results and provides:
- Slowest operations identification
- Throughput analysis
- Parallelization efficiency analysis
- Performance recommendations
# Basic benchmark run
python benchmark.py
# Save results to JSON file
python benchmark.py --output results.json
# Run with profiling (shows top functions by time)
python benchmark.py --profile# Analyze benchmark results
python analyze_benchmark.py results.jsonBenchmark Analysis - Bottleneck Identification
======================================================================
Slowest Operations (by total time):
----------------------------------------------------------------------
1. image_processing_parallel 0.169s (10 ops, 59.21 ops/sec)
2. image_processing 0.154s (10 ops, 64.86 ops/sec)
3. file_hashing_parallel 0.033s (10 ops, 300.47 ops/sec)
Parallelization Analysis:
----------------------------------------------------------------------
image_processing:
Sequential: 0.154s
Parallel: 0.169s
Speedup: 0.91x
⚠️ Parallel is SLOWER - overhead may be too high
- Total Time: Total time for all operations
- Throughput: Operations per second
- Mean/Median: Average operation time
- Speedup: Parallel vs sequential speedup ratio
- Slow Operations: Operations taking >100ms per item
- Low Throughput: <10 ops/sec for I/O operations
- Poor Parallelization: Parallel version slower than sequential
- Database Queries: >10ms for simple queries (check indexes)
-
Image Processing is Slow: This is expected - JPEG processing is CPU-intensive
- Solution: Use parallel processing with more workers for large batches
- Consider: Optimize PIL operations or use faster image libraries
-
Parallel Overhead: For small workloads (<20 files), parallel overhead may exceed benefits
- Solution: Only use parallel processing for larger batches
- Consider: Adjust worker count based on workload size
-
Database Queries Fast: With proper indexes, queries should be <1ms
- If slow: Verify indexes exist (
ghostroll/db.py)
- If slow: Verify indexes exist (
Edit benchmark.py to adjust:
- Number of test files/images
- File sizes
- Number of workers
- Test data complexity
You can integrate benchmarks into CI/CD:
# Run benchmarks and check for regressions
python benchmark.py --output ci_results.json
python analyze_benchmark.py ci_results.json > benchmark_report.txtBased on benchmark results:
-
Use Parallel Processing for:
- Large batches (>50 files)
- CPU-bound operations (image processing)
- I/O-bound operations (hashing, copying)
-
Use Sequential Processing for:
- Small batches (<20 files)
- Operations with high overhead
- When simplicity is preferred
-
Database Optimization:
- Ensure indexes exist (see
ghostroll/db.py) - Use batch queries instead of one-by-one
- Consider connection pooling for high concurrency
- Ensure indexes exist (see
-
Image Processing:
- Parallel processing helps significantly for large batches
- Consider worker count = CPU cores for CPU-bound work
- Use fewer workers for I/O-bound operations