All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- sudoers file ownership: postinst script now sets correct ownership (root:root) to avoid sudoers permission error.
- Timing package: New
internal/timingpackage withEstimatorandMovingAveragefor calculating stage weights and progress based on data transfer rates. - Dynamic progress weight estimation: Stage weights (download vs convert split) are now calculated dynamically based on historical rates and image sizes, replacing the static 50/50 split.
- Stage timing metrics: New Prometheus metrics:
libvirt_volume_provisioner_stage_duration_seconds- histogram for download/convert stage durationslibvirt_volume_provisioner_stage_throughput_bytes_per_second- gauge for current throughput per stage
- Stuck progress at 55%: Progress reporting now works on first run with no historical data by using default rates (100 MB/s download, 200 MB/s convert).
- qemu-img convert output: Confirmed correct output target is device path, not stdout (test coverage added).
- Default rates: Updated to 100 MB/s download, 200 MB/s convert (was 300/500 MB/s).
- Progress algorithm: Uses
timing.Estimatorfor weight calculation and time-based progress ticks during convert stage.
- Database schema for stage rates: New
stage_ratestable to store historical performance data for progress estimation. - Rate-based progress estimation: Collects download and conversion rates, uses historical averages with defaults (300 Mb/s download, 500 Mb/s convert) for accurate ETA reporting.
- LVM volume population failure:
qemu-img convertfailed on block devices with "Cannot grow device files" error. Fixed by streaming output directly to device file using native Go I/O, avoiding device resize issues.
- Progress reporting: Now uses estimated completion times based on rate data instead of hardcoded percentages.
- Image cache mechanism: Three bugs caused cached images to be re-downloaded on every request:
- The local file checksum computation after download was dead code — the fallback always set a non-empty cache key before reaching it, so the actual SHA256 of the downloaded file was never computed or stored.
- The cache lookup key switched between
SHA256(image content)(when a MinIO.sha256file was present) andSHA256(URL)(when absent), causing guaranteed cache misses whenever MinIO.sha256availability changed between runs. - Checksum files in
sha256sum(1)format (HASH filename) were silently rejected by a strict 64-character length check, causing all such files to fall back to URL-hash keying even when a valid remote checksum was available.
- Cache design corrected: The cache key is now always
SHA256(URL)(stable filesystem identifier). After each download the actual file checksum is computed and stored in the.sha256sentinel. On subsequent requests, if a remote checksum is available from MinIO it is compared against the stored value; a mismatch triggers a fresh download (stale cache eviction). Both raw-hash andsha256sum(1)format checksum files are now accepted.
- Prometheus metrics not served: The custom Prometheus registry created in
NewMetrics()was discarded instead of being stored, so/metricsserved an empty default registry. The registry is now stored on theMetricsstruct and used by the/metricsHTTP handler.
- Image cache key fallback: When no
.sha256file is available in MinIO, the provisioner now uses a SHA256 hash of the image URL as the cache key instead of the raw URL. This fixes a crash when using HTTPS endpoints with ports (e.g.https://host:9000/...) where the raw URL produced an invalid filesystem path.
- Enterprise Observability Stack: Comprehensive monitoring and debugging capabilities
- Enhanced Logging System: Configurable JSON/text logging with sampling, external aggregation (webhook/Loki), and structured error logging
- Advanced Metrics Collection: 20+ Prometheus metrics covering cache ratios, job lifecycle, image operations, and health status
- Multi-Exporter Tracing: OpenTelemetry support with OTLP, Jaeger, and Zipkin exporters, configurable sampling, and distributed context propagation
- Request Logging Middleware: HTTP request/response logging with correlation IDs and performance metrics
- External Log Hooks: Configurable webhook and Loki integration for centralized logging
- Health Status Monitoring: Comprehensive dependency and system health tracking
- Performance Optimization: Log sampling and configurable tracing to minimize overhead
- Dependencies: Updated Gin (v1.12.0), OpenTelemetry (v1.42.0), MinIO (v7.0.99), golangci-lint (v2.11.3)
- CI/CD: Streamlined workflow configuration, removed problematic build-release workflows
- Code Quality: Enhanced error handling, resource management, and type safety
- GitHub Actions Workflows: Resolved CI failures by removing incompatible workflow configurations
- Resource Management: Improved HTTP client usage and response body handling
- Type Safety: Enhanced error checking and nil pointer protection
- Enhanced Logging System: New configurable logging with JSON/text formats, log sampling, external aggregation support (webhook, Loki), and structured error logging
- Comprehensive Metrics: Added cache hit/miss ratios, job execution metrics, image download statistics, storage operation metrics, and health status indicators
- Multi-Exporter Tracing: Support for OTLP, Jaeger, and Zipkin exporters with configurable sampling rates and enhanced span coverage
- Request Logging Middleware: HTTP request/response logging with correlation IDs and performance metrics
- External Log Hooks: Configurable webhook and Loki integration for centralized logging
- Image Cache: Fixed cache key mismatch that caused images to be redownloaded on every provision request. The
getOrDownloadImagefunction now consistently uses the SHA256 checksum as the cache key for both cache lookup and storage. - GitHub Actions Workflow: Fixed YAML indentation issues in build-release.yml that prevented proper CI/CD execution
- Dependencies: Updated Gin (v1.11.0→v1.12.0), OpenTelemetry (v1.40.0→v1.42.0), MinIO (v7.0.98→v7.0.99), and other security patches
- golangci-lint: Updated to v2.11.3 with enhanced rules
- API Handler: Refactored to support enhanced metrics collection
- OTLP Endpoint Parsing: Fixed gRPC exporter "too many colons in address" error by parsing full URLs and extracting host:port
- Progress Percentages: Corrected provisioning progress calculation to advance properly (download: 10-40%, create: 45%, convert: 55-95%)
- Context Handling: Resolved contextcheck lint errors by updating CancelJob method signatures and implementations
- CHANGELOG: Added documentation for v0.5.1 release
- Production Readiness: Comprehensive systemd service hardening with security best practices
- Database Backup: Automated daily database backups with systemd timer and retention policy
- Enhanced Testing: Expanded test coverage from 27.4% to 41.6% with 9 new test cases
- OpenTelemetry: Fixed tracing configuration (gRPC port 4317) and added connection validation
- Docker Host Image: Ubuntu 24.04 cloud image prepared for Docker host provisioning
- Repository Security: Complete removal of sensitive files from git history (devices.json, binaries, artifacts)
- Database Configuration: Fixed environment variable mismatch (DB_PATH vs DATABASE_PATH)
- MinIO TLS: Added InsecureSkipVerify support for self-signed certificates
- Makefile: Enhanced Debian package building with security-hardened systemd service
- CI/CD: Improved build reliability and Docker image publishing to GHCR
- Critical Bug: Database initialization failure due to incorrect environment variable
- Test Panics: Resolved nil pointer dereferences in test environment detection
- Git History: Cleaned repository of accidentally committed binaries, coverage reports, and sensitive data
- OTLP Tracing: Fixed gRPC endpoint configuration preventing trace export
- Linting: Resolved gosec G402 TLS warning and line length violations
- Systemd Hardening: Added NoNewPrivileges, PrivateTmp, ProtectHome, ProtectSystem, and ReadWritePaths restrictions
- Data Protection: Removed sensitive infrastructure data (devices.json) from repository history
- Binary Security: Eliminated committed binaries that could contain sensitive build information
- TLS Security: Proper handling of self-signed certificates with security annotations
- Test Quality: Significant improvement in test coverage and reliability
- Repository Health: Clean git history with only appropriate files tracked
- Build Reliability: Fixed CI/CD pipeline issues and improved error handling
- OpenTelemetry integration for distributed tracing with gRPC export
- Comprehensive logging with trace correlation using Logrus hooks
- Docker containerization with multi-stage builds and security hardening
- GitHub Actions CI/CD pipeline with automated testing and Docker publishing
- GitHub Container Registry (GHCR) integration for container distribution
- Enhanced API error handling with structured error responses
- Improved database schema with proper migrations and error handling
- Modernized Go module dependencies and build process
- Refactored job processing with improved concurrency and error recovery
- Race conditions in job processing and database operations
- Memory leaks in long-running job operations
- API response inconsistencies and missing error details
- Docker build issues with libvirt dependencies
- Added request validation and input sanitization
- Implemented proper authentication token handling
- Enhanced TLS certificate validation
- Added security headers and CORS protection
- New file-based image caching system that preserves QCOW2 compression, replacing libvirt RAW volume allocation
AllocateImageFile()method for allocating compressed image cache paths without converting to RAW format- Comprehensive test suite with 25 unit tests covering all cache operations and error paths
- Enhanced README with "Bigger Picture" section explaining the complete VM deployment workflow with diagrams
- Integer overflow validation in
CheckCache()for secure file size conversions - Security hardening of directory permissions (0o750) and file permissions (0o600)
- BREAKING: Image caching now stores QCOW2 images in compressed format instead of uncompressed RAW volumes
- Refactored
CheckCache()to use direct filesystem lookups via checksum files instead of libvirt volume queries - Updated
getOrDownloadImage()in job manager to use file-based caching for better compression handling - Improved cache directory creation and error handling with early directory initialization
- Enhanced documentation with compression preservation details and deployment workflow context
- Storage space efficiency: Compressed QCOW2 images now remain compressed in cache (was being expanded to RAW)
- Integer overflow vulnerability when converting file sizes in
CheckCache()(G115 gosec) - Directory permissions too permissive (0755 → 0o750) for security hardening
- File permissions in tests too permissive (0644 → 0o600)
- gosec G304 file inclusion vulnerability with proper nolint directives
- gci import formatting issues throughout codebase
- Added explicit validation for negative file sizes before uint64 conversion
- Hardened directory permissions for cache directories (0o750)
- Hardened file permissions for sensitive files (0o600)
- Validated file inclusion paths in tests with security-conscious nolint directives
- Expanded monitoring and alerting documentation
- Improved security in systemd service configuration with enhanced LVM access controls
- Updated systemd service with security best practices
- Enhanced code quality and testing infrastructure
- Code quality improvements and linting
- Resolved lint errors in TLS certificate tests
- GitHub Container Registry (GHCR) publishing to CI/CD workflow
- Modernized container images to latest versions
- Replaced Redis with Valkey
- Updated to latest PostgreSQL 18
- Fixed GHCR image tagging paths
- Fixed dev Docker image builds in CI/CD
- Removed static linking for libvirt builds
- Fixed CI workflow to use master branch only