Option to retry writes if on-disk checksums don't match#58
Option to retry writes if on-disk checksums don't match#58alchark wants to merge 10 commits intoyoctoproject:mainfrom
Conversation
If some data is still in the Python output buffer when the sync is called, it will not be taken into consideration by the OS sync and might get lost if the device is removed before all buffer contents reach the disk. Add an explicit flush before the sync to avoid it. Signed-off-by: Alexey Charkov <alchark@flipper.net>
Add a new `--checksum-retry` option to `bmaptool copy` that will retry writing each block (possibly multiple times) until the checksum of data on disk matches the expected one from the bmap file. This is useful for devices that have unreliable writes, such as some worn down USB card readers. Signed-off-by: Alexey Charkov <alchark@flipper.net>
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in “read-back verify and retry” capability to bmaptool copy, intended to catch cases where data written to a block device later reads back with a checksum mismatch. It also ensures the Python file buffer is flushed before issuing fsync().
Changes:
- Add
--checksum-retry[=N]CLI option and plumb the retry count into the copy implementation. - Extend
BmapCopy/BmapBdevCopyto (optionally) read back written ranges, compare against bmap range checksums, and retry writes on mismatch. - Flush the Python destination buffer before
fsync()to reduce “in flight” user-space buffered data during sync.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/bmaptool/CLI.py |
Adds --checksum-retry argument, validates it, and opens block devices read/write when needed. |
src/bmaptool/BmapCopy.py |
Implements post-write readback verification + retry loop, threads range metadata through the batching pipeline, and adds an explicit flush() before fsync(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/bmaptool/BmapCopy.py
Outdated
| @@ -740,6 +895,11 @@ def copy(self, sync=True, verify=True): | |||
| blocks_written += end - start + 1 | |||
| bytes_written += len(buf) | |||
|
|
|||
| # Track buffers for this range (for retry if needed) | |||
| if self._checksum_retry: | |||
| range_buffers[(start, end)] = buf | |||
| current_range_blocks.add((start, end)) | |||
There was a problem hiding this comment.
When --checksum-retry is enabled, range_buffers stores every batch buffer for the current bmap range so it can be rewritten on mismatch. A single bmap range can be very large, so this can grow to (near) the full image size and cause excessive memory usage/OOM. To make retries safe, consider spooling the range data to a temporary file (or another bounded-storage approach) and replaying from there on mismatch, rather than holding all buffers in RAM.
There was a problem hiding this comment.
Spooling the range to disk would dramatically increase complexity, so I believe that would be an overkill, and can introduce issues of its own (e.g. in case of insufficient storage space, but plentiful RAM).
Ensure the docstring wording matches the logic of the function, i.e. the checksums are validated per bmap range (and not per block) Signed-off-by: Alexey Charkov <alchark@flipper.net> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…vided When the bmap file doesn't include range checksums, `--checksum-retry` becomes a no-op, so warn the user about it (at most once per file). Signed-off-by: Alexey Charkov <alchark@flipper.net>
This was a leftover from an earlier work in progress implementation, no longer needed in the current version, so remove it to avoid confusion. Signed-off-by: Alexey Charkov <alchark@flipper.net>
Add a new tests file runnable via `python3 -m unittest tests.test_checksum_retry` to verify the checksum-retry functionality. This includes tests for aligned and unaligned image sizes, correct handling of retries, and edge cases like single-byte image. Signed-off-by: Alexey Charkov <alchark@flipper.net>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/bmaptool/BmapCopy.py
Outdated
| # Check if we've moved to a new range - if so, verify the previous one | ||
| new_range = (range_first, range_last, range_chksum) | ||
| if self._checksum_retry and current_range and new_range != current_range: | ||
| # Verify the completed range before moving to the next one | ||
| prev_first, prev_last, prev_chksum = current_range | ||
| if prev_chksum and self._cs_type: | ||
| self._verify_range_with_retry( | ||
| prev_first, prev_last, prev_chksum, range_buffers | ||
| ) |
There was a problem hiding this comment.
The warning path for --checksum-retry when the bmap has no per-range checksums never triggers because callers only invoke _verify_range_with_retry() when range_chksum is truthy. As a result, --checksum-retry becomes a silent no-op for checksum-less bmaps (despite the CLI log message). Call _verify_range_with_retry() unconditionally when finishing a range (let it decide whether to warn/skip), or move the missing-checksum detection outside the if prev_chksum ... guard.
tests/test_checksum_retry.py
Outdated
| def test_checksum_retry_default_and_explicit_values(self): | ||
| """Test that default retry=1 and explicit retry=1 behave the same.""" | ||
| image_size = 8192 | ||
| image_path, bmap_path = _create_test_image_with_bmap(image_size) | ||
|
|
||
| try: | ||
| # Test with default retry (should be 1 or string "1") | ||
| f_image = TransRead.TransRead(image_path) | ||
| f_dest = tempfile.NamedTemporaryFile("w+b", delete=False) | ||
| dest_path = f_dest.name | ||
| f_bmap = open(bmap_path, "r") | ||
|
|
||
| # checksum_retry=None means user didn't specify it, should not retry | ||
| writer = BmapCopy.BmapCopy( | ||
| f_image, f_dest, f_bmap, image_size, checksum_retry=None | ||
| ) | ||
| writer.copy(sync=True, verify=True) |
There was a problem hiding this comment.
test_checksum_retry_default_and_explicit_values is mislabeled: it compares checksum_retry=None (disabled) vs checksum_retry=1 (enabled), not “default retry=1” vs “explicit retry=1”. Either update the test name/docstring to reflect the actual behavior being tested, or add a case that exercises the CLI-style default (flag present with implicit value 1).
tests/test_checksum_retry.py
Outdated
| import sys | ||
| import tempfile | ||
| import hashlib | ||
| from bmaptool import BmapCreate, BmapCopy, TransRead, BmapHelpers |
There was a problem hiding this comment.
This test module imports sys, hashlib, and BmapHelpers but does not use them. If pylint/linters run in CI, this will raise unused-import warnings/errors; please remove the unused imports or use them.
| import sys | |
| import tempfile | |
| import hashlib | |
| from bmaptool import BmapCreate, BmapCopy, TransRead, BmapHelpers | |
| import tempfile | |
| from bmaptool import BmapCreate, BmapCopy, TransRead |
…-retry If the checksum-retry feature is enabled, the destination may return a full last block when reading back even if the source image size is not a multiple of the block size, resulting in a checksum mismatch. Adjust the readback logic to only access as much data as the source image size to make checksums comparable, even if a full block is written/read back. Signed-off-by: Alexey Charkov <alchark@flipper.net>
…enabled This helper already checks if each range has a checksum, so call it unconditionally when checksum-retry is enabled. This simplifies the code and ensures that the user gets a warning if checksums are missing. Signed-off-by: Alexey Charkov <alchark@flipper.net>
These were left over from the file used as a reference when writing the new tests and are unused, so remove them to clean up the code. Signed-off-by: Alexey Charkov <alchark@flipper.net>
The docstring and function name of test_checksum_retry_default_and_explicit_values() were confusing, as the default value for checksum retry is disabled. Update the test and its description to explicitly compare the images produced with checksum retry vs. without checksum retry to make sure they are the same. Signed-off-by: Alexey Charkov <alchark@flipper.net>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "checksum-retry requested but bmap file does not contain checksums; " | ||
| "skipping verification for blocks %d-%d and beyond" |
There was a problem hiding this comment.
The warning in _verify_range_with_retry() says the bmap file lacks checksums, but this path can also happen when no bmap file is provided at all (e.g., API use or --nobmap). Consider adjusting the message/logic to distinguish “no bmap” from “bmap without checksums” to avoid misleading users.
| "checksum-retry requested but bmap file does not contain checksums; " | |
| "skipping verification for blocks %d-%d and beyond" | |
| "checksum-retry requested but no checksum information is available " | |
| "(no bmap file or bmap without checksums); skipping verification " | |
| "for blocks %d-%d and beyond" |
| def test_checksum_retry_disabled_vs_enabled(self): | ||
| """Test that checksum_retry=None (disabled) vs checksum_retry=1 (enabled) produce identical results.""" | ||
| image_size = 8192 |
There was a problem hiding this comment.
The docstring/comment for this test says "checksum_retry=None (disabled - no verification performed)", but the test still calls copy(..., verify=True) (which performs in-copy checksum verification when the bmap provides checksums). Consider rewording to clarify this only disables post-write readback verification/retry.
| try: | ||
| # Test with checksum-retry enabled | ||
| f_image = TransRead.TransRead(image_path) | ||
| f_dest = tempfile.NamedTemporaryFile("w+b", delete=False) | ||
| dest_path = f_dest.name | ||
| f_bmap = open(bmap_path, "r") | ||
|
|
||
| writer = BmapCopy.BmapCopy( | ||
| f_image, f_dest, f_bmap, image_size, checksum_retry=1 | ||
| ) | ||
| # copy() verifies each bmap range's checksum during operation | ||
| writer.copy(sync=True, verify=True) | ||
|
|
||
| # Verify destination has correct size | ||
| dest_size = os.path.getsize(dest_path) | ||
| self.assertEqual(dest_size, image_size) | ||
|
|
||
| f_image.close() | ||
| f_dest.close() | ||
| f_bmap.close() | ||
| os.unlink(dest_path) | ||
| finally: |
There was a problem hiding this comment.
Several tests (e.g. this one) delete dest_path only on the success path. If an assertion or writer.copy() fails, the temporary destination file can be left behind. Consider moving destination cleanup into the finally block (or using self.addCleanup) to ensure cleanup on failures too.
| # Validate checksum_retry argument | ||
| checksum_retry = None | ||
| if args.checksum_retry is not None: | ||
| try: | ||
| checksum_retry = int(args.checksum_retry) | ||
| if checksum_retry < 1: | ||
| error_out("--checksum-retry argument must be a positive integer") | ||
| except ValueError: | ||
| error_out("--checksum-retry argument must be a valid integer") | ||
|
|
There was a problem hiding this comment.
--checksum-retry relies on bmap range checksums; when --nobmap is used the option becomes a no-op (and block devices may still be opened read-write). Consider rejecting --checksum-retry together with --nobmap (and/or when no bmap file is found) with a clear error message.
As highlighted in #56, sometimes successfully writing an image with
bmaptoolto a storage device doesn't mean all data safely reached the device, which may happen for a variety of reasons, such as:bmaptoolwill verify that the data it sends to the block device matches the source image and respective checksums, but that might still change by the time those bits hit the storage mediumbmaptooloperation finishes, but some data might still be in OS buffers from the time between whenbmaptoolcalls its sync() and when its own output buffer is fully flushed and given to the OS kernelPossibly others not mentioned here, too.
This pull request introduces an explicit flush of the Python output buffer before asking the OS kernel to sync data to disk to ensure no data is still "in flight" at the moment the sync is executed, and then adds an additional option to the
bmaptool copycommand to force explicit checking of data on disk against the checksums recorded in the bmap file, retrying writes of every block region that didn't match. This helps ensure the integrity of data on disk after thebmaptool copyoperation.