Skip to content

Option to retry writes if on-disk checksums don't match#58

Open
alchark wants to merge 10 commits intoyoctoproject:mainfrom
flipperdevices:checksum-retry
Open

Option to retry writes if on-disk checksums don't match#58
alchark wants to merge 10 commits intoyoctoproject:mainfrom
flipperdevices:checksum-retry

Conversation

@alchark
Copy link

@alchark alchark commented Mar 6, 2026

As highlighted in #56, sometimes successfully writing an image with bmaptool to a storage device doesn't mean all data safely reached the device, which may happen for a variety of reasons, such as:

  1. The storage device has transient hardware faults resulting in silent data corruption. bmaptool will verify that the data it sends to the block device matches the source image and respective checksums, but that might still change by the time those bits hit the storage medium
  2. The storage device gets removed immediately after the bmaptool operation finishes, but some data might still be in OS buffers from the time between when bmaptool calls its sync() and when its own output buffer is fully flushed and given to the OS kernel

Possibly others not mentioned here, too.

This pull request introduces an explicit flush of the Python output buffer before asking the OS kernel to sync data to disk to ensure no data is still "in flight" at the moment the sync is executed, and then adds an additional option to the bmaptool copy command to force explicit checking of data on disk against the checksums recorded in the bmap file, retrying writes of every block region that didn't match. This helps ensure the integrity of data on disk after the bmaptool copy operation.

alchark added 2 commits March 6, 2026 12:00
If some data is still in the Python output buffer when the sync is called,
it will not be taken into consideration by the OS sync and might get
lost if the device is removed before all buffer contents reach the disk.

Add an explicit flush before the sync to avoid it.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
Add a new `--checksum-retry` option to `bmaptool copy` that will retry
writing each block (possibly multiple times) until the checksum of data
on disk matches the expected one from the bmap file.

This is useful for devices that have unreliable writes, such as some
worn down USB card readers.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
Copilot AI review requested due to automatic review settings March 6, 2026 09:26
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in “read-back verify and retry” capability to bmaptool copy, intended to catch cases where data written to a block device later reads back with a checksum mismatch. It also ensures the Python file buffer is flushed before issuing fsync().

Changes:

  • Add --checksum-retry[=N] CLI option and plumb the retry count into the copy implementation.
  • Extend BmapCopy/BmapBdevCopy to (optionally) read back written ranges, compare against bmap range checksums, and retry writes on mismatch.
  • Flush the Python destination buffer before fsync() to reduce “in flight” user-space buffered data during sync.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/bmaptool/CLI.py Adds --checksum-retry argument, validates it, and opens block devices read/write when needed.
src/bmaptool/BmapCopy.py Implements post-write readback verification + retry loop, threads range metadata through the batching pipeline, and adds an explicit flush() before fsync().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +836 to +901
@@ -740,6 +895,11 @@ def copy(self, sync=True, verify=True):
blocks_written += end - start + 1
bytes_written += len(buf)

# Track buffers for this range (for retry if needed)
if self._checksum_retry:
range_buffers[(start, end)] = buf
current_range_blocks.add((start, end))
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --checksum-retry is enabled, range_buffers stores every batch buffer for the current bmap range so it can be rewritten on mismatch. A single bmap range can be very large, so this can grow to (near) the full image size and cause excessive memory usage/OOM. To make retries safe, consider spooling the range data to a temporary file (or another bounded-storage approach) and replaying from there on mismatch, rather than holding all buffers in RAM.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spooling the range to disk would dramatically increase complexity, so I believe that would be an overkill, and can introduce issues of its own (e.g. in case of insufficient storage space, but plentiful RAM).

alchark and others added 4 commits March 6, 2026 13:57
Ensure the docstring wording matches the logic of the function, i.e. the checksums are validated per bmap range (and not per block)

Signed-off-by: Alexey Charkov <alchark@flipper.net>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…vided

When the bmap file doesn't include range checksums, `--checksum-retry`
becomes a no-op, so warn the user about it (at most once per file).

Signed-off-by: Alexey Charkov <alchark@flipper.net>
This was a leftover from an earlier work in progress implementation,
no longer needed in the current version, so remove it to avoid confusion.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
Add a new tests file runnable via
`python3 -m unittest tests.test_checksum_retry`
to verify the checksum-retry functionality. This includes tests for
aligned and unaligned image sizes, correct handling of retries, and edge
cases like single-byte image.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +872 to +880
# Check if we've moved to a new range - if so, verify the previous one
new_range = (range_first, range_last, range_chksum)
if self._checksum_retry and current_range and new_range != current_range:
# Verify the completed range before moving to the next one
prev_first, prev_last, prev_chksum = current_range
if prev_chksum and self._cs_type:
self._verify_range_with_retry(
prev_first, prev_last, prev_chksum, range_buffers
)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning path for --checksum-retry when the bmap has no per-range checksums never triggers because callers only invoke _verify_range_with_retry() when range_chksum is truthy. As a result, --checksum-retry becomes a silent no-op for checksum-less bmaps (despite the CLI log message). Call _verify_range_with_retry() unconditionally when finishing a range (let it decide whether to warn/skip), or move the missing-checksum detection outside the if prev_chksum ... guard.

Copilot uses AI. Check for mistakes.
Comment on lines +226 to +242
def test_checksum_retry_default_and_explicit_values(self):
"""Test that default retry=1 and explicit retry=1 behave the same."""
image_size = 8192
image_path, bmap_path = _create_test_image_with_bmap(image_size)

try:
# Test with default retry (should be 1 or string "1")
f_image = TransRead.TransRead(image_path)
f_dest = tempfile.NamedTemporaryFile("w+b", delete=False)
dest_path = f_dest.name
f_bmap = open(bmap_path, "r")

# checksum_retry=None means user didn't specify it, should not retry
writer = BmapCopy.BmapCopy(
f_image, f_dest, f_bmap, image_size, checksum_retry=None
)
writer.copy(sync=True, verify=True)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_checksum_retry_default_and_explicit_values is mislabeled: it compares checksum_retry=None (disabled) vs checksum_retry=1 (enabled), not “default retry=1” vs “explicit retry=1”. Either update the test name/docstring to reflect the actual behavior being tested, or add a case that exercises the CLI-style default (flag present with implicit value 1).

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +30
import sys
import tempfile
import hashlib
from bmaptool import BmapCreate, BmapCopy, TransRead, BmapHelpers
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test module imports sys, hashlib, and BmapHelpers but does not use them. If pylint/linters run in CI, this will raise unused-import warnings/errors; please remove the unused imports or use them.

Suggested change
import sys
import tempfile
import hashlib
from bmaptool import BmapCreate, BmapCopy, TransRead, BmapHelpers
import tempfile
from bmaptool import BmapCreate, BmapCopy, TransRead

Copilot uses AI. Check for mistakes.
alchark added 4 commits March 6, 2026 17:24
…-retry

If the checksum-retry feature is enabled, the destination may return a
full last block when reading back even if the source image size is not a
multiple of the block size, resulting in a checksum mismatch.

Adjust the readback logic to only access as much data as the source
image size to make checksums comparable, even if a full block is
written/read back.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
…enabled

This helper already checks if each range has a checksum, so call it
unconditionally when checksum-retry is enabled. This simplifies the code
and ensures that the user gets a warning if checksums are missing.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
These were left over from the file used as a reference when writing the
new tests and are unused, so remove them to clean up the code.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
The docstring and function name of
test_checksum_retry_default_and_explicit_values()
were confusing, as the default value for checksum retry is disabled.

Update the test and its description to explicitly compare the images
produced with checksum retry vs. without checksum retry to make sure they
are the same.

Signed-off-by: Alexey Charkov <alchark@flipper.net>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +784 to +785
"checksum-retry requested but bmap file does not contain checksums; "
"skipping verification for blocks %d-%d and beyond"
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning in _verify_range_with_retry() says the bmap file lacks checksums, but this path can also happen when no bmap file is provided at all (e.g., API use or --nobmap). Consider adjusting the message/logic to distinguish “no bmap” from “bmap without checksums” to avoid misleading users.

Suggested change
"checksum-retry requested but bmap file does not contain checksums; "
"skipping verification for blocks %d-%d and beyond"
"checksum-retry requested but no checksum information is available "
"(no bmap file or bmap without checksums); skipping verification "
"for blocks %d-%d and beyond"

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +226
def test_checksum_retry_disabled_vs_enabled(self):
"""Test that checksum_retry=None (disabled) vs checksum_retry=1 (enabled) produce identical results."""
image_size = 8192
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring/comment for this test says "checksum_retry=None (disabled - no verification performed)", but the test still calls copy(..., verify=True) (which performs in-copy checksum verification when the bmap provides checksums). Consider rewording to clarify this only disables post-write readback verification/retry.

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +190
try:
# Test with checksum-retry enabled
f_image = TransRead.TransRead(image_path)
f_dest = tempfile.NamedTemporaryFile("w+b", delete=False)
dest_path = f_dest.name
f_bmap = open(bmap_path, "r")

writer = BmapCopy.BmapCopy(
f_image, f_dest, f_bmap, image_size, checksum_retry=1
)
# copy() verifies each bmap range's checksum during operation
writer.copy(sync=True, verify=True)

# Verify destination has correct size
dest_size = os.path.getsize(dest_path)
self.assertEqual(dest_size, image_size)

f_image.close()
f_dest.close()
f_bmap.close()
os.unlink(dest_path)
finally:
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several tests (e.g. this one) delete dest_path only on the success path. If an assertion or writer.copy() fails, the temporary destination file can be left behind. Consider moving destination cleanup into the finally block (or using self.addCleanup) to ensure cleanup on failures too.

Copilot uses AI. Check for mistakes.
Comment on lines +623 to +632
# Validate checksum_retry argument
checksum_retry = None
if args.checksum_retry is not None:
try:
checksum_retry = int(args.checksum_retry)
if checksum_retry < 1:
error_out("--checksum-retry argument must be a positive integer")
except ValueError:
error_out("--checksum-retry argument must be a valid integer")

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--checksum-retry relies on bmap range checksums; when --nobmap is used the option becomes a no-op (and block devices may still be opened read-write). Consider rejecting --checksum-retry together with --nobmap (and/or when no bmap file is found) with a clear error message.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants