lsm: chunk cloud reads of larger SST files by andrwng · Pull Request #30843 · redpanda-data/redpanda

andrwng · 2026-06-18T07:43:59Z

TODO

Backports Required

Release Notes

None

Pulls the DMA alignment logic out of disk_file_reader::read into a free function so the upcoming chunked reader can reuse it, one ss::file at a time. The short-read error now reports natural-byte coordinates (the requested offset and length) rather than the internal adjusted offset and array size, since the helper no longer has the reader's path for context. Same exception type and control flow -- only the diagnostic text changes.

Copilot

Pull request overview

This PR updates the LSM cloud-cache data path to avoid full-object downloads for large SSTs by introducing a chunked, range-GET-based random-access reader backed by the cloud disk cache. It also extends the persistence interface to pass known SST sizes down to implementations and adds configuration + tests to validate chunked behavior and range streaming via the S3 imposter.

Changes:

Add lsm::io::chunked_remote_file_reader to read SSTs from object storage in end-aligned fixed-size chunks, hydrating chunks into the cache on demand with coalesced concurrent fetches.
Extend data_persistence::open_random_access_reader to accept file_size, and thread this through callers/implementations (disk, memory, cloud cache, and tests).
Add a tunable cloud_topics_metastore_sst_chunk_size, plus new unit tests and S3 imposter support for download_stream with Content-Length (including byte ranges).

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/v/lsm/sst/tests/sst_test.cc	Update SST tests to pass `file_size` when opening random access readers.
src/v/lsm/sst/tests/iterator_test.cc	Update iterator tests to pass `file_size` to the reader open call.
src/v/lsm/io/tests/persistence_test.cc	Update persistence tests and mocks for new `open_random_access_reader(handle, file_size)` API; wire chunk-size binding in cloud-cache persistence test factory.
src/v/lsm/io/tests/chunked_remote_file_reader_test.cc	New tests covering open/read behavior, caching, and concurrency for chunked remote reader.
src/v/lsm/io/tests/BUILD	Add Bazel target for `chunked_remote_file_reader_test`.
src/v/lsm/io/persistence.h	Change `open_random_access_reader` interface to accept `file_size` and document rationale.
src/v/lsm/io/memory_persistence.cc	Adapt memory persistence implementation to new API (ignore `file_size`).
src/v/lsm/io/file_io.h	Add `aligned_dma_read` helper declaration for alignment-safe DMA reads.
src/v/lsm/io/file_io.cc	Implement `aligned_dma_read` and reuse it from `disk_file_reader::read`.
src/v/lsm/io/disk_persistence.cc	Adapt disk persistence implementation to new API (ignore `file_size`).
src/v/lsm/io/cloud_cache_persistence.h	Extend cloud-cache persistence factory to accept chunk-size binding and document chunked reads.
src/v/lsm/io/cloud_cache_persistence.cc	Switch cold SST reads to chunked range GETs; keep a fast-path for fully cached local files; add chunk-cache key prefixing; refactor retry budget helpers.
src/v/lsm/io/chunked_remote_file_reader.h	New chunked remote random-access reader API and design notes.
src/v/lsm/io/chunked_remote_file_reader.cc	New chunked reader implementation: end-aligned chunking, cache hydration, in-flight coalescing, parallel reads.
src/v/lsm/io/BUILD	Add library target for `chunked_remote_file_reader` and link it into cloud-cache persistence.
src/v/lsm/db/tests/impl_test.cc	Update persistence wrappers used in tests for new API.
src/v/lsm/db/tests/compaction_task_test.cc	Update fault-injecting persistence wrapper for new API.
src/v/lsm/db/table_cache.cc	Pass known SST `file_size` when opening random access readers.
src/v/lsm/block/tests/contents_test.cc	Update tests for new API (currently passes `0` as size).
src/v/config/configuration.h	Add tunable `cloud_topics_metastore_sst_chunk_size`.
src/v/config/configuration.cc	Define `cloud_topics_metastore_sst_chunk_size` property with defaults and bounds.
src/v/cloud_topics/read_replica/tests/db_utils.h	Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_topics/read_replica/snapshot_manager.cc	Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc	Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc	Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_topics/level_one/domain/tests/db_domain_manager_test.cc	Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_io/tests/db_s3_imposter_test.cc	Add tests for `download_stream` whole-object and byte-range reads.
src/v/cloud_io/tests/db_s3_imposter_fixture.cc	Ensure small GET responses are inlined to include Content-Length for `download_stream` (range/whole).

vbotbuildovich · 2026-06-18T19:46:08Z

CI test results

test results on build#85993

test_status	test_class	test_method	test_arguments	test_kind	job_url	passed	reason	test_history
FLAKY(PASS)	TopicRecoveryTest	test_many_partitions	{"check_mode": "no_check", "cloud_storage_type": 1}	integration	https://buildkite.com/redpanda/redpanda/builds/85993#019edbfb-c194-477c-9d5f-166c409312c1	10/11	Test PASSES after retries.No significant increase in flaky rate(baseline=0.0011, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000)	https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TopicRecoveryTest&test_method=test_many_partitions

Our cloud client download_stream() requires a Content-Length on the response, but db_s3_imposter always streamed its GET responses chunked, which makes the HTTP server not set it on the response. An upcoming chunked SST reader reads via download_stream() and fails with "field not found" on the missing header. This commit inlines response bodies that fit in an sstring so seastar's HTTP server sets Content-Length automatically. Larger bodies still stream and such tests must use download_object. Adds download_stream test cases, whole-object and byte-range, that reproduce the failure.

Adds the SST file reader that will back cloud_cache_data_persistence. It reads in fixed-size chunks hydrated into the cloud_io disk cache on demand. Chunks are aligned to the end of the file so the SST's footer/index land in the tail chunk and open() costs one range GET. open() probes the tail to confirm existence and warm the metadata for the first reads. When reading a given range, chunks are downloaded and subsequently read concurrently. Concurrency is bounded by the client pool capacity. Unlike the other file readers that keep an ss::file handle alive for the duration of the reader, this opts to only keep the file handle open for the duration of in-flight reads. Subsequent attempts to read chunks may require re-downloading, but I thought that's preferable than keeping a potentially significant number of ss::files open.

The LSM already knows each SST's size from the manifest. This forwards it into the persistence's open_random_access_reader so a cloud-backed implementation can compute chunk byte ranges at open time, without a separate HEAD request. This is a no-op for the existing implementations, which ignore the argument until chunked reads land.

The metastore reads its LSM SST files from object storage in fixed-size byte ranges hydrated into the local cache. Expose the range size as a tunable so it can be tuned against the read mix without a rebuild: smaller chunks cut read/cache amplification for point reads, larger ones cut request count for scans. Defaults to 24 MiB, 1.5x the 16 MiB write buffer, so an L0 SST stays a single chunk while larger compacted SSTs are split for parallel reads.

The cloud cache LSM persistence was previously reading SSTs whole. At scale, this meant that when the LSM grew to include some L3 or L4 objects (O(GiBs) each), readers could be hit with minutes of latency. Worse yet, often times the reads only need to read some few MiBs (footers, bloom filters, indexes) to actually serve the incoming metastore request. This adds the new chunked file reader to the cloud_cache_persistence, though it still attempts to read the full file from the cache if it exists (e.g. if the object was there from previous versions of Redpanda, or if it was written to the cache on the SST upload path). The default chunk size is 24MiB, 1.5x the default SST size of L0, so that by default L0 files are read whole (empirically they didn't seem problematic being downloaded whole).

Copilot AI review requested due to automatic review settings June 18, 2026 07:44

andrwng requested a review from a team as a code owner June 18, 2026 07:44

github-actions Bot added area/build area/redpanda labels Jun 18, 2026

Copilot started reviewing on behalf of andrwng June 18, 2026 07:44 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

andrwng force-pushed the lsm-sst-chunks branch 5 times, most recently from db4c659 to 56c0fa1 Compare June 18, 2026 17:59

andrwng force-pushed the lsm-sst-chunks branch 3 times, most recently from 15957a3 to dc6b496 Compare June 18, 2026 21:30

andrwng added 4 commits June 18, 2026 16:09

andrwng force-pushed the lsm-sst-chunks branch from dc6b496 to 9fa317b Compare June 18, 2026 23:13

andrwng force-pushed the lsm-sst-chunks branch from 9fa317b to 1a6fd7b Compare June 18, 2026 23:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lsm: chunk cloud reads of larger SST files#30843

lsm: chunk cloud reads of larger SST files#30843
andrwng wants to merge 6 commits into
redpanda-data:devfrom
andrwng:lsm-sst-chunks

andrwng commented Jun 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbotbuildovich commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andrwng commented Jun 18, 2026

Backports Required

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbotbuildovich commented Jun 18, 2026

CI test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants