Skip to content

lsm: chunk cloud reads of larger SST files#30843

Open
andrwng wants to merge 6 commits into
redpanda-data:devfrom
andrwng:lsm-sst-chunks
Open

lsm: chunk cloud reads of larger SST files#30843
andrwng wants to merge 6 commits into
redpanda-data:devfrom
andrwng:lsm-sst-chunks

Conversation

@andrwng

@andrwng andrwng commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

TODO

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • None

Pulls the DMA alignment logic out of disk_file_reader::read into a free
function so the upcoming chunked reader can reuse it, one ss::file at a
time.

The short-read error now reports natural-byte coordinates (the requested
offset and length) rather than the internal adjusted offset and array
size, since the helper no longer has the reader's path for context. Same
exception type and control flow -- only the diagnostic text changes.
Copilot AI review requested due to automatic review settings June 18, 2026 07:44
@andrwng andrwng requested a review from a team as a code owner June 18, 2026 07:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the LSM cloud-cache data path to avoid full-object downloads for large SSTs by introducing a chunked, range-GET-based random-access reader backed by the cloud disk cache. It also extends the persistence interface to pass known SST sizes down to implementations and adds configuration + tests to validate chunked behavior and range streaming via the S3 imposter.

Changes:

  • Add lsm::io::chunked_remote_file_reader to read SSTs from object storage in end-aligned fixed-size chunks, hydrating chunks into the cache on demand with coalesced concurrent fetches.
  • Extend data_persistence::open_random_access_reader to accept file_size, and thread this through callers/implementations (disk, memory, cloud cache, and tests).
  • Add a tunable cloud_topics_metastore_sst_chunk_size, plus new unit tests and S3 imposter support for download_stream with Content-Length (including byte ranges).

Reviewed changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/v/lsm/sst/tests/sst_test.cc Update SST tests to pass file_size when opening random access readers.
src/v/lsm/sst/tests/iterator_test.cc Update iterator tests to pass file_size to the reader open call.
src/v/lsm/io/tests/persistence_test.cc Update persistence tests and mocks for new open_random_access_reader(handle, file_size) API; wire chunk-size binding in cloud-cache persistence test factory.
src/v/lsm/io/tests/chunked_remote_file_reader_test.cc New tests covering open/read behavior, caching, and concurrency for chunked remote reader.
src/v/lsm/io/tests/BUILD Add Bazel target for chunked_remote_file_reader_test.
src/v/lsm/io/persistence.h Change open_random_access_reader interface to accept file_size and document rationale.
src/v/lsm/io/memory_persistence.cc Adapt memory persistence implementation to new API (ignore file_size).
src/v/lsm/io/file_io.h Add aligned_dma_read helper declaration for alignment-safe DMA reads.
src/v/lsm/io/file_io.cc Implement aligned_dma_read and reuse it from disk_file_reader::read.
src/v/lsm/io/disk_persistence.cc Adapt disk persistence implementation to new API (ignore file_size).
src/v/lsm/io/cloud_cache_persistence.h Extend cloud-cache persistence factory to accept chunk-size binding and document chunked reads.
src/v/lsm/io/cloud_cache_persistence.cc Switch cold SST reads to chunked range GETs; keep a fast-path for fully cached local files; add chunk-cache key prefixing; refactor retry budget helpers.
src/v/lsm/io/chunked_remote_file_reader.h New chunked remote random-access reader API and design notes.
src/v/lsm/io/chunked_remote_file_reader.cc New chunked reader implementation: end-aligned chunking, cache hydration, in-flight coalescing, parallel reads.
src/v/lsm/io/BUILD Add library target for chunked_remote_file_reader and link it into cloud-cache persistence.
src/v/lsm/db/tests/impl_test.cc Update persistence wrappers used in tests for new API.
src/v/lsm/db/tests/compaction_task_test.cc Update fault-injecting persistence wrapper for new API.
src/v/lsm/db/table_cache.cc Pass known SST file_size when opening random access readers.
src/v/lsm/block/tests/contents_test.cc Update tests for new API (currently passes 0 as size).
src/v/config/configuration.h Add tunable cloud_topics_metastore_sst_chunk_size.
src/v/config/configuration.cc Define cloud_topics_metastore_sst_chunk_size property with defaults and bounds.
src/v/cloud_topics/read_replica/tests/db_utils.h Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_topics/read_replica/snapshot_manager.cc Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_topics/level_one/domain/tests/db_domain_manager_test.cc Pass chunk-size binding into cloud-cache persistence open.
src/v/cloud_io/tests/db_s3_imposter_test.cc Add tests for download_stream whole-object and byte-range reads.
src/v/cloud_io/tests/db_s3_imposter_fixture.cc Ensure small GET responses are inlined to include Content-Length for download_stream (range/whole).

Comment thread src/v/lsm/io/chunked_remote_file_reader.cc
Comment thread src/v/lsm/io/chunked_remote_file_reader.cc Outdated
Comment thread src/v/lsm/io/chunked_remote_file_reader.cc
Comment thread src/v/lsm/block/tests/contents_test.cc Outdated
Comment thread src/v/lsm/block/tests/contents_test.cc Outdated
@andrwng andrwng force-pushed the lsm-sst-chunks branch 5 times, most recently from db4c659 to 56c0fa1 Compare June 18, 2026 17:59
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#85993
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) TopicRecoveryTest test_many_partitions {"check_mode": "no_check", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/85993#019edbfb-c194-477c-9d5f-166c409312c1 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0011, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TopicRecoveryTest&test_method=test_many_partitions

@andrwng andrwng force-pushed the lsm-sst-chunks branch 3 times, most recently from 15957a3 to dc6b496 Compare June 18, 2026 21:30
andrwng added 4 commits June 18, 2026 16:09
Our cloud client download_stream() requires a Content-Length on the
response, but db_s3_imposter always streamed its GET responses chunked,
which makes the HTTP server not set it on the response.

An upcoming chunked SST reader reads via download_stream() and fails
with "field not found" on the missing header.

This commit inlines response bodies that fit in an sstring so seastar's
HTTP server sets Content-Length automatically. Larger bodies still
stream and such tests must use download_object. Adds download_stream
test cases, whole-object and byte-range, that reproduce the failure.
Adds the SST file reader that will back cloud_cache_data_persistence.
It reads in fixed-size chunks hydrated into the cloud_io disk cache on
demand.

Chunks are aligned to the end of the file so the SST's footer/index land
in the tail chunk and open() costs one range GET. open() probes the tail
to confirm existence and warm the metadata for the first reads.

When reading a given range, chunks are downloaded and subsequently read
concurrently. Concurrency is bounded by the client pool capacity.

Unlike the other file readers that keep an ss::file handle alive for the
duration of the reader, this opts to only keep the file handle open for
the duration of in-flight reads. Subsequent attempts to read chunks may
require re-downloading, but I thought that's preferable than keeping a
potentially significant number of ss::files open.
The LSM already knows each SST's size from the manifest. This forwards
it into the persistence's open_random_access_reader so a cloud-backed
implementation can compute chunk byte ranges at open time, without a
separate HEAD request. This is a no-op for the existing implementations,
which ignore the argument until chunked reads land.
The metastore reads its LSM SST files from object storage in fixed-size
byte ranges hydrated into the local cache. Expose the range size as a
tunable so it can be tuned against the read mix without a rebuild:
smaller chunks cut read/cache amplification for point reads, larger ones
cut request count for scans.

Defaults to 24 MiB, 1.5x the 16 MiB write buffer, so an L0 SST stays a
single chunk while larger compacted SSTs are split for parallel reads.
The cloud cache LSM persistence was previously reading SSTs whole. At
scale, this meant that when the LSM grew to include some L3 or L4
objects (O(GiBs) each), readers could be hit with minutes of latency.
Worse yet, often times the reads only need to read some few MiBs
(footers, bloom filters, indexes) to actually serve the incoming
metastore request.

This adds the new chunked file reader to the cloud_cache_persistence,
though it still attempts to read the full file from the cache if it
exists (e.g. if the object was there from previous versions of Redpanda,
or if it was written to the cache on the SST upload path).

The default chunk size is 24MiB, 1.5x the default SST size of L0, so
that by default L0 files are read whole (empirically they didn't seem
problematic being downloaded whole).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants