lsm: chunk cloud reads of larger SST files#30843
Open
andrwng wants to merge 6 commits into
Open
Conversation
Pulls the DMA alignment logic out of disk_file_reader::read into a free function so the upcoming chunked reader can reuse it, one ss::file at a time. The short-read error now reports natural-byte coordinates (the requested offset and length) rather than the internal adjusted offset and array size, since the helper no longer has the reader's path for context. Same exception type and control flow -- only the diagnostic text changes.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the LSM cloud-cache data path to avoid full-object downloads for large SSTs by introducing a chunked, range-GET-based random-access reader backed by the cloud disk cache. It also extends the persistence interface to pass known SST sizes down to implementations and adds configuration + tests to validate chunked behavior and range streaming via the S3 imposter.
Changes:
- Add
lsm::io::chunked_remote_file_readerto read SSTs from object storage in end-aligned fixed-size chunks, hydrating chunks into the cache on demand with coalesced concurrent fetches. - Extend
data_persistence::open_random_access_readerto acceptfile_size, and thread this through callers/implementations (disk, memory, cloud cache, and tests). - Add a tunable
cloud_topics_metastore_sst_chunk_size, plus new unit tests and S3 imposter support fordownload_streamwith Content-Length (including byte ranges).
Reviewed changes
Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/lsm/sst/tests/sst_test.cc | Update SST tests to pass file_size when opening random access readers. |
| src/v/lsm/sst/tests/iterator_test.cc | Update iterator tests to pass file_size to the reader open call. |
| src/v/lsm/io/tests/persistence_test.cc | Update persistence tests and mocks for new open_random_access_reader(handle, file_size) API; wire chunk-size binding in cloud-cache persistence test factory. |
| src/v/lsm/io/tests/chunked_remote_file_reader_test.cc | New tests covering open/read behavior, caching, and concurrency for chunked remote reader. |
| src/v/lsm/io/tests/BUILD | Add Bazel target for chunked_remote_file_reader_test. |
| src/v/lsm/io/persistence.h | Change open_random_access_reader interface to accept file_size and document rationale. |
| src/v/lsm/io/memory_persistence.cc | Adapt memory persistence implementation to new API (ignore file_size). |
| src/v/lsm/io/file_io.h | Add aligned_dma_read helper declaration for alignment-safe DMA reads. |
| src/v/lsm/io/file_io.cc | Implement aligned_dma_read and reuse it from disk_file_reader::read. |
| src/v/lsm/io/disk_persistence.cc | Adapt disk persistence implementation to new API (ignore file_size). |
| src/v/lsm/io/cloud_cache_persistence.h | Extend cloud-cache persistence factory to accept chunk-size binding and document chunked reads. |
| src/v/lsm/io/cloud_cache_persistence.cc | Switch cold SST reads to chunked range GETs; keep a fast-path for fully cached local files; add chunk-cache key prefixing; refactor retry budget helpers. |
| src/v/lsm/io/chunked_remote_file_reader.h | New chunked remote random-access reader API and design notes. |
| src/v/lsm/io/chunked_remote_file_reader.cc | New chunked reader implementation: end-aligned chunking, cache hydration, in-flight coalescing, parallel reads. |
| src/v/lsm/io/BUILD | Add library target for chunked_remote_file_reader and link it into cloud-cache persistence. |
| src/v/lsm/db/tests/impl_test.cc | Update persistence wrappers used in tests for new API. |
| src/v/lsm/db/tests/compaction_task_test.cc | Update fault-injecting persistence wrapper for new API. |
| src/v/lsm/db/table_cache.cc | Pass known SST file_size when opening random access readers. |
| src/v/lsm/block/tests/contents_test.cc | Update tests for new API (currently passes 0 as size). |
| src/v/config/configuration.h | Add tunable cloud_topics_metastore_sst_chunk_size. |
| src/v/config/configuration.cc | Define cloud_topics_metastore_sst_chunk_size property with defaults and bounds. |
| src/v/cloud_topics/read_replica/tests/db_utils.h | Pass chunk-size binding into cloud-cache persistence open. |
| src/v/cloud_topics/read_replica/snapshot_manager.cc | Pass chunk-size binding into cloud-cache persistence open. |
| src/v/cloud_topics/level_one/metastore/lsm/tests/replicated_db_test.cc | Pass chunk-size binding into cloud-cache persistence open. |
| src/v/cloud_topics/level_one/metastore/lsm/replicated_db.cc | Pass chunk-size binding into cloud-cache persistence open. |
| src/v/cloud_topics/level_one/domain/tests/db_domain_manager_test.cc | Pass chunk-size binding into cloud-cache persistence open. |
| src/v/cloud_io/tests/db_s3_imposter_test.cc | Add tests for download_stream whole-object and byte-range reads. |
| src/v/cloud_io/tests/db_s3_imposter_fixture.cc | Ensure small GET responses are inlined to include Content-Length for download_stream (range/whole). |
db4c659 to
56c0fa1
Compare
Collaborator
CI test resultstest results on build#85993
|
15957a3 to
dc6b496
Compare
Our cloud client download_stream() requires a Content-Length on the response, but db_s3_imposter always streamed its GET responses chunked, which makes the HTTP server not set it on the response. An upcoming chunked SST reader reads via download_stream() and fails with "field not found" on the missing header. This commit inlines response bodies that fit in an sstring so seastar's HTTP server sets Content-Length automatically. Larger bodies still stream and such tests must use download_object. Adds download_stream test cases, whole-object and byte-range, that reproduce the failure.
Adds the SST file reader that will back cloud_cache_data_persistence. It reads in fixed-size chunks hydrated into the cloud_io disk cache on demand. Chunks are aligned to the end of the file so the SST's footer/index land in the tail chunk and open() costs one range GET. open() probes the tail to confirm existence and warm the metadata for the first reads. When reading a given range, chunks are downloaded and subsequently read concurrently. Concurrency is bounded by the client pool capacity. Unlike the other file readers that keep an ss::file handle alive for the duration of the reader, this opts to only keep the file handle open for the duration of in-flight reads. Subsequent attempts to read chunks may require re-downloading, but I thought that's preferable than keeping a potentially significant number of ss::files open.
The LSM already knows each SST's size from the manifest. This forwards it into the persistence's open_random_access_reader so a cloud-backed implementation can compute chunk byte ranges at open time, without a separate HEAD request. This is a no-op for the existing implementations, which ignore the argument until chunked reads land.
The metastore reads its LSM SST files from object storage in fixed-size byte ranges hydrated into the local cache. Expose the range size as a tunable so it can be tuned against the read mix without a rebuild: smaller chunks cut read/cache amplification for point reads, larger ones cut request count for scans. Defaults to 24 MiB, 1.5x the 16 MiB write buffer, so an L0 SST stays a single chunk while larger compacted SSTs are split for parallel reads.
The cloud cache LSM persistence was previously reading SSTs whole. At scale, this meant that when the LSM grew to include some L3 or L4 objects (O(GiBs) each), readers could be hit with minutes of latency. Worse yet, often times the reads only need to read some few MiBs (footers, bloom filters, indexes) to actually serve the incoming metastore request. This adds the new chunked file reader to the cloud_cache_persistence, though it still attempts to read the full file from the cache if it exists (e.g. if the object was there from previous versions of Redpanda, or if it was written to the cache on the SST upload path). The default chunk size is 24MiB, 1.5x the default SST size of L0, so that by default L0 files are read whole (empirically they didn't seem problematic being downloaded whole).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TODO
Backports Required
Release Notes