[CORE-16628]: Cloud Topics: HTT and more scale tests#30818
Conversation
|
/cdt |
|
/cdt |
b67093f to
49a6427
Compare
|
/cdt |
49a6427 to
7e1cdeb
Compare
|
/cdt |
7e1cdeb to
ff4f4d0
Compare
|
/cdt |
ff4f4d0 to
ddae5c4
Compare
|
/cdt |
ddae5c4 to
0a354e4
Compare
|
/cdt |
0a354e4 to
f019710
Compare
|
/cdt |
f019710 to
db9b33f
Compare
|
/cdt |
db9b33f to
c745e84
Compare
|
/cdt |
c745e84 to
422a8e2
Compare
|
/cdt |
422a8e2 to
6e40ca8
Compare
|
/cdt |
6e40ca8 to
0f9c2a9
Compare
A CDT scale gate for the cloud_io reservation floor. A producer keeps a cloud topic warm at a moderate rate while a multi-reader consumer group re-reads it from offset 0 on a loop. The topic's data exceeds the cloud cache, so the reads keep missing and fetch L1 cold, contending a small per-shard S3 pool against the produce-path L0 uploads. The test asserts produce stays healthy under that contention, protected by producer_upload's reserved floor, with self-confirming guards that the reads were genuinely cold (more than the cache) and the pool was genuinely contended (had waiters). A coarse regression gate, not a reservation-vs-passthrough A/B. Brokers are force-stopped at teardown: under sustained pool saturation the reconciler wedges on orphaned multipart uploads and a graceful shutdown hangs. That bug is tracked in CORE-16648, and we'll reinstate the usual shutdown ceremony when a fix lands.
stage_cloud_topics_cold_read + test_cloud_topics_cold_read: a cloud-topics analog of stage_tiered_storage_consuming that runs on a real Redpanda Cloud cluster at the sold tier (cloud topics is available there). Steady produce at max tier ingress + an RpkConsumer draining the backlog cold from oldest; asserts produce advances and the backlog drains. Backlog volume and drain timeout are calibration knobs.
0f9c2a9 to
3c9000f
Compare
Cloud topics serve cold reads by fetching L1 objects from object storage through a
per-shard S3 connection pool that the cloud_io scheduler arbitrates under its reservation
policy: the produce path's L0 uploads (producer_upload) keep a reserved floor even when
cold fetches (consumer_fetch) saturate the pool. This PR adds coverage that produce stays
healthy under cold-read pressure for cloud topics, as part of the tier-9 cloud-topics
scaleup (CORE-16628).
CDT scale test
(scale_tests/cloud_topics_cold_read_scale_test.py) a coarse regression
gate. A steady producer keeps a cloud topic warm while a multi-reader consumer group
re-reads it from offset 0 on a loop; the topic's data exceeds the cloud cache, so the
reads keep missing and fetch L1 cold, contending the per-shard pool. The pool is
deliberately small (cloud_storage_max_connections=8, shipped reservation so
producer_upload is floored at 2) — at CDT scale you can't saturate the default
20-connection pool, and the floor mechanism is scale-invariant. It asserts produce holds
≥70% of its offered rate under that contention, with self-confirming guards that the reads
were genuinely cold (pulled back more than the cache) and the pool was genuinely
contended (had waiters). The precise reservation-vs-passthrough A/B lives in the
bench-runner tier-9 configs; the floor itself is unit-tested in
cloud_io/tests/scheduler_test.cc.
The scale test SIGKILLs the brokers at teardown: under sustained pool saturation the
reconciler wedges on orphaned multipart uploads and a graceful shutdown hangs
(CORE-16648). The produce assertion is already decided by then, so the test abandons the
cluster rather than block on the wedge; a code comment marks the force-stop for removal
once CORE-16648 is fixed.
High-throughput cloud stage
(redpanda_cloud_tests/high_throughput_test.py::test_cloud_topics_cold_read) — the
real-cloud analog of the tiered-storage consuming stage. Steady produce on a
storage.mode=cloud topic at max tier ingress while a large backlog (~4 min of ingress,
sized to exceed the batch cache) drains cold from object storage; asserts produce keeps
flowing and the backlog drains. Runs against a real Redpanda Cloud cluster and requires
cloud topics enabled on the tier — a throughput/functionality check at tier scale, not the
pool-saturation floor gate.
Testing: the scale test passes in CDT; the high-throughput stage runs against a
provisioned cloud cluster per tier.
Backports Required
Release Notes