Skip to content

[CORE-16225] 3/3: Cloud I/O Scheduler: Cluster config and metrics#30596

Merged
oleiman merged 5 commits into
devfrom
ct/core-16225/cloud-io-sched-cfg-and-metrics
Jun 12, 2026
Merged

[CORE-16225] 3/3: Cloud I/O Scheduler: Cluster config and metrics#30596
oleiman merged 5 commits into
devfrom
ct/core-16225/cloud-io-sched-cfg-and-metrics

Conversation

@oleiman

@oleiman oleiman commented May 25, 2026

Copy link
Copy Markdown
Member

Sequence:

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

@oleiman oleiman self-assigned this May 25, 2026
@oleiman oleiman changed the title [CORE-16225] 1/3: Cloud I/O Scheduler: Cluster config and metrics [CORE-16225] 3/3: Cloud I/O Scheduler: Cluster config and metrics May 25, 2026
@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from ce7b2b7 to dcd4e1c Compare May 25, 2026 00:38
@oleiman

oleiman commented May 25, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1
debug
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_valid_settings
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_rpk_export_import

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#84915

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/cluster_config_test.py::ClusterConfigTest.test_rpk_export_import

@vbotbuildovich

vbotbuildovich commented May 25, 2026

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#84915
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL ClusterConfigTest test_rpk_export_import null integration https://buildkite.com/redpanda/redpanda/builds/84915#019e5dc9-17c3-4d69-9b0b-2cf7f5464f30 0/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterConfigTest&test_method=test_rpk_export_import
test results on build#84936
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_replication_with_failures {"storage_mode": "local"} integration https://buildkite.com/redpanda/redpanda/builds/84936#019e632c-9295-4a85-b57a-22b21fa82708 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0073, p0=0.1359, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
FAIL SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 2, "num_to_upgrade": 2, "with_cloud_topics": false} integration https://buildkite.com/redpanda/redpanda/builds/84936#019e632e-565e-43e0-851e-23ccfc8a7e1e 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
FAIL SIPartitionMovementTest test_cross_shard {"cloud_storage_type": 1, "num_to_upgrade": 2, "with_cloud_topics": false} integration https://buildkite.com/redpanda/redpanda/builds/84936#019e632e-565f-41cd-8276-403a2a0a4b5d 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_cross_shard
FAIL SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 2, "num_to_upgrade": 2, "with_cloud_topics": false} integration https://buildkite.com/redpanda/redpanda/builds/84936#019e632e-565d-48f9-b632-5256a466389f 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
FAIL SIPartitionMovementTest test_shadow_indexing {"cloud_storage_type": 1, "num_to_upgrade": 2, "with_cloud_topics": false} integration https://buildkite.com/redpanda/redpanda/builds/84936#019e632e-565e-43e0-851e-23ccfc8a7e1e 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SIPartitionMovementTest&test_method=test_shadow_indexing
FAIL ReadReplicasUpgradeTest test_upgrades {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/84936#019e632c-9291-4f8f-ab1e-da4fc09b6eb6 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ReadReplicasUpgradeTest&test_method=test_upgrades
FAIL ReadReplicasUpgradeTest test_upgrades {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/84936#019e632e-565f-41cd-8276-403a2a0a4b5d 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ReadReplicasUpgradeTest&test_method=test_upgrades
test results on build#84961
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingRandomOpsTest test_node_operations {"failures": true, "workload_set": "cloud_combos"} integration https://buildkite.com/redpanda/redpanda/builds/84961#019e655d-a6cb-4705-b07e-a21fce7fbf67 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0027, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
test results on build#85392
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(FAIL) ShadowLinkingReplicationTests test_replication_with_compaction {"storage_mode": "cloud"} integration https://buildkite.com/redpanda/redpanda/builds/85392#019e9178-e4aa-48c6-baca-ec6375c1137a 7/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_compaction
FLAKY(FAIL) ShadowLinkingReplicationTests test_replication_with_compaction {"storage_mode": "cloud"} integration https://buildkite.com/redpanda/redpanda/builds/85392#019e917c-88f7-4b7f-b445-5bd0e1fe365c 5/11 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0000, p0=0.0000, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_compaction
FLAKY(PASS) ShadowLinkingReplicationTests test_replication_with_failures {"storage_mode": "tiered"} integration https://buildkite.com/redpanda/redpanda/builds/85392#019e9178-e4af-4a41-98af-a35bd6c206de 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0045, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_with_failures
FLAKY(PASS) ShadowLinkingReplicationTests test_with_restart {"storage_mode": "cloud"} integration https://buildkite.com/redpanda/redpanda/builds/85392#019e917c-88fa-45be-8f4f-47c029872888 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0501, p0=0.6425, reject_threshold=0.0100. adj_baseline=0.1430, p1=0.1982, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_with_restart
FLAKY(PASS) NodeWiseRecoveryTest test_recovery_local_data_missing {"wait_for_final_manifest_uploads": true} integration https://buildkite.com/redpanda/redpanda/builds/85392#019e9178-e4b1-4e96-8b17-fc1187980ecf 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0096, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_recovery_local_data_missing
test results on build#85565
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "storage_mode": "cloud", "with_failures": false} integration https://buildkite.com/redpanda/redpanda/builds/85565#019eae7f-4331-4288-ae89-cf22630dcfa6 19/21 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0044, p0=0.0847, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3917, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
test results on build#85726
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FAIL DatalakeCustomPartitioningTest test_many_partitions {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/85726#019ebcbd-69c2-4d92-95e9-19e3d49c84c6 0/1 https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeCustomPartitioningTest&test_method=test_many_partitions
FLAKY(PASS) NodesDecommissioningTest test_recommissioning_do_not_stop_all_moves_node {"cloud_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/85726#019ebcb9-836e-491b-8901-96992144aacc 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_recommissioning_do_not_stop_all_moves_node
FLAKY(PASS) NodeWiseRecoveryTest test_recovery_local_data_missing {"wait_for_final_manifest_uploads": false} integration https://buildkite.com/redpanda/redpanda/builds/85726#019ebcbd-69c7-4dc8-b122-00a9f50ab122 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0021, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_recovery_local_data_missing
FLAKY(PASS) NodeWiseRecoveryTest test_recovery_local_data_missing {"wait_for_final_manifest_uploads": true} integration https://buildkite.com/redpanda/redpanda/builds/85726#019ebcbd-69c7-4dc9-9c5e-b980039a9312 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0180, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodeWiseRecoveryTest&test_method=test_recovery_local_data_missing

@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from dcd4e1c to 32dbce5 Compare May 25, 2026 23:37
@oleiman

oleiman commented May 25, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1

@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch 4 times, most recently from dc3a181 to 0e790ed Compare May 26, 2026 04:01
@oleiman

oleiman commented May 26, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1

@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from 0e790ed to 8ec10ac Compare May 26, 2026 07:11
@oleiman

oleiman commented May 26, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1

@vbotbuildovich

vbotbuildovich commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Retry command for Build#84936

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/read_replica_e2e_test.py::ReadReplicasUpgradeTest.test_upgrades@{"cloud_storage_type":1}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":2,"with_cloud_topics":false}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":2,"with_cloud_topics":false}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":1,"num_to_upgrade":2,"with_cloud_topics":false}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":2,"num_to_upgrade":2,"with_cloud_topics":false}

@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from 8ec10ac to cb416c2 Compare May 26, 2026 16:34
@oleiman

oleiman commented May 26, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1
skip-units
skip-rebase
tests/rptest/tests/read_replica_e2e_test.py::ReadReplicasUpgradeTest.test_upgrades@{"cloud_storage_type":1}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":2,"num_to_upgrade":2,"with_cloud_topics":false}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":1,"num_to_upgrade":2,"with_cloud_topics":false}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_shadow_indexing@{"cloud_storage_type":1,"num_to_upgrade":2,"with_cloud_topics":false}
tests/rptest/tests/partition_movement_test.py::SIPartitionMovementTest.test_cross_shard@{"cloud_storage_type":2,"num_to_upgrade":2,"with_cloud_topics":false}

@oleiman

oleiman commented May 26, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1
skip-redpanda-build
skip-rebase

@oleiman oleiman requested review from andrwng and Copilot May 26, 2026 22:26
@oleiman oleiman added the claude-review Adding this label to a PR will trigger a workflow to review the code using claude. label May 27, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the Cloud I/O Scheduler rollout by adding cluster-level configuration and wiring the scheduler into the cloud client pool, along with metrics and unit tests. It introduces a cloud_io::scheduler admission gate (policy-driven, currently reservation-based) and tags cloud I/O operations with a cloud_io::group_id to enforce per-group concurrency behavior.

Changes:

  • Add cluster properties for selecting the scheduler policy and configuring per-group reservation targets; plumb this config into cloud_storage::configuration and down into cloud_storage_clients::client_pool.
  • Gate client-pool leases through the per-shard scheduler, and propagate group_id through cloud I/O APIs/callers to classify operations (producer uploads vs consumer fetch vs default).
  • Add scheduler/reservation-policy unit tests and a design note doc; update fixtures and mocks for the new API surfaces.

Reviewed changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/rptest/services/redpanda.py Adjusts test cluster config generation to avoid reservation defaults conflicting with small connection pools.
src/v/redpanda/tests/fixture.cc Forces empty reservation config in fixtures to keep policy active without reserved lanes under small capacities.
src/v/redpanda/application_services.cc Passes scheduler configuration into cloud storage client pool construction.
src/v/config/rjson_serialization.h Adds JSON serialization support for cloud_io::policy_type.
src/v/config/rjson_serialization.cc Implements JSON serialization for cloud_io::policy_type.
src/v/config/property.h Registers cloud_io::policy_type as a string-typed config property.
src/v/config/convert.h Adds YAML conversion for cloud_io::policy_type.
src/v/config/configuration.h Declares new cluster properties for scheduler policy and reservation specs.
src/v/config/configuration.cc Defines the new cluster properties, defaults, and validation for reservation spec shape.
src/v/config/BUILD Adds dependency on cloud_io:scheduler_types for config library.
src/v/cluster/archival/tests/service_fixture.cc Updates archival test fixture to configure the scheduler for small connection limits.
src/v/cloud_topics/level_zero/tests/l0_object_size_distribution_test.cc Updates remote API mocks/calls for new group_id parameter.
src/v/cloud_topics/level_zero/reader/tests/mocks.h Updates remote mock interface to accept group_id (with default in override helper).
src/v/cloud_topics/level_zero/reader/materialized_extent.cc Tags L0 downloads as consumer_fetch.
src/v/cloud_topics/level_zero/batcher/tests/remote_mock.h Updates batcher remote mocks/expectations for new group_id parameter.
src/v/cloud_topics/level_zero/batcher/batcher.cc Tags L0 uploads as producer_upload.
src/v/cloud_topics/level_one/metastore/tests/garbage_collector_test.cc Updates io::read_object overrides to accept group_id.
src/v/cloud_topics/level_one/frontend_reader/level_one_reader.cc Tags L1 reads as consumer_fetch.
src/v/cloud_topics/level_one/common/file_io.h Extends io implementation interface to accept group_id.
src/v/cloud_topics/level_one/common/file_io.cc Plumbs group_id through to cloud download calls.
src/v/cloud_topics/level_one/common/fake_io.h Updates fake IO interface to accept group_id.
src/v/cloud_topics/level_one/common/fake_io.cc Implements updated fake IO signature (ignores group_id).
src/v/cloud_topics/level_one/common/BUILD Adds dependency on cloud_io:scheduler_types.
src/v/cloud_topics/level_one/common/abstract_io.h Extends abstract IO API (read_object, read_object_as_iobuf) with group_id.
src/v/cloud_topics/level_one/common/abstract_io.cc Plumbs group_id through read_object_as_iobuf to read_object.
src/v/cloud_storage/configuration.h Adds cloud_io::scheduler_config to cloud storage runtime configuration.
src/v/cloud_storage/configuration.cc Builds scheduler config from cluster properties and guards against reservation sums exceeding pool capacity.
src/v/cloud_storage/BUILD Adds dependency on cloud_io:scheduler_types.
src/v/cloud_storage_clients/tests/client_pool_builder.h Adds ability to inject a scheduler config into test client pool builds.
src/v/cloud_storage_clients/client_pool.h Adds scheduler integration and new acquire overloads that take a group_id.
src/v/cloud_storage_clients/client_pool.cc Implements scheduler-gated lease acquisition, release, and borrow-path integration.
src/v/cloud_storage_clients/BUILD Adds dependencies on cloud_io:scheduler and cloud_io:scheduler_types.
src/v/cloud_io/tests/scheduler_test.cc Adds unit tests for scheduler wrapper behavior (passthrough + reservation integration).
src/v/cloud_io/tests/reservation_policy_test.cc Adds detailed unit tests for reservation policy behavior and edge cases.
src/v/cloud_io/tests/BUILD Registers new cloud_io gtests in Bazel build.
src/v/cloud_io/scheduler.h Introduces scheduler wrapper API (admit/try_admit/release + observability).
src/v/cloud_io/scheduler.cc Implements scheduler wrapper and passthrough policy + factory wiring.
src/v/cloud_io/scheduler-design.md Adds design notes for the scheduler/reservation policy and config surface.
src/v/cloud_io/scheduler_types.h Adds shared types: policy_type, group_id, parsing helpers, and scheduler config structs.
src/v/cloud_io/scheduler_policy.h Adds scheduler policy abstract base class.
src/v/cloud_io/reservation_policy.h Introduces reservation-based scheduler policy interface and metrics hooks.
src/v/cloud_io/reservation_policy.cc Implements reservation policy admission/release logic and metrics registration.
src/v/cloud_io/reservation_policy_types.h Adds internal structs for per-group state, waiters, and dwell/refill mechanics.
src/v/cloud_io/remote.h Extends remote API methods to accept group_id for classification.
src/v/cloud_io/remote.cc Plumbs group_id into client pool acquisition for download/upload paths.
src/v/cloud_io/remote_api.h Extends remote_api virtual interface to accept group_id with defaults.
src/v/cloud_io/BUILD Adds new cloud_io libraries for scheduler/policies/types and wires deps.

Comment thread src/v/cluster/archival/tests/service_fixture.cc Outdated
Comment thread src/v/cloud_io/scheduler-design.md Outdated
@oleiman oleiman added claude-review Adding this label to a PR will trigger a workflow to review the code using claude. and removed claude-review Adding this label to a PR will trigger a workflow to review the code using claude. labels May 27, 2026
@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review — Cloud I/O Scheduler: cluster config and metrics

Nice, well-documented final step in the series. The config plumbing (enum_property + convert + rjson_serialize + property_type_name) is complete and consistent, the metrics wiring looks correct, and the graceful fallback in build_scheduler_config when target_reserved sum exceeds capacity is a thoughtful touch for rolling upgrades. The new scheduler integration tests are clear and cover the right behaviors. BUILD dep changes look right, and there's no config↔reservation_policy dependency cycle (config only pulls in :scheduler_types).

Notable behavior change (intended, but flagging)

cloud_io_scheduler_policy defaults to reservation, so this PR flips the default cloud-storage admission path from passthrough to reservation-based admission control for all deployments on upgrade. The default reservation ([2,2,2]=6) fits under the default cloud_storage_max_connections=20, and the fallback handles the case where an operator has shrunk connections below 6, so the rollout looks safe — just want to confirm enabling it by default is the deliberate intent for this milestone.

Findings (left inline)

  • service_fixture.cc: comment says "reserve one slot each for producer_upload and consumer_fetch" but the code reserves 2 for default_group — comment/code mismatch.
  • redpanda.py: the <6 override sets [1,1,1]=3, which still exceeds capacity for cloud_storage_max_connections of 1–2; in that case the C++ side silently drops the reservation, so the test gets no lanes despite intending to configure them.
  • scheduler_types.h: parse_target_spec_shape parses the value as size_t but it's stored as uint32_t, so values above UINT32_MAX truncate silently rather than being rejected (edge case).

Minor / non-blocking

  • cloud_io/BUILD: //src/v/base is now listed in both deps and implementation_deps for the scheduler target — redundant; and since scheduler.cc dropped base/vassert.h, it may no longer need a direct base dep at all.
  • Duplicate group names in cloud_io_scheduler_reservation (e.g. two producer_upload: entries) are silently last-write-wins; probably fine given the validator, but worth being aware of.
  • With disable_metrics defaulting to false, the reservation_policy<manual_clock> unit tests now register real metric groups on every construction. They rely on the metric_groups destructor (not stop()) to unregister; this is safe as long as no test keeps two policies alive simultaneously, which is currently the case.

Overall looks solid — the inline items are small. 👍

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Comment thread src/v/cluster/archival/tests/service_fixture.cc
@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from 8ff728c to beaaf86 Compare June 9, 2026 16:59
@oleiman oleiman marked this pull request as ready for review June 9, 2026 17:00
@oleiman oleiman requested a review from a team as a code owner June 9, 2026 17:00
@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from beaaf86 to f364a76 Compare June 9, 2026 22:09
sjust-redpanda
sjust-redpanda previously approved these changes Jun 11, 2026
Comment thread src/v/config/configuration.cc Outdated
@oleiman

oleiman commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

/cdt
tests/rptest/scale_tests/many_partitions_test.py::ManyPartitionsTest.test_many_partitions_cloud_topics

@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from f364a76 to 6a83dc1 Compare June 12, 2026 16:08
Add the metrics surface for reservation_policy.

Internal metrics:
- aggregate: available_slots, total_capacity, total_waiters,
  total_waiters_canceled
- per-group: in_flight, waiters, admit_total,
  admit_immediate_total, current_reserved, canceled_total

Public metrics (shard-aggregated):
- aggregate: available_slots, total_capacity
- per-group: in_flight, waiters

Also adds the required counters and accessors for tracking canceled waiters.
Lumped in with the metrics because it's trivial and related.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from 6a83dc1 to 7d0eaa9 Compare June 12, 2026 16:10
@oleiman

oleiman commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

force push CR changes

force push rebase dev for merge conflict

oleiman added 4 commits June 12, 2026 09:18
Add two cluster properties:

- cloud_io_scheduler_policy: selector for cloud_io::scheduler's
  admission policy. needs_restart=true.
- cloud_io_scheduler_reservation: list of "group_name:slots"
  entries that supply per-group reservation targets to
  reservation_policy. Defaults to 2 slots each for producer_upload,
  consumer_fetch, and default_group as a starvation-prevention
  floor. Only consulted when cloud_io_scheduler_policy=reservation.

cloud_storage::configuration::get_config reads both properties at
startup and stages them into the scheduler_config field that
flows down to client_pool and scheduler.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Replace the placeholder reservation case in scheduler::make_policy
with construction of reservation_policy from the reservation_policy_config
threaded through scheduler_config.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
Three integration tests that construct a real reservation_policy
through the scheduler shell:

- ReservationHasWaitersReflectsQueueState: has_waiters() flips with
  queue state across admit/release.
- ReservationStopDrainsAndRejectsAdmits: stop() resolves queued
  waiters with abort and subsequent admit/try_admit reject.
- ReservationReservationsRespectConfiguredTargets: the per-group
  reservation passed in via scheduler_config is observable in
  admission behavior.

Policy internals are exercised by reservation_policy_test.cc.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
This is needed to support higher throughput usage for Cloud Topics in
particular. The no-op passthrough policy is kept around as an escape hatch.

Signed-off-by: Oren Leiman <oren.leiman@redpanda.com>
@oleiman oleiman force-pushed the ct/core-16225/cloud-io-sched-cfg-and-metrics branch from 7d0eaa9 to 19f7cf9 Compare June 12, 2026 16:19
@sjust-redpanda sjust-redpanda self-requested a review June 12, 2026 16:24
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#85726

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_many_partitions@{"catalog_type":"rest_jdbc","cloud_storage_type":1}

@oleiman

oleiman commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

/ci-repeat 1
release
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_many_partitions@{"catalog_type":"rest_jdbc","cloud_storage_type":1}

@oleiman

oleiman commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

custom partitioning test failed unrelated based on claude investigation but let's run it a few more times.

@Lazin Lazin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good.
What's the point of having passthrough mode? Is it just an escape hatch or do we need to test with both reservation and passthrough?

@oleiman

oleiman commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

The code looks good. What's the point of having passthrough mode? Is it just an escape hatch or do we need to test with both reservation and passthrough?

@Lazin I'd say a bit of both. Primarily an escape hatch though.

@oleiman oleiman merged commit 96b4807 into dev Jun 12, 2026
21 checks passed
@oleiman oleiman deleted the ct/core-16225/cloud-io-sched-cfg-and-metrics branch June 12, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/build area/redpanda claude-review Adding this label to a PR will trigger a workflow to review the code using claude.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants