[Performance] Shared-memory command signaling for ParallelEnv and ring-buffer transport for MultiAsyncCollector#3854
Open
vmoens wants to merge 2 commits into
Open
[Performance] Shared-memory command signaling for ParallelEnv and ring-buffer transport for MultiAsyncCollector#3854vmoens wants to merge 2 commits into
vmoens wants to merge 2 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3854
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New Failure, 10 Pending, 1 Unrelated FailureAs of commit 5363e83 with merge base 26ece89 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
… transport for MultiAsyncCollector Two IPC optimizations targeting per-step syscalls and per-batch copies: 1. ParallelEnv worker_wait="adaptive"|"spin" (default "block", unchanged): payload-free hot-path commands (step, step_and_maybe_reset) are written as opcodes into a shared-memory RawArray that workers spin-poll instead of being pickled and sent through a pipe (one syscall per worker per step). This mirrors the existing shm done-flags in the other direction. In adaptive mode workers spin for spin_for seconds then fall back to a blocking pipe wait, advertising a sleep state so the parent wakes them through the pipe; a short poll recheck covers the theoretical lost-wake window. Payload-carrying commands (resets, seeds, non-tensor data, RNN passthrough) keep using the pipe, and no-buffer mode falls back to "block" with a warning. ~30% step-throughput gain on 4 workers with a fast env. 2. MultiAsyncCollector buffer_depth=K (default 1, unchanged): workers copy each rollout into one of K rotating shared-memory slots and the queue message shrinks to (idx, slot); the main process yields zero-copy views instead of cloning every batch. A yielded batch stays valid until the same worker has collected K-1 further batches; K=2 covers the standard iteration pattern since the continue-before-yield handshake keeps workers at most one rollout ahead. First use of a slot ships the buffer ref through the queue (re-sent if the put times out). Rejected for MultiSyncCollector, replay_buffer mode and use_buffers=False. Both knobs are exposed in configure_parallel and the Hydra configs (BatchedEnvConfig, MultiSyncCollectorConfig, MultiAsyncCollectorConfig), covered by tests (TestWorkerWait, TestBufferDepth) and parametrized benchmarks (test_parallel_worker_wait, test_async_buffer_depth). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2f286cb to
f4e30c3
Compare
…nning workers The benchmark files share one pytest session; without an explicit close the "spin" case leaves three busy-waiting workers burning cores for the rest of the session, which can skew subsequent benchmarks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
Benchmark Results: PR
|
| Benchmark | main ops | PR ops | Change |
|---|---|---|---|
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] |
38.63 | 198.91 | +414.90% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] |
188.33 | 54.12 | -71.26% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] |
766.85 | 1,114 | +45.21% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] |
901.83 | 671.47 | -25.54% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] |
3,667 | 2,780 | -24.20% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-backward] |
825.82 | 994.58 | +20.44% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] |
3,016 | 3,622 | +20.10% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] |
3,284 | 2,680 | -18.40% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] |
2,932 | 3,449 | +17.63% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] |
2,681 | 3,105 | +15.79% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-same] |
20.12 | 22.67 | +12.71% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] |
2,045 | 2,274 | +11.19% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[True-backward] |
115.14 | 126.40 | +9.78% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] |
508.85 | 556.74 | +9.41% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-None] |
1,671 | 1,823 | +9.10% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] |
2,106 | 2,296 | +9.02% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] |
768.74 | 837.53 | +8.95% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-32-512] |
29.36 | 31.90 | +8.65% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] |
468.68 | 507.97 | +8.38% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[reduce-overhead-None] |
78.31 | 84.58 | +8.00% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] |
725.22 | 782.91 | +7.96% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-backward] |
387.45 | 417.21 | +7.68% |
benchmarks/test_collectors_benchmark.py::test_sync |
15.62 | 16.80 | +7.60% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] |
2,818 | 3,029 | +7.50% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] |
2,032 | 2,174 | +6.98% |
benchmarks/test_envs_benchmark.py::test_simple |
1.7167 | 1.8182 | +5.91% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape1-atari] |
5,098 | 5,377 | +5.47% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape2-large_img] |
411.62 | 433.01 | +5.20% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-backward] |
109.55 | 115.20 | +5.15% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-False-0-gru] |
1.4102 | 1.3397 | -5.00% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[safetensors] |
23,109 | 24,255 | +4.96% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-lstm] |
1.9570 | 2.0537 | +4.94% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape2-large_img] |
178.92 | 170.14 | -4.91% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape2-large_img] |
393.67 | 412.80 | +4.86% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-gru] |
2.9400 | 3.0818 | +4.82% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[reduce-overhead-None] |
1,794 | 1,870 | +4.22% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-True-0-gru] |
1.4795 | 1.4185 | -4.12% |
benchmarks/test_envs_benchmark.py::test_transformed |
0.8921 | 0.9196 | +3.09% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-1-512] |
652.81 | 672.76 | +3.06% |
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-None] |
286.20 | 294.93 | +3.05% |
benchmarks/test_envs_benchmark.py::test_parallel |
0.9835 | 0.9539 | -3.01% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-None] |
258.19 | 265.80 | +2.95% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape1-atari] |
707.59 | 728.33 | +2.93% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-False-0-lstm] |
0.8889 | 0.8635 | -2.86% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-False-True] |
37,417 | 38,475 | +2.83% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[200-img_shape3-large_batch] |
336.71 | 345.99 | +2.76% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-gru] |
4.1645 | 4.2776 | +2.72% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] |
2,096 | 2,153 | +2.71% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-1-512] |
2,223 | 2,283 | +2.71% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[200-img_shape3-large_batch] |
314.05 | 321.93 | +2.51% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-constant] |
2,524 | 2,462 | -2.43% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] |
2,901 | 2,969 | +2.33% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[reduce-overhead-None] |
708.92 | 725.32 | +2.31% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[reduce-overhead-None] |
475.11 | 486.02 | +2.30% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[reduce-overhead-None] |
278.80 | 285.20 | +2.29% |
benchmarks/test_objectives_benchmarks.py::test_redq_speed[True-None] |
222.16 | 227.20 | +2.27% |
benchmarks/test_collectors_benchmark.py::test_async |
17.69 | 18.09 | +2.27% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] |
25.12 | 25.69 | +2.26% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[numpy] |
382,320 | 373,723 | -2.25% |
benchmarks/test_objectives_benchmarks.py::test_redq_speed[False-backward] |
55.86 | 54.67 | -2.13% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[50-img_shape0-small] |
4,284 | 4,374 | +2.11% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] |
558.85 | 570.33 | +2.05% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-True-0-lstm] |
0.9652 | 0.9457 | -2.03% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-True] |
24,029 | 23,543 | -2.03% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-False] |
58,613 | 57,446 | -1.99% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] |
48.57 | 49.53 | +1.98% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-False] |
32,851 | 32,205 | -1.97% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[50-img_shape0-small] |
870.98 | 888.01 | +1.96% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[reduce-overhead-None] |
567.66 | 578.75 | +1.95% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape1-atari] |
657.18 | 669.99 | +1.95% |
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sampler_sample_scale[10000000-cpu] |
53.45 | 52.41 | -1.94% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-backward] |
510.48 | 520.39 | +1.94% |
benchmarks/test_objectives_benchmarks.py::test_values[td0_return_estimate-False-False] |
7,895 | 7,742 | -1.93% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[pickle] |
12,138 | 11,904 | -1.93% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-False] |
64,654 | 65,875 | +1.89% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-False] |
78,530 | 77,061 | -1.87% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-True] |
21,983 | 22,370 | +1.76% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[100-img_shape0-atari] |
30.07 | 30.58 | +1.68% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[reduce-overhead-None] |
116.47 | 118.40 | +1.66% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-True] |
30,677 | 31,155 | +1.56% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[200-img_shape3-large_batch] |
787.02 | 775.13 | -1.51% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-True] |
34,769 | 34,267 | -1.44% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[False-backward] |
92.60 | 91.27 | -1.44% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-None] |
84.46 | 85.67 | +1.43% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] |
2,955 | 2,997 | +1.43% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] |
171.13 | 173.46 | +1.36% |
benchmarks/test_collectors_benchmark.py::test_single |
8.9407 | 9.0592 | +1.33% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-False] |
35,034 | 34,573 | -1.31% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[200-img_shape3-large_batch] |
142.34 | 144.19 | +1.30% |
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sampler_sample_scale[1000000-cpu] |
96.42 | 97.67 | +1.29% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[False-None] |
124.92 | 123.32 | -1.28% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] |
0.5329 | 0.5396 | +1.25% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape2-large_img] |
581.04 | 574.01 | -1.21% |
benchmarks/test_objectives_benchmarks.py::test_redq_speed[reduce-overhead-None] |
229.68 | 226.91 | -1.21% |
benchmarks/test_objectives_benchmarks.py::test_values[vec_td_lambda_return_estimate-True-False] |
56.03 | 55.36 | -1.19% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-True] |
20,021 | 19,784 | -1.18% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-None] |
123.15 | 124.58 | +1.16% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[200-img_shape1-large_batch] |
13.41 | 13.57 | +1.15% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-None] |
211.90 | 209.46 | -1.15% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] |
171.38 | 169.41 | -1.15% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-lstm] |
3.1144 | 3.1498 | +1.14% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[200-img_shape1-large_batch] |
15.20 | 15.37 | +1.13% |
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-memmap_cpu_storage_cpu... |
81.97 | 82.87 | +1.10% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-True] |
42,365 | 41,902 | -1.09% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[100-img_shape0-atari] |
26.33 | 26.62 | +1.08% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[generalized_advantage_estimate-False-1-512] |
111.60 | 110.40 | -1.07% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-single-False] |
1.6203 | 1.6370 | +1.03% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-True] |
19,829 | 20,032 | +1.02% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[50-img_shape0-small] |
7,313 | 7,383 | +0.95% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] |
0.5992 | 0.6050 | +0.95% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-False] |
35,293 | 35,629 | +0.95% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-False-True] |
29,435 | 29,703 | +0.91% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-None] |
272.66 | 275.05 | +0.88% |
benchmarks/test_objectives_benchmarks.py::test_values[td_lambda_return_estimate-True-False] |
25.02 | 25.23 | +0.85% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-backward] |
241.21 | 243.23 | +0.84% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-True] |
37,787 | 38,101 | +0.83% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-False] |
32,382 | 32,650 | +0.83% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-False-False] |
45,647 | 45,280 | -0.81% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-backward] |
58.68 | 58.21 | -0.80% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-None] |
695.76 | 701.33 | +0.80% |
| ... | ... | ... | Showing 120 of 192 comparisons, sorted by absolute change. |
GPU
Compared 202 benchmarks. Regressions over 5%: 17. Improvements over 5%: 13.
| Benchmark | main ops | PR ops | Change |
|---|---|---|---|
benchmarks/test_objectives_benchmarks.py::test_iql_speed[False-backward] |
68.76 | 34.77 | -49.43% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[reduce-overhead-None] |
77.20 | 102.15 | +32.33% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] |
1,054 | 737.68 | -30.00% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] |
3,453 | 2,567 | -25.68% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] |
3,042 | 3,745 | +23.12% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] |
3,277 | 2,719 | -17.01% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] |
3,068 | 3,539 | +15.34% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] |
45.75 | 39.04 | -14.66% |
benchmarks/test_collectors_benchmark.py::test_single |
5.9351 | 6.7536 | +13.79% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] |
709.57 | 804.40 | +13.36% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] |
763.56 | 857.26 | +12.27% |
benchmarks/test_collectors_benchmark.py::test_single_with_rb_pixels |
5.3024 | 4.6676 | -11.97% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] |
2,014 | 2,247 | +11.60% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-backward] |
895.17 | 988.36 | +10.41% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] |
496.03 | 452.56 | -8.76% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape2-large_img] |
444.10 | 410.59 | -7.55% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[generalized_advantage_estimate-False-1-512] |
46.91 | 50.36 | +7.34% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape2-large_img] |
598.45 | 558.25 | -6.72% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] |
2,924 | 2,729 | -6.65% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-False] |
67,207 | 62,806 | -6.55% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape2-large_img] |
418.61 | 391.76 | -6.41% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] |
497.09 | 527.48 | +6.11% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[200-img_shape3-large_batch] |
784.39 | 738.47 | -5.85% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] |
2,299 | 2,166 | -5.77% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-True] |
38,946 | 36,702 | -5.76% |
benchmarks/test_objectives_benchmarks.py::test_values[td1_return_estimate-False-False] |
19.64 | 20.71 | +5.48% |
benchmarks/test_objectives_benchmarks.py::test_values[td_lambda_return_estimate-True-False] |
11.88 | 12.52 | +5.41% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] |
50.26 | 47.57 | -5.35% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape1-atari] |
4,039 | 4,255 | +5.35% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] |
55.23 | 52.37 | -5.19% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape2-large_img] |
176.57 | 168.15 | -4.77% |
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-None] |
749.31 | 714.71 | -4.62% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-True] |
31,951 | 33,418 | +4.59% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] |
183.97 | 192.33 | +4.54% |
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[safetensors] |
23,301 | 24,350 | +4.50% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[50-img_shape0-small] |
3,389 | 3,541 | +4.48% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-True] |
43,335 | 41,396 | -4.48% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-gru] |
21.94 | 22.87 | +4.25% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[200-img_shape3-large_batch] |
141.82 | 135.84 | -4.22% |
benchmarks/test_envs_benchmark.py::test_simple |
1.2532 | 1.2006 | -4.19% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-backward] |
211.78 | 220.43 | +4.08% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] |
738.61 | 709.24 | -3.98% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-lstm] |
76.81 | 74.15 | -3.46% |
benchmarks/test_objectives_benchmarks.py::test_values[generalized_advantage_estimate-True-True] |
47.59 | 49.23 | +3.43% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-True] |
20,738 | 20,053 | -3.31% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-backward] |
77.69 | 80.10 | +3.11% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-True] |
24,039 | 23,293 | -3.11% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-None] |
1,948 | 1,888 | -3.06% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] |
53.61 | 52.00 | -3.00% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-True] |
20,230 | 19,655 | -2.84% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-constant] |
4,833 | 4,696 | -2.83% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-backward] |
445.58 | 457.80 | +2.74% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] |
842.79 | 865.31 | +2.67% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] |
23.51 | 22.90 | -2.60% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-False] |
55,772 | 54,328 | -2.59% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[True-backward] |
236.83 | 242.88 | +2.55% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-False-False] |
64,907 | 63,282 | -2.50% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-False] |
35,261 | 34,380 | -2.50% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-True-False] |
28,234 | 27,535 | -2.48% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape1-atari] |
721.97 | 704.28 | -2.45% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-True] |
31,044 | 30,344 | -2.26% |
benchmarks/test_envs_benchmark.py::test_transformed |
0.7179 | 0.7026 | -2.13% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[200-img_shape1-large_batch] |
13.70 | 13.41 | -2.12% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[True-None] |
737.95 | 753.50 | +2.11% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[reduce-overhead-None] |
107.58 | 109.79 | +2.06% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[200-img_shape3-large_batch] |
307.73 | 313.96 | +2.02% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[100-img_shape0-atari] |
26.86 | 26.33 | -1.98% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[False-backward] |
127.02 | 129.54 | +1.98% |
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-None] |
369.75 | 377.06 | +1.98% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-32-512] |
1,268 | 1,292 | +1.93% |
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-cuda_storage_cpu_sampler] |
86.51 | 84.87 | -1.89% |
benchmarks/test_envs_benchmark.py::test_parallel |
0.5321 | 0.5421 | +1.88% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-False] |
35,762 | 35,135 | -1.76% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[False-None] |
112.88 | 114.86 | +1.75% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[4-constant] |
4,889 | 4,803 | -1.74% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape1-atari] |
277.02 | 272.23 | -1.73% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-1-512] |
1,261 | 1,283 | +1.73% |
benchmarks/test_objectives_benchmarks.py::test_values[td0_return_estimate-False-False] |
11,562 | 11,750 | +1.62% |
benchmarks/test_objectives_benchmarks.py::test_values[vec_td1_return_estimate-False-False] |
840.27 | 853.72 | +1.60% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb_cuda[200-img_shape1-large_batch] |
8.5172 | 8.3828 | -1.58% |
benchmarks/test_objectives_benchmarks.py::test_td3_speed[True-backward] |
372.17 | 378.03 | +1.57% |
benchmarks/test_objectives_benchmarks.py::test_values[vec_generalized_advantage_estimate-True-True] |
289.00 | 293.51 | +1.56% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-False] |
77,601 | 76,392 | -1.56% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-True] |
21,330 | 21,002 | -1.53% |
benchmarks/test_envs_benchmark.py::test_serial |
0.4229 | 0.4294 | +1.52% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] |
0.2185 | 0.2152 | -1.50% |
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-backward] |
232.12 | 235.49 | +1.45% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-None] |
395.26 | 400.98 | +1.45% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-True] |
35,518 | 35,012 | -1.43% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[200-img_shape1-large_batch] |
15.53 | 15.31 | -1.42% |
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[4-same] |
6.6167 | 6.7101 | +1.41% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb_cuda[100-img_shape0-atari] |
17.79 | 17.54 | -1.39% |
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-cuda_storage_cuda_samp... |
1,487 | 1,508 | +1.38% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-backward] |
331.28 | 335.75 | +1.35% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-False] |
32,525 | 32,093 | -1.33% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-True] |
22,610 | 22,311 | -1.32% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] |
3,525 | 3,571 | +1.32% |
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-1-512] |
1,342 | 1,325 | -1.31% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-False-True] |
29,037 | 28,659 | -1.30% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-lstm] |
20.86 | 21.13 | +1.29% |
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[False-backward] |
145.56 | 147.41 | +1.27% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-None] |
635.06 | 643.04 | +1.26% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-backward] |
268.62 | 271.94 | +1.23% |
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-gru] |
47.23 | 47.81 | +1.22% |
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[50-img_shape0-small] |
6,144 | 6,071 | -1.18% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[reduce-overhead-None] |
117.85 | 116.51 | -1.14% |
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-memmap_cpu_storage_cud... |
978.04 | 966.90 | -1.14% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb_cuda[200-img_shape1-large_batch] |
8.8255 | 8.7259 | -1.13% |
benchmarks/test_objectives_benchmarks.py::test_iql_speed[False-None] |
99.17 | 100.28 | +1.12% |
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-backward] |
258.56 | 261.41 | +1.10% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] |
2,371 | 2,397 | +1.09% |
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] |
0.5319 | 0.5377 | +1.09% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-True-True] |
18,942 | 18,742 | -1.05% |
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb_cuda[100-img_shape0-atari] |
16.98 | 16.81 | -1.03% |
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] |
163.44 | 165.11 | +1.02% |
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[True-backward] |
350.89 | 354.44 | +1.01% |
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[reduce-overhead-None] |
805.30 | 813.36 | +1.00% |
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-False] |
43,025 | 42,596 | -1.00% |
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[reduce-overhead-None] |
1,919 | 1,938 | +0.98% |
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-None] |
112.28 | 113.33 | +0.94% |
| ... | ... | ... | Showing 120 of 202 comparisons, sorted by absolute change. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Microbenchmarking the IPC primitives used by TorchRL shows a large gap between kernel-mediated signaling and shared-memory polling (spawn ctx, torch 2.11):
mp.Pipe, small messagemp.Queue, small messagemp.Eventpaircopy_into pre-shared tensor + spin flagParallelEnvalready spin-polls shm done-flags on the worker-to-parent direction, but the parent-to-worker direction still pays one pickled pipe message (a syscall) per worker per step.MultiAsyncCollectorclones every yielded batch on the main process because each worker has a single shared buffer that the next rollout overwrites.Changes
1.
ParallelEnv(worker_wait=..., spin_for=...)worker_wait="block"(default): unchanged behavior.worker_wait="adaptive": payload-free hot-path commands (step,step_and_maybe_reset) are published as opcodes in a shared-memoryRawArraythat workers spin-poll, symmetric with the existing done-flags. Afterspin_forseconds (default 1 ms) without a command, the worker advertises a sleep state and falls back to a blocking pipe wait; the parent then sends a wake message through the pipe alongside the opcode. A short poll-recheck interval covers the theoretical lost-wake race, so long policy forwards never burn CPU and latency degrades gracefully to today's pipe path.worker_wait="spin": workers spin indefinitely (lowest latency, one busy core per worker).stepcarries data.use_buffers=Falsefalls back to"block"with a warning;SerialEnvrejects the option.configure_parallel()andBatchedEnvConfig.Measured on 4 workers with a fast env (
CountingEnv): 5755 -> 7439 batched steps/s (+29%);"spin"and"adaptive"are equivalent, confirming the spin window covers the inter-step gap.2.
MultiAsyncCollector(buffer_depth=K)None-> 1): unchanged behavior, including the per-yieldclone().buffer_depth=K > 1: each worker copies its rollout into one ofKrotating shared-memory slots; after a slot's first shipment the queue message shrinks to(idx, slot), and the main process yields zero-copy views instead of cloning the full batch. The existing continue-before-yield handshake keeps a worker at most one rollout ahead, so the yielded view stays valid until the same worker has collectedK - 1further batches:K=2covers the standardfor data in collector:pattern.reset()mid-iteration grants one extra rollout of lookahead, soK=3is recommended there (a warning is emitted withK=2).puttimes out, so buffer refs cannot be lost on a full queue.MultiSyncCollector(sync semantics conflict with collect-ahead),replay_buffermode (buffers are bypassed entirely) anduse_buffers=False.MultiSyncCollectorConfig/MultiAsyncCollectorConfig.Tests
test/envs/test_parallel.py::TestWorkerWait: seeded rollout parity foradaptive(including a tinyspin_forthat forces the sleep/wake path) andspinvsblock, on flat and nested envs; non-tensor payload fallback; no-buffers fallback warning; validation errors;configure_parallel.test/test_collectors.py::TestBufferDepth: deterministic slot rotation and validity window viadata_ptraliasing; content parity depth-1 vs depth-2 with a deterministic policy on flat and nested-action envs; validation errors.test_parallel_worker_waitandtest_async_buffer_depthparametrizations added.test_env_that_waits[MultiAsyncCollector-*]fails identically with and without this change (pre-existing).🤖 Generated with Claude Code