Skip to content

[Performance] Shared-memory command signaling for ParallelEnv and ring-buffer transport for MultiAsyncCollector#3854

Open
vmoens wants to merge 2 commits into
mainfrom
claude/vibrant-shannon-67d334
Open

[Performance] Shared-memory command signaling for ParallelEnv and ring-buffer transport for MultiAsyncCollector#3854
vmoens wants to merge 2 commits into
mainfrom
claude/vibrant-shannon-67d334

Conversation

@vmoens

@vmoens vmoens commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Motivation

Microbenchmarking the IPC primitives used by TorchRL shows a large gap between kernel-mediated signaling and shared-memory polling (spawn ctx, torch 2.11):

Mechanism Roundtrip
shm flag, spin-polled 0.1 us
mp.Pipe, small message 18 us
mp.Queue, small message 30 us
mp.Event pair 40 us
1 MiB fresh tensor through a pipe 1800 us
1 MiB copy_ into pre-shared tensor + spin flag 40 us

ParallelEnv already spin-polls shm done-flags on the worker-to-parent direction, but the parent-to-worker direction still pays one pickled pipe message (a syscall) per worker per step. MultiAsyncCollector clones every yielded batch on the main process because each worker has a single shared buffer that the next rollout overwrites.

Changes

1. ParallelEnv(worker_wait=..., spin_for=...)

  • worker_wait="block" (default): unchanged behavior.
  • worker_wait="adaptive": payload-free hot-path commands (step, step_and_maybe_reset) are published as opcodes in a shared-memory RawArray that workers spin-poll, symmetric with the existing done-flags. After spin_for seconds (default 1 ms) without a command, the worker advertises a sleep state and falls back to a blocking pipe wait; the parent then sends a wake message through the pipe alongside the opcode. A short poll-recheck interval covers the theoretical lost-wake race, so long policy forwards never burn CPU and latency degrades gracefully to today's pipe path.
  • worker_wait="spin": workers spin indefinitely (lowest latency, one busy core per worker).
  • Payload-carrying commands (resets, seeds, non-tensor data, RNN key passthrough) transparently keep using the pipe, including per-call fallback when step carries data. use_buffers=False falls back to "block" with a warning; SerialEnv rejects the option.
  • Exposed in configure_parallel() and BatchedEnvConfig.

Measured on 4 workers with a fast env (CountingEnv): 5755 -> 7439 batched steps/s (+29%); "spin" and "adaptive" are equivalent, confirming the spin window covers the inter-step gap.

2. MultiAsyncCollector(buffer_depth=K)

  • Default (None -> 1): unchanged behavior, including the per-yield clone().
  • buffer_depth=K > 1: each worker copies its rollout into one of K rotating shared-memory slots; after a slot's first shipment the queue message shrinks to (idx, slot), and the main process yields zero-copy views instead of cloning the full batch. The existing continue-before-yield handshake keeps a worker at most one rollout ahead, so the yielded view stays valid until the same worker has collected K - 1 further batches: K=2 covers the standard for data in collector: pattern. reset() mid-iteration grants one extra rollout of lookahead, so K=3 is recommended there (a warning is emitted with K=2).
  • The slot's first shipment is re-sent if the queue put times out, so buffer refs cannot be lost on a full queue.
  • Rejected with clear errors for MultiSyncCollector (sync semantics conflict with collect-ahead), replay_buffer mode (buffers are bypassed entirely) and use_buffers=False.
  • Exposed in MultiSyncCollectorConfig / MultiAsyncCollectorConfig.

Tests

  • test/envs/test_parallel.py::TestWorkerWait: seeded rollout parity for adaptive (including a tiny spin_for that forces the sleep/wake path) and spin vs block, on flat and nested envs; non-tensor payload fallback; no-buffers fallback warning; validation errors; configure_parallel.
  • test/test_collectors.py::TestBufferDepth: deterministic slot rotation and validity window via data_ptr aliasing; content parity depth-1 vs depth-2 with a deterministic policy on flat and nested-action envs; validation errors.
  • Benchmarks: test_parallel_worker_wait and test_async_buffer_depth parametrizations added.

test_env_that_waits[MultiAsyncCollector-*] fails identically with and without this change (pre-existing).

🤖 Generated with Claude Code

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 11, 2026
@github-actions github-actions Bot added Performance Performance issue or suggestion for improvement Benchmarks rl/benchmark changes Collectors Trainers Integrations/torch_geometric Integrations and removed Performance Performance issue or suggestion for improvement labels Jun 11, 2026
@pytorch-bot

pytorch-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3854

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 10 Pending, 1 Unrelated Failure

As of commit 5363e83 with merge base 26ece89 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

… transport for MultiAsyncCollector

Two IPC optimizations targeting per-step syscalls and per-batch copies:

1. ParallelEnv worker_wait="adaptive"|"spin" (default "block", unchanged):
   payload-free hot-path commands (step, step_and_maybe_reset) are written
   as opcodes into a shared-memory RawArray that workers spin-poll instead
   of being pickled and sent through a pipe (one syscall per worker per
   step). This mirrors the existing shm done-flags in the other direction.
   In adaptive mode workers spin for spin_for seconds then fall back to a
   blocking pipe wait, advertising a sleep state so the parent wakes them
   through the pipe; a short poll recheck covers the theoretical lost-wake
   window. Payload-carrying commands (resets, seeds, non-tensor data, RNN
   passthrough) keep using the pipe, and no-buffer mode falls back to
   "block" with a warning. ~30% step-throughput gain on 4 workers with a
   fast env.

2. MultiAsyncCollector buffer_depth=K (default 1, unchanged): workers copy
   each rollout into one of K rotating shared-memory slots and the queue
   message shrinks to (idx, slot); the main process yields zero-copy views
   instead of cloning every batch. A yielded batch stays valid until the
   same worker has collected K-1 further batches; K=2 covers the standard
   iteration pattern since the continue-before-yield handshake keeps
   workers at most one rollout ahead. First use of a slot ships the buffer
   ref through the queue (re-sent if the put times out). Rejected for
   MultiSyncCollector, replay_buffer mode and use_buffers=False.

Both knobs are exposed in configure_parallel and the Hydra configs
(BatchedEnvConfig, MultiSyncCollectorConfig, MultiAsyncCollectorConfig),
covered by tests (TestWorkerWait, TestBufferDepth) and parametrized
benchmarks (test_parallel_worker_wait, test_async_buffer_depth).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@vmoens vmoens force-pushed the claude/vibrant-shannon-67d334 branch from 2f286cb to f4e30c3 Compare June 11, 2026 16:30
@github-actions github-actions Bot added the Performance Performance issue or suggestion for improvement label Jun 11, 2026
…nning workers

The benchmark files share one pytest session; without an explicit close the
"spin" case leaves three busy-waiting workers burning cores for the rest of
the session, which can skew subsequent benchmarks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Benchmark Results: PR 5363e83c vs main 26ece893

Benchmark run: https://github.com/pytorch/rl/actions/runs/27362325079

Higher ops/sec is better. Tables are sorted by largest absolute change.

CPU

Compared 192 benchmarks. Regressions over 5%: 5. Improvements over 5%: 25.

Benchmark main ops PR ops Change
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 38.63 198.91 +414.90%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 188.33 54.12 -71.26%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 766.85 1,114 +45.21%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 901.83 671.47 -25.54%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3,667 2,780 -24.20%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-backward] 825.82 994.58 +20.44%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3,016 3,622 +20.10%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3,284 2,680 -18.40%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2,932 3,449 +17.63%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2,681 3,105 +15.79%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-same] 20.12 22.67 +12.71%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2,045 2,274 +11.19%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[True-backward] 115.14 126.40 +9.78%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 508.85 556.74 +9.41%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-None] 1,671 1,823 +9.10%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2,106 2,296 +9.02%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 768.74 837.53 +8.95%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 29.36 31.90 +8.65%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 468.68 507.97 +8.38%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[reduce-overhead-None] 78.31 84.58 +8.00%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 725.22 782.91 +7.96%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-backward] 387.45 417.21 +7.68%
benchmarks/test_collectors_benchmark.py::test_sync 15.62 16.80 +7.60%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 2,818 3,029 +7.50%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 2,032 2,174 +6.98%
benchmarks/test_envs_benchmark.py::test_simple 1.7167 1.8182 +5.91%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape1-atari] 5,098 5,377 +5.47%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape2-large_img] 411.62 433.01 +5.20%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-backward] 109.55 115.20 +5.15%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-False-0-gru] 1.4102 1.3397 -5.00%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[safetensors] 23,109 24,255 +4.96%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-lstm] 1.9570 2.0537 +4.94%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape2-large_img] 178.92 170.14 -4.91%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape2-large_img] 393.67 412.80 +4.86%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-gru] 2.9400 3.0818 +4.82%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[reduce-overhead-None] 1,794 1,870 +4.22%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-True-0-gru] 1.4795 1.4185 -4.12%
benchmarks/test_envs_benchmark.py::test_transformed 0.8921 0.9196 +3.09%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 652.81 672.76 +3.06%
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-None] 286.20 294.93 +3.05%
benchmarks/test_envs_benchmark.py::test_parallel 0.9835 0.9539 -3.01%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-None] 258.19 265.80 +2.95%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape1-atari] 707.59 728.33 +2.93%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-False-0-lstm] 0.8889 0.8635 -2.86%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-False-True] 37,417 38,475 +2.83%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[200-img_shape3-large_batch] 336.71 345.99 +2.76%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-gru] 4.1645 4.2776 +2.72%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 2,096 2,153 +2.71%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 2,223 2,283 +2.71%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[200-img_shape3-large_batch] 314.05 321.93 +2.51%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-constant] 2,524 2,462 -2.43%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2,901 2,969 +2.33%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[reduce-overhead-None] 708.92 725.32 +2.31%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[reduce-overhead-None] 475.11 486.02 +2.30%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[reduce-overhead-None] 278.80 285.20 +2.29%
benchmarks/test_objectives_benchmarks.py::test_redq_speed[True-None] 222.16 227.20 +2.27%
benchmarks/test_collectors_benchmark.py::test_async 17.69 18.09 +2.27%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 25.12 25.69 +2.26%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[numpy] 382,320 373,723 -2.25%
benchmarks/test_objectives_benchmarks.py::test_redq_speed[False-backward] 55.86 54.67 -2.13%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[50-img_shape0-small] 4,284 4,374 +2.11%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 558.85 570.33 +2.05%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-cudnn-True-0-lstm] 0.9652 0.9457 -2.03%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-True] 24,029 23,543 -2.03%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-False] 58,613 57,446 -1.99%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 48.57 49.53 +1.98%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-False] 32,851 32,205 -1.97%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[50-img_shape0-small] 870.98 888.01 +1.96%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[reduce-overhead-None] 567.66 578.75 +1.95%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape1-atari] 657.18 669.99 +1.95%
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sampler_sample_scale[10000000-cpu] 53.45 52.41 -1.94%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-backward] 510.48 520.39 +1.94%
benchmarks/test_objectives_benchmarks.py::test_values[td0_return_estimate-False-False] 7,895 7,742 -1.93%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[pickle] 12,138 11,904 -1.93%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-False] 64,654 65,875 +1.89%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-False] 78,530 77,061 -1.87%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-True] 21,983 22,370 +1.76%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[100-img_shape0-atari] 30.07 30.58 +1.68%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[reduce-overhead-None] 116.47 118.40 +1.66%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-True] 30,677 31,155 +1.56%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[200-img_shape3-large_batch] 787.02 775.13 -1.51%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-True] 34,769 34,267 -1.44%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[False-backward] 92.60 91.27 -1.44%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-None] 84.46 85.67 +1.43%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 2,955 2,997 +1.43%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 171.13 173.46 +1.36%
benchmarks/test_collectors_benchmark.py::test_single 8.9407 9.0592 +1.33%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-False] 35,034 34,573 -1.31%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[200-img_shape3-large_batch] 142.34 144.19 +1.30%
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sampler_sample_scale[1000000-cpu] 96.42 97.67 +1.29%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[False-None] 124.92 123.32 -1.28%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 0.5329 0.5396 +1.25%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape2-large_img] 581.04 574.01 -1.21%
benchmarks/test_objectives_benchmarks.py::test_redq_speed[reduce-overhead-None] 229.68 226.91 -1.21%
benchmarks/test_objectives_benchmarks.py::test_values[vec_td_lambda_return_estimate-True-False] 56.03 55.36 -1.19%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-True] 20,021 19,784 -1.18%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-None] 123.15 124.58 +1.16%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[200-img_shape1-large_batch] 13.41 13.57 +1.15%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-None] 211.90 209.46 -1.15%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 171.38 169.41 -1.15%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-lstm] 3.1144 3.1498 +1.14%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[200-img_shape1-large_batch] 15.20 15.37 +1.13%
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-memmap_cpu_storage_cpu... 81.97 82.87 +1.10%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-True] 42,365 41,902 -1.09%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[100-img_shape0-atari] 26.33 26.62 +1.08%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[generalized_advantage_estimate-False-1-512] 111.60 110.40 -1.07%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-single-False] 1.6203 1.6370 +1.03%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-True] 19,829 20,032 +1.02%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[50-img_shape0-small] 7,313 7,383 +0.95%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 0.5992 0.6050 +0.95%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-False] 35,293 35,629 +0.95%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-False-True] 29,435 29,703 +0.91%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-None] 272.66 275.05 +0.88%
benchmarks/test_objectives_benchmarks.py::test_values[td_lambda_return_estimate-True-False] 25.02 25.23 +0.85%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-backward] 241.21 243.23 +0.84%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-True] 37,787 38,101 +0.83%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-False] 32,382 32,650 +0.83%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-False-False] 45,647 45,280 -0.81%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-backward] 58.68 58.21 -0.80%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-None] 695.76 701.33 +0.80%
... ... ... Showing 120 of 192 comparisons, sorted by absolute change.

GPU

Compared 202 benchmarks. Regressions over 5%: 17. Improvements over 5%: 13.

Benchmark main ops PR ops Change
benchmarks/test_objectives_benchmarks.py::test_iql_speed[False-backward] 68.76 34.77 -49.43%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[reduce-overhead-None] 77.20 102.15 +32.33%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1,054 737.68 -30.00%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 3,453 2,567 -25.68%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3,042 3,745 +23.12%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3,277 2,719 -17.01%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3,068 3,539 +15.34%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 45.75 39.04 -14.66%
benchmarks/test_collectors_benchmark.py::test_single 5.9351 6.7536 +13.79%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 709.57 804.40 +13.36%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 763.56 857.26 +12.27%
benchmarks/test_collectors_benchmark.py::test_single_with_rb_pixels 5.3024 4.6676 -11.97%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 2,014 2,247 +11.60%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-backward] 895.17 988.36 +10.41%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 496.03 452.56 -8.76%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape2-large_img] 444.10 410.59 -7.55%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[generalized_advantage_estimate-False-1-512] 46.91 50.36 +7.34%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape2-large_img] 598.45 558.25 -6.72%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2,924 2,729 -6.65%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-False] 67,207 62,806 -6.55%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[100-img_shape2-large_img] 418.61 391.76 -6.41%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 497.09 527.48 +6.11%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[200-img_shape3-large_batch] 784.39 738.47 -5.85%
benchmarks/test_replaybuffer_benchmark.py::test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2,299 2,166 -5.77%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-False-True] 38,946 36,702 -5.76%
benchmarks/test_objectives_benchmarks.py::test_values[td1_return_estimate-False-False] 19.64 20.71 +5.48%
benchmarks/test_objectives_benchmarks.py::test_values[td_lambda_return_estimate-True-False] 11.88 12.52 +5.41%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 50.26 47.57 -5.35%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[100-img_shape1-atari] 4,039 4,255 +5.35%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 55.23 52.37 -5.19%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape2-large_img] 176.57 168.15 -4.77%
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[True-None] 749.31 714.71 -4.62%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-False-True] 31,951 33,418 +4.59%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 183.97 192.33 +4.54%
benchmarks/test_compressed_storage_benchmark.py::TestCompressedStorageBenchmark::test_tensor_to_bytestream_speed[safetensors] 23,301 24,350 +4.50%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[50-img_shape0-small] 3,389 3,541 +4.48%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-True] 43,335 41,396 -4.48%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-gru] 21.94 22.87 +4.25%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[200-img_shape3-large_batch] 141.82 135.84 -4.22%
benchmarks/test_envs_benchmark.py::test_simple 1.2532 1.2006 -4.19%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-backward] 211.78 220.43 +4.08%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 738.61 709.24 -3.98%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-lstm] 76.81 74.15 -3.46%
benchmarks/test_objectives_benchmarks.py::test_values[generalized_advantage_estimate-True-True] 47.59 49.23 +3.43%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-True] 20,738 20,053 -3.31%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-backward] 77.69 80.10 +3.11%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-True] 24,039 23,293 -3.11%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[True-None] 1,948 1,888 -3.06%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 53.61 52.00 -3.00%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-False-True-True] 20,230 19,655 -2.84%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[16-constant] 4,833 4,696 -2.83%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[True-backward] 445.58 457.80 +2.74%
benchmarks/test_replaybuffer_benchmark.py::test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 842.79 865.31 +2.67%
benchmarks/test_replaybuffer_benchmark.py::test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 23.51 22.90 -2.60%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-False] 55,772 54,328 -2.59%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[True-backward] 236.83 242.88 +2.55%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-False-False] 64,907 63,282 -2.50%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-False] 35,261 34,380 -2.50%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-True-False] 28,234 27,535 -2.48%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_lazystack[100-img_shape1-atari] 721.97 704.28 -2.45%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-False-True] 31,044 30,344 -2.26%
benchmarks/test_envs_benchmark.py::test_transformed 0.7179 0.7026 -2.13%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[200-img_shape1-large_batch] 13.70 13.41 -2.12%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[True-None] 737.95 753.50 +2.11%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[reduce-overhead-None] 107.58 109.79 +2.06%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_lazystack_then_write[200-img_shape3-large_batch] 307.73 313.96 +2.02%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb[100-img_shape0-atari] 26.86 26.33 -1.98%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[False-backward] 127.02 129.54 +1.98%
benchmarks/test_objectives_benchmarks.py::test_cql_speed[True-None] 369.75 377.06 +1.98%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 1,268 1,292 +1.93%
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-cuda_storage_cpu_sampler] 86.51 84.87 -1.89%
benchmarks/test_envs_benchmark.py::test_parallel 0.5321 0.5421 +1.88%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-True-True-True-False] 35,762 35,135 -1.76%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[False-None] 112.88 114.86 +1.75%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[4-constant] 4,889 4,803 -1.74%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_collector_stack_then_write[100-img_shape1-atari] 277.02 272.23 -1.73%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1,261 1,283 +1.73%
benchmarks/test_objectives_benchmarks.py::test_values[td0_return_estimate-False-False] 11,562 11,750 +1.62%
benchmarks/test_objectives_benchmarks.py::test_values[vec_td1_return_estimate-False-False] 840.27 853.72 +1.60%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb_cuda[200-img_shape1-large_batch] 8.5172 8.3828 -1.58%
benchmarks/test_objectives_benchmarks.py::test_td3_speed[True-backward] 372.17 378.03 +1.57%
benchmarks/test_objectives_benchmarks.py::test_values[vec_generalized_advantage_estimate-True-True] 289.00 293.51 +1.56%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-False-False] 77,601 76,392 -1.56%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-True-True-True] 21,330 21,002 -1.53%
benchmarks/test_envs_benchmark.py::test_serial 0.4229 0.4294 +1.52%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 0.2185 0.2152 -1.50%
benchmarks/test_objectives_benchmarks.py::test_ddpg_speed[False-backward] 232.12 235.49 +1.45%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-None] 395.26 400.98 +1.45%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-False-True] 35,518 35,012 -1.43%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb[200-img_shape1-large_batch] 15.53 15.31 -1.42%
benchmarks/test_envs_benchmark.py::test_cat_frames_functional[4-same] 6.6167 6.7101 +1.41%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb_cuda[100-img_shape0-atari] 17.79 17.54 -1.39%
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-cuda_storage_cuda_samp... 1,487 1,508 +1.38%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[True-backward] 331.28 335.75 +1.35%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-False-False-True-False] 32,525 32,093 -1.33%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-False-True-True] 22,610 22,311 -1.32%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 3,525 3,571 +1.32%
benchmarks/test_objectives_benchmarks.py::test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 1,342 1,325 -1.31%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-False-False-True] 29,037 28,659 -1.30%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-False-0-lstm] 20.86 21.13 +1.29%
benchmarks/test_objectives_benchmarks.py::test_a2c_speed[False-backward] 145.56 147.41 +1.27%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[False-None] 635.06 643.04 +1.26%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[False-backward] 268.62 271.94 +1.23%
benchmarks/test_rnn_reset_backends_benchmark.py::test_rnn_rollout_with_intermediate_resets[b256-t128-i32-h512-scan-True-0-gru] 47.23 47.81 +1.22%
benchmarks/test_storage_write_benchmark.py::TestStorageWriteBenchmark::test_storage_write_contiguous[50-img_shape0-small] 6,144 6,071 -1.18%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[reduce-overhead-None] 117.85 116.51 -1.14%
benchmarks/test_replaybuffer_benchmark.py::TestPrioritizedReplayBufferBenchmark::test_sample_mixed_devices[1000000-memmap_cpu_storage_cud... 978.04 966.90 -1.14%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_without_rb_cuda[200-img_shape1-large_batch] 8.8255 8.7259 -1.13%
benchmarks/test_objectives_benchmarks.py::test_iql_speed[False-None] 99.17 100.28 +1.12%
benchmarks/test_objectives_benchmarks.py::test_redq_deprec_speed[True-backward] 258.56 261.41 +1.10%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 2,371 2,397 +1.09%
benchmarks/test_non_tensor_env_benchmark.py::test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 0.5319 0.5377 +1.09%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[False-False-True-True-True] 18,942 18,742 -1.05%
benchmarks/test_storage_write_benchmark.py::TestCollectorIntegrationBenchmark::test_collector_with_rb_cuda[100-img_shape0-atari] 16.98 16.81 -1.03%
benchmarks/test_replaybuffer_benchmark.py::test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 163.44 165.11 +1.02%
benchmarks/test_objectives_benchmarks.py::test_reinforce_speed[True-backward] 350.89 354.44 +1.01%
benchmarks/test_objectives_benchmarks.py::test_ppo_speed[reduce-overhead-None] 805.30 813.36 +1.00%
benchmarks/test_envs_benchmark.py::test_step_mdp_speed[True-True-True-True-False] 43,025 42,596 -1.00%
benchmarks/test_objectives_benchmarks.py::test_dqn_speed[reduce-overhead-None] 1,919 1,938 +0.98%
benchmarks/test_objectives_benchmarks.py::test_sac_speed[False-None] 112.28 113.33 +0.94%
... ... ... Showing 120 of 202 comparisons, sorted by absolute change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Benchmarks rl/benchmark changes CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Collectors Integrations/torch_geometric Integrations Performance Performance issue or suggestion for improvement Trainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant