Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance, WIP] Faster SAC #1958

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

[Performance, WIP] Faster SAC #1958

wants to merge 1 commit into from

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Feb 23, 2024

@matteobettini there's a way to make vmap functional calls much much faster!
we'll need to make sure this works across the board but if it does speed up could be 2x for many losses 🤯

Copy link

pytorch-bot bot commented Feb 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/1958

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures, 1 Unrelated Failure

As of commit 3923bc2 with merge base 492091a (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 23, 2024
Copy link

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 89. Improved: $\large\color{#35bf28}11$. Worsened: $\large\color{#d91a1a}4$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 63.3136ms 62.3617ms 16.0355 Ops/s 16.1287 Ops/s $\color{#d91a1a}-0.58\%$
test_sync 34.7507ms 33.9539ms 29.4517 Ops/s 30.2222 Ops/s $\color{#d91a1a}-2.55\%$
test_async 54.7522ms 31.8410ms 31.4061 Ops/s 30.3396 Ops/s $\color{#35bf28}+3.52\%$
test_simple 0.5147s 0.4492s 2.2260 Ops/s 2.3177 Ops/s $\color{#d91a1a}-3.96\%$
test_transformed 0.6481s 0.5928s 1.6870 Ops/s 1.7210 Ops/s $\color{#d91a1a}-1.98\%$
test_serial 1.5109s 1.4496s 0.6899 Ops/s 0.6929 Ops/s $\color{#d91a1a}-0.44\%$
test_parallel 1.4889s 1.4214s 0.7036 Ops/s 0.7154 Ops/s $\color{#d91a1a}-1.66\%$
test_step_mdp_speed[True-True-True-True-True] 0.2132ms 21.0614μs 47.4803 KOps/s 47.3464 KOps/s $\color{#35bf28}+0.28\%$
test_step_mdp_speed[True-True-True-True-False] 44.4630μs 13.0363μs 76.7086 KOps/s 76.7211 KOps/s $\color{#d91a1a}-0.02\%$
test_step_mdp_speed[True-True-True-False-True] 55.2930μs 12.4198μs 80.5165 KOps/s 80.6801 KOps/s $\color{#d91a1a}-0.20\%$
test_step_mdp_speed[True-True-True-False-False] 0.1192ms 7.7825μs 128.4930 KOps/s 133.9221 KOps/s $\color{#d91a1a}-4.05\%$
test_step_mdp_speed[True-True-False-True-True] 80.2100μs 22.4400μs 44.5633 KOps/s 44.5725 KOps/s $\color{#d91a1a}-0.02\%$
test_step_mdp_speed[True-True-False-True-False] 61.0650μs 14.1236μs 70.8034 KOps/s 70.2443 KOps/s $\color{#35bf28}+0.80\%$
test_step_mdp_speed[True-True-False-False-True] 40.4460μs 13.6073μs 73.4901 KOps/s 73.3795 KOps/s $\color{#35bf28}+0.15\%$
test_step_mdp_speed[True-True-False-False-False] 34.8850μs 8.6858μs 115.1299 KOps/s 115.3467 KOps/s $\color{#d91a1a}-0.19\%$
test_step_mdp_speed[True-False-True-True-True] 60.8140μs 23.7955μs 42.0248 KOps/s 42.0212 KOps/s $+0.01\%$
test_step_mdp_speed[True-False-True-True-False] 45.8460μs 15.4878μs 64.5668 KOps/s 64.1428 KOps/s $\color{#35bf28}+0.66\%$
test_step_mdp_speed[True-False-True-False-True] 65.3940μs 13.5961μs 73.5503 KOps/s 73.1392 KOps/s $\color{#35bf28}+0.56\%$
test_step_mdp_speed[True-False-True-False-False] 42.0490μs 8.7865μs 113.8110 KOps/s 116.0894 KOps/s $\color{#d91a1a}-1.96\%$
test_step_mdp_speed[True-False-False-True-True] 84.4580μs 25.0620μs 39.9011 KOps/s 40.2385 KOps/s $\color{#d91a1a}-0.84\%$
test_step_mdp_speed[True-False-False-True-False] 43.1100μs 16.5471μs 60.4337 KOps/s 60.7340 KOps/s $\color{#d91a1a}-0.49\%$
test_step_mdp_speed[True-False-False-False-True] 70.4620μs 14.7481μs 67.8053 KOps/s 68.1263 KOps/s $\color{#d91a1a}-0.47\%$
test_step_mdp_speed[True-False-False-False-False] 45.3440μs 9.8687μs 101.3304 KOps/s 102.4897 KOps/s $\color{#d91a1a}-1.13\%$
test_step_mdp_speed[False-True-True-True-True] 75.6410μs 23.8951μs 41.8495 KOps/s 42.0234 KOps/s $\color{#d91a1a}-0.41\%$
test_step_mdp_speed[False-True-True-True-False] 64.0300μs 15.5364μs 64.3648 KOps/s 63.7064 KOps/s $\color{#35bf28}+1.03\%$
test_step_mdp_speed[False-True-True-False-True] 57.6480μs 16.1735μs 61.8296 KOps/s 62.8367 KOps/s $\color{#d91a1a}-1.60\%$
test_step_mdp_speed[False-True-True-False-False] 41.4270μs 9.9448μs 100.5549 KOps/s 101.0372 KOps/s $\color{#d91a1a}-0.48\%$
test_step_mdp_speed[False-True-False-True-True] 49.2720μs 25.7035μs 38.9052 KOps/s 39.5274 KOps/s $\color{#d91a1a}-1.57\%$
test_step_mdp_speed[False-True-False-True-False] 72.6160μs 16.6261μs 60.1463 KOps/s 60.3409 KOps/s $\color{#d91a1a}-0.32\%$
test_step_mdp_speed[False-True-False-False-True] 69.3700μs 17.2078μs 58.1133 KOps/s 58.9463 KOps/s $\color{#d91a1a}-1.41\%$
test_step_mdp_speed[False-True-False-False-False] 55.7650μs 11.1837μs 89.4157 KOps/s 91.3082 KOps/s $\color{#d91a1a}-2.07\%$
test_step_mdp_speed[False-False-True-True-True] 88.9670μs 26.3782μs 37.9101 KOps/s 38.3077 KOps/s $\color{#d91a1a}-1.04\%$
test_step_mdp_speed[False-False-True-True-False] 43.9430μs 17.9520μs 55.7040 KOps/s 55.4219 KOps/s $\color{#35bf28}+0.51\%$
test_step_mdp_speed[False-False-True-False-True] 78.4170μs 17.0489μs 58.6548 KOps/s 58.8033 KOps/s $\color{#d91a1a}-0.25\%$
test_step_mdp_speed[False-False-True-False-False] 51.6270μs 11.2141μs 89.1736 KOps/s 91.2309 KOps/s $\color{#d91a1a}-2.26\%$
test_step_mdp_speed[False-False-False-True-True] 78.3670μs 27.3497μs 36.5635 KOps/s 36.9461 KOps/s $\color{#d91a1a}-1.04\%$
test_step_mdp_speed[False-False-False-True-False] 70.6230μs 19.0296μs 52.5496 KOps/s 52.9155 KOps/s $\color{#d91a1a}-0.69\%$
test_step_mdp_speed[False-False-False-False-True] 46.5670μs 18.0980μs 55.2547 KOps/s 55.5382 KOps/s $\color{#d91a1a}-0.51\%$
test_step_mdp_speed[False-False-False-False-False] 70.7620μs 12.2561μs 81.5923 KOps/s 83.1106 KOps/s $\color{#d91a1a}-1.83\%$
test_values[generalized_advantage_estimate-True-True] 12.1009ms 9.4415ms 105.9152 Ops/s 108.2171 Ops/s $\color{#d91a1a}-2.13\%$
test_values[vec_generalized_advantage_estimate-True-True] 38.8651ms 35.4655ms 28.1964 Ops/s 28.3041 Ops/s $\color{#d91a1a}-0.38\%$
test_values[td0_return_estimate-False-False] 0.2426ms 0.1905ms 5.2483 KOps/s 5.8080 KOps/s $\textbf{\color{#d91a1a}-9.64\%}$
test_values[td1_return_estimate-False-False] 26.1736ms 23.5475ms 42.4674 Ops/s 42.3752 Ops/s $\color{#35bf28}+0.22\%$
test_values[vec_td1_return_estimate-False-False] 38.3187ms 35.7191ms 27.9962 Ops/s 28.1771 Ops/s $\color{#d91a1a}-0.64\%$
test_values[td_lambda_return_estimate-True-False] 37.1314ms 33.7970ms 29.5884 Ops/s 29.9753 Ops/s $\color{#d91a1a}-1.29\%$
test_values[vec_td_lambda_return_estimate-True-False] 38.7340ms 35.6116ms 28.0807 Ops/s 28.2432 Ops/s $\color{#d91a1a}-0.58\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 8.3967ms 8.0123ms 124.8087 Ops/s 125.0718 Ops/s $\color{#d91a1a}-0.21\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 2.1196ms 1.9108ms 523.3458 Ops/s 537.1060 Ops/s $\color{#d91a1a}-2.56\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.5463ms 0.3515ms 2.8450 KOps/s 2.8874 KOps/s $\color{#d91a1a}-1.47\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 41.7600ms 40.4430ms 24.7262 Ops/s 21.9625 Ops/s $\textbf{\color{#35bf28}+12.58\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 3.8006ms 3.0247ms 330.6157 Ops/s 329.9328 Ops/s $\color{#35bf28}+0.21\%$
test_dqn_speed 75.1670ms 1.5054ms 664.2842 Ops/s 700.4505 Ops/s $\textbf{\color{#d91a1a}-5.16\%}$
test_ddpg_speed 3.5141ms 2.7851ms 359.0581 Ops/s 355.6452 Ops/s $\color{#35bf28}+0.96\%$
test_sac_speed 7.9146ms 6.6538ms 150.2896 Ops/s 120.5107 Ops/s $\textbf{\color{#35bf28}+24.71\%}$
test_redq_speed 14.9687ms 13.4668ms 74.2566 Ops/s 75.6784 Ops/s $\color{#d91a1a}-1.88\%$
test_redq_deprec_speed 15.8664ms 13.8909ms 71.9895 Ops/s 75.3803 Ops/s $\color{#d91a1a}-4.50\%$
test_td3_speed 8.6492ms 8.3911ms 119.1735 Ops/s 119.2888 Ops/s $\color{#d91a1a}-0.10\%$
test_cql_speed 46.7520ms 37.6100ms 26.5886 Ops/s 27.0084 Ops/s $\color{#d91a1a}-1.55\%$
test_a2c_speed 9.0478ms 7.6515ms 130.6942 Ops/s 129.8356 Ops/s $\color{#35bf28}+0.66\%$
test_ppo_speed 8.8246ms 7.9917ms 125.1293 Ops/s 125.8436 Ops/s $\color{#d91a1a}-0.57\%$
test_reinforce_speed 7.0654ms 6.7056ms 149.1297 Ops/s 143.7766 Ops/s $\color{#35bf28}+3.72\%$
test_iql_speed 34.3134ms 32.7320ms 30.5512 Ops/s 30.0442 Ops/s $\color{#35bf28}+1.69\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.9270ms 2.2424ms 445.9490 Ops/s 418.4807 Ops/s $\textbf{\color{#35bf28}+6.56\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.6214ms 0.4961ms 2.0159 KOps/s 1.9791 KOps/s $\color{#35bf28}+1.86\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3.6479ms 0.4763ms 2.0993 KOps/s 2.0709 KOps/s $\color{#35bf28}+1.37\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.4090ms 2.1959ms 455.4040 Ops/s 434.1561 Ops/s $\color{#35bf28}+4.89\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.8430ms 0.4910ms 2.0368 KOps/s 2.0072 KOps/s $\color{#35bf28}+1.47\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6379ms 0.4664ms 2.1440 KOps/s 2.0768 KOps/s $\color{#35bf28}+3.23\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.3908ms 2.3507ms 425.4044 Ops/s 420.0111 Ops/s $\color{#35bf28}+1.28\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.9816ms 0.6084ms 1.6438 KOps/s 1.6223 KOps/s $\color{#35bf28}+1.32\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7899ms 0.5879ms 1.7009 KOps/s 1.6766 KOps/s $\color{#35bf28}+1.45\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.4708ms 2.2274ms 448.9460 Ops/s 447.9141 Ops/s $\color{#35bf28}+0.23\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 91.7842ms 0.5873ms 1.7027 KOps/s 1.9895 KOps/s $\textbf{\color{#d91a1a}-14.42\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.6263ms 0.4708ms 2.1239 KOps/s 1.6737 KOps/s $\textbf{\color{#35bf28}+26.90\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.8062ms 2.2424ms 445.9411 Ops/s 418.1765 Ops/s $\textbf{\color{#35bf28}+6.64\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.6603ms 0.4916ms 2.0342 KOps/s 1.9826 KOps/s $\color{#35bf28}+2.60\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 3.7952ms 0.4744ms 2.1080 KOps/s 2.0397 KOps/s $\color{#35bf28}+3.35\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.0084ms 2.3720ms 421.5805 Ops/s 402.3527 Ops/s $\color{#35bf28}+4.78\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.8866ms 0.6123ms 1.6332 KOps/s 1.3287 KOps/s $\textbf{\color{#35bf28}+22.92\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 3.7089ms 0.5881ms 1.7004 KOps/s 1.6522 KOps/s $\color{#35bf28}+2.92\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.1070s 5.6916ms 175.6985 Ops/s 177.9815 Ops/s $\color{#d91a1a}-1.28\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 14.3034ms 11.8043ms 84.7148 Ops/s 80.8783 Ops/s $\color{#35bf28}+4.74\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 1.7175ms 1.0112ms 988.9444 Ops/s 917.7412 Ops/s $\textbf{\color{#35bf28}+7.76\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 91.1024ms 7.1327ms 140.1988 Ops/s 132.7681 Ops/s $\textbf{\color{#35bf28}+5.60\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 14.4478ms 11.7311ms 85.2435 Ops/s 81.1391 Ops/s $\textbf{\color{#35bf28}+5.06\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 1.4960ms 1.0044ms 995.6560 Ops/s 963.6852 Ops/s $\color{#35bf28}+3.32\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 87.8195ms 5.6324ms 177.5448 Ops/s 127.2357 Ops/s $\textbf{\color{#35bf28}+39.54\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 96.2808ms 13.7930ms 72.5005 Ops/s 78.5341 Ops/s $\textbf{\color{#d91a1a}-7.68\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 1.4209ms 1.3127ms 761.7985 Ops/s 695.8720 Ops/s $\textbf{\color{#35bf28}+9.47\%}$

Copy link

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 92. Improved: $\large\color{#35bf28}4$. Worsened: $\large\color{#d91a1a}9$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 0.1137s 0.1129s 8.8577 Ops/s 8.9874 Ops/s $\color{#d91a1a}-1.44\%$
test_sync 95.9671ms 95.5702ms 10.4635 Ops/s 10.4553 Ops/s $\color{#35bf28}+0.08\%$
test_async 0.1791s 90.5864ms 11.0392 Ops/s 10.9200 Ops/s $\color{#35bf28}+1.09\%$
test_single_pixels 0.2015s 0.1373s 7.2810 Ops/s 7.8124 Ops/s $\textbf{\color{#d91a1a}-6.80\%}$
test_sync_pixels 80.7839ms 79.5872ms 12.5648 Ops/s 12.6943 Ops/s $\color{#d91a1a}-1.02\%$
test_async_pixels 0.1406s 71.0142ms 14.0817 Ops/s 13.7751 Ops/s $\color{#35bf28}+2.23\%$
test_simple 0.8433s 0.8366s 1.1953 Ops/s 1.2306 Ops/s $\color{#d91a1a}-2.87\%$
test_transformed 1.0568s 1.0499s 0.9525 Ops/s 0.9701 Ops/s $\color{#d91a1a}-1.81\%$
test_serial 2.5563s 2.5140s 0.3978 Ops/s 0.4272 Ops/s $\textbf{\color{#d91a1a}-6.88\%}$
test_parallel 2.1236s 2.0925s 0.4779 Ops/s 0.4870 Ops/s $\color{#d91a1a}-1.87\%$
test_step_mdp_speed[True-True-True-True-True] 0.1114ms 33.5262μs 29.8274 KOps/s 30.3164 KOps/s $\color{#d91a1a}-1.61\%$
test_step_mdp_speed[True-True-True-True-False] 93.7710μs 19.8922μs 50.2710 KOps/s 49.9213 KOps/s $\color{#35bf28}+0.70\%$
test_step_mdp_speed[True-True-True-False-True] 36.7100μs 18.7915μs 53.2156 KOps/s 52.6518 KOps/s $\color{#35bf28}+1.07\%$
test_step_mdp_speed[True-True-True-False-False] 37.5500μs 11.2829μs 88.6294 KOps/s 89.0560 KOps/s $\color{#d91a1a}-0.48\%$
test_step_mdp_speed[True-True-False-True-True] 53.9700μs 34.6304μs 28.8764 KOps/s 28.6986 KOps/s $\color{#35bf28}+0.62\%$
test_step_mdp_speed[True-True-False-True-False] 40.8000μs 21.6169μs 46.2601 KOps/s 45.5288 KOps/s $\color{#35bf28}+1.61\%$
test_step_mdp_speed[True-True-False-False-True] 87.3210μs 20.5006μs 48.7791 KOps/s 47.9027 KOps/s $\color{#35bf28}+1.83\%$
test_step_mdp_speed[True-True-False-False-False] 32.2900μs 13.2425μs 75.5146 KOps/s 75.3388 KOps/s $\color{#35bf28}+0.23\%$
test_step_mdp_speed[True-False-True-True-True] 58.6600μs 36.5319μs 27.3733 KOps/s 26.7418 KOps/s $\color{#35bf28}+2.36\%$
test_step_mdp_speed[True-False-True-True-False] 46.0910μs 23.4353μs 42.6707 KOps/s 41.8743 KOps/s $\color{#35bf28}+1.90\%$
test_step_mdp_speed[True-False-True-False-True] 36.7200μs 20.4337μs 48.9388 KOps/s 48.5306 KOps/s $\color{#35bf28}+0.84\%$
test_step_mdp_speed[True-False-True-False-False] 28.3090μs 13.1617μs 75.9782 KOps/s 76.2806 KOps/s $\color{#d91a1a}-0.40\%$
test_step_mdp_speed[True-False-False-True-True] 0.1035ms 38.2838μs 26.1207 KOps/s 25.9683 KOps/s $\color{#35bf28}+0.59\%$
test_step_mdp_speed[True-False-False-True-False] 48.3610μs 25.3850μs 39.3933 KOps/s 39.3215 KOps/s $\color{#35bf28}+0.18\%$
test_step_mdp_speed[True-False-False-False-True] 44.3890μs 22.3896μs 44.6636 KOps/s 44.5489 KOps/s $\color{#35bf28}+0.26\%$
test_step_mdp_speed[True-False-False-False-False] 30.4710μs 15.1130μs 66.1684 KOps/s 66.1795 KOps/s $\color{#d91a1a}-0.02\%$
test_step_mdp_speed[False-True-True-True-True] 53.6200μs 36.3783μs 27.4890 KOps/s 26.9074 KOps/s $\color{#35bf28}+2.16\%$
test_step_mdp_speed[False-True-True-True-False] 52.1320μs 23.8126μs 41.9946 KOps/s 42.3032 KOps/s $\color{#d91a1a}-0.73\%$
test_step_mdp_speed[False-True-True-False-True] 46.4900μs 24.4557μs 40.8902 KOps/s 40.6136 KOps/s $\color{#35bf28}+0.68\%$
test_step_mdp_speed[False-True-True-False-False] 39.7600μs 15.0544μs 66.4256 KOps/s 64.8878 KOps/s $\color{#35bf28}+2.37\%$
test_step_mdp_speed[False-True-False-True-True] 63.3900μs 39.4098μs 25.3744 KOps/s 25.1848 KOps/s $\color{#35bf28}+0.75\%$
test_step_mdp_speed[False-True-False-True-False] 85.7500μs 25.4525μs 39.2889 KOps/s 38.7929 KOps/s $\color{#35bf28}+1.28\%$
test_step_mdp_speed[False-True-False-False-True] 43.6310μs 26.6649μs 37.5025 KOps/s 38.3021 KOps/s $\color{#d91a1a}-2.09\%$
test_step_mdp_speed[False-True-False-False-False] 40.9510μs 16.7527μs 59.6918 KOps/s 60.0428 KOps/s $\color{#d91a1a}-0.58\%$
test_step_mdp_speed[False-False-True-True-True] 64.6310μs 40.1907μs 24.8814 KOps/s 24.2912 KOps/s $\color{#35bf28}+2.43\%$
test_step_mdp_speed[False-False-True-True-False] 44.2100μs 27.3583μs 36.5520 KOps/s 36.1082 KOps/s $\color{#35bf28}+1.23\%$
test_step_mdp_speed[False-False-True-False-True] 41.3300μs 26.2644μs 38.0743 KOps/s 37.9869 KOps/s $\color{#35bf28}+0.23\%$
test_step_mdp_speed[False-False-True-False-False] 31.8010μs 16.7534μs 59.6895 KOps/s 58.1345 KOps/s $\color{#35bf28}+2.67\%$
test_step_mdp_speed[False-False-False-True-True] 69.1900μs 41.5118μs 24.0896 KOps/s 23.2268 KOps/s $\color{#35bf28}+3.71\%$
test_step_mdp_speed[False-False-False-True-False] 0.1048ms 29.1136μs 34.3483 KOps/s 33.6908 KOps/s $\color{#35bf28}+1.95\%$
test_step_mdp_speed[False-False-False-False-True] 48.1110μs 27.4936μs 36.3721 KOps/s 35.7889 KOps/s $\color{#35bf28}+1.63\%$
test_step_mdp_speed[False-False-False-False-False] 32.2610μs 18.6056μs 53.7473 KOps/s 53.0234 KOps/s $\color{#35bf28}+1.37\%$
test_values[generalized_advantage_estimate-True-True] 26.6774ms 25.7626ms 38.8160 Ops/s 41.1868 Ops/s $\textbf{\color{#d91a1a}-5.76\%}$
test_values[vec_generalized_advantage_estimate-True-True] 81.3992ms 3.1925ms 313.2324 Ops/s 298.9929 Ops/s $\color{#35bf28}+4.76\%$
test_values[td0_return_estimate-False-False] 0.1035ms 59.8672μs 16.7036 KOps/s 16.0368 KOps/s $\color{#35bf28}+4.16\%$
test_values[td1_return_estimate-False-False] 57.4423ms 56.7939ms 17.6075 Ops/s 19.2721 Ops/s $\textbf{\color{#d91a1a}-8.64\%}$
test_values[vec_td1_return_estimate-False-False] 2.0778ms 1.7639ms 566.9104 Ops/s 570.5980 Ops/s $\color{#d91a1a}-0.65\%$
test_values[td_lambda_return_estimate-True-False] 91.6125ms 90.5662ms 11.0416 Ops/s 11.2732 Ops/s $\color{#d91a1a}-2.05\%$
test_values[vec_td_lambda_return_estimate-True-False] 3.9813ms 1.7884ms 559.1715 Ops/s 562.2791 Ops/s $\color{#d91a1a}-0.55\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 25.6011ms 25.2348ms 39.6279 Ops/s 44.1631 Ops/s $\textbf{\color{#d91a1a}-10.27\%}$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 0.8660ms 0.7310ms 1.3679 KOps/s 1.4629 KOps/s $\textbf{\color{#d91a1a}-6.49\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7102ms 0.6554ms 1.5259 KOps/s 1.5645 KOps/s $\color{#d91a1a}-2.47\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.5081ms 1.4559ms 686.8788 Ops/s 695.3938 Ops/s $\color{#d91a1a}-1.22\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.9841ms 0.6982ms 1.4322 KOps/s 1.5075 KOps/s $\color{#d91a1a}-5.00\%$
test_dqn_speed 4.0253ms 1.4909ms 670.7410 Ops/s 632.8267 Ops/s $\textbf{\color{#35bf28}+5.99\%}$
test_ddpg_speed 3.9905ms 2.9209ms 342.3579 Ops/s 356.4531 Ops/s $\color{#d91a1a}-3.95\%$
test_sac_speed 6.3411ms 6.1225ms 163.3322 Ops/s 124.0976 Ops/s $\textbf{\color{#35bf28}+31.62\%}$
test_redq_speed 11.7863ms 10.7155ms 93.3230 Ops/s 95.7104 Ops/s $\color{#d91a1a}-2.49\%$
test_redq_deprec_speed 12.6486ms 11.8784ms 84.1863 Ops/s 88.1863 Ops/s $\color{#d91a1a}-4.54\%$
test_td3_speed 8.3257ms 8.2592ms 121.0765 Ops/s 123.7944 Ops/s $\color{#d91a1a}-2.20\%$
test_cql_speed 27.6618ms 25.9894ms 38.4772 Ops/s 38.4951 Ops/s $\color{#d91a1a}-0.05\%$
test_a2c_speed 5.9871ms 5.7081ms 175.1909 Ops/s 175.1570 Ops/s $\color{#35bf28}+0.02\%$
test_ppo_speed 6.4150ms 6.0078ms 166.4500 Ops/s 166.9402 Ops/s $\color{#d91a1a}-0.29\%$
test_reinforce_speed 4.9272ms 4.6860ms 213.3998 Ops/s 214.5077 Ops/s $\color{#d91a1a}-0.52\%$
test_iql_speed 20.8964ms 20.2761ms 49.3192 Ops/s 49.9575 Ops/s $\color{#d91a1a}-1.28\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.0078ms 2.9027ms 344.5053 Ops/s 343.3985 Ops/s $\color{#35bf28}+0.32\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.7073ms 0.5531ms 1.8081 KOps/s 1.8271 KOps/s $\color{#d91a1a}-1.04\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 4.2794ms 0.5281ms 1.8937 KOps/s 1.9013 KOps/s $\color{#d91a1a}-0.40\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.1642ms 2.9357ms 340.6312 Ops/s 341.7802 Ops/s $\color{#d91a1a}-0.34\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.0728ms 0.5414ms 1.8470 KOps/s 1.8477 KOps/s $\color{#d91a1a}-0.04\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.7078ms 0.5222ms 1.9149 KOps/s 1.9210 KOps/s $\color{#d91a1a}-0.32\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.1589ms 3.0538ms 327.4571 Ops/s 329.4825 Ops/s $\color{#d91a1a}-0.61\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.8258ms 0.6733ms 1.4852 KOps/s 1.5126 KOps/s $\color{#d91a1a}-1.81\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 4.3169ms 0.6582ms 1.5194 KOps/s 1.5566 KOps/s $\color{#d91a1a}-2.39\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.0641ms 2.9172ms 342.7936 Ops/s 343.7550 Ops/s $\color{#d91a1a}-0.28\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.7118ms 0.5484ms 1.8236 KOps/s 1.5691 KOps/s $\textbf{\color{#35bf28}+16.22\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.1054s 0.6192ms 1.6151 KOps/s 1.9200 KOps/s $\textbf{\color{#d91a1a}-15.88\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.1334ms 2.9522ms 338.7316 Ops/s 337.6772 Ops/s $\color{#35bf28}+0.31\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.6861ms 0.5457ms 1.8325 KOps/s 1.8693 KOps/s $\color{#d91a1a}-1.97\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 4.3423ms 0.5282ms 1.8931 KOps/s 1.9469 KOps/s $\color{#d91a1a}-2.76\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.1860ms 3.0707ms 325.6637 Ops/s 329.4977 Ops/s $\color{#d91a1a}-1.16\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.1080s 0.7741ms 1.2919 KOps/s 1.5061 KOps/s $\textbf{\color{#d91a1a}-14.22\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8400ms 0.6601ms 1.5150 KOps/s 1.5561 KOps/s $\color{#d91a1a}-2.64\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.1078s 6.8764ms 145.4248 Ops/s 139.3695 Ops/s $\color{#35bf28}+4.34\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 16.5830ms 14.3599ms 69.6386 Ops/s 69.4648 Ops/s $\color{#35bf28}+0.25\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 2.1521ms 1.0835ms 922.8977 Ops/s 881.2340 Ops/s $\color{#35bf28}+4.73\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1089s 8.9002ms 112.3568 Ops/s 115.3196 Ops/s $\color{#d91a1a}-2.57\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 16.3843ms 14.3825ms 69.5292 Ops/s 69.6360 Ops/s $\color{#d91a1a}-0.15\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 2.1838ms 1.0747ms 930.5148 Ops/s 800.7408 Ops/s $\textbf{\color{#35bf28}+16.21\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1029s 7.1323ms 140.2081 Ops/s 141.3490 Ops/s $\color{#d91a1a}-0.81\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 0.1138s 16.6857ms 59.9315 Ops/s 68.3529 Ops/s $\textbf{\color{#d91a1a}-12.32\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 1.5196ms 1.3893ms 719.7613 Ops/s 686.3659 Ops/s $\color{#35bf28}+4.87\%$

@vmoens vmoens linked an issue Feb 25, 2024 that may be closed by this pull request
@vmoens
Copy link
Contributor Author

vmoens commented Feb 25, 2024

Related issue: pytorch/pytorch#120572

@matteobettini
Copy link
Contributor

Nice! I am not super familiar with the code here, but I am happy to help benchmark this.

Let's keep in mind readability though. Ideally we would like a normal torch user to be able to read and understand the loss classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long GPU idle times in loss forward pass
3 participants