Adjust locations of setting the policy in train/eval mode #1122

maxhuettenrauch · 2024-04-24T15:05:03Z

Currently, tianshou sets the policy's mode in the trainer and test_episode function. The corresponding training attribute is then used to determine if a stochastic policy should be evaluated deterministically given that policy.deterministic_eval is True. This, however, is a misuse as the training attribute primarily has influence on modules like dropout and batchnorm. It should always be False during data collection and only be True inside policy.learn.

The text was updated successfully, but these errors were encountered:

opcode81 · 2024-05-03T13:37:15Z

Max and I have implemented the following solution in #1123:

We Introduced a new flag is_within_training_step which is enabled by the training algorithm when within a training step, where a training step encompasses training data collection and policy updates. This flag is now used by algorithms to decide whether their deterministic_eval setting should indeed apply instead of the torch training flag (which was abused!).
The policy's training/eval mode (which should control torch-level learning only) no longer needs to be set in user code in order to control collector behaviour (this didn't make sense!). The respective calls have been removed.
The policy should, in fact, always be in evaluation mode when applying data collection, as there is no reason to ever have gradient accumulation enabled for any type of rollout. We thus specifically set the policy to evaluation mode in Collector.collect.

Addresses #1122: * We Introduced a new flag `is_within_training_step` which is enabled by the training algorithm when within a training step, where a training step encompasses training data collection and policy updates. This flag is now used by algorithms to decide whether their `deterministic_eval` setting should indeed apply instead of the torch training flag (which was abused!). * The policy's training/eval mode (which should control torch-level learning only) no longer needs to be set in user code in order to control collector behaviour (this didn't make sense!). The respective calls have been removed. * The policy should, in fact, always be in evaluation mode when applying data collection, as there is no reason to ever have gradient accumulation enabled for any type of rollout. We thus specifically set the policy to evaluation mode in Collector.collect. Further, it never makes sense to compute gradients during collection, so the possibility to pass `no_grad=False` was removed. Further changes: - Base class for collectors: `BaseCollector` - New util context managers `in_eval_mode` and `in_train_mode` for torch modules. - `reset` of `Collectors` now returns `obs` and `info`. - `no-grad` no longer accepted as kwarg of `collect` - Removed deprecations of `0.5.1` (will likely not affect anyone) and the unused `warnings` module.

maxhuettenrauch mentioned this issue Apr 24, 2024

Adjust locations of setting the policy in train/eval mode #1123

Merged

MischaPanch added bug Something isn't working refactoring No change to functionality labels Apr 25, 2024

MischaPanch assigned maxhuettenrauch Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust locations of setting the policy in train/eval mode #1122

Adjust locations of setting the policy in train/eval mode #1122

maxhuettenrauch commented Apr 24, 2024

opcode81 commented May 3, 2024

Adjust locations of setting the policy in train/eval mode #1122

Adjust locations of setting the policy in train/eval mode #1122

Comments

maxhuettenrauch commented Apr 24, 2024

opcode81 commented May 3, 2024