Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collector sampling with multiple environment does not seem to be unbiased with n_episodes #1042

Open
5 of 9 tasks
utkarshp opened this issue Feb 2, 2024 · 10 comments · May be fixed by #1127
Open
5 of 9 tasks

Collector sampling with multiple environment does not seem to be unbiased with n_episodes #1042

utkarshp opened this issue Feb 2, 2024 · 10 comments · May be fixed by #1127
Assignees
Labels
algorithm enhancement Not quite a new algorithm, but an enhancement to algo. functionality question Further information is requested

Comments

@utkarshp
Copy link

utkarshp commented Feb 2, 2024

  • I have marked all applicable categories:
    • exception-raising bug
    • RL algorithm bug
    • documentation request (i.e. "X is missing from the documentation.")
    • new feature request
    • design request (i.e. "X should be changed to Y.")
  • I have visited the source website
  • I have searched through the issue tracker for duplicates
  • I have mentioned version numbers, operating system and environment, where applicable:
python
>>>  import tianshou, gymnasium as gym, torch, numpy, sys
>>>  print(tianshou.__version__, gym.__version__, torch.__version__, numpy.__version__, sys.version, sys.platform)
0.5.1 0.29.1 2.1.2 1.26.3 3.11.0 (main, Mar  1 2023, 18:26:19) [GCC 11.2.0] linux

I have observed that when Collector is inited with n_episodes, it tends to sample more from environments which take less steps to finish compared to others. There is some logic implemented in the collect method, but it does not seem to be doing enough. Here is the code that tries to deal with this by removing some envs:
data/collector.py lines 367--373 in master branch:

                # remove surplus env id from ready_env_ids
                # to avoid bias in selecting environments
                if n_episode:
                    surplus_env_num = len(ready_env_ids) - (n_episode - episode_count)
                    if surplus_env_num > 0:
                        mask = np.ones_like(ready_env_ids, dtype=bool)
                        mask[env_ind_local[:surplus_env_num]] = False
                        ready_env_ids = ready_env_ids[mask]
                        self.data = self.data[mask]

Here is the problem: Suppose we have 50 envs, running 100 episodes. Ideally, I would want each env to be stepped through two episodes regardless of their lengths. Now suppose 30 of these 50 envs are slow and take 31 steps each, while the remaining 20 are fast take 10 steps each to finish. After 10 steps, we get episode_count = 20 => surplus_env_num < 0, and nothing changes. All envs are stepped through another episode each, resulting in episode_count = 40 => surplus_env_num < 0. The same thing happens again, and we finally get episode_count = 60 => surplus_env_num > 0. At this point, we have already gathered a total of 60 episodes from the fast envs (3 episodes each), but no episode from the slow envs has finished. In the end, we would get 3 episodes each from the fast envs, 1-2 episodes each from the slower envs.

I made an attempt at fixing this as follows. Essentially, this attempt makes the envs that finished wait on those that have not, and then calls reset on all at once.

+    def _reset_env_to_next(self, gym_reset_kwargs: Optional[Dict[str, Any]] = None) -> None:
+        gym_reset_kwargs = gym_reset_kwargs if gym_reset_kwargs else {}
+        obs, info = self.env.reset(**gym_reset_kwargs)
+        if self.preprocess_fn:
+            processed_data = self.preprocess_fn(
+                obs=obs, info=info, env_id=np.arange(self.env_num)
+            )
+            obs = processed_data.get("obs", obs)
+            info = processed_data.get("info", info)
+        self.data = Batch(
+            obs={},
+            act={},
+            rew={},
+            terminated={},
+            truncated={},
+            done={},
+            obs_next={},
+            info={},
+            policy={}
+        )
+        self.data.info = info
+        self.data.obs_next = obs
+
     def collect(
             self,
             n_step: Optional[int] = None,
@@ -69,8 +92,10 @@
                     "which may cause extra transitions collected into the buffer."
                 )
             ready_env_ids = np.arange(self.env_num)
+            init_ready_env_ids = np.arange(self.env_num)
         elif n_episode is not None:
             assert n_episode > 0
+            init_ready_env_ids = np.arange(min(self.env_num, n_episode))
             ready_env_ids = np.arange(min(self.env_num, n_episode))
             self.data = self.data[:min(self.env_num, n_episode)]
         else:
@@ -170,24 +195,18 @@
                 episode_lens.append(ep_len[env_ind_local])
                 episode_rews.append(ep_rew[env_ind_local])
                 episode_start_indices.append(ep_idx[env_ind_local])
-                # now we copy obs_next to obs, but since there might be
-                # finished episodes, we have to reset finished envs first.
-                self._reset_env_with_ids(
-                    env_ind_local, env_ind_global, gym_reset_kwargs
-                )
-                for i in env_ind_local:
-                    self._reset_state(i)
+                unfinished_ind_local = np.where(~done)[0]
 
                 # remove surplus env id from ready_env_ids
                 # to avoid bias in selecting environments
                 if n_episode:
-                    surplus_env_num = len(ready_env_ids) - (n_episode - episode_count)
-                    if surplus_env_num > 0:
-                        mask = np.ones_like(ready_env_ids, dtype=bool)
-                        mask[env_ind_local[:surplus_env_num]] = False
-                        ready_env_ids = ready_env_ids[mask]
-                        self.data = self.data[mask]
-
+                    ready_env_ids = ready_env_ids[unfinished_ind_local]
+                    self.data = self.data[unfinished_ind_local]
+                    if len(unfinished_ind_local) == 0:
+                        self._reset_env_to_next(gym_reset_kwargs)
+                        ready_env_ids = init_ready_env_ids
+                        for i in ready_env_ids:
+                            self._reset_state(i)
             self.data.obs = self.data.obs_next
 
             if (n_step and step_count >= n_step) or \

@MischaPanch
Copy link
Collaborator

Thank you for the detailed description, I'm gonna think about this tomorrow and give a proper answer.

But already now the situation you describe seems unusual - during training you would usually parallelize different instantiations of the same environment type. Why would part of them be slower than the others? Do you have a specific use-case in mind?

@MischaPanch MischaPanch added question Further information is requested algorithm enhancement Not quite a new algorithm, but an enhancement to algo. functionality labels Feb 6, 2024
@utkarshp
Copy link
Author

utkarshp commented Feb 6, 2024

Actually, I thought to give a deterministic example to make the issue clearer, but I now realize that was not necessary and might have caused confusion.

In my specific situation, the episode lengths can vary a lot between episodes. Different environments are not inherently different, but if some envs get stuck with long episodes, then this problem can arise.

I should have started with this example, my apologies. Hope it is clearer now.

@maxhuettenrauch
Copy link
Collaborator

Hi @utkarshp, thanks for the issue. We recently had a discussion around this very topic and currently I believe that if all of your envs are created equal (i.e., the length of the environment is just due to some randomness), then you are simply not collecting enough episodes to capture the variance between short and long episodes of your task. Deliberately stopping the first envs and "handpicking" some of the longer runs might actually skew the statistics in the same way as collecting too few episodes and thus only end up with short ones. The solution would be to either reduce the number of envs to collect more episodes per env, or increase the number of collected episodes. Please let me know if you have other reasons why you would like to have each env sample two episodes. We will also perform some testing whether there are significant differences if episodes come only from a small subset of available envs, or equally from all envs.

@utkarshp
Copy link
Author

utkarshp commented Feb 7, 2024

The only reason is time. When I am training, I am collecting lots of episodes from each env, and this is not such an issue. When testing, I want to be able to test on the same test set each time I test. Furthermore, I have a baseline algorithm that also needs to run on the same test set so that I can compare it with the RL agent's performance. I have 100 pre-selected episodes (=seeds) that I want to run all tests on, and my foremost requirement is that each of these episodes be evaluated once so that a fair comparison can be made with the baseline algorithm, and past iterations. Of course I plan on trying with different test sets for robustness, but currently I am only looking for reproducible behavior on a fixed test set. Since there is no good way (that I know of) to remove envs that are done with their episodes from the rotation, I elected to divide the episodes equally into different envs, hoping that it would work. Is there a better way to achieve this?

Anyway, each episode takes around 10-15 minutes. If I parallelize into 50 threads, I am done in 30 minutes for each test. Otherwise, it takes longer. I have tried with 20-25 envs (with 5-4 episodes each), and this is still an issue. I suppose I could also try 10 envs with 10 episodes each and reduce my testing frequency.

In all, I still think that enforcing an equal number of episodes would be a desirable behavior. But now I think perhaps this could be a feature request.

@utkarshp
Copy link
Author

utkarshp commented Feb 7, 2024

A bit unrelated perhaps, but I guess my problem can also be solved through some sort of interprocess communication between my test envs. Currently I pass episode start seeds at test env init so that each env can execute the episodes they are assigned. Once an env is done with its two episodes, the next reset and step produce garbage that I don't want. If the processes could communicate, then the faster envs can take over the load from the slower envs and things could still be made to work. While this seems to be more complicated and prone to errors, it may be worth exploring in case there is a sufficient speedup.

@bordeauxred
Copy link
Contributor

Hi @utkarshp,
thanks for raising this issue and your proposed solution.
Based on this, I will open a PR addressing the request later on.

Using the sequential execution where each env has to terminate in each batch is as you rightly point out unbiased. It might be computationally quite undesirable through. Consider for example an env that takes in 90% of cases t steps and T steps otherwise. In the batch sequential for 10 test env with 10 runs each, the batch sequential run time is 10T, whereas there is a good chance of T + 9t in the asynchronous case.

As your use case is reproducibility at test time instead of performance, the batch sequential fashion looks alright through.

On a related note, do you find substantially different behaviour for running each of the 100 test env seeds one time vs. performing 100 runs in one test env (passing only the seed for the first reset, fixing the sequence of random numbers/ starting states for the subsequent envs as well )?

@utkarshp
Copy link
Author

utkarshp commented Feb 7, 2024

Yes, I agree with the computational toll this would take. Perhaps the collector could be given an additional non-default argument (unbiased=False) that can be set to True to ensure this behavior?

I am not entirely sure I fully understand your question, but I will try to answer. With the current implementation, yes I do find substantially different behavior in the following two cases:

  1. Pass 100 episode start states to a single test env
  2. Split the 100 start states into sequences of, say, two each and passing them to 50 test envs.

In the second case, I find that some of the envs are done with their allotted episodes and start on newer ones. So I get some unwanted episodes, but miss out on those that I wanted.

Please let me know if this answers your question.

@MischaPanch MischaPanch added this to To do in Overall Tianshou Status via automation Feb 7, 2024
@bordeauxred
Copy link
Contributor

Agree, it will be a non default argument.

Yes, I see how you would get different results in the two cases you describe.

What I meant is slightly different, let me try to be more clear.

By passing start states, you mean passing seeds?

Could you please elaborate a bit on the environment you are using (like deterministic or stochastic, has the seed only impact on the initial state distribution or also sth. else, how different are the starting states from the initial state distribution)?

The two cases I meant:
1.) Having one test env, that is reset with seed in the first call, env.reset() in the next (99) calls is invoked without explicit seed argument, thus you get a different initial state (and maybe in addition sth. else in your task) in each case.
2.) Calling env.reset(seed) for 100 different seed values (parallel or sequential, doesn't matter).

For a sufficiently large number of runs, both approaches should converge to the same evaluation statistics.

@utkarshp
Copy link
Author

utkarshp commented Feb 9, 2024

My env is deterministic, so that taking the same action in the same state would lead to the same reward and state transition. So passing 100 start states would be the same as passing 100 seeds and then reseeding with the next seed at reset time.

I agree that the situations you described should converge to substantially the same evaluation statistic. The problem in my case is that for any policy running on a set of episodes, the rewards tend to have a really large standard deviation. Comparing different policies on different sets of episodes is then a challenge since their confidence intervals tend to overlap. However, if I can show that for the same set of episodes, one policy is better than the other in terms of mean rewards, and that this holds for multiple sets of episodes, then I have a concrete result about which policy is better.

Does this make sense? Please do let me know if you think there is a flaw in my reasoning.

@MischaPanch
Copy link
Collaborator

We'll have to do #1058 first. The Collector is too convoluted to be touched, so we'll improve the situation in multiple steps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algorithm enhancement Not quite a new algorithm, but an enhancement to algo. functionality question Further information is requested
Development

Successfully merging a pull request may close this issue.

4 participants