Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Support for different hardware configurations for different task roles of one distributed job. #5808

Open
siaimes opened this issue Oct 12, 2022 · 0 comments

Comments

@siaimes
Copy link
Contributor

siaimes commented Oct 12, 2022

What would you like to be added:
Support for different hardware configurations for different task roles of one distributed job.

Why is this needed:
For complex learning tasks, the programs that need to run on each computer are very different, and the requirements for CPU /GPU and RAM /GPU memory are also different. At the same time, these computers need to communicate with each other to enable joint training. For example, in reinforcement learning, the entire reinforcement learning algorithm consists of different modules. The actor uses the GPU to generate data, the learner uses the GPU to train data, the environment and MCTS use CPU to generate data in parallel, and these modules involve complex data communication.

Without this feature, how does the current module work:
Reinforcement learning tasks cannot be performed jointly by multiple computers.

Components that may involve changes:
Job protocol and related.

Downgrade vc to taskrole:
image

Allows each taskrole to have a different skutype:
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant