Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion: continuous integration using GPUs #2184

Open
grlee77 opened this issue Jun 14, 2020 · 7 comments
Open

discussion: continuous integration using GPUs #2184

grlee77 opened this issue Jun 14, 2020 · 7 comments

Comments

@grlee77
Copy link
Contributor

grlee77 commented Jun 14, 2020

Description

I opened this issue to discussion potential future solutions for continuous integration with GPUs as discussed in last week's lab meeting.

Some initial searching found approaches based on Azure and/or Jenkins using either local or cloud hardware.

1.) I found this blog post from a researcher at UIUC explaining how they configured Azure to run CI on their local hardware.
https://cwpearson.github.io/post/20190520-self-hosted-azp/
https://github.com/cwpearson/azure-pipelines-agent

2.) It looks like CuPy and PyTorch both use Jenkins.
I think in their cases it runs on their own hardware, although it looks like there is also an EC2 plugin that would allow running on cloud hardware from Amazon EC2 compute instances.
http://blog.innodatalabs.com/automating-ml-training-with-jenkins-pipelines/

Given that most PRs here would not involve GPU code, we would not want to run CI on the GPU instances unless specifically requested. CuPy have configured a bot so that when one of the maintainers types "Jenkins, test this please", the full CI starts.

If others are aware of additional potential solutions, please let me know.

@matthew-brett
Copy link
Contributor

Buildbot is still healthy - I recently set up an instance for https://www.psychopy.org . It's relatively easy to configure. I'm happy to help if you're interested.

@grlee77
Copy link
Contributor Author

grlee77 commented Jun 15, 2020

Thanks @matthew-brett. We do not need it immediately, but are trying to determine feasibility for use with future work.

Are you saying you have previously configured testing involving GPUs for other projects? Is the hardware used via EC2 or is NVIDIA hardware available by other means? We would probably want to start from something like an nvidia-docker container.

I am an occasional user of CI, but it is not a real area of expertise, so any guidance would be appreciated. Basically at this point we are trying to get some estimate of the level of effort it would take to set up and what the potential cost might be.

@matthew-brett
Copy link
Contributor

No, sorry, I have not set up GPUs on Buildbot. I have set up and Build farms with actual machines, rather than VMs. This is a moderate amount of work, that we needed before Travis-CI and Azure could cover most of the variants we needed. The Builbot instance I was working on recently was for Psychopy - where they need to test machines attached to actual hardware like specific graphics cards, buttonboxes and so on. I was imagining that was one way you could test specific GPUs.

@leofang
Copy link

leofang commented Jul 20, 2020

We had a local Azure agent set up to do this job, see NSLS-II/ptycho_gui#90. It is doable on a per-repo basis: just set up the connection between any GitHub repo and Azure Pipeline, add local GPU machines to the Azure runner pool, and run the Azure agents on those machines to listen to jobs. We also set up a GitHub Action to do the same thing: NSLS-II/ptycho_gui#91. (Those PRs just verified we can access our local GPU machine; I haven't found time to actually test the GPU capability, but it should be relatively trivial.)

One thing I am unhappy with this approach is that one really needs dedicated resources to do this. Because of the recent lockdown, we had to reallocate our machine for other purposes, so we had to shut down the CI...

@grlee77
Copy link
Contributor Author

grlee77 commented Jul 20, 2020

Thanks @leofang, that is helpful. In your configuration, does the GPU-based CI kick off automatically on each PR or can you configure it to have a bot run it only when requested? I think for DIPY's usage scenario, given that all code is currently CPU-only, we would only need to start the GPU-based CI when specifically requested on future PRs that involve additions/changes to GPU-related features.

I was thinking that using cloud resources rather than local hardware might be cheaper given the relatively low volume we would initially need, but haven't tried to come up with a concrete cost estimate at this point.

@leofang
Copy link

leofang commented Jul 20, 2020

does the GPU-based CI kick off automatically on each PR or can you configure it to have a bot run it only when requested?

Currently it is up every time the PR is updated, but yes, I think invoking it as needed (something like "Jenkins, test this please") is a more desirable behavior. I'll look into it.

I was thinking that using cloud resources rather than local hardware might be cheaper given the relatively low volume we would initially need, but haven't tried to come up with a concrete cost estimate at this point.

The thing is we don't even know what cloud resources provide GPUs, so we have to use our own. I know RAPIDS have gpuCI but I don't think it's exposed to general users. Would be nice to have a cloud-based option indeed (our use case is also low-volume).

@arokem
Copy link
Contributor

arokem commented Aug 24, 2020

This neat video shows how to use GPU on a self-hosted runner in CI using GitHub Actions: https://www.youtube.com/watch?v=rVq-SCNyxVc. Also, this blogpost: https://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants