discussion: continuous integration using GPUs #2184

grlee77 · 2020-06-14T12:37:34Z

Description

I opened this issue to discussion potential future solutions for continuous integration with GPUs as discussed in last week's lab meeting.

Some initial searching found approaches based on Azure and/or Jenkins using either local or cloud hardware.

1.) I found this blog post from a researcher at UIUC explaining how they configured Azure to run CI on their local hardware.
https://cwpearson.github.io/post/20190520-self-hosted-azp/
https://github.com/cwpearson/azure-pipelines-agent

2.) It looks like CuPy and PyTorch both use Jenkins.
I think in their cases it runs on their own hardware, although it looks like there is also an EC2 plugin that would allow running on cloud hardware from Amazon EC2 compute instances.
http://blog.innodatalabs.com/automating-ml-training-with-jenkins-pipelines/

Given that most PRs here would not involve GPU code, we would not want to run CI on the GPU instances unless specifically requested. CuPy have configured a bot so that when one of the maintainers types "Jenkins, test this please", the full CI starts.

If others are aware of additional potential solutions, please let me know.

matthew-brett · 2020-06-14T13:57:26Z

Buildbot is still healthy - I recently set up an instance for https://www.psychopy.org . It's relatively easy to configure. I'm happy to help if you're interested.

grlee77 · 2020-06-15T18:09:40Z

Thanks @matthew-brett. We do not need it immediately, but are trying to determine feasibility for use with future work.

Are you saying you have previously configured testing involving GPUs for other projects? Is the hardware used via EC2 or is NVIDIA hardware available by other means? We would probably want to start from something like an nvidia-docker container.

I am an occasional user of CI, but it is not a real area of expertise, so any guidance would be appreciated. Basically at this point we are trying to get some estimate of the level of effort it would take to set up and what the potential cost might be.

matthew-brett · 2020-06-15T22:39:58Z

No, sorry, I have not set up GPUs on Buildbot. I have set up and Build farms with actual machines, rather than VMs. This is a moderate amount of work, that we needed before Travis-CI and Azure could cover most of the variants we needed. The Builbot instance I was working on recently was for Psychopy - where they need to test machines attached to actual hardware like specific graphics cards, buttonboxes and so on. I was imagining that was one way you could test specific GPUs.

leofang · 2020-07-20T02:14:31Z

We had a local Azure agent set up to do this job, see NSLS-II/ptycho_gui#90. It is doable on a per-repo basis: just set up the connection between any GitHub repo and Azure Pipeline, add local GPU machines to the Azure runner pool, and run the Azure agents on those machines to listen to jobs. We also set up a GitHub Action to do the same thing: NSLS-II/ptycho_gui#91. (Those PRs just verified we can access our local GPU machine; I haven't found time to actually test the GPU capability, but it should be relatively trivial.)

One thing I am unhappy with this approach is that one really needs dedicated resources to do this. Because of the recent lockdown, we had to reallocate our machine for other purposes, so we had to shut down the CI...

grlee77 · 2020-07-20T15:34:56Z

Thanks @leofang, that is helpful. In your configuration, does the GPU-based CI kick off automatically on each PR or can you configure it to have a bot run it only when requested? I think for DIPY's usage scenario, given that all code is currently CPU-only, we would only need to start the GPU-based CI when specifically requested on future PRs that involve additions/changes to GPU-related features.

I was thinking that using cloud resources rather than local hardware might be cheaper given the relatively low volume we would initially need, but haven't tried to come up with a concrete cost estimate at this point.

leofang · 2020-07-20T17:24:43Z

does the GPU-based CI kick off automatically on each PR or can you configure it to have a bot run it only when requested?

Currently it is up every time the PR is updated, but yes, I think invoking it as needed (something like "Jenkins, test this please") is a more desirable behavior. I'll look into it.

I was thinking that using cloud resources rather than local hardware might be cheaper given the relatively low volume we would initially need, but haven't tried to come up with a concrete cost estimate at this point.

The thing is we don't even know what cloud resources provide GPUs, so we have to use our own. I know RAPIDS have gpuCI but I don't think it's exposed to general users. Would be nice to have a cloud-based option indeed (our use case is also low-volume).

arokem · 2020-08-24T16:28:52Z

This neat video shows how to use GPU on a self-hosted runner in CI using GitHub Actions: https://www.youtube.com/watch?v=rVq-SCNyxVc. Also, this blogpost: https://dvc.org/blog/cml-self-hosted-runners-on-demand-with-gpus

skoudoro pinned this issue Sep 1, 2020

skoudoro added the 💬type:Discussion 💬 label Sep 1, 2020

camall3n mentioned this issue Apr 21, 2021

plot g-score and learning curve camall3n/actgen#21

Merged

dpanici mentioned this issue Feb 21, 2022

Add test case that uses a GPU PlasmaControl/DESC#149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussion: continuous integration using GPUs #2184

discussion: continuous integration using GPUs #2184

grlee77 commented Jun 14, 2020

matthew-brett commented Jun 14, 2020

grlee77 commented Jun 15, 2020

matthew-brett commented Jun 15, 2020

leofang commented Jul 20, 2020

grlee77 commented Jul 20, 2020

leofang commented Jul 20, 2020

arokem commented Aug 24, 2020

discussion: continuous integration using GPUs #2184

discussion: continuous integration using GPUs #2184

Comments

grlee77 commented Jun 14, 2020

Description

matthew-brett commented Jun 14, 2020

grlee77 commented Jun 15, 2020

matthew-brett commented Jun 15, 2020

leofang commented Jul 20, 2020

grlee77 commented Jul 20, 2020

leofang commented Jul 20, 2020

arokem commented Aug 24, 2020