Add NVIDIA apex support and gradient checkpointing to reduce memory footprint #1090

seovchinnikov · 2020-07-10T21:46:21Z

I've added NVIDIA apex support and checkpointing (https://pytorch.org/docs/stable/checkpoint.html) mechanism to reduce memory footprint.

You can run it with --checkpointing --opt_level "O2" and increased input crop size (I was able to run CycleGAN with up to 896 on my 2080 RTX). Checkpointing is only used for CycleGAN for now (can be improved further).
Please note that it was tested on pytorch 1.7 nightly build, and behavior of apex is unstable on old versions.

…//pytorch.org/docs/stable/checkpoint.html)

…//pytorch.org/docs/stable/checkpoint.html) Fix data_parallel order

…//pytorch.org/docs/stable/checkpoint.html) Disable checkpointing for pix2pix

…//pytorch.org/docs/stable/checkpoint.html) Minor fix

…//pytorch.org/docs/stable/checkpoint.html) Refactor configs

junyanz · 2020-07-11T00:48:34Z

Great feature! I am wondering if you can get the same results with and without apex and gradient checkpointing.

seovchinnikov · 2020-07-11T01:14:07Z

I think we should run base tests and check it against the baselines

vict0rsch · 2020-09-30T22:59:51Z

Note that amp is part of pytorch as of 1.6 => https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples

seovchinnikov added 5 commits July 9, 2020 14:11

Add NVIDIA apex support and checkpointing memory optimization (https:…

6168e1a

…//pytorch.org/docs/stable/checkpoint.html)

Add NVIDIA apex support and checkpointing memory optimization (https:…

653b578

…//pytorch.org/docs/stable/checkpoint.html) Fix data_parallel order

Add NVIDIA apex support and checkpointing memory optimization (https:…

48ef29f

…//pytorch.org/docs/stable/checkpoint.html) Disable checkpointing for pix2pix

Add NVIDIA apex support and checkpointing memory optimization (https:…

6096163

…//pytorch.org/docs/stable/checkpoint.html) Minor fix

Add NVIDIA apex support and checkpointing memory optimization (https:…

d2a6680

…//pytorch.org/docs/stable/checkpoint.html) Refactor configs

seovchinnikov mentioned this pull request Jul 10, 2020

CUDA Error: Out of Memory #422

Closed

Fix CPU version

bfa902a

seovchinnikov mentioned this pull request Jul 25, 2020

Add "Progressive Growing of GANs" (ProGAN) model #1105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVIDIA apex support and gradient checkpointing to reduce memory footprint #1090

Add NVIDIA apex support and gradient checkpointing to reduce memory footprint #1090

seovchinnikov commented Jul 10, 2020

junyanz commented Jul 11, 2020

seovchinnikov commented Jul 11, 2020

vict0rsch commented Sep 30, 2020

Add NVIDIA apex support and gradient checkpointing to reduce memory footprint #1090

Are you sure you want to change the base?

Add NVIDIA apex support and gradient checkpointing to reduce memory footprint #1090

Conversation

seovchinnikov commented Jul 10, 2020

junyanz commented Jul 11, 2020

seovchinnikov commented Jul 11, 2020

vict0rsch commented Sep 30, 2020