Create a CUDA context #406

DhairyaLGandhi · 2019-08-29T14:29:01Z

Thanks to IRTools.jl, we can do some nifty things with Julia IR. Like using a dynamo to walk through the deep IR and offload sensible ops to the GPU.

julia> c = Conv((3,3), 3 => 16, pad = (1,1), relu); # from Flux

julia> r = rand(Float32, 32, 32, 3, 100);

julia> cuda() do
           c(r)
       end # run on GPU

julia> a = rand(Float32, 5*10^4);

julia> b = rand(Float32, 5*10^4);

julia> cuda() do
           a + b
       end
50000-element Array{Float32,1}:
 0.9649581
 1.2122422
 0.423553
...

Notice the return type is a normal Array, meaning that without much fidgeting, it is trivial to offload computation to the GPU and continue where you left off.

There are a couple caveats, not all functions behave nicely yet and we need better test coverage, but opening it now to get some review and direction of the way forward

cc @MikeInnes

ref https://github.com/JuliaGPU/CuArrays.jl/issues/303

vchuravy · 2019-08-29T14:48:16Z

Thanks! What is driving the choice to use IRTools over Cassette? I would prefer the maintenance burden to rest with Cassette (e.g. me)

DhairyaLGandhi · 2019-08-29T15:44:58Z

The choice was made for the little nicer control over the IR with IRTools. Also, it's conceptually simpler so maintaining it should be easier also.

It was also fairly straightforward to define in lesser code making it more readable. Mind you I'm no cassette pro, but definitely worth a discussion.

MikeInnes · 2019-08-29T16:08:22Z

src/context.jl

+function get_cached(array_bank, arr::Array{T,N})::CuArray{T,N} where {T,N}
+  haskey(array_bank, arr) ?
+    array_bank[arr] :
+    (array_bank[arr] = CuArray(arr))


We could probably write get_cached(cx, x) as cx[x].

Thoughts on using get!? I suppose the extra network transfer would be a problem in this case.

You could look at Base.@get!, which avoids evaluating the new value if it's not needed.

MikeInnes · 2019-08-29T16:09:02Z

src/context.jl

+    (array_bank[arr] = CuArray(arr))
+end
+
+function (c::CUDACtx)(::typeof(broadcasted), f, args...)


For broadcast I think we should leave the broadcast struct alone (so that CuArrays can't leak into the program), and instead do all conversion and computation in materialize.

Makes sense, I will dig into it

src/context.jl

MikeInnes · 2019-08-29T16:13:07Z

src/context.jl

+  ir = IR(meta...)
+  ir == nothing && return
+
+  pr = Pipe(ir)


You could replace this code with IRTools.recurse!(ir) (there's some info in the docs if needed).

src/context.jl

MikeInnes · 2019-08-29T16:15:58Z

src/context.jl

+  @eval (c::CUDACtx)(::typeof($f), args...) = $f(args...)
+end
+
+noop_pass.((get_cached, NNlib.check_spdf,


Probably best if these are macros. It'd be nice to add an @cuda macro or similar for the purpose of overloading CUDACtx.

MikeInnes · 2019-08-29T16:16:33Z

src/context.jl

+noop_pass.((get_cached, NNlib.check_spdf,
+	))
+
+for f in names(NNlib)


Better to do this explicitly per function.

src/context.jl

MikeInnes · 2019-08-29T16:17:28Z

test/context.jl

+@testset "simple ops" begin
+  W = rand(5, 5)
+  b = rand(5)
+  @test cuda(() -> W*b) ≈ W*b


Good to check types here as well, e.g. that the output is still an Array.

Would it be worthwhile to have a way to switch off emptying the context? I'd like to be able to say if Arrays were in fact also allocated on the GPU; and a crude way might be to check the types in the context dict after the fact

MikeInnes · 2019-08-29T16:22:19Z

@vchuravy there probably isn't much in it, so if the lead maintainers of this package strongly prefer Cassette then I imagine it'd be OK to port it over.

Though as Dhairya points out there's a couple of potential advantages to fine grained control of the IR pass; the main one is that it's easier to cut out classes of functions we're not interested in, e.g. intrinsics or certain modules in Base, avoiding some redundant recompilation.

maleadt · 2019-08-29T16:59:08Z

src/CuArrays.jl

@@ -4,7 +4,7 @@ using CUDAapi, CUDAdrv, CUDAnative

 using GPUArrays

-export CuArray, CuVector, CuMatrix, CuVecOrMat, cu
+export CuArray, CuVector, CuMatrix, CuVecOrMat, cu, cuda


seems like a pretty generic function to export (both cu and cuda are bound to confuse users). why not something that implies its action, e.g., on_cuda? or @cuda re @async?

I agree about not exporting this for now. In the longer term, if this is successful it should replace cu entirely (alongside all the other APIs, for most users), so a generic name seems appropriate.

I think cuda() do ... reads right, and provides an obvious space for options (cuda(device=2) do ...), but @cuda could work well too (especially in that it's a bit nicer for one liners).

src/context.jl

maleadt · 2019-08-29T17:04:37Z

Very interesting! Looking forward to giving this a spin, might open up some nice new ways of doing GPU computation.

I guess we'll need some way to assert GPU execution to actually test this?

DhairyaLGandhi · 2019-08-29T17:08:23Z

Yeah, for the tests I was thinking just having a context which we can look into to assert that the array is actually in there and corresponds to memory associated with the GPU

vchuravy · 2019-09-02T20:08:01Z

Grrml Gmail ate my reply:

Since CUDAnative will use Cassette and GPUifyLoops already does I would strongly prefer only having one tool in the GPU ecosystem to do this. I would be interested in making IRTools transforms/utility functions work with Cassette, which should work relatively straightforwardly.

MikeInnes · 2019-10-08T13:29:25Z

src/contextual.jl

+  end
+end
+
+@contextual :+ :- :* :/ sum similar materialize


I think we should set these things up to explicitly call whatever lower-level bindings we have; it should show what it would look like if we got rid of CuArray altogether.

MikeInnes · 2019-10-08T13:30:15Z

src/contextual.jl

+  end
+end
+
+@noop_pass get_cached NNlib.check_spdf


Why do we need a noop for get_cached? That shouldn't ever be called in code that we're transforming, right?

MikeInnes · 2019-10-08T13:30:35Z

src/contextual.jl

+using IRTools: meta, Pipe, finish, Variable, self
+using MacroTools: @forward
+
+import Base.Broadcast.broadcasted


These imports are redundant now

DhairyaLGandhi · 2019-08-29T20:11:08Z

test/context.jl

+@testset "simple ops" begin
+  W = rand(5, 5)
+  b = rand(5)
+  @test cuda(() -> W*b) ≈ W*b


Would it be worthwhile to have a way to switch off emptying the context? I'd like to be able to say if Arrays were in fact also allocated on the GPU; and a crude way might be to check the types in the context dict after the fact

DhairyaLGandhi · 2019-08-30T07:01:02Z

src/context.jl

+    (array_bank[arr] = CuArray(arr))
+end
+
+function (c::CUDACtx)(::typeof(broadcasted), f, args...)


Makes sense, I will dig into it

DhairyaLGandhi · 2019-09-20T10:52:28Z

src/contextual.jl

+function cache(cx, x::CuArray{T,N})::Array{T,N} where {T,N}
+  cpu = Array{T,N}(undef, ntuple(_->0,N))
+  cx[cpu] = x
+  return cpu


In cases like BatchNorm, before any compute is hit, the check for dimensions takes the cpu which has 0 shape, which errors. Returning the data also seems wasteful. Thoughts?

This does seem to make things work (including backwards pass on Zygote with Flux models, but its hitting some bad code paths currently).

Dhairya Gandhi and others added 25 commits May 10, 2019 19:30

add first compiler pass

52e476d

add compiler pass

036a170

cleanup

a3246fd

cache cu'd arrays

7495b98

use cached results

e0965c1

fixes

82f8696

use IdDict over WeakKeyDict

44f802c

fixes

601a6f4

separate concrete types as an opt in

9324434

cache objects explicitly and use

efb50d1

clean interface

bc40e6d

dont cache objects

7ecfbc9

use cached nametuples

b077a98

cleanup

f3a484f

compiler.jl -> context.jl

fb5b058

add IRTools dependency

d44a86d

simple test

8997b6c

restrict to arrays

99425a1

array keys

01010c5

refill arrays

65d347c

use context/ rm iddict

815d49b

Merge branch 'master' into context

79424f3

simplify CUDACtx

7a4934e

add basic tests

50eaec9

use to return cpu arrays

8515b7f

MikeInnes reviewed Aug 29, 2019

View reviewed changes

maleadt reviewed Aug 29, 2019

View reviewed changes

src/context.jl Outdated Show resolved Hide resolved

maleadt added the enhancement label Aug 29, 2019

Dhairya Gandhi added 4 commits August 30, 2019 01:23

cleanups

be85709

remove getproperty fn

289fb46

context -> contextual

1d10f62

fixes

2a342ca

Dhairya Gandhi added 5 commits September 19, 2019 22:10

merge upstream

49f5937

move to macros

c07fa95

get rid of broadcasted def

e67dd22

check output type in test

0c681fb

better macro name

173dea9

MikeInnes reviewed Oct 8, 2019

View reviewed changes

DhairyaLGandhi commented Nov 28, 2019

View reviewed changes

maleadt force-pushed the master branch from 2e25b28 to 16ebad8 Compare December 20, 2019 16:03

maleadt force-pushed the master branch 2 times, most recently from fc487fd to fced436 Compare January 22, 2020 15:17

maleadt force-pushed the master branch from 2c08630 to 6ca9b4f Compare February 7, 2020 11:23

maleadt force-pushed the master branch from 8259597 to 1fe1883 Compare March 10, 2020 11:42

MikeInnes mentioned this pull request Apr 16, 2020

Applications FluxML/Mjolnir.jl#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a CUDA context #406

Create a CUDA context #406

DhairyaLGandhi commented Aug 29, 2019 •

edited by maleadt

vchuravy commented Aug 29, 2019

DhairyaLGandhi commented Aug 29, 2019

MikeInnes Aug 29, 2019

DhairyaLGandhi Aug 29, 2019

MikeInnes Aug 30, 2019

MikeInnes Aug 29, 2019

DhairyaLGandhi Aug 30, 2019

MikeInnes Aug 29, 2019

MikeInnes Aug 29, 2019

MikeInnes Aug 29, 2019

MikeInnes Aug 29, 2019

DhairyaLGandhi Aug 29, 2019

MikeInnes commented Aug 29, 2019

maleadt Aug 29, 2019

MikeInnes Aug 30, 2019

maleadt commented Aug 29, 2019 •

edited

DhairyaLGandhi commented Aug 29, 2019

vchuravy commented Sep 2, 2019

MikeInnes Oct 8, 2019

MikeInnes Oct 8, 2019

MikeInnes Oct 8, 2019

DhairyaLGandhi Aug 29, 2019

DhairyaLGandhi Aug 30, 2019

DhairyaLGandhi Sep 20, 2019

Create a CUDA context #406

Are you sure you want to change the base?

Create a CUDA context #406

Conversation

DhairyaLGandhi commented Aug 29, 2019 • edited by maleadt

vchuravy commented Aug 29, 2019

DhairyaLGandhi commented Aug 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeInnes commented Aug 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maleadt commented Aug 29, 2019 • edited

DhairyaLGandhi commented Aug 29, 2019

vchuravy commented Sep 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DhairyaLGandhi commented Aug 29, 2019 •

edited by maleadt

maleadt commented Aug 29, 2019 •

edited