Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU reverse mode to EnzymeExt #454

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add GPU reverse mode to EnzymeExt #454

wants to merge 1 commit into from

Conversation

wsmoses
Copy link
Collaborator

@wsmoses wsmoses commented Jan 24, 2024

@michel2323 's PR, but opening so we can have a place to discuss.

@wsmoses
Copy link
Collaborator Author

wsmoses commented Jan 24, 2024

wmoses@beast:~/git/Enzyme.jl/KernelAbstractions.jl (enz_rev_gpu) $ ../julia-1.10.0-rc2/bin/julia --project reverse_gpu.jl 
Custom rule GPU
TapeType = @NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}
kernels: Error During Test at /home/wmoses/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:31
  Got exception outside of a @test
  GPU compilation of MethodInstance for EnzymeExt.aug_fwd(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::typeof(gpu_square!), ::Val{(false, false, true)}, ::Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, ::Duplicated{CuDeviceVector{Float64, 1}}) failed
  KernelError: passing and using non-bitstype argument
  
  Argument 5 to your kernel function is of type Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, which is not isbits:
  
  
  Stacktrace:
    [1] check_invocation(job::GPUCompiler.CompilerJob)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/validation.jl:92
    [2] macro expansion
      @ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:123 [inlined]
    [3] macro expansion
      @ ~/.julia/packages/TimerOutputs/RsWnF/src/TimerOutput.jl:253 [inlined]
    [4] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:121
    [5] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:106
    [6] compile
      @ ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:98 [inlined]
    [7] #1075
      @ ~/.julia/packages/CUDA/YIj5X/src/compiler/compilation.jl:247 [inlined]
    [8] JuliaContext(f::CUDA.var"#1075#1077"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
    [9] compile(job::GPUCompiler.CompilerJob)
      @ CUDA ~/.julia/packages/CUDA/YIj5X/src/compiler/compilation.jl:246
   [10] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:125
   [11] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/execution.jl:103
   [12] macro expansion
      @ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:382 [inlined]
   [13] macro expansion
      @ ./lock.jl:267 [inlined]
   [14] cufunction(f::typeof(EnzymeExt.aug_fwd), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, typeof(gpu_square!), Val{(false, false, true)}, Vector{@NamedTuple{1, 2, 3::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 4::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::@NamedTuple{1, 2, 3}, 4::Core.LLVMPtr{UInt8, 0}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64, 10, 11::Bool}, 2, 3::Bool, 4::Bool, 5::Bool}, 5::@NamedTuple{1::@NamedTuple{1::UInt32, 2::@NamedTuple{1, 2::@NamedTuple{1, 2::UInt64, 3::Bool, 4::UInt64}, 3, 4::Bool}, 3::Core.LLVMPtr{UInt8, 0}, 4::@NamedTuple{1, 2, 3}, 5::Bool, 6, 7::Bool, 8::UInt64, 9::UInt64}, 2, 3::Bool, 4::Bool, 5::Bool}, 6, 7, 8, 9, 10::Float64, 11::Float64}}, Duplicated{CuDeviceVector{Float64, 1}}}}; kwargs::@Kwargs{always_inline::Bool, maxthreads::Nothing})
      @ CUDA ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:377
   [15] macro expansion
      @ ~/.julia/packages/CUDA/YIj5X/src/compiler/execution.jl:104 [inlined]
   [16] (::KernelAbstractions.Kernel{CUDABackend, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(EnzymeExt.aug_fwd)})(::Function, ::Vararg{Any}; ndrange::Tuple{Int64}, workgroupsize::Nothing)
      @ CUDA.CUDAKernels ~/.julia/packages/CUDA/YIj5X/src/CUDAKernels.jl:118
   [17] #augmented_primal#12
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/ext/EnzymeExt.jl:163
   [18] augmented_primal
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/ext/EnzymeExt.jl:115 [inlined]
   [19] square_caller
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:14 [inlined]
   [20] square_caller
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:0 [inlined]
   [21] diffejulia_square_caller_3883_inner_1wrap
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:0
   [22] macro expansion
      @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:5306 [inlined]
   [23] enzyme_call
      @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:4984 [inlined]
   [24] CombinedAdjointThunk
      @ Enzyme.Compiler ~/git/Enzyme.jl/src/compiler.jl:4926 [inlined]
   [25] autodiff
      @ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:215 [inlined]
   [26] autodiff(::ReverseMode{false, FFIABI}, ::Const{typeof(square_caller)}, ::Duplicated{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, ::Const{CUDABackend})
      @ Enzyme ~/git/Enzyme.jl/src/Enzyme.jl:238
   [27] autodiff
      @ ~/git/Enzyme.jl/src/Enzyme.jl:224 [inlined]
   [28] macro expansion
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:40 [inlined]
   [29] macro expansion
      @ ~/git/Enzyme.jl/julia-1.10.0-rc2/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [30] enzyme_testsuite(backend::Type{CUDABackend}, ArrayT::Type, supports_reverse::Bool)
      @ Main ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:32
   [31] top-level scope
      @ ~/git/Enzyme.jl/KernelAbstractions.jl/reverse_gpu.jl:64
   [32] include(mod::Module, _path::String)
      @ Base ./Base.jl:495
   [33] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:318
   [34] _start()
      @ Base ./client.jl:552
Test Summary: | Error  Total     Time
kernels       |     1      1  1m38.4s

ext/EnzymeExt.jl Outdated
@show TapeType


subtape = Array{TapeType}(undef, size(blocks(iterspace)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be Array, but it should be the device type specific to the array.

Is there a way to get that @vchuravy @michel2323

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KernelAbstractions.allocate

Project.toml Outdated
@@ -6,6 +6,9 @@ version = "0.9.14"
[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
Atomix = "a9b6321e-bd34-4604-b9c9-b65b8de01458"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

@michel2323
Copy link
Collaborator

@wsmoses Forward mode doesn't work anymore, which used to work when I started on this. I'm on latest Enzyme#main
out_fwd.log.

(KernelAbstractions) pkg> st
Project KernelAbstractions v0.9.15
Status `~/.julia/dev/KernelAbstractions/Project.toml`
  [79e6a3ab] Adapt v4.0.1
  [a9b6321e] Atomix v0.1.0
  [052768ef] CUDA v5.2.0
  [7da242da] Enzyme v0.11.13 `~/.julia/dev/Enzyme`
  [1914dd2f] MacroTools v0.5.13
  [aea7be01] PrecompileTools v1.2.0
  [ae029012] Requires v1.3.0
  [90137ffa] StaticArrays v1.9.1
  [013be700] UnsafeAtomics v0.2.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.1.3
  [7cc45869] Enzyme_jll v0.0.98+0 `../Enzyme_jll`
  [b77e0a4c] InteractiveUtils
  [37e2e46d] LinearAlgebra
  [2f01184e] SparseArrays v1.10.0
  [cf7118a7] UUIDs

@michel2323 michel2323 force-pushed the enz_rev_gpu branch 2 times, most recently from 8221520 to e584ed6 Compare January 25, 2024 17:35
@show TapeType


subtape = Array{TapeType}(undef, size(blocks(iterspace)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michel2323 can you have this use the device-specific allocator like @vchuravy mentioned above?

function mul_caller(A, B, backend)
kernel = mul!(backend)
kernel(A, B, ndrange=size(A))
KernelAbstractions.synchronize(backend)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vchuravy is there a KA way to have the kernel run immediately rather than get sync'd. That way this can be separately merged from adding support for KA async

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by that? CUDA etc are all async, so the synchronized is needed in any case. You can leave it out and rely on stream ordering?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok in which case we need to add KA sync rules for forward and reverse. Forward is easy, just sync the regular and shadow kernels. Reverse needs to launch the reverse kernel. and reverse kernel execute needs to sync it

@michel2323
Copy link
Collaborator

I added the following allocate call:

subtape = allocate(CUDABackend(), TapeType, size(blocks(iterspace)))

Now with Enzyme@0.11.12 and the artifact I get:

╰─$ julia --project=. reverse_gpu.jl
kernels: Error During Test at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:28
  Got exception outside of a @test
  AssertionError: value_type(lhs_v) == value_type(rhs_v)
  Stacktrace:
    [1] (::Enzyme.Compiler.var"#getparent#361"{LLVM.Function, LLVM.IntegerType, Int64, Dict{LLVM.PHIInst, LLVM.PHIInst}, Dict{LLVM.PHIInst, LLVM.PHIInst}, LLVM.PHIInst, LLVM.BitCastInst, LLVM.IRBuilder})(v::LLVM.SelectInst, offset::LLVM.ConstantInt, hasload::Bool)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:262
    [2] (::Enzyme.Compiler.var"#getparent#361"{LLVM.Function, LLVM.IntegerType, Int64, Dict{LLVM.PHIInst, LLVM.PHIInst}, Dict{LLVM.PHIInst, LLVM.PHIInst}, LLVM.PHIInst, LLVM.BitCastInst, LLVM.IRBuilder})(v::LLVM.BitCastInst, offset::LLVM.ConstantInt, hasload::Bool)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:223
    [3] nodecayed_phis!(mod::LLVM.Module)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:278
    [4] optimize!
      @ ~/.julia/packages/Enzyme/Dd2LU/src/compiler/optimize.jl:1334 [inlined]
    [5] nested_codegen!(mode::Enzyme.API.CDerivativeMode, mod::LLVM.Module, funcspec::Core.MethodInstance, world::UInt64)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:1416
    [6] enzyme_custom_common_rev(forward::Bool, B::LLVM.IRBuilder, orig::LLVM.CallInst, gutils::Enzyme.Compiler.GradientUtils, normalR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, shadowR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, tape::Nothing)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/customrules.jl:567
    [7] enzyme_custom_augfwd
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/customrules.jl:886 [inlined]
    [8] (::Enzyme.Compiler.var"#212#213")(B::Ptr{LLVM.API.LLVMOpaqueBuilder}, OrigCI::Ptr{LLVM.API.LLVMOpaqueValue}, gutils::Ptr{Nothing}, normalR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, shadowR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}}, tapeR::Ptr{Ptr{LLVM.API.LLVMOpaqueValue}})
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/rules/llvmrules.jl:1139
    [9] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{Enzyme.API.CDIFFE_TYPE}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{Nothing}, forceAnonymousTape::Bool, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{Bool}, augmented::Ptr{Nothing}, atomicAdd::Bool)
      @ Enzyme.API ~/.julia/packages/Enzyme/Dd2LU/src/api.jl:141
   [10] enzyme!(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, mod::LLVM.Module, primalf::LLVM.Function, TT::Type, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, wrap::Bool, modifiedBetween::Tuple{Bool, Bool, Bool}, returnPrimal::Bool, jlrules::Vector{String}, expectedTapeType::Type, loweredArgs::Set{Int64}, boxedArgs::Set{Int64})
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:3124
   [11] codegen(output::Symbol, job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, toplevel::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:4756
   [12] codegen
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:4339 [inlined]
   [13] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, postopt::Bool) (repeats 2 times)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5351
   [14] cached_compilation
      @ ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5385 [inlined]
   [15] (::Enzyme.Compiler.var"#506#507"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, Tuple{Bool, Bool, Bool}, Int64, Bool, Bool, UInt64, DataType})(ctx::LLVM.Context)
      @ Enzyme.Compiler ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5451
   [16] JuliaContext(f::Enzyme.Compiler.var"#506#507"{DataType, DataType, DataType, Enzyme.API.CDerivativeMode, Tuple{Bool, Bool, Bool}, Int64, Bool, Bool, UInt64, DataType})
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
   [17] #s1056#505
      @ ~/.julia/packages/Enzyme/Dd2LU/src/compiler.jl:5403 [inlined]
   [18] var"#s1056#505"(FA::Any, A::Any, TT::Any, Mode::Any, ModifiedBetween::Any, width::Any, ReturnPrimal::Any, ShadowInit::Any, World::Any, ABI::Any, ::Any, ::Type, ::Type, ::Type, tt::Any, ::Type, ::Type, ::Type, ::Type, ::Type, ::Any)
      @ Enzyme.Compiler ./none:0
   [19] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
      @ Core ./boot.jl:602
   [20] autodiff
      @ Enzyme ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:209 [inlined]
   [21] autodiff(::ReverseMode{false, FFIABI}, ::Const{typeof(square_caller)}, ::Duplicated{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, ::Const{CUDABackend})
      @ Enzyme ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:238
   [22] autodiff
      @ ~/.julia/packages/Enzyme/Dd2LU/src/Enzyme.jl:224 [inlined]
   [23] macro expansion
      @ ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:37 [inlined]
   [24] macro expansion
      @ ~/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
   [25] enzyme_testsuite(backend::Type{CUDABackend}, ArrayT::Type, supports_reverse::Bool)
      @ Main ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:29
   [26] top-level scope
      @ ~/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:64

With the latest Enzyme and Enzyme.jl I get this below in the call to

TapeType = EnzymeCore.tape_type(job, ReverseSplitModified(ReverseSplitWithPrimal, ModifiedBetween), FT, Const, Const{ctxTy}, map(Core.Typeof, args2)...)
.

[32421] signal (11.1): Segmentation fault
in expression starting at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:64
typekeyvalue_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1622 [inlined]
lookup_typevalue at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1059
jl_inst_arg_tuple_type at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2157
jl_f_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:868 [inlined]
jl_f_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:863
absint at /home/michel/.julia/dev/Enzyme/src/absint.jl:116
abs_typeof at /home/michel/.julia/dev/Enzyme/src/absint.jl:213
unknown function (ip: 0x7f48e19f5043)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:500
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:208
check_ir! at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:178
check_ir at /home/michel/.julia/dev/Enzyme/src/compiler/validation.jl:157 [inlined]
#codegen#468 at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4382
codegen at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4346 [inlined]
#48 at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:672
JuliaContext at /home/michel/.julia/packages/GPUCompiler/U36Ed/src/driver.jl:47
tape_type at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:671 [inlined]
#augmented_primal#4 at /home/michel/.julia/dev/KernelAbstractions/ext/CUDAEnzymeExt.jl:57
augmented_primal at /home/michel/.julia/dev/KernelAbstractions/ext/CUDAEnzymeExt.jl:14 [inlined]
square_caller at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:13 [inlined]
square_caller at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:0 [inlined]
diffejulia_square_caller_3884_inner_1wrap at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:0
macro expansion at /home/michel/.julia/dev/Enzyme/src/compiler.jl:5306 [inlined]
enzyme_call at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4984 [inlined]
CombinedAdjointThunk at /home/michel/.julia/dev/Enzyme/src/compiler.jl:4926 [inlined]
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:215 [inlined]
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:238
unknown function (ip: 0x7f48e19edfba)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
autodiff at /home/michel/.julia/dev/Enzyme/src/Enzyme.jl:224 [inlined]
macro expansion at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:37 [inlined]
macro expansion at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
enzyme_testsuite at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:29
unknown function (ip: 0x7f49504d5c9f)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:877
ijl_toplevel_eval_in at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
include_string at ./loading.jl:2070
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
_include at ./loading.jl:2130
include at ./Base.jl:495
jfptr_include_46343.1 at /home/michel/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
exec_options at ./client.jl:318
_start at ./client.jl:552
jfptr__start_82703.1 at /home/michel/.julia/juliaup/julia-1.10.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/cli/loader_exe.c:58
unknown function (ip: 0x7f4967759d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 223384695 (Pool: 223135920; Big: 248775); GC: 129
[1]    32421 segmentation fault  julia --project=. reverse_gpu.jl

@wsmoses
Copy link
Collaborator Author

wsmoses commented Jan 29, 2024

you should update Enzyme to latest (0.11.14)

@michel2323
Copy link
Collaborator

michel2323 commented Feb 6, 2024

The reverse kernel uses autodiff_deferred_thunk as opposed to the forward mode using autodiff_deferred. Indeed, there is no test for autodiff_deferred_thunk on CUDA in Enzyme.jl. Trying my luck, but not sure I'll figure it out.

 kernels: Error During Test at /home/michel/.julia/dev/KernelAbstractions/test/reverse_gpu.jl:28
  Got exception outside of a @test
  InvalidIRError: compiling MethodInstance for CUDAEnzymeExt.aug_fwd(::KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, ::typeof(gpu_square!), ::Val{(false, false, false)}, ::CuDeviceVector{Float64, 1}, ::Duplicated{CuDeviceVector{Float64, 1}}) resulted in invalid LLVM IR
  Reason: unsupported dynamic function invocation (call to autodiff_deferred_thunk(::EnzymeCore.ReverseModeSplit{ReturnPrimal, ReturnShadow, Width, ModifiedBetweenT, RABI}, ::Type{FA}, ::Type{A}, args...) where {FA<:Annotation, A<:Annotation, ReturnPrimal, ReturnShadow, Width, ModifiedBetweenT, RABI<:ABI} @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:726)
  Stacktrace:
   [1] aug_fwd
     @ ~/.julia/dev/KernelAbstractions/ext/enzyme_utils.jl:7
  Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
  Stacktrace:
    [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module)
      @ GPUCompiler ~/.julia/packages/GPUCompiler/U36Ed/src/validation.jl:147

@michel2323
Copy link
Collaborator

@vchuravy Cleaned up. Are we waiting for EnzymeAD/Enzyme.jl#1104 and JuliaGPU/CUDA.jl#2260 ?

@michel2323 michel2323 changed the title Enz rev gpu Add GPU reverse mode to EnzymeExt Mar 28, 2024
@vchuravy
Copy link
Member

vchuravy commented Apr 9, 2024

Will need to change

EnzymeCore = "0.6.4, 0.7"
to only be 0.7.1

@wsmoses
Copy link
Collaborator Author

wsmoses commented May 11, 2024

@michel2323 given that the prerequisites have landed, mind getting this over the finish line?

@michel2323 michel2323 force-pushed the enz_rev_gpu branch 2 times, most recently from 7c024a6 to ede2f76 Compare May 31, 2024 14:47
Fix examples tests with CUDA backend

Add synchronize rule
@michel2323
Copy link
Collaborator

michel2323 commented May 31, 2024

@wsmoses @vchuravy Cleanup with working tests (if CUDA is working). Last unresolved issue is active arguments to a kernel. The compiler cannot figure out the type here for the actives, so all actives are marked Any which then leads to a wrong return type.

res = ntuple(Val(N)) do i

I tried to fix it, but I'm not sure there's a way. So for now, it gracefully errors with

error("Active kernel arguments not supported on GPU")
in the augmented forward run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants