Enable custom AD primitives #1868

FluxusMagna · 2023-02-05T18:53:34Z

Some functions have a known expression for calculating the gradient that is better than the 'naive' AD. Perhaps the most obvious one is the FFT, which is in essence just a matrix multiplication, but surely other cases exist too. Assuming the expression that is to be differentiated contains such a function we would currently need to manually separate the differentiation of this component to apply our own differentiation scheme to it.

I think it would be neat if we could instead provide the compiler with information of what expression should replace for example the vector-jacobian product. One idea of how this could be done is through an attribute, like

def fft a = #[vjp(\_ a -> fft a)]  ???

where the expression in the attribute replaces vjp fft. This way efficient gradient definitions of relevant expressions can be defined in libraries and the user won't have to think about it.

Similar to the #[unsafe] attribute, it could be disabled with some compiler option.

I have no idea how well this would fit in with the current AD-machinery, but I think it makes sense syntactically.

The text was updated successfully, but these errors were encountered:

athas · 2023-02-05T19:12:45Z

I have no idea how well this would fit in with the current AD-machinery, but I think it makes sense syntactically.

This would likely be easy: just don't inline functions with custom derivatives until after AD is done.

FluxusMagna · 2023-02-05T21:32:23Z

I suppose it might make a bit more sense to use the structure

def fft = #[vjp(\_ a -> fft a)] (\a -> ???)

instead, since the attribute is of a function, unless the previously stated structure can easily be handled.

zfnmxt · 2023-04-16T05:47:11Z

@nhey is adding something similar to JAX's stop_gradient function (It just zeroes out (i.e., doesn't compute) the gradient of its argument) via an attribute. For example,

jvp2 (\x -> #[stop_gradient] x*x) x 1

returns 0 for all x.

This is in theory very simple to implement; for any statement with a stop_gradient attribute AD only inserts the primal statement and doesn't do anything else. Unfortunately, this doesn't quite work for things like

jvp2 (\x ->#[stop_gradient] x) x 1

because the body of \x ->#[stop_gradient] x will have no statements in the IR and the attribute is forgotten. Even

jvp2 (\x ->#[stop_gradient] (id x)) x 1

doesn't fix it due to inlining. You can hack it to work by creating a stop_gradient function which doesn't inline the application

def stop_gradient 't (x: t): t =
  #[noinline] (#[stop_gradient] (id x))
  
jvp2 (\x -> stop_gradient x) x 1

but then you've polluted things with non-inlined with identity functions, post-AD pass. So it seems like the right way to do this is make the in-lining passes aware of #[stop_gradient] (treating it like #[noinline]) and then, during AD, you handle #[stop_gradient] as described and also remove it from the attributes set. Post-AD, you run the in-lining passes again to clean things up.

@athas Is this the right way to do it?

I'm posting this on this issue because it seems like the basic implementation machinery is the same between the two features. You could even define stop_gradient x = #[vjp(\_ _ -> 0),jvp(\_ _ -> 0)] x (although the attribute system doesn't support expressions so I guess this would need to be done with identifiers).

athas · 2023-04-16T06:47:52Z

I can't think of any other way to do it. You might lose out on some pre-AD optimisation opportunities, but hopefully nothing significant - and it's difficult to see how passes such as fusion should propagate #[stop_gradient] anyway.

@zfnmxt

…ment. Effectively stops the compiler from generating adjoint code for the argument expression (as explained by @zfnmxt in diku-dk#1868). This is useful for (amongst other things) implementing variational Bayesian methods and hacking gradients together for non-differentiable functions since it let's us treat any variable as a constant. Co-authored-by: zfnmxt <zfnmxt@zfnmxt.com>

@zfnmxt

…ment. Effectively stops the compiler from generating adjoint code for the argument expression (as explained by @zfnmxt in diku-dk#1868). This is useful for (amongst other things) implementing variational Bayesian methods and hacking gradients together for non-differentiable functions since it lets us treat any variable as a constant. Co-authored-by: zfnmxt <zfnmxt@zfnmxt.com>

samestep · 2023-12-29T04:08:53Z

I'm still learning about Futhark and haven't been able to find much about its AD from the docs, so excuse my ignorance here; are you folks using an approach that splits AD into forward-mode and transposition, as described in this POPL 2023 paper? In my (admittedly, limited) experience implementing AD systems, this is nice because it means that the user can simply specify a custom JVP for a function, and then the compiler can transpose that to produce a custom VJP that agrees with that JVP by construction. I think this is what JAX does, although I'm not entirely sure. Would this apply for the FFT example @FluxusMagna described? (I can imagine there existing an example where the JVP is not actually easier to specify than the VJP, although I have not encountered one yet.)

zfnmxt · 2023-12-29T04:58:43Z

Futhark doesn't do reverse-mode AD via explicit transposition, no. There's very little on Futhark's forward-mode AD because it's so simple and just corresponds to the classic dual number formulation (see section 3.1.1 here). Reverse-mode AD in Futhark is discussed in our SC paper here as well as here.

At any rate, the whole point of custom derivatives is to spit out better code than the AD transformation can. The transposition approach doesn't inherently ameliorate this problem---the resulting code is only as good as the transposition transformation is!

samestep · 2023-12-29T13:52:26Z

Makes sense! Yeah, I'm familiar with the contrast between the very simple dual number formulation for forward-mode and the much more complicated reverse-mode. Thanks for the links to those papers about your reverse-mode AD approach, I'll check those out!

I'm not sure I quite agree with your point about that being "the whole point of custom derivatives"; for instance, let's say you have a function which takes a polynomial and computes the roots of that polynomial. If you use some sort of iterative method to compute the polynomial roots, then it'd be inefficient to transform that code, and would be much better to use implicit differentiation instead. In this case, if I'm not mistaken, transposition of the custom implicitly differentiated JVP would still be much better than direct reverse-mode AD of the iterative method, no?

zfnmxt · 2023-12-29T18:32:10Z

That's a good point; there are definitely benefits to having a transposition transformation and, conceptually, the decomposition of reverse-mode into forward-mode + transposition is really nice (as well as having first-class linear map support in a language). And there surely are cases where a custom JVP is simple, but the VJP is still complex (because the JVP just happens to be something that's difficult to transpose) so you wouldn't want to write out the custom VJP.

FluxusMagna · 2024-01-15T15:53:01Z

I'm still learning about Futhark and haven't been able to find much about its AD from the docs, so excuse my ignorance here; are you folks using an approach that splits AD into forward-mode and transposition, as described in this POPL 2023 paper? In my (admittedly, limited) experience implementing AD systems, this is nice because it means that the user can simply specify a custom JVP for a function, and then the compiler can transpose that to produce a custom VJP that agrees with that JVP by construction. I think this is what JAX does, although I'm not entirely sure. Would this apply for the FFT example @FluxusMagna described? (I can imagine there existing an example where the JVP is not actually easier to specify than the VJP, although I have not encountered one yet.)

I think for the fft-case the JVP and VJP are actually the same, because the operation is equivalent to multiplication with a symmetric matrix. It's a very special case in that sense though.

zfnmxt added enhancement compiler AD Related to automatic differentiation labels Feb 7, 2023

zfnmxt self-assigned this Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable custom AD primitives #1868

Enable custom AD primitives #1868

FluxusMagna commented Feb 5, 2023

athas commented Feb 5, 2023

FluxusMagna commented Feb 5, 2023

zfnmxt commented Apr 16, 2023 •

edited

athas commented Apr 16, 2023

samestep commented Dec 29, 2023

zfnmxt commented Dec 29, 2023

samestep commented Dec 29, 2023 •

edited

zfnmxt commented Dec 29, 2023 •

edited

FluxusMagna commented Jan 15, 2024 •

edited

Enable custom AD primitives #1868

Enable custom AD primitives #1868

Comments

FluxusMagna commented Feb 5, 2023

athas commented Feb 5, 2023

FluxusMagna commented Feb 5, 2023

zfnmxt commented Apr 16, 2023 • edited

athas commented Apr 16, 2023

samestep commented Dec 29, 2023

zfnmxt commented Dec 29, 2023

samestep commented Dec 29, 2023 • edited

zfnmxt commented Dec 29, 2023 • edited

FluxusMagna commented Jan 15, 2024 • edited

zfnmxt commented Apr 16, 2023 •

edited

samestep commented Dec 29, 2023 •

edited

zfnmxt commented Dec 29, 2023 •

edited

FluxusMagna commented Jan 15, 2024 •

edited