Replies: 16 comments
-
A good starting point would be to start calculating the Jacobian row wise using reverse mode inside a loop. Here is some pseudo-code:
|
Beta Was this translation helpful? Give feedback.
-
Thanks @tgymnich ! Some follow-up comments and questions:
|
Beta Was this translation helpful? Give feedback.
-
You are right.
In general all shadows should have the same length as the values they are shadowing.
For forward diff you can call Once that works it should be pretty straight forward to go to batched forward mode. |
Beta Was this translation helpful? Give feedback.
-
Thank you for clarifying and I realize that for reverse AD I would actually loop over One last thing, I have never used an LLVM Fortran compiler and I know of two options (flang, lfortran). Would either one be better than the other for using enzyme, and are there some important pitfalls I should be aware of? I do know that LLVM support is less complete for Fortran than for C/C++. |
Beta Was this translation helpful? Give feedback.
-
One comment here, often use cases actually only need a jacobian vector or vector jacobian product, which can be done with a single call to either forward or reverse mode. There are several options for llvm-based fortran, to varying degrees of completion. Unless you exist in an lfortran environment, usually I point folks to either classic flang, or the new MLIR-based flang, either of which can work but come with different setup challenges and robust support for the language. The new flang is where most development is being done now days by the compiler devs, whereas classic flang is more robust (but has performance and other concerns). |
Beta Was this translation helpful? Give feedback.
-
@wsmoses Thanks! |
Beta Was this translation helpful? Give feedback.
-
@agoodm if you feel lost in LLVM Fortran + Enzyme, I am always happy to quickly jump on a call and help you get set up. |
Beta Was this translation helpful? Give feedback.
-
EDIT: I fixed the problem I had below once I noticed that I was setting my temporary variable to the wrong type (int). Setting it to double fixed it. So an update, to help get myself fully up to speed I am currently trying to implement a relatively simple example in C on my local machine first since I could easily install clang, llvm, and enzyme via homebrew. Here is the example: // test.c
extern double __enzyme_fwddiff(void*, double[100], double[100], double[100], double[100]);
void f(double x[100], double out[100]) {
int prev = 0;
for(int i = 0; i < 100; i++) {
out[i] = x[i] + prev/x[i];
prev = x[i];
}
}
void jvpf(double x[100], double v[100], double out[100], double dout[100]) {
__enzyme_fwddiff((void*)f, x, v, out, dout);
} Then compiled as follows:
Now I am running it in python via ctypes and comparing the output I get from jax: import ctypes
import jax
import numpy as np
lib = ctypes.CDLL('/path/to/libteset.dylib')
# Enzyme JVP
def jvp(x, v):
out = np.zeros(100)
dout = np.zeros(100)
args = []
for a in [x, v, out, dout]:
args.append(a.ctypes.data_as(ctypes.POINTER(ctypes.c_double)))
lib.jvpf(*args)
return dout
# Equivalent to f
@jax.jit
def f(x):
prev = 0
out = []
for i in range(100):
out.append(x[i] + prev/x[i])
prev = x[i]
out = jax.numpy.array(out)
return out When I run this I find that the enzyme implementation is much faster than jax even with the ctypes overhead included but the output isn't quite right. For example: In [5]: x = np.arange(1, 101, dtype='float64') ** 2
In [6]: v = np.ones(100)
In [7]: jvp(x, v)
Out[7]:
array([1. , 0.9375 , 0.95061728, 0.96484375, 0.9744 ,
0.98070988, 0.98500625, 0.98803711, 0.99024539, 0.9919 ,
0.99316987, 0.99416474, 0.99495816, 0.99560079, 0.9961284 ,
0.99656677, 0.9969349 , 0.99724699, 0.99751383, 0.99774375,
0.99794324, 0.99811744, 0.99827045, 0.99840555, 0.99852544,
0.99863231, 0.99872799, 0.99881397, 0.99889153, 0.99896173,
0.99902547, 0.99908352, 0.99913654, 0.99918509, 0.99922965,
0.99927067, 0.99930849, 0.99934345, 0.99937582, 0.99940586,
0.99943378, 0.99945978, 0.99948403, 0.99950668, 0.99952788,
0.99954773, 0.99956637, 0.99958387, 0.99960033, 0.99961584,
0.99963046, 0.99964426, 0.99965731, 0.99966965, 0.99968133,
0.99969241, 0.99970292, 0.9997129 , 0.99972238, 0.9997314 ,
0.99973999, 0.99974818, 0.99975598, 0.99976343, 0.99977054,
0.99977734, 0.99978383, 0.99979005, 0.999796 , 0.99980171,
0.99980718, 0.99981242, 0.99981745, 0.99982229, 0.99982693,
0.9998314 , 0.99983569, 0.99983982, 0.9998438 , 0.99984763,
0.99985132, 0.99985488, 0.99985832, 0.99986163, 0.99986483,
0.99986792, 0.9998709 , 0.99987379, 0.99987657, 0.99987927,
0.99988188, 0.99988441, 0.99988685, 0.99988922, 0.99989152,
0.99989374, 0.9998959 , 0.99989799, 0.99990002, 0.99990199])
In [8]: jax.jvp(f, [x], [v])[1]
Out[8]:
Array([1. , 1.1875 , 1.0617284, 1.0273438, 1.0144 , 1.0084877,
1.0054144, 1.0036621, 1.002591 , 1.0019 , 1.0014343, 1.0011091,
1.0008754, 1.0007029, 1.0005728, 1.000473 , 1.000395 , 1.0003334,
1.000284 , 1.0002438, 1.0002108, 1.0001836, 1.0001608, 1.0001416,
1.0001254, 1.0001116, 1.0000998, 1.0000895, 1.0000806, 1.0000728,
1.000066 , 1.0000601, 1.0000548, 1.0000502, 1.000046 , 1.0000423,
1.000039 , 1.000036 , 1.0000333, 1.0000309, 1.0000286, 1.0000267,
1.0000249, 1.0000232, 1.0000217, 1.0000203, 1.0000191, 1.0000179,
1.0000168, 1.0000159, 1.0000149, 1.0000141, 1.0000134, 1.0000126,
1.0000119, 1.0000113, 1.0000107, 1.0000101, 1.0000097, 1.0000092,
1.0000087, 1.0000083, 1.000008 , 1.0000076, 1.0000073, 1.0000069,
1.0000066, 1.0000063, 1.0000061, 1.0000058, 1.0000056, 1.0000054,
1.0000051, 1.0000049, 1.0000048, 1.0000045, 1.0000044, 1.0000042,
1.000004 , 1.0000039, 1.0000037, 1.0000036, 1.0000035, 1.0000033,
1.0000032, 1.0000031, 1.000003 , 1.0000029, 1.0000029, 1.0000027,
1.0000026, 1.0000025, 1.0000025, 1.0000024, 1.0000023, 1.0000023,
1.0000021, 1.0000021, 1.000002 , 1.000002 ], dtype=float32)
// test.c
extern double __enzyme_fwddiff(void*, double[100], double[100], double[100], double[100]);
void f(double x[100], double out[100]) {
out[0] = x[0];
for(int i = 1; i < 100; i++) {
out[i] = x[i] + x[i-1]/x[i];
}
}
void jvpf(double x[100], double v[100], double out[100], double dout[100]) {
__enzyme_fwddiff((void*)f, x, v, out, dout);
}
|
Beta Was this translation helpful? Give feedback.
-
So I managed to get forward mode fully working including a function for generating the full Jacobian matrix: // test.c
int enzyme_dupnoneed;
void __enzyme_fwddiff(void*, ...);
void __enzyme_autodiff(void*, ...);
void f(double x[100], double out[100]) {
double prev = 0;
for(int i = 0; i < 100; i++) {
out[i] = x[i] + prev/x[i];
prev = x[i];
}
}
void jvpf(double x[100], double v[100], double out[100], double dout[100]) {
__enzyme_fwddiff((void*)f, x, v, enzyme_dupnoneed, out, dout);
}
void jacfwdf(double x[100], double out[100], double jac[100][100]) {
double v[100] = {0};
for(int i = 0; i < 100; i++) {
v[i] = 1;
__enzyme_fwddiff((void*)f, x, v, enzyme_dupnoneed, out, jac[i]);
v[i] = 0;
}
} But when I try doing reverse mode AD, it fails: void vjpf(double x[100], double dx[100], double out[100], double v[100]) {
__enzyme_autodiff((void*)f, x, dx, enzyme_dupnoneed, out, v);
}
void jacrevf(double x[100], double out[100], double jac[100][100]) {
double v[100] = {0};
for(int i = 0; i < 100; i++) {
v[i] = 1;
__enzyme_autodiff((void*)f, x, jac[i], enzyme_dupnoneed, out, v);
v[i] = 0;
}
} The failure happens when running the enzyme plugin on the IR:
Is there also any information on how to use batch/vector mode? I see that there is a supported signature in the API for Thanks! |
Beta Was this translation helpful? Give feedback.
-
Use |
Beta Was this translation helpful? Give feedback.
-
Here is your example with vector mode:
For forward vector mode you need to pass the vector width at compile time using Be mindful, that the amount of generated IR/code in Enzyme currently is proportional to the vector width. |
Beta Was this translation helpful? Give feedback.
-
I can now run the reverse AD functions with those extra compiler flags added. The forward mode vector example compiles fine and seems to be at least an order of magnitude faster than the looped version, however the resulting Jacobian is all zeros. Any ideas? |
Beta Was this translation helpful? Give feedback.
-
Can you share your code? How does it differ from this: https://fwd.gymni.ch/y12WMv? |
Beta Was this translation helpful? Give feedback.
-
My code was basically the same as yours but I didn't have a main function since I was testing it from inside python through ctypes. So I added the main function and found that when running it as an executable, it printed out the correct values of the Jacobian. However when I called the function externally in python through ctypes again I was still getting a zero Jacobian matrix. Then I looked at your version of void jacfwdvecf(double const* __restrict x, double* __restrict out, double* __restrict jac) I now get the correct output when calling it through ctypes. Note that ultimately for my use-case I will be calling this code through python so it is essential for me to get this working correctly through ctypes. The remaining downside is that after comparing the running times of both the vector and loop versions of the Jacobian calculation, I found that they were now identical. For the added complexity of calling __enzyme_fwddiff in vector mode, I was hoping this would give me some additional performance gain, but I am guessing it will probably still be a substantial improvement over what I currently have regardless, and I could also try reworking my use-case to use VJP/JVPs instead of the full Jacobian matrix. I have also finally finished getting llvm/flang setup from source, so I should be ready to try moving on to trying this all in Fortran. I will let you know if I encounter any issues. Thanks for all the help! |
Beta Was this translation helpful? Give feedback.
-
High vector widths can cause a lot of register pressure. Have you tried using vector widths of 2, 4, 8, or 16? |
Beta Was this translation helpful? Give feedback.
-
I realized I made a slight mistake when comparing the running times for each method, with a vector width of 100 it's actually twice as slow as running the looped version (30 us vs 15 us). So I played with setting the vector width and realized that setting it to less than 100 in this case results in only the number of rows equal to the width being set. This was tricky but I managed to get it working: void jacfwdvecf(double x[100], double out[100], double jac[100][100]) {
double v[100][100] = {0};
const int nblocks = 25;
const int block_size = 4;
for(int block = 0; block < nblocks; block++) {
int chunk = block*block_size;
for(int i = 0; i < block_size; i++) {
v[i][i+chunk] = 1;
}
__enzyme_fwddiff((void*)f, enzyme_width, block_size,
enzyme_dupv, sizeof(double) * 100, x, v,
enzyme_dupnoneedv, sizeof(double) * 100, out, jac[chunk]);
for(int i = 0; i < block_size; i++) {
v[i][i+chunk] = 0;
}
}
} Out of all the vector widths I tried, 4 gave me the best results at 10 us which is now a solid enough improvement to the looped version (15 us). Though I wonder how the results would change if my example computed a more dense Jacobian matrix. |
Beta Was this translation helpful? Give feedback.
-
Hi there, I have just recently learned about the Enzyme project. I have read through some of the presentations and thought that it could be a very promising tool for my use case. One of the bits that caught my interest was the mention that it could work with any language that has LLVM support. Fortran is of interest in particular to me because my use-case involves running a fairly old Fortran 77 codebase for a radiative transfer model. The inputs are arrays for multiple physical quantities that are defined over a set of atmospheric layers, so the subroutine signature looks something like this:
I wish to write another subroutine which computes the associated Jacobian matrix (nargs x nlayer rows by nchan columns) for this. Right now we are just manually calculating it via finite differences but due to the large size of the Jacobian matrix, this has a huge computational cost since it requires running the model hundreds of times. I have read that some examples have been done with Fortran, but all the examples I could find in the documentation were scalar functions in C++, so I am not quite sure where to start. I am currently thinking it should look something like this
Can I get some confirmation that I am taking the correct approach, and if not what changes would I need to make this work?
Beta Was this translation helpful? Give feedback.
All reactions