Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFTW3 FFT via FFI #23

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open

FFTW3 FFT via FFI #23

wants to merge 20 commits into from

Conversation

h4rm
Copy link
Contributor

@h4rm h4rm commented Feb 6, 2014

Hey Francesco,

here is a branch ("fftw") with the fftw functions replaced by wrapper calls for the FFTW library. Why did I do that? For one thing I wanted 2D, 3D and nD FFT which are far better supported by FFTW. Also I know you like to have the best speed there is and FFTW is faster than the GSL implementation (in case of many FFTs).

I have done some benchmarks that you maybe also want to try.
Original version (4.5 sec):

git checkout master
time gsl-shell -e "test=matrix.cnew(256,1,|i,j|i-1); for i = 1,1e6 do num.fft(test) end"

New version (0.9 sec):

git checkout fftw
time gsl-shell -e "test=matrix.cnew(256,1,|i,j|i-1); test2,plan = num.fft(test); for i = 1,1e6 do num.fft_plan( plan,test,test2) end"

I just want to know your opinion about it and if I should proceed with this (documentation, tests, maybe 2D demos).

@h4rm
Copy link
Contributor Author

h4rm commented Feb 7, 2014

The only single thing that I don't understand here is the problem that after initialization of gsl-shell

the very first num.fft transform gives zero output and transforms the input to zero as well. All following calls work correctly. Any ideas on that?

@franko
Copy link
Owner

franko commented Feb 7, 2014

Hi Benjamin,

I was trying your branch and I stumbled on the same problem. It seems that I found the solution by reading the FFTW user's manual here: http://www.fftw.org/fftw3_doc/Complex-One_002dDimensional-DFTs.html#Complex-One_002dDimensional-DFTs.

You must create the plan before initializing the input, because FFTW_MEASURE overwrites the in/out arrays. (Technically, FFTW_ESTIMATE does not touch your arrays, but you should always create plans first just to be sure.)

and before they write:

The flags argument is usually either FFTW_MEASURE or FFTW_ESTIMATE. FFTW_MEASURE instructs FFTW to run and measure the execution time of several FFTs in order to find the best way to compute the transform of size n. This process takes some time (usually a few seconds), depending on your machine and on the size of the transform. FFTW_ESTIMATE, on the contrary, does not run any computation and just builds a reasonable plan that is probably sub-optimal. In short, if your program performs many transforms of the same size and initialization time is not important, use FFTW_MEASURE; otherwise use the estimate.

So in practice when you create a plan with FFTW_MEASURE the library does mess with your input even before you run the plan.

I've seens the problem disappear if you use FFTW_ESTIMATE instead. Probably a better solution is to keep a flag to know weather a given plan has been already measured or not. If it is not yet measured you should create a copy of the input and copy back the data after the creation of the plan.

I'm also wondering if they don't have lower level interface to make the MEASURE stuff explicit.

Francesco

@h4rm
Copy link
Contributor Author

h4rm commented Feb 7, 2014

Excellent points.
So I guess we should make a "standalone" version of the fft functions with FFTW_ESTIMATE which would then also operates with the same performance as the original gsl versions when being in a loop. In addition, extra "create plan" and "execute plan" wrappers for the individual 1d, 2d and nd versions could be offered for many transformations. This would when lead to maximum performance as well as as maximum usability for the sake of having more than just fft, fft2 and fftn - what do you think?
Shall I make the changes then?

@franko
Copy link
Owner

franko commented Feb 7, 2014

Well, of course we need to change something because right now it doesn't work :-)

I hesitate to use ESTIMATE because if you doesn't really gain in performance what the point of using FFTW? Right now I was reading the manual to figure out if there is a smarter approach to use MEASURE and yet do not ask the user to plan ahead of time.

@h4rm
Copy link
Contributor Author

h4rm commented Feb 7, 2014

Hehe, sure, just didn't want to work on in if you are reworking the code at the same time.

So my understanding is, for normal fft calls we use ESTIMATE (because we don't want the measuring overhead for just this call) but for "create plan", "execute plan" we use MEASURE. If we make that point clear in the documentation I guess it is not too much to ask from the user to choose one or the other if the performance increase is so vast.

@franko
Copy link
Owner

franko commented Feb 7, 2014

I was wondering about using the New Array Interface

http://www.fftw.org/fftw3_doc/New_002darray-Execute-Functions.html#New_002darray-Execute-Functions

which may be adapted to our case. We need also to look the the stride of the input matrix (can be > 1 even for column matrix). The stride is "m.tda" in the general case.

I was also wondering about using fftw_malloc for matrix data allocation to guarantee data alignment or otherwise give the flag FFTW_UNALIGNED.

My plan was: use fftw_malloc to allocate matrix data block, always. Use the new-array-interface by caching plans for the most common used array size & strides. May be when stride is greater that one we can give up and use ESTIMATE to optimize only the common case with stride = 1.

It is different from your approach in that we still does not ask the user to plan. Otherwise your approach seems reasonable but you are giving up very soon by not looking at all to the more advanced interfaces.

@h4rm
Copy link
Contributor Author

h4rm commented Feb 7, 2014

In the current implementation fftw_execute_dft (as a new array execute function) is already used when calling the fft_plan functions. So its not that I haven't thought of reusing the plans for different input/output (or am I misunderstanding something?).

Probably you are right and we should enforce allocating the matrix data via fftw_alloc to avoid the unallignement mess up. Do you consider using fftw_alloc for ALL matrix allocations in matrix.lua then?

About the caching, do you mean precaching or caching when the user is requesting the FFT of a certain size? Otherwise caching would be a nice idea to "hide" the plans all along from the user. Then we could always use MEASURE and be sure, that the fastest possible scheme is used. I am not sure what the allocation size of a normal plan is and if we would have memory issue if we allocated the first 2^n, n from 4 to 8 or 16 plans.

@franko
Copy link
Owner

franko commented Feb 7, 2014

In the current implementation fftw_execute_dft (as a new array execute function) is already used when calling the fft_plan functions. So its not that I haven't thought of reusing the plans for different input/output (or am I misunderstanding something?).

I was thinking to use the new array interface also for ordinary "one shot" fft and rfft calls.

Probably you are right and we should enforce allocating the matrix data via fftw_alloc to avoid the unallignement mess up. Do you consider using fftw_alloc for ALL matrix allocations in matrix.lua then?

Yes.

About the caching, do you mean precaching or caching when the user is requesting the FFT of a certain size? Otherwise caching would be a nice idea to "hide" the plans all along from the user.

Yes, this is my nice idea :-) I mean just caching, not precaching. The old FFT implementation was already doing that to avoid allocating a new workspace each time.

I am not sure what the allocation size of a normal plan is and if we would have memory issue if we allocated the first 2^n, n from 4 to 8 or 16 plans.

We don't to guess. We can just keep in memory the plan for the previous call and if does match just reuse the plan. otherwise we can keep something like a fixed number of plans in memory and overwrite the least recently used when a new is requested and the slots are all already used. This latter schema is significantly more complex. I advice for just caching the last used plan for each kind of transforms, I.e. r2c_1d, c2c_1d, r2c_2d etc. This schema is simple to implement and cover the most common case when a user transform many times arrays of the same size.

@h4rm
Copy link
Contributor Author

h4rm commented Feb 7, 2014

  • fftw_alloc - agreed and check.
  • caching: agreed to cache used plans and reuse if size and dimension fits. Also distinguish between forward and backward in addition to r2c, c2r, dft_2d, etc...
  • using fftw_execute_dft for all and remove plans and plan functions alltogether - check.

I can modify the code with regard to caching and fftw_execute_... if you like. Could you make the matrix.lua changes with the allocations?

…rdft, rdftinv which are called by the fft, fft2 and fftn functions. Added test file and removed old fft-tests which was the same as the demo file.
@h4rm
Copy link
Contributor Author

h4rm commented Feb 13, 2014

Hey Francesco,

I have rewritten the whole fftw interface and reduced it down to three fundamental functions dft, rdft and rdftinv.

The dft-function is the most general, handling only complex input. It checks if output is provided by the user, otherwise it allocates the output. Users then can decide if they want to live with the allocation overhead or provide their output (default is dft allocates the memory for the user).

To address the problem that input was nullified during "planning", I allocate extra input memory for this function but I use the original input when I execute the plans via fftw_execute_dft which is one of these new array functions. Now all the demos work from the beginning and we don't get the [0,0...] vectors when first launching the demo.

dft also caches the plans for same input rank (dimension) and both forward and backward direction.

There is some if/else overhead in the dft function to avoid creating a function for forward and backward (for the complex case). I don't think it is performance critical to have these operations there because they are marginal in contrast to the operations involved in a 100_100_100 FFT.

The caching makes the whole thing very fast, just with a very basic comparison to the python FFT implementation I get 3x speed increase. I am happy with it and I look forward to your comments.
If this is an acceptable option I would of course modify the docs accordingly.

@h4rm
Copy link
Contributor Author

h4rm commented Feb 13, 2014

Remark to the fact that the output location can be given as a paramter:

This significantly increases performance:

time gsl-shell -e "test=matrix.cnew(256,1,|i,j|i-1); for i = 1,1e6 do bla=num.fft(test) end"
gsl-shell -e   2.68s user 0.74s system 99% cpu 3.426 total

vs.

time gsl-shell -e "test=matrix.cnew(256,1,|i,j|i-1); test2=num.fft(test); for i = 1,1e6 do num.fft(test, test2) end"
gsl-shell -e   0.88s user 0.01s system 99% cpu 0.886 total

@franko
Copy link
Owner

franko commented Feb 13, 2014

Hi Benjamin,

that's looks great but unfortunately it seems there is a problem. On my computer if I do demo'fft1' from the interactive shell I get a SEGFAULT, not always but often. The same happens for fft2 and fft3.

I've seen that you didn't fix the FFTW_UNALIGNED vs fftw_malloc but it seems that this is not the cause of the SEGFAULT. I guess there is a subtle error in the allocation/free of some vector.

Francesco

@h4rm
Copy link
Contributor Author

h4rm commented Feb 13, 2014

Hey,

I will look into it. I did not fix the fftw_malloc yet because we wanted to
integrate it into the matrix.lua but maybe that is the source of the
problem.
For the sake of testing I will include it in the code and try to reproduce
the error on my machine.

I will come back to you.

On Thu, Feb 13, 2014 at 9:49 PM, Francesco notifications@github.com wrote:

Hi Benjamin,

that's looks great but unfortunately it seems there is a problem. On my
computer if I do demo'fft1' from the interactive shell I get a SEGFAULT,
not always but often. The same happens for fft2 and fft3.

I've seen that you didn't fix the FFTW_UNALIGNED vs fftw_malloc but it
seems that this is not the cause of the SEGFAULT. I guess there is a subtle
error in the allocation/free of some vector.

Francesco

Reply to this email directly or view it on GitHubhttps://github.com//pull/23#issuecomment-35024165
.

@franko
Copy link
Owner

franko commented Feb 15, 2014

Hi Benjamin,

I'm planning to merge the FFTW branch soon but some extra work is needed before it can be merged in the master branch.

The most serious thing to be done right now is that we ignore the "tda" field in the column matrices. This is ok in many cases because "tda" is one for matrices created as column matrices but is the most general case it can be > 1. For example this happens if your column matrix is a column view of a matrix with multiple columns.

This is actually the only thing that prevent me from merging with master.

Otherwise I would like to get rid of UNALIGNED and use fftw_malloc for all matrix data allocations but this can be done later.

Then, more on the cosmetic side, we can remove some cruft from the FFTW ffi c definitions. Define fftw_complex as gsl_complex (you can identiify them) and other minor things like that.

I hope I will find the time soon to implement these changes and merge the branch. I really like the work that you have done and the FFTW is really interesting in terms of speed.

Francesco

@h4rm
Copy link
Contributor Author

h4rm commented Feb 21, 2014

Hey Francesco,

I am happy you like it. So I looked into FFTW Advanced Interface which allows to give the stride as well. Now I am giving tda of the input vector as the inputstride and tda of the output vector as the output stride. I am also noting the stride in the plan saving so we dont use the wrong stride later for other transformations of same dimension but different stride.

It works on my machine but I get a segfault when exiting the shell after running the following code:

num.fft(matrix.vec{1,2,3,4+0i})
c=matrix.cnew(4,4,|i|1)
num.fft(c:slice(1,1,4,1))

Maybe you have an idea how to fix that. Anyway, with the advanced interface the tda problems should be solved.

@h4rm
Copy link
Contributor Author

h4rm commented Feb 21, 2014

So it is strange:

c=matrix.cnew(4,4,|i|1)
num.fft(c:slice(1,1,4,1), c:slice(1,1,4,1))

gives no error when exiting gsl-shell and shows cool inplace fft transformation while

c=matrix.cnew(4,4,|i|1)
num.fft(c:slice(1,1,4,1))

gives the segfault. Is the GC trying to free some memory that already has been freed? It seems to be a problem with the output.

@franko
Copy link
Owner

franko commented Feb 21, 2014

Hi Benjamin,

very good work, thank your for your persistence :-)

It seems I found the error. Here a fix for the test case:

diff --git a/fft.lua b/fft.lua
index b78d36c..0c756ef 100644
--- a/fft.lua
+++ b/fft.lua
@@ -47,7 +47,7 @@ local function dft(invec, dimlist, outvec, direction)
                --Or create a new plan
                else
                        --allcoate input for plan making but planning can invalidate input, therefore we use extra array
-                       local inputtest = ffi.new("fftw_complex[?]", size)
+                       local inputtest = ffi.new("fftw_complex[?]", size * invec.tda)

                        local howmany = 1
                        local idist,odist = 0,0

It is because the inputtest vector need to have the same memory layout of "invec" itself. If stride is > 1 you have to allocate a bigger one.

Please note that i didn't fix the problem in other similar functions.

@h4rm
Copy link
Contributor Author

h4rm commented Feb 25, 2014

Great, that solved it.
Now it's time for me to change the documentation. I'll commit the changes later this week.

@h4rm
Copy link
Contributor Author

h4rm commented Mar 7, 2014

While changing the documentation and the examples accordingly I came across the problem, that the input is being altered by the num.rfftinv function but not by the num.rfft function. They both call different functions (rdft and rdftinv) but the implementations are almost the same. I cannot figure out a reason why a call of fftw.execute_dft_c2r(plan, input, output) should do something with the input but a call to fftw.execute_dft_r2c - it's an odd asymmetric bug (besides the fact that I would not expect the input to be alterted because we are doing out-of-place transformation and we are calling the new-array interface).

@franko
Copy link
Owner

franko commented Mar 11, 2014

Hi Benjamin,

sorry for the late reply. I've studied the problem and I didn't find anything wrong. I was even able to reproduce the problem with a simple C program:

#include <fftw3.h>

int
main()
{
    double x1[256], x2[256];
    fftw_complex z[129], ztest[129];
    double intest[256];
    int n = 256, i;

    for (i = 0; i < n; i++) {
        x1[i] = (i < 86 ? 0 : (i < 172 ? 1 : 0));
    }

    fftw_plan plan_r2c = fftw_plan_many_dft_r2c(1, &n, 1,
                                      intest, NULL, 1, 0,
                                      z, NULL, 1, 0,
                                      FFTW_MEASURE | FFTW_UNALIGNED);

    fftw_execute_dft_r2c(plan_r2c, x1, z);

    for (i = 0; i < 129; i++) {
        printf("%g + %gi\n", z[i][0], z[i][1]);
    }

    printf("\n\n\n");

    fftw_plan plan_c2r = fftw_plan_many_dft_c2r(1, &n, 1,
                                      ztest, NULL, 1, 0,
                                      x2, NULL, 1, 0,
                                      FFTW_MEASURE | FFTW_UNALIGNED);

    fftw_execute_dft_c2r(plan_c2r, z, x2);

    for (i = 0; i < 129; i++) {
        printf("%g + %gi\n", z[i][0], z[i][1]);
    }

}

May be you can ask to the FFTW mailing list, probably they should be able to help us.

@h4rm
Copy link
Contributor Author

h4rm commented Mar 12, 2014

Okay, I got the same altered results on my machine which is very curious. I will indeed as the FFTW mailing list. Stay tuned.

@h4rm
Copy link
Contributor Author

h4rm commented Mar 12, 2014

Okay, so I got an answer, that was quick.
The answer was hidden in the documentation of the 1d c2r function (they did not bother to mention that in the more general case). The problem is the following:

"Second, the inverse transform (complex to real) has the side-effect of overwriting its input array, by default. Neither of these inconveniences should pose a serious problem for users, but it is important to be aware of them."

and also

"As noted above, the c2r transform destroys its input array even for out-of-place transforms. This can be prevented, if necessary, by including FFTW_PRESERVE_INPUT in the flags, with unfortunately some sacrifice in performance. This flag is also not currently supported for multi-dimensional real DFTs (next section)."

So either we mention it in the documentation OR we copy the input every time OR we implement the FFTW_PRESERVE_INPUT for 1D and mention the altered input for 2D+ functions.

It's speed vs. usability again ;)

@franko
Copy link
Owner

franko commented Mar 12, 2014

Wow! They didn't even mention that in the documentation, because I've read quite carefully the user manual.

Anyway, I propose to use the FLAG PRESERVE_INPUT for 1d and always make a copy of the input for the 2d+ case. This can be bad surprise for the users that can make them angry.

Look to our case: only the mailing list was able to clarify the issue and the user manual was incomplete. this is the worst you can do for the users!

@h4rm
Copy link
Contributor Author

h4rm commented May 27, 2014

Hey Francesco,

sorry for the delay, holidays, work travel, presentations and such delayed my additional modifications to the documentations a bit. I have now tidied up every remaining bit of the original FFT implementation and substituted with the new FFTW wrap. The comments should be self explanatory.

Personally I have the FFTN funning at my project and it performs very well. The demos are working for me and the documentation compiles with the proper formatting/LATEX formatting.

For me this was fun.

@franko
Copy link
Owner

franko commented May 27, 2014

Hi Benjamin,

that looks great, thank you for taking the time for finalizing the work.

I will just need the time to review the work but I should be able to merge the fftw branch. Just one question, do you have some predefined tests for the FFT routines ?

@h4rm
Copy link
Contributor Author

h4rm commented May 27, 2014

No problem.
Concerning the tests: I have tested each function for its capabilities of accepting input and rerouting it to the underlying FFTW functions. Since all functions are just wraps for the three major functions dft, rdft, rdftinv I am pretty sure they work.

However, if you want some basic test functions for the test section in order to easily test if the fft module works at all I can supply those right away.

BTW: Installed gsl-shell on our cluster and first had the problem that FFTW3 (libfftw3.so) was not installed as a shared library. I have put a note into the INSTALL file concerning this and I hope this will be sufficient. If not installed, a "module not found" error will point toward the reason anyway.

Cheers,
Ben

@franko
Copy link
Owner

franko commented Jun 2, 2014

Hi Benjamin,

I was willing to make some checks and then merge the fftw branch. I've got only one problem, on Windows the fftw crash quite often. It is actually enough to call demo("fft2") two or three times to have a crash.

For the monent I don't have any idea bout that problem. On Linux everything seems fine. Do you have any idea ? Some basic tests that can help to point out the problem ?

@h4rm
Copy link
Contributor Author

h4rm commented Jun 3, 2014

I would like to replicate the bug in my Windows-VM but before I do that I would like to cross check with you, how you compile gsl-shell on windows? Are you using cygwin, gcc, manually download AGG & FOX library, install blas, fftw, etc.. via cygwin interface and then change makepackages accordingly?

@franko
Copy link
Owner

franko commented Jun 3, 2014

I'm using on windows (32bit) the mingw toolchain. I compile myself everything but for the "fftw" library I took the binary package for windows from the fftw website.

Yet I think that it is simpler that I do the tests myself since I have everything ready on my windows box. I don't want to give you a lot of extra work. May be you can suggest me some basic tests I can perform or otherwise I can figure out, I can even begin testing everything methodically!

May be I can also try to compile myself the fftw library but I suspect that the problem does not came from the precompiled library.

franko added a commit that referenced this pull request Jan 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants