Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

device function performing FFT/DCT of data in shared memory (feature request / feasability) #173

Open
nschaeff opened this issue Apr 15, 2024 · 1 comment

Comments

@nschaeff
Copy link

nschaeff commented Apr 15, 2024

Hello,

For small (1D) transforms that are part of a more complex workflow, it would be interesting to be able to call VkFFT from a kernel on data already residing in shared memory.
Would that be possible to implement? If yes it is a feature that could interest me in the future.

For instance, within my cuda kernel that works on some data, I would call

__shared__ double my_data[N];
double* aux_data_needed_by_vkfft = pointer_returnd_by_vkfft_when_planning;

vkfft_perform_dft_inplace(my_data,  aux_data_needed_by_vkfft);

I assume the blocksize and data size must both be known at the planning stage for VkFFT to generate the function.
Alternatively, VkFFT could also return a string with the code of the device function, that could be included into another cuda program compiled at runtime with nvrtc.

[For instance, this would allow to do some filtering within a kernel:
generate some data -> FFT -> set some frequencies to zero -> IFFT -> further process data -> write to memory.]

@DTolm
Copy link
Owner

DTolm commented Apr 16, 2024

Hello,

I have been thinking about how kernel inlining (and callbacks) can be handled in VkFFT for some time, as the runtime generation nature of VkFFT should allow it to be implemented smoothly. Mostly what I need for this is to learn how separate linking is done in different APIs and decide on how the interface will look like. Speaking of the latter, I think VkFFT can be modified to generate codelets for any sequence for an arbitrary fixed number of threads, so this won't be a constraint, but handling the input/output format (in/to registers, in/to shared memory) can be more tricky. I will try to come back to this soon, after finishing the separate complex data layout for the next release.

Best regards,
Dmitrii

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants