Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternatives for non-standard sub_group::shuffle_up/down/xor and yet-to-be-implemented atomic_fence #767

Closed
nmnobre opened this issue Jul 11, 2022 · 13 comments · May be fixed by #1031
Closed
Labels
discussion General discussion about something

Comments

@nmnobre
Copy link
Member

nmnobre commented Jul 11, 2022

Hi @illuhad,

I've been trying to compile a project written in SYCL which also uses a few DPC++ extensions.

  1. I'm aware sub_group::shuffle_up/down/xor are not part of the standard, but I was wondering if I could emulate their behaviour with standard features currently implemented in hipSYCL. Do you have any ideas you'd be happy to share?

  2. Since atomic_fence hasn't yet been implemented, I've been thinking of biting the bullet and switching my code to use mem_fence instead. In your experience, is this the best course of action or are there better alternatives?

Thank you very much for your time,
-Nuno

@nmnobre nmnobre added the discussion General discussion about something label Jul 11, 2022
@illuhad
Copy link
Collaborator

illuhad commented Jul 11, 2022

1.) Yes, e.g. shift_group_left/right: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#_shift_left_and_shift_right

2.) We could just implement it/you could submit a PR, it shouldn't be too difficult :-) We currently only support relaxed memory order on GPUs anyway, so there would not be a difference to a mem_fence for the same memory_scope.

@nmnobre
Copy link
Member Author

nmnobre commented Aug 5, 2022

  1. Worked like a charm, thanks! 🙂
  2. How do we expect the current mem_fence implementation to work when targeting nvc++?

@illuhad
Copy link
Collaborator

illuhad commented Aug 6, 2022

It invokes one of the __threadfence() CUDA functions (depending on the fence scope argument to mem_fence). Here's the implementation detail: https://github.com/illuhad/hipSYCL/blob/c0808ade082d0e08ddfa603fd1cffb3db57910c7/include/hipSYCL/sycl/libkernel/detail/mem_fence.hpp

There's no difference between the nvc++ path and the clang CUDA path.

@nmnobre
Copy link
Member Author

nmnobre commented Aug 6, 2022

I'd have thought so, but it's actually invoking __hipsycl_if_target_host() which would do nothing, if not for the weird link time undefined reference to __builtin_is_device_code, presumably from if target (nv::target::is_host).

@illuhad
Copy link
Collaborator

illuhad commented Aug 6, 2022

It should only execute __hipsycl_if_target_host() for the host code - the device binary should be compiled using the __hipsycl_if_target_hiplike() path, which is defined here:
https://github.com/illuhad/hipSYCL/blob/3dc545fc77d558e0c9b6a2d5de989a62f17b8a30/include/hipSYCL/sycl/libkernel/backend.hpp#L135

Since HIPSYCL_LIBKERNEL_IS_DEVICE_PASS_CUDA is always set to 1 for nvc++ (see https://github.com/illuhad/hipSYCL/blob/3dc545fc77d558e0c9b6a2d5de989a62f17b8a30/include/hipSYCL/sycl/libkernel/cuda/cuda_backend.hpp#L54), it should use the definition of __hipsycl_if_target_hiplike() which resolves to __hipsycl_if_target_device(). This then is defined as if target (nv::target::is_device).

Is something there not working?
Can you maybe look at the preprocessed file to see if it preprocesses to the correct code? It wouldn't be the first time an nvc++ compiler bug around the if target mechanism messes things up...

@nmnobre
Copy link
Member Author

nmnobre commented Aug 6, 2022

My bad... __hipsycl_if_target_hiplike() does also resolve to an if target, I was looking at the wrong place.
They are both preprocessed correctly apparently:

template<access::fence_space, access::mode>
struct mem_fence_impl
{
  __attribute__((host)) __attribute__((device))
  static void mem_fence()
  {

    if target (nv::target::is_device) { __threadfence(); };


    ;





    
    if target (nv::target::is_host) {  };
  }

};

so it must be both of them giving me trouble with the builtin undefined reference....
I've tried versions 21.7 and 22.7 of the Nvidia HPC SDK with the same result: undefined reference to `__builtin_is_device_code'.

@illuhad
Copy link
Collaborator

illuhad commented Aug 7, 2022

Thanks, the following snippet reproduces independently of hipSYCL with nvc++ 22.2:

#include <nv/target>

void mem_fence() {
  if target(nv::target::is_device){
    __threadfence();
  }
  if target(nv::target::is_host){
    
  }
}

__global__ void invoke(){
  mem_fence();
}

int main(){
  invoke<<<1,1>>>();
}

(it happens with or without the __host__ __device__ attributes, so they don't seem to matter. Compile with nvc++ -cuda -o test test.cpp).
Does this also reproduce with your nvc++ versions? If so, this looks like a massive bug in the if target mechanism, and I will open a bug report with NVIDIA.

EDIT: Seems it does not like the empty host path. Removing it or putting code inside seems to resolve the issue.

@nmnobre
Copy link
Member Author

nmnobre commented Aug 8, 2022

Does this also reproduce with your nvc++ versions?

Yes, with both 21.7 and 22.7.

EDIT: Seems it does not like the empty host path. Removing it or putting code inside seems to resolve the issue.

Can also confirm your findings: emptying either if target path breaks compilation.

@illuhad
Copy link
Collaborator

illuhad commented Aug 8, 2022

Thank you, I have filed a bug report: https://forums.developer.nvidia.com/t/nvc-up-to-22-7-undefined-reference-to-builtin-is-device-code-for-empty-if-target-paths/223232
Hopefully it's fixed in one of the next nvc++ releases. I've checked if we can just put a semicolon in the empty branch (i.e., an empty statement) as a workaround on the hipSYCL side, but it seems that doesn't work either and it really needs actual code :/

@nmnobre
Copy link
Member Author

nmnobre commented Aug 8, 2022

Hopefully it's fixed in one of the next nvc++ releases. I've checked if we can just put a semicolon in the empty branch (i.e., an empty statement) as a workaround on the hipSYCL side, but it seems that doesn't work either and it really needs actual code :/

I also tried putting one empty statement, multiple empty statements, switching their order,... but to no avail. 🤪

@illuhad
Copy link
Collaborator

illuhad commented Aug 8, 2022

Yeah, my experience is that nvc++ bugs are quite throrough ;) But thanks for trying.

I guess as a hotfix we can just remove that empty statement for now: #798
But I don't know if similar patterns might not be present elsewhere.

@illuhad
Copy link
Collaborator

illuhad commented Aug 8, 2022

NVIDIA says the issue is known, and should be fixed in future releases: https://forums.developer.nvidia.com/t/nvc-up-to-22-7-undefined-reference-to-builtin-is-device-code-for-empty-if-target-paths/223232/2

@nmnobre
Copy link
Member Author

nmnobre commented Aug 8, 2022

Yeah, my experience is that nvc++ bugs are quite throrough ;) But thanks for trying.

If they are anything like nvfortran's... 😝

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion General discussion about something
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants