CPU performance AdaptiveCpp vs oneAPI #1324

csccva · 2024-01-22T08:41:49Z

csccva
Jan 22, 2024

Hello,

I am testing some sycl codes on a machine with AMD EPYC 7H12 64-Core Processor and nvidia gpus. When using the nvidia gpu AdaptiveCpp and oneAPI give very similar results:

Warm-up first
Offload Device        : NVIDIA A100-SXM4-40GB
max_work_group_size   : 1024
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 0.00362291 seconds
Compute Duration      : 0.905174 seconds
 [0][0] = 3.67041e+11

and

Warm-up first
Offload Device        : NVIDIA A100-SXM4-40GB
max_work_group_size   : 1024
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 0.00364134 seconds
Compute Duration      : 8.94636 seconds
 [0][0] = 3.67041e+11

But when I try to run the code using a cpu core there is significant difference:

Warm-up first
Offload Device        : hipSYCL OpenMP host device
max_work_group_size   : 1024
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 9.08512 seconds
Compute Duration      : 18.605 seconds
 [0][0] = 3.67041e+11

vs.

Warm-up first
Offload Device        : AMD EPYC 7H12 64-Core Processor                
max_work_group_size   : 8192
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 1.9411 seconds
Compute Duration      : 11.5266 seconds
 [0][0] = 3.67041e+11

I compile the code using:

syclcc -O3 --hipsycl-targets="omp.accelerated;cuda:sm_80" matrix_matrix_multiply_buffer.cpp
# and 
clang++ -std=c++17 -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64_x86_64 -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 matrix_matrix_multiply_buffer.cpp

Below is the code:

//==============================================================
// Matrix Multiplication: DPC++ Basic Parallel Kernel
//==============================================================
// Copyright © 2021 Intel Corporation
//
// SPDX-License-Identifier: MIT
// =============================================================


#include <sycl/sycl.hpp>
#include <ctime>
#include <chrono>
#include <getopt.h>

using namespace sycl;

int main(int argc, char *argv[]) {

    size_t N = 1024;
    size_t M = 8;
    int VERIFY = 0;
    int PRINT_OUTPUT_MATRIX = 0;

    int arg;
    while ((arg = getopt (argc, argv, "n:m:vp")) != -1)
        switch (arg){
            case 'n':
                N = std::atoi(optarg);
                break;
            case 'm':
                M = std::atoi(optarg);
                break;
            case 'v':
                VERIFY = 1;
                break;
            case 'p':
                PRINT_OUTPUT_MATRIX = 1;
                break;
            case 'h':
                std::cout << std::endl;
                std::cout << "Usage   : ./a.out -n <MATRIX_SIZE> -m <WORK_GROUP_SIZE> -v -p\n\n";
                std::cout << "          [-n] size for matrix, eg: 1024\n";
                std::cout << "          [-m] size of work_group, eg: 8/16\n";
                std::cout << "          [-v] verify output with linear computation on cpu\n";
                std::cout << "          [-p] print output matrix\n";
                std::cout << "Example : ./a.out -n 1024 -m 16 -v -p\n\n";
                std::exit(0);
        }

    //# Define vectors for matricies
    std::vector<float> matrix_a(N*N);
    std::vector<float> matrix_b(N*N);
    std::vector<float> matrix_c(N*N);
    std::vector<float> matrix_d(N*N);

    //# Initialize matricies with values
    float v1 = 2.f;
    float v2 = 3.f;
    for (int i=0; i<N; i++)
        for (int j=0; j<N; j++){
            matrix_a[i*N+j] = v1++;
            matrix_b[i*N+j] = v2++;
            matrix_c[i*N+j] = 0.f;
            matrix_d[i*N+j] = 0.f;
    }

    auto start = std::chrono::high_resolution_clock::now().time_since_epoch().count();

    //# Define queue with default device for offloading computation
    queue q{property::queue::enable_profiling{}};


    // First we warm-up the device
    std::cout << "Warm-up first" << "\n"; 

    {
        //# Create buffers for matrices

        buffer<float, 1> a(matrix_a.data(), range<1>(N*N));
        buffer<float, 1> b(matrix_b.data(), range<1>(N*N));
        buffer<float, 1> c(matrix_c.data(), range<1>(N*N));
        
        //  buffer a(matrix_a);
        // buffer b(matrix_b);
        // buffer c(matrix_c); 
         //# Submit command groups to execute on device
         q.submit([&](handler &h){
            //# Create accessors to copy buffers to the device
            auto A = a.get_access<access::mode::read>(h);
            auto B = b.get_access<access::mode::read>(h);
            auto C = c.get_access<access::mode::write>(h);

            //# Define size for ND-Range and work-group size
            range<2> global_size(N,N);
            range<2> work_group_size(M,M);

            //# Parallel Compute Matrix Multiplication
            h.parallel_for(nd_range<2>{global_size, work_group_size}, [=](nd_item<2> item){
                const int i = item.get_global_id(0);
                const int j = item.get_global_id(1);
                for (int k = 0; k < N; k++) {
                    C[i*N+j] += A[i*N+k] * B[k*N+j];
                }
            });
        });
    } 

    //# Initialize matrices with values
    v1 = 2.f;
    v2 = 3.f;
    for (int i=0; i<N; i++)
        for (int j=0; j<N; j++){
            matrix_a[i*N+j] = v1++;
            matrix_b[i*N+j] = v2++;
            matrix_c[i*N+j] = 0.f;
            matrix_d[i*N+j] = 0.f;
    }

    event e;
    std::cout << "Offload Device        : " << q.get_device().get_info<info::device::name>() << "\n";
    std::cout << "max_work_group_size   : " << q.get_device().get_info<info::device::max_work_group_size>() << "\n";
    std::cout << "Configuration         : MATRIX_SIZE= " << N << "x" << N << "\n";
    {
        //# Create buffers for matrices

        buffer<float, 1> a(matrix_a.data(), range<1>(N*N));
        buffer<float, 1> b(matrix_b.data(), range<1>(N*N));
        buffer<float, 1> c(matrix_c.data(), range<1>(N*N));
        /*buffer a(matrix_a);
        buffer b(matrix_b);
        buffer c(matrix_c);*/

        //# Submit command groups to execute on device
        e = q.submit([&](handler &h){
            //# Create accessors to copy buffers to the device
            auto A = a.get_access<access::mode::read>(h);
            auto B = b.get_access<access::mode::read>(h);
            auto C = c.get_access<access::mode::write>(h);

            //# Define size for ND-Range and work-group size
            range<2> global_size(N,N);
            range<2> work_group_size(M,M);

            //# Parallel Compute Matrix Multiplication
            h.parallel_for(nd_range<2>{global_size, work_group_size}, [=](nd_item<2> item){
                const int i = item.get_global_id(0);
                const int j = item.get_global_id(1);
                for (int k = 0; k < N; k++) {
                    C[i*N+j] += A[i*N+k] * B[k*N+j];
                }
            });
        });
    }


    auto kernel_duration = (e.get_profiling_info<info::event_profiling::command_end>() - e.get_profiling_info<info::event_profiling::command_start>());
    std::cout << "Kernel Execution Time : " << kernel_duration / 1e+9 << " seconds" << "\n";

    auto duration = std::chrono::high_resolution_clock::now().time_since_epoch().count() - start;
    std::cout << "Compute Duration      : " << duration / 1e+9 << " seconds\n";

    //# Print Output
    if (PRINT_OUTPUT_MATRIX){
        for (int i=0; i<N; i++){
            for (int j=0; j<N; j++){
                std::cout << matrix_c[i*N+j] << " ";
            }
            std::cout << "\n";
        }
    } else {
        std::cout << " [0][0] = " << matrix_c[0] << "\n";
    }

    //# Compute local and compare with offload computation
    if (VERIFY){
        int fail = 0;
        for(int i=0; i<N; i++){
            for (int j = 0; j < N; j++) {
                for(int k=0; k<N; k++){
                    matrix_d[i*N+j] += matrix_a[i*N+k] * matrix_b[k*N+j];
                }
                if(fabs(matrix_c[i*N+j] - matrix_d[i*N+j])>1.0e-1) 
                {
                    fail = 1;
                     //std::cout << matrix_c[i*N+j] << "   " <<  matrix_d[i*N+j] << "  " <<matrix_c[i*N+j] - matrix_d[i*N+j] <<" FAIL\n";
                }

            }
        }
        if(fail == 1){
            std::cout << "FAIL\n";
        } else {
            std::cout << "PASS\n";
        }
    }
}

Is there anyway to improve the AdaptiveCpp cpu performance?

fodinabor · 2024-01-22T10:42:04Z

fodinabor
Jan 22, 2024
Maintainer

Hi @csccva,
there's a few things to consider and I'd love to know if you can verify if any of these are the root cause.

AdaptiveCpp has two CPU modes for nd_range parallel_for: "library-only", this one uses fibers and can be excruciatingly slow for kernels with many barriers, but performs relatively well for kernels without barriers. The other mode "accelerated" is compiler-supported and tries to support the LLVM vectorizer to do its magic. The mode used by default can be found in your AdaptiveCpp installation path ...../etc/hipSYCL/syclcc.json with the key "default-use-accelerated-cpu" : "true",. You can force using the accelerated flow (if ACpp is compiled for it) by setting -DACPP_TARGETS="omp.accelerated" or --acpp-targets="omp.accelerated".

Last time I checked, DPC++ still used Intel OpenCL CPU runtime under the hood which does similar optimizations to our compiler-supported flow but on top employs its own whole-function vectorizer, which we cannot ship similarly. In theory, our approach is compatible with LLVM's VPlan native path that employs an outer-loop vectorizer or the out-of-tree Region Vectorizer (RV). You could test if this is the deciding factor by forcing the Intel runs to not be vectorized by setting CL_CONFIG_CPU_VECTORIZER_MODE=1 (https://www.intel.com/content/www/us/en/docs/opencl-sdk/developer-guide-core-xeon/2018/vectorizer-knobs.html)

Additionally, in my experience, Intel's OpenCL CPU runtime does tend to compile the kernels with more aggressive options, such as fast-math. But I don't remember how to get the runtime to tell you what its current compile-time options are...

1 reply

csccva Jan 23, 2024
Author

So far the vectorization seems to account for the differences in the running time.
I am running the codes suing slurm on the Mahti cluster. My first runs used 1 core from the node. But if I ran the codes with 128 cores the results are more similar to each other and more or less acceptable:
AdaptiveCpp

$ srun  --time=00:15:00 --partition=test --account=project_2008874 --nodes=1 --ntasks-per-node=1  --cpus-per-task=128  ./a.out
srun: job 3125230 queued and waiting for resources
srun: job 3125230 has been allocated resources
Warm-up first
Offload Device        : hipSYCL OpenMP host device
max_work_group_size   : 1024
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 0.177942 seconds
Compute Duration      : 1.15874 seconds
 [0][0] = 3.67041e+11

Vs.

 srun  --time=00:15:00 --partition=test --account=project_2008874 --nodes=1 --ntasks-per-node=1  --cpus-per-task=128  ./a.out
srun: job 3125233 queued and waiting for resources
srun: job 3125233 has been allocated resources

Warm-up first
Offload Device        : AMD EPYC 7H12 64-Core Processor                
max_work_group_size   : 8192
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 0.0797268 seconds
Compute Duration      : 7.81014 seconds
 [0][0] = 3.67041e+11

So when using 1 core in 1 node not share with anyone else the intel compiler manages to use the resources better, but at this level the results are close enough for me. I was thinking I did something extremely wrong.

Cristian

csccva · 2024-01-22T11:39:02Z

csccva
Jan 22, 2024
Author

Hello,

I checked my
´syclcc.json` and I have:


  "default-use-accelerated-cpu": "true",

Regarding the installation I am not sure, I installed AdaptiveCpp using spack install hipsycl@0.9.4 +cuda. I Am not sure what are the compilation options in this recipe, but as far as I can figure out in a few minutes -DWITH_ACCELERATED_CPU=ON is missing. Maybe this explains something.

Regarding the oneAPI I set export CL_CONFIG_CPU_VECTORIZER_MODE=1 recompile and then ran again and got this:

Warm-up first
Offload Device        : AMD EPYC 7H12 64-Core Processor                
max_work_group_size   : 8192
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 10.8023 seconds
Compute Duration      : 32.2298 seconds
 [0][0] = 3.67041e+11

The kernel execution time is now similar to hipsycl.Your suspicion about vectorization must be correct.
I suppose I have to recompile AdaptiveCpp and include -DWITH_ACCELERATED_CPU=ON to have the omp.accelerated flow working. Is this correct?

Cristian

0 replies

csccva · 2024-01-22T12:04:31Z

csccva
Jan 22, 2024
Author

I changed the spack recipe and added "-DWITH_ACCELERATED_CPU:Bool=TRUE", to the args = [ , but the kernel execution time was still 9 s or more. So I guess maybe the vectorization is missing in this case.
Cristian

2 replies

MarkusBuettner Jan 22, 2024

I think you can (somewhat) check this because AdaptiveCpp emits a warning if it cannot vectorize a loop. You could also pass -Rpass-missed=loop-vectorize which will emit a remark every time it fails to vectorize a loop, but be warned that the output can be very verbose and also includes stuff from the STL.

MarkusBuettner Jan 22, 2024

Or you can pass -Rpass=loop-vectorize for the opposite effect: Only get a remark if the compiler vectorized a loop. It will then also tell you the vectorization width it used.

Also you should pass -march=native or -march=x86-64-v3 to AdaptiveCpp, otherwise it will only use SSE instructions and not AVX.

illuhad · 2024-01-22T12:59:00Z

illuhad
Jan 22, 2024
Maintainer

Some additional pointers:

hipSYCL 0.9.4 is ancient. Please try a newer release.
As has been pointed out, make sure that accelerated CPU is activated
Accelerated CPU relies on clang. We have previously observed issues when the AdaptiveCpp OpenMP runtime backend was built against a different OpenMP runtime than the one the application is linked against. This can happen if AdaptiveCpp is built with gcc, in which case the runtime might use libgomp while the application might be linked against LLVM libomp. The easiest way to avoid this is to build AdaptiveCpp with clang.
OMP_PROC_BIND is usually very important due to AdaptiveCpp's reliance on background worker threads for the runtime. You can also try to play around with OMP_PLACES. Try monitoring the program execution using htop or similar to verify that it uses all cores.

Note that AdaptiveCpp can also use the exact same Intel OpenCL CPU runtime that DPC++ uses for CPU execution. For this, you need to build AdaptiveCpp with the OpenCL backend enabled. In practice, it is almost impossible that DPC++ outperforms AdaptiveCpp on CPU since AdaptiveCpp can use the exact same paths and more, so you have more options to try.
However, DPC++ on CPU is known to suffer from severe NUMA issues due to the way Intel OpenCL works internally. You might not notice that for a compute-bound matrix multiplication, but you will see it as soon as memory perf becomes important.

EDIT: Your code does not contain barriers. It's unlikely that accelerated CPU affects performance.

0 replies

csccva · 2024-01-22T14:49:46Z

csccva
Jan 22, 2024
Author

I did the installation via spack. I tried different things. The last one was spack install hipsycl@stable %gcc@11.2.0 +cuda ^cuda@11.5.0 %gcc@11.2.0.

When I compile the code I get this:

$ syclcc -O2 --hipsycl-targets="omp.accelerated;cuda:sm_80" matrix_matrix_multiply_buffer.cpp
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
In file included from matrix_matrix_multiply_buffer.cpp:10:
In file included from /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/bin/../include/AdaptiveCpp/sycl/sycl.hpp:31:
In file included from /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/bin/../include/AdaptiveCpp/sycl/../hipSYCL/sycl/sycl.hpp:93:
In file included from /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/bin/../include/AdaptiveCpp/sycl/../hipSYCL/sycl/queue.hpp:51:
In file included from /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/bin/../include/AdaptiveCpp/sycl/../hipSYCL/sycl/handler.hpp:67:
In file included from /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/bin/../include/AdaptiveCpp/hipSYCL/glue/kernel_launcher_factory.hpp:49:
/scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/bin/../include/AdaptiveCpp/hipSYCL/glue/omp/omp_kernel_launcher.hpp:230:13: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
inline void iterate_nd_range_omp(Function f, const sycl::id<Dim> &&group_id, const sycl::range<Dim> num_groups,
            ^
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/bin/../include/AdaptiveCpp/hipSYCL/glue/omp/omp_kernel_launcher.hpp:230:13: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
4 warnings generated when compiling for host.

There are some warnings. I assumed that they refer to the inner loop over k .

I checked the libraries. The libomp is the correct one:

$ ldd a.out 
        linux-vdso.so.1 (0x00007ffcdc52f000)
        libacpp-rt.so => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/bin/../lib/libacpp-rt.so (0x00007fa68e9bd000)
        libboost_context.so.1.69.0 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/boost-1.69.0-b4idhl4uw2lnqvwmt4g7m33m6kd37qy2/lib/libboost_context.so.1.69.0 (0x00007fa68e9b7000)
        libboost_fiber.so.1.69.0 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/boost-1.69.0-b4idhl4uw2lnqvwmt4g7m33m6kd37qy2/lib/libboost_fiber.so.1.69.0 (0x00007fa68e9a4000)
        libcudart.so.11.0 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/cuda-11.5.0-gmjjetscy7qwcwrrmuoqsujhdqkkyjss/lib64/libcudart.so.11.0 (0x00007fa68e595000)
        libstdc++.so.6 => /appl/spack/v017/install-tree/gcc-8.5.0/gcc-11.2.0-zshp2k/lib64/libstdc++.so.6 (0x00007fa68e37d000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fa68dffb000)
        libomp.so => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/llvm-15.0.7-s6mb3wcnpt2zunpkdmpztsgtiuabg5mv/lib/libomp.so (0x00007fa68e8a1000)
        libgcc_s.so.1 => /appl/spack/v017/install-tree/gcc-8.5.0/gcc-11.2.0-zshp2k/lib64/libgcc_s.so.1 (0x00007fa68e887000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa68dddb000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fa68da16000)
        libacpp-common.so => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/lib/libacpp-common.so (0x00007fa68e873000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fa68d812000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fa68d60a000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa68e837000)
        libboost_filesystem.so.1.69.0 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/boost-1.69.0-b4idhl4uw2lnqvwmt4g7m33m6kd37qy2/lib/libboost_filesystem.so.1.69.0 (0x00007fa68d5ec000)
        libboost_system.so.1.69.0 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/boost-1.69.0-b4idhl4uw2lnqvwmt4g7m33m6kd37qy2/lib/libboost_system.so.1.69.0 (0x00007fa68e86c000)
        libhwloc.so.15 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hwloc-2.9.1-44qhgfrh5xlzwawlzoqbcqvtgmh5iacp/lib/libhwloc.so.15 (0x00007fa68d58f000)
        libpciaccess.so.0 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/libpciaccess-0.17-cojg3kdv4hwidxdflkclk4myufyhbwkg/lib/libpciaccess.so.0 (0x00007fa68d583000)
        libxml2.so.2 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/libxml2-2.10.3-trelafonq3cngj5gg2p4rlwrfpe7ss2m/lib/libxml2.so.2 (0x00007fa68d41d000)
        libz.so.1 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/zlib-ng-2.1.4-j3mh34ytjulmbriw37lk5fp45pjrm33e/lib/libz.so.1 (0x00007fa68d3f8000)
        liblzma.so.5 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/xz-5.4.1-pbs27v3p36j63lwxg5kvn3wuiu6td6yv/lib/liblzma.so.5 (0x00007fa68d3c8000)
        libiconv.so.2 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/libiconv-1.17-ptmvoi33gli7gsew3ts27aurstbcewct/lib/libiconv.so.2 (0x00007fa68d2ba000)

I will try on LUMI tomorrow.

Cristian

1 reply

illuhad Jan 22, 2024
Maintainer

I did the installation via spack. I tried different things. The last one was spack install hipsycl@stable %gcc@11.2.0 +cuda ^cuda@11.5.0 %gcc@11.2.0.

The spack package was created by a third-party and is not maintained by us. I don't know in what shape it is. It is probably outdated.

I checked the libraries. The libomp is the correct one:

You need to check librt-backend-omp.so too. The OpenMP library there needs to match the one from the ldd output of your application.

illuhad · 2024-01-22T15:52:32Z

illuhad
Jan 22, 2024
Maintainer

This is with current AdaptiveCpp on 2x AMD Epyc 7713, which is probably similar to your hardware. I cannot reproduce your performance observation:

[ri-aalpay@p3-compute01 ~]$ ~/local/bin/acpp --acpp-targets=omp.accelerated -o matmul -O3 -march=native matmul.cpp 
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
In file included from matmul.cpp:10:
In file included from /lustre/home/ri-aalpay/local/bin/../include/AdaptiveCpp/sycl/sycl.hpp:31:
In file included from /lustre/home/ri-aalpay/local/bin/../include/AdaptiveCpp/sycl/../hipSYCL/sycl/sycl.hpp:93:
In file included from /lustre/home/ri-aalpay/local/bin/../include/AdaptiveCpp/sycl/../hipSYCL/sycl/queue.hpp:51:
In file included from /lustre/home/ri-aalpay/local/bin/../include/AdaptiveCpp/sycl/../hipSYCL/sycl/handler.hpp:67:
In file included from /lustre/home/ri-aalpay/local/bin/../include/AdaptiveCpp/hipSYCL/glue/kernel_launcher_factory.hpp:49:
/lustre/home/ri-aalpay/local/bin/../include/AdaptiveCpp/hipSYCL/glue/omp/omp_kernel_launcher.hpp:230:13: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
inline void iterate_nd_range_omp(Function f, const sycl::id<Dim> &&group_id, const sycl::range<Dim> num_groups,
            ^
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
/lustre/home/ri-aalpay/local/bin/../include/AdaptiveCpp/hipSYCL/glue/omp/omp_kernel_launcher.hpp:230:13: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
4 warnings generated.
[ri-aalpay@p3-compute01 ~]$ ./matmul 
[AdaptiveCpp Warning] backend_loader: Could not load backend plugin: /lustre/home/ri-aalpay/local/bin/../lib/hipSYCL/librt-backend-cuda.so
[AdaptiveCpp Warning] libcudart.so.12: cannot open shared object file: No such file or directory
[AdaptiveCpp Warning] backend_loader: Could not load backend plugin: /lustre/home/ri-aalpay/local/bin/../lib/hipSYCL/librt-backend-hip.so
[AdaptiveCpp Warning] libamdhip64.so.5: cannot open shared object file: No such file or directory
[AdaptiveCpp Warning] backend_loader: Could not load backend plugin: /lustre/home/ri-aalpay/local/bin/../lib/hipSYCL/librt-backend-ocl.so
[AdaptiveCpp Warning] libOpenCL.so.1: cannot open shared object file: No such file or directory
Warm-up first
Offload Device        : hipSYCL OpenMP host device
max_work_group_size   : 1024
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 0.184671 seconds
Compute Duration      : 0.855957 seconds
 [0][0] = 3.67041e+11

7 replies

illuhad Jan 23, 2024
Maintainer

The librt-backend-omp.so is not in the list from ldd. I only get libomp.so from rocm.

No I meant to apply ldd to the librt-backend-omp.so. It lives in $INSTALL_PREFIX/lib/hipSYCL. E.g. like so:

$ ldd ~/local/lib/hipSYCL/librt-backend-omp.so 
        linux-vdso.so.1 (0x00007ffce5f0b000)
        libacpp-rt.so => /lustre/home/ri-aalpay/local/lib/libacpp-rt.so (0x000014e1b0946000)
        libomp.so => /lustre/home/ri-aalpay/local/lib/libomp.so (0x000014e1b0630000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000014e1b0410000)
        libstdc++.so.6 => /opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6 (0x000014e1affea000)
        libm.so.6 => /lib64/libm.so.6 (0x000014e1afc68000)
        libgcc_s.so.1 => /opt/cray/pe/gcc/12.2.0/snos/lib64/libgcc_s.so.1 (0x000014e1afa49000)
        libc.so.6 => /lib64/libc.so.6 (0x000014e1af684000)
        /lib64/ld-linux-x86-64.so.2 (0x000014e1b0db5000)
        libacpp-common.so => /lustre/home/ri-aalpay/local/lib/libacpp-common.so (0x000014e1af47e000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000014e1af27a000)
        librt.so.1 => /lib64/librt.so.1 (0x000014e1af072000)

The way you ran it uses the whole cpu( 64 core). With only 1 core maybe the total execution time is 0.18x64 ending up to the same performance I noticed, about 10 s.

Ah right, I missed that you were running only on one core. Note that in general you cannot estimate final performance using a 1-core run. Unless apps are not dependent on memory performance, in a run across multiple NUMA domains AdaptiveCpp will almost always substantially outperform oneAPI on CPU. oneAPI has well-known severe NUMA issues when running on CPUs. And even in the rare case that oneAPI is faster, AdaptiveCpp's OpenCL backend can use the exact same Intel CPU OpenCL runtime that oneAPI uses to run on CPUs.

csccva Jan 23, 2024
Author

It seems that libomp is missing from the list:

> ldd librt-backend-omp.so 
        linux-vdso.so.1 (0x00007ffc52dc6000)
        libacpp-rt.so => /scratch/project_462000456/AdaptiveCpp/lib/libacpp-rt.so (0x00007fcf20af2000)
        libamdhip64.so.5 => /appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/lib/libamdhip64.so.5 (0x00007fcf1f2cf000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf1f2b0000)
        libmpi_gtl_hsa.so.0 => /opt/cray/pe/lib64/libmpi_gtl_hsa.so.0 (0x00007fcf1f038000)
        libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00007fcf1f035000)
        libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x00007fcf1ec13000)
        libquadmath.so.0 => /opt/cray/pe/gcc-libs/libquadmath.so.0 (0x00007fcf1e9ce000)
        libmodules.so.1 => /opt/cray/pe/lib64/cce/libmodules.so.1 (0x00007fcf1e9b5000)
        libfi.so.1 => /opt/cray/pe/lib64/cce/libfi.so.1 (0x00007fcf1e40e000)
        libcraymath.so.1 => /opt/cray/pe/lib64/cce/libcraymath.so.1 (0x00007fcf1e327000)
        libf.so.1 => /opt/cray/pe/lib64/cce/libf.so.1 (0x00007fcf1e294000)
        libu.so.1 => /opt/cray/pe/lib64/cce/libu.so.1 (0x00007fcf1e18c000)
        libcsup.so.1 => /opt/cray/pe/lib64/cce/libcsup.so.1 (0x00007fcf1e183000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fcf1e037000)
        libunwind.so.1 => /opt/cray/pe/cce/16.0.1/cce-clang/x86_64/lib/x86_64-unknown-linux-gnu/libunwind.so.1 (0x00007fcf1e02a000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fcf1de33000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fcf20b75000)
        libacpp-common.so => /scratch/project_462000456/AdaptiveCpp/lib/libacpp-common.so (0x00007fcf1de2b000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf1de07000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fcf1ddfb000)
        libamd_comgr.so.2 => /appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/lib/libamd_comgr.so.2 (0x00007fcf16175000)
        libhsa-runtime64.so.1 => /appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/lib/libhsa-runtime64.so.1 (0x00007fcf15ce8000)
        libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007fcf15adc000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fcf158be000)
        libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x00007fcf153f1000)
        libz.so.1 => /appl/lumi/SW/LUMI-22.08/G/EB/zlib/1.2.12-cpeCray-22.08/lib/libz.so.1 (0x00007fcf153cc000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007fcf1519e000)
        libelf.so.1 => /usr/lib64/libelf.so.1 (0x00007fcf14f85000)
        libdrm.so.2 => /usr/lib64/libdrm.so.2 (0x00007fcf14d6f000)
        libdrm_amdgpu.so.1 => /usr/lib64/libdrm_amdgpu.so.1 (0x00007fcf14b65000)

I still have to figure out what that means, but at least at the moment the two compilers (AdaptiveCpp and oneAPI) seems to give very similar execution time when running on the whole cpu (cpus-per-task=64 or 128).

Cristian

csccva Jan 23, 2024
Author

On the second machine I do have the wrong omp in the ldd list:

 ldd ../spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/lib/hipSYCL/librt-backend-omp.so 
        linux-vdso.so.1 (0x00007fffeaff6000)
        libacpp-rt.so => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-11.2.0/hipsycl-stable-6afn2l4y7uns3hrlovqc3usti4c6en25/lib/libacpp-rt.so (0x00007f35418ff000)
        libgomp.so.1 => /scratch/project_2008874/spack/opt/spack/linux-rhel8-x86_64_v3/gcc-8.5.0/gcc-11.2.0-gpvckmbfwjw5vhti5pznbiwpgzap2qld/lib64/libgomp.so.1 (0x00007f35418ba000)

Edit: Is there anyway to fix it and make point to the llvm one?

illuhad Jan 24, 2024
Maintainer

I still have to figure out what that means, but at least at the moment the two compilers (AdaptiveCpp and oneAPI) seems to give very similar execution time when running on the whole cpu (cpus-per-task=64 or 128).

Probably it means that your code is compute-bound and memory performance does not play a substantial role. Depending on the problem size and algorithm, this is possible for a matrix multiplication as in your code.

Edit: Is there anyway to fix it and make point to the llvm one?

The easiest fix is to build AdaptiveCpp with clang, i.e. set -DCMAKE_CXX_COMPILER appropriately when building AdaptiveCpp. But if your code really is compute-bound, you might not see a difference.

csccva Jan 24, 2024
Author

Thank you for replying. In the end I was able to compile on LUMI with the clang compiler which is part of the the rocm /rocm/5.3.3/llvm/bin/clang++, but somehow I messed something because it did recognize the gpu properly:

Warm-up first
Offload Device        : 
max_work_group_size   : 1024
Configuration         : MATRIX_SIZE= 1024x1024
Kernel Execution Time : 0.00358769 seconds
Compute Duration      : 0.00543743 seconds
 [0][0] = 3.67041e+11

Though the execution time is good and also the total compute duration decreased by a huge factor.
I get now for ldd:

> ldd ../AdaptiveCpp_New_New_New/lib/hipSYCL/librt-backend-omp.so 
        linux-vdso.so.1 (0x00007ffefbdf8000)
        libacpp-rt.so => /scratch/project_462000456/AdaptiveCpp_New_New_New/lib/libacpp-rt.so (0x00007f989ede8000)
        libomp.so => /pfs/lustrep4/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/llvm/lib/libomp.so (0x00007f989ecf3000)

But while the sycl code is running quite weel i get this error:

> ../AdaptiveCpp_New_New_New/bin/acpp-info 
../AdaptiveCpp_New_New_New/bin/acpp-info: symbol lookup error: /scratch/project_462000456/AdaptiveCpp_New_New_New/lib/libacpp-rt.so: undefined symbol: pthread_create

I will have to get back to this and spend some time in the documentation and do it properly.

Cristian

EDIT: Matrix-matrix multiplication is memory bounad and I would anyway try to use some libraries, but the topic is interesting to me. I am trying to learn how to set up AdaptiveCpp and be able to make goos recommendations to the users, for which both cpu and portability are important.

MarkusBuettner · 2024-01-22T16:38:03Z

MarkusBuettner
Jan 22, 2024

Depending on the exact system topology it might also be feasible to first zero the buffers with a parallel_for kernel (or handler.fill) and then copy the data into it. At least with a few quick benchmarks I ran it seems that passing a pointer to the buffer constructor does not yield highest performance, for example on a dual-socket AMD EPYC 7352 server:

	sequential init	parallel init	copy	parallel first touch + copy	fill + copy	host pointer
AdaptiveCpp	35 GB/s	90 GB/s	26.5 GB/s	90 GB/s	90 GB/s	26.5 GB/s
oneAPI	35 GB/s	70 GB/s	50 GB/s	70 GB/s	27 GB/s	27 GB/s

host pointer here means passing a pointer to the buffer constructor

2 replies

illuhad Jan 22, 2024
Maintainer

These values still seem pretty low to me for a dual-socket AMD machine. Do they increase further if you increase problem size? If not there is probably a NUMA issue somewhere. On 2x AMD Epyc 7713 I get ~313 GB/s for BabelStrean triad. That's a newer system (Milan instead of Rome) with more cores, but shouldn't be that different in terms of memory channels.

MarkusBuettner Jan 22, 2024

Well the code gets the execution time from the SYCL profiling events and estimates the bandwidth based on that. So if the events are not the actual execution time but a bit longer, it would get a lower value. But I wouldn't expect it to be much lower.

On that particular system I could measure up to 120 GB/s with the benchmark code, with likwid-bench I could measure up to 180 GB/s.
On a node with 2x AMD EPYC 7543 I had up to 200 GB/s with the benchmark code, and 354 GB/s with likwid-bench.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU performance AdaptiveCpp vs oneAPI #1324

{{title}}

Replies: 7 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

CPU performance AdaptiveCpp vs oneAPI #1324

csccva Jan 22, 2024

Replies: 7 comments · 13 replies

fodinabor Jan 22, 2024 Maintainer

csccva Jan 23, 2024 Author

csccva Jan 22, 2024 Author

csccva Jan 22, 2024 Author

MarkusBuettner Jan 22, 2024

MarkusBuettner Jan 22, 2024

illuhad Jan 22, 2024 Maintainer

csccva Jan 22, 2024 Author

illuhad Jan 22, 2024 Maintainer

illuhad Jan 22, 2024 Maintainer

illuhad Jan 23, 2024 Maintainer

csccva Jan 23, 2024 Author

csccva Jan 23, 2024 Author

illuhad Jan 24, 2024 Maintainer

csccva Jan 24, 2024 Author

MarkusBuettner Jan 22, 2024

illuhad Jan 22, 2024 Maintainer

MarkusBuettner Jan 22, 2024

csccva
Jan 22, 2024

Replies: 7 comments 13 replies

fodinabor
Jan 22, 2024
Maintainer

csccva Jan 23, 2024
Author

csccva
Jan 22, 2024
Author

csccva
Jan 22, 2024
Author

illuhad
Jan 22, 2024
Maintainer

csccva
Jan 22, 2024
Author

illuhad Jan 22, 2024
Maintainer

illuhad
Jan 22, 2024
Maintainer

illuhad Jan 23, 2024
Maintainer

csccva Jan 23, 2024
Author

csccva Jan 23, 2024
Author

illuhad Jan 24, 2024
Maintainer

csccva Jan 24, 2024
Author

MarkusBuettner
Jan 22, 2024

illuhad Jan 22, 2024
Maintainer