Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running Corrfunc with nthreads>1 on cluster and some strange results #197

Open
zxzhai opened this issue Oct 5, 2019 · 14 comments
Open

running Corrfunc with nthreads>1 on cluster and some strange results #197

zxzhai opened this issue Oct 5, 2019 · 14 comments
Labels

Comments

@zxzhai
Copy link

zxzhai commented Oct 5, 2019

General information

Hi, I installed Corrfunc on a cluster and run some simple tests with the algorithm DDsmu_mocks. When I specified nthreads>1, the values of the resulting pair counts are nthreads time the result using single thread. And the runtime is also nthreads larger. This is very strange and it seems that each thread is running consecutively and processing the full set of points itself instead of splitting the work between threads.

I also test the same code on my laptop and another cluster, and there are no problem. The results of different nthreads are the same and the runtime is also (roughly) nthreads times faster. This implies that the problem only exists on this particular cluster, but I don't understand why if there is anything about the configuration of this cluster impacts the code.

May I ask if any of the developers have similar experience, and any suggestions?

Thanks!

  • Corrfunc version: 2.3.1
  • platform: linux
  • installation method (pip/source/other?): pip

Issue description

Expected behavior

Actual behavior

What have you tried so far?

Minimal failing example

import Corrfunc


# rest of sample code goes here...
@lgarrison
Copy link
Collaborator

Thanks for the report! That's pretty strange, I certainly haven't seen Corrfunc do that before. Does the problem only occur for DDsmu_mocks or other estimators as well?

Could you export OMP_DISPLAY_ENV=TRUE as an environment variable in your shell then run your script? That should print out some diagnostic information about the OpenMP setup.

@manodeep
Copy link
Owner

manodeep commented Oct 6, 2019

Thanks @zxzhai for the report. Could you please follow what @lgarrison suggested above. It seems that OpenMP might need to be explicitly enabled at runtime.

When you installed Corrfunc with pip on that cluster, were all the required modules loaded explicitly? Otherwise, the Corrfunc install might have proceeded with the compiler supplied with the OS - and those might not come with OpenMP support.

@zxzhai
Copy link
Author

zxzhai commented Oct 7, 2019

Hi @lgarrison and @manodeep , thanks for the suggestions!

I did two tests and these are what I got:

I tested the code DDrppi_mock from Corrfunc.mocks.DDrppi_mocks, and there was the same problem on that particular cluster.

When I did export OMP_DISPLAY_ENV=TRUE and rerun the code, it gave me following information:

OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '24'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'FALSE'
OMP_PLACES = ''
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
OMP_DEFAULT_DEVICE = '0'
OMP_MAX_TASK_PRIORITY = '0'
OPENMP DISPLAY ENVIRONMENT END

OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP='201611'
[host] OMP_CANCELLATION='FALSE'
[host] OMP_DEFAULT_DEVICE='0'
[host] OMP_DISPLAY_ENV='TRUE'
[host] OMP_DYNAMIC='FALSE'
[host] OMP_MAX_ACTIVE_LEVELS='2147483647'
[host] OMP_MAX_TASK_PRIORITY='0'
[host] OMP_NESTED='FALSE'
[host] OMP_NUM_THREADS: value is not defined
[host] OMP_PLACES: value is not defined
[host] OMP_PROC_BIND='false'
[host] OMP_SCHEDULE='static'
[host] OMP_STACKSIZE='4M'
[host] OMP_THREAD_LIMIT='2147483647'
[host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END

There seems to be some inconsistency, the two _OPENMP are different (201511 vs 201611). I did the same thing on the other computer that has no problem, and the two _OPENMP are the same (both are 201511). So I suspect that this might be the reason. I will check to see if I can fix it.

@lgarrison
Copy link
Collaborator

I agree it looks like two different OpenMP libraries are getting loaded at runtime, possibly GNU and Intel (libgomp.so and libiomp5.so), or maybe just two different versions of the same library (i.e. one extension has the RPATH set and one doesn't, so they find different versions). I've also seen this with Anaconda Python, because the Anaconda python executable has the RPATH set to the Anaconda lib directory, which often contains libgomp.so or libiomp5.so, or both! But the Corrfunc compiler doesn't know about that, so it may compile against a different OpenMP library than Anaconda's, and then OpenMP is resolved to Anaconda's at runtime. I think this is supposed to be okay, but it clearly can be harmful if certain OpenMP features are promised at compile time that can't be resolved at runtime. Whether that sort of confusion can also cause multiple OpenMP runtimes to get loaded, I'm not sure.

Static linking of OpenMP by other extensions or executables is another way multiple runtimes could get initialized. But I'm not sure if that would cause duplicate pair counts... although I don't understand how multiple dynamic runtimes would cause that either!

I think I'd recommend what @manodeep suggested: try uninstalling Corrfunc, making sure all your compiler modules are loaded, and then reinstall. If pip doesn't work, try building from source where you can specify the correct compiler manually. And maybe try both inside and outside of Anaconda Python (if you're using that).

@zxzhai
Copy link
Author

zxzhai commented Oct 8, 2019

Thanks for the suggestions. I think I've solved this problem, but don't completely understand why. What I learned is: install Corrfunc from source, don't do pip to install.

I have a .bashrc file indicating another gcc library for another code (not Corrfunc). So I have to switch off all the related setup and this means the that openmp library is just the default on the system. After that I reinstall Corrfunc from source and it seems that the problem is solved. One place to check is "CC :=" in the common.mk file, just in case if some other people meet the same problem in the future.

The place I don't understand is that when this problem is solved (the output doesn't depend on nthreads and the scaling of speed is fine), I also do export OPM_DISPLAY_ENV=TRUE, the output for the two sections are still inconsistent. So it looks like the different versions of openmp or different versions of the same library doesn't impact the result (at least in this scenario), the previous error was caused by something else but unknown, maybe depend on how python uses openmp and in which step the library is called.

@lgarrison
Copy link
Collaborator

Thanks for reporting back! This is all really good to know. It's something of a relief that the multiple OpenMP versions aren't clashing, because I don't know how they would have been executing the same parallel region. I think the "inconsistent" OpenMP libraries could easily be coming from different Python packages that were compiled with different RPATH or static linking, which should all be safe.

I'm happy to help if you'd like to dig into the other compiler to try to figure out how it caused this behavior, but otherwise feel free to close the issue.

@samotracio
Copy link

Hi,
Just sharing that I am facing similar issues here with a very different Fortran code and OpenMP. For some reason, the entire parallel DO loop is executed in each thread when nthreads>1, leading to a compute time equal to nthreads x single_thread_time. This only happens in a virtual cluster when OMP_SCHEDULE is set to other than the default value which is "static", or when a scheduling other than static is specified directly in the DO loop. If OMP_SCHEDULE=auto or static, then it works fine.
For reference, I also have exactly the same "conflicting" _OPENMP preprocessor verisions. I also noted the problem only happens in a virtual server running in a 2 CPU machine. In other single CPU machines, the issue does not happen under any circumstance. I my laptop everything runs fine, but in the virtual server the problem appears.
For me, scheduling as static seems to work, but I will keep investigating and get back here if something useful surfaces.

@manodeep
Copy link
Owner

manodeep commented Oct 9, 2019

@zxzhai There is no real difference between pip install and git clone + install. Underneath the hood, the compiler is checked, and set as appropriate for the underlying OS (gcc for linux, clang for OSX).

@samotracio Your investigation seems quite relevant. The Corrfunc scheduling is always specified as dynamic, and from @zxzhai's report, the runtime OpenMP seems to be configured for static scheduling.

@Christopher-Bradshaw
Copy link
Contributor

I'm also having, perhaps related, perhaps totally different, problems with parallelization. In my case, setting nthreads appears to have no affect on runtime, and checking CPU usage via htop I never appear to be using more than 1 thread.

I've added logging to make sure that numthreads is properly getting passed through, and _OPENMP is defined (e.g. here

omp_set_num_threads(numthreads);
). If I write a toy script using openmp that works, so I'm not sure it is a toolchain problem.

My system:

  • Debian
  • Corrfunc at master, compiled using both GCC and clang
  • Called through python (not sure if that changes anything)
  • Don't have anaconda
OPENMP DISPLAY ENVIRONMENT BEGIN      
  _OPENMP = '201511'                  
  OMP_DYNAMIC = 'FALSE'               
  OMP_NESTED = 'FALSE'                
  OMP_NUM_THREADS = '4'               
  OMP_SCHEDULE = 'DYNAMIC'            
  OMP_PROC_BIND = 'FALSE'             
  OMP_PLACES = ''                     
  OMP_STACKSIZE = '0'                 
  OMP_WAIT_POLICY = 'PASSIVE'         
  OMP_THREAD_LIMIT = '4294967295'     
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'          
  OMP_DEFAULT_DEVICE = '0'            
  OMP_MAX_TASK_PRIORITY = '0'         
OPENMP DISPLAY ENVIRONMENT END        

I'll keep looking, but any suggestions would be appreciated!

@lgarrison
Copy link
Collaborator

These problems smack of core affinity issues; i.e. the process affinity mask is set to only execute on one core. This can arise when using OMP_PROC_BIND, which does not appear to be the case here. Another way is when executing on a cluster using Slurm, LSF, or another queue system that launches your job or shell with a specific affinity mask (processes inherit their parent's affinity mask). Sometimes there are "resource binding" flags in the job allocation request that can affect affinity. @Christopher-Bradshaw, are you running on a cluster?

Even if not, a "rogue" Python package could be setting the affinity. Numpy does this, but only when using OMP_PROC_BIND, I think.

Regardless, I would try tracking the core affinity, starting at the C level inside Corrfunc to check if the affinity is actually restricted. Here is a sample program I have used in the past for this purpose:

#include <omp.h>
#include <stdio.h>
#include <sched.h>
#include <assert.h>
#include <stdlib.h>

int main(void){
    // First report the CPU affinity bitmask.
    cpu_set_t mask;
    int cpusetsize = sizeof(cpu_set_t);

    assert(sched_getaffinity(0, cpusetsize, &mask) == 0);

    int naff = CPU_COUNT_S(cpusetsize, &mask);

    printf("Core affinities (%d total): ", naff);
    for (int i = 0; i < CPU_SETSIZE; i++) {
        if(CPU_ISSET(i, &mask))
            printf("%d ", i);
    }
    printf("\n");

    int maxthreads = omp_get_max_threads();
    int nprocs = omp_get_num_procs();

    printf("omp_get_max_threads(): %d\n", maxthreads);
    printf("omp_get_num_procs(): %d\n", nprocs);

    return 0;
}

omp_get_max_threads() and omp_get_num_procs() are useful because the latter will be less than the former if the affinity was restricted at program startup.

If the affinity is actually restricted, then try going one level higher, into Python:

import psutil
print('Python affinity #:', len(psutil.Process().cpu_affinity()))

Place that print statement in a few strategic spots in the code, e.g. at startup, after imports, before running Corrfunc, and after running Corrfunc. See if anything changes.

@lgarrison
Copy link
Collaborator

I forgot to mention that to check the affinity of the shell, in Bash one can use:

taskset -c -p $$

where $$ gives you the PID of the shell in Bash.

@Christopher-Bradshaw
Copy link
Contributor

Thanks a lot for the suggestions, I'll give them a try now. I am not running on a cluster, just my local desktop.

@lgarrison
Copy link
Collaborator

Another possibility totally unrelated to OpenMP: Corrfunc threads over cell pairs, so if the problem is extremely clustered such that a single cell pair dominates the runtime (e.g. the autocorrelation of a single massive cell), then you will see a burst of multi-threaded activity at the beginning followed by a long period of a single thread running. You can alleviate this somewhat by specifying a larger max_cells_per_dim and possibly bin_refine_factors too. You can use verbose to see what grid size Corrfunc is using.

@manodeep
Copy link
Owner

I just encountered this issue while running on an interactive node via the slurm queue. The solution was that I had to specify --cpus-per-task to the max-threads I was planning to use. Once I specified that, the taskset command showed the correct cpu affinities. For instance with my job as --cpus-per-task 4, I get the following:

    [~ @john1] taskset -c -p $$
    pid 116541's current affinity list: 16,18,20,30

Before I added the --cpus-per-task, I was submitting with --ntasks 4, and the taskset command always showed one entry, and I could not get Corrfunc to run on multiple threads. (In hindsight, that makes sense -- I was requesting to run four tasks, each with a CPU assigned for it)

Not the solution required, but might solve one class of OpenMP issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants