Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change OpenMP scheduling to from dynamic to static #264

Open
manodeep opened this issue Nov 17, 2021 · 12 comments
Open

Change OpenMP scheduling to from dynamic to static #264

manodeep opened this issue Nov 17, 2021 · 12 comments

Comments

@manodeep
Copy link
Owner

General information

  • Corrfunc version: 2.4.0
  • platform: linux
  • installation method (pip/source/other?): from source with python -m pip install -e .

Issue description

OpenMP install with gcc/9.2.0 produces incorrect results when invoked from python. This was originally reported in #197, and then rediscovered in #258

Expected behaviour

The results with OpenMP should be correct when invoked through python

Actual behaviour

The pair counts are nthreads times larger when using the python wrappers. The results are correct when directly using the C static library.

What have you tried so far?

Compiled without OpenMP makes this problem go away

Changing the OpenMP scheduling to static makes the problem less severe - ie, the pair counts are off by a little bit rather than nthreads times. As an added benefit, the code also appears to run marginally faster

Minimal failing example

After installing with the changes in #258, the following fails:

python -m Corrfunc.tests 

@manodeep
Copy link
Owner Author

@lgarrison I was thinking of implementing this change asap - what do you think?

@lgarrison
Copy link
Collaborator

Not sure... I'm worried that static scheduling will hurt the performance for more clustered cases. I'm also not sure "hiding" the problem with static scheduling is a good idea, even as a stop-gap measure, because we don't really understand all the ways this problem can manifest. I think keeping the problem highly visible and obviously wrong, as it is now, might be preferable.

@manodeep
Copy link
Owner Author

Yeah that's the choice - would it be better to be disastrously wrong or just slightly wrong? Disastrously wrong has a greater chance of people noticing but for those that may not notice, the results will be wrong

I will close this then and solve/detect the race condition as we are progressing on #258

@lgarrison
Copy link
Collaborator

Speaking of detecting the race condition, maybe it's worth trying the GCC 9.2 thread sanitizer on the Python extension? The memory/address sanitizers were always hard to run because of all the false positives from Python, but perhaps the thread sanitizer will behave better?

@manodeep
Copy link
Owner Author

Ohh good idea - I will try that later today

@manodeep manodeep reopened this Nov 17, 2021
@lgarrison
Copy link
Collaborator

Another idea: maybe worth trying OMP_NESTED=true and OMP_NESTED=false? I don't think we're (intentionally) using any nested parallelism, but if somehow we are, it could mess with the thread IDs...

An even simpler version of this train of thought: can we print the thread IDs and check that they are unique?

@manodeep
Copy link
Owner Author

I tried omp_set_nested(0) but that did not change anything. Good idea with printing the thread ids - let me try that and I will report back

@manodeep
Copy link
Owner Author

Nothing stands out with the printed threads. However, setting OMP_DISPLAY_ENV=verbose does show two OPENMP environments from python but only 1 from C

output from C test -> make tests_bin that passes
======================================

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '4'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  GOMP_CPU_AFFINITY = ''
  GOMP_STACKSIZE = '0'
  GOMP_SPINCOUNT = '300000'
OPENMP DISPLAY ENVIRONMENT END

output from python test -> python -m Corrfunc.tests that fails
=============================================

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP = '201511'
  OMP_DYNAMIC = 'FALSE'
  OMP_NESTED = 'FALSE'
  OMP_NUM_THREADS = '4'
  OMP_SCHEDULE = 'DYNAMIC'
  OMP_PROC_BIND = 'FALSE'
  OMP_PLACES = ''
  OMP_STACKSIZE = '0'
  OMP_WAIT_POLICY = 'PASSIVE'
  OMP_THREAD_LIMIT = '4294967295'
  OMP_MAX_ACTIVE_LEVELS = '2147483647'
  OMP_CANCELLATION = 'FALSE'
  OMP_DEFAULT_DEVICE = '0'
  OMP_MAX_TASK_PRIORITY = '0'
  OMP_DISPLAY_AFFINITY = 'FALSE'
  OMP_AFFINITY_FORMAT = 'level %L thread %i affinity %A'
  GOMP_CPU_AFFINITY = ''
  GOMP_STACKSIZE = '0'
  GOMP_SPINCOUNT = '300000'
OPENMP DISPLAY ENVIRONMENT END


OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] KMP_ABORT_DELAY='0'
  [host] KMP_ADAPTIVE_LOCK_PROPS='1,1024'
  [host] KMP_ALIGN_ALLOC='64'
  [host] KMP_ALL_THREADPRIVATE='144'
  [host] KMP_ATOMIC_MODE='2'
  [host] KMP_BLOCKTIME='200'
  [host] KMP_CPUINFO_FILE: value is not defined
  [host] KMP_DETERMINISTIC_REDUCTION='FALSE'
  [host] KMP_DEVICE_THREAD_LIMIT='2147483647'
  [host] KMP_DISP_HAND_THREAD='FALSE'
  [host] KMP_DISP_NUM_BUFFERS='7'
  [host] KMP_DUPLICATE_LIB_OK='FALSE'
  [host] KMP_ENABLE_TASK_THROTTLING='TRUE'
  [host] KMP_FORCE_REDUCTION: value is not defined
  [host] KMP_FOREIGN_THREADS_THREADPRIVATE='TRUE'
  [host] KMP_FORKJOIN_BARRIER='2,2'
  [host] KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_FORKJOIN_FRAMES='TRUE'
  [host] KMP_FORKJOIN_FRAMES_MODE='3'
  [host] KMP_GTID_MODE='3'
  [host] KMP_HANDLE_SIGNALS='FALSE'
  [host] KMP_HOT_TEAMS_MAX_LEVEL='1'
  [host] KMP_HOT_TEAMS_MODE='0'
  [host] KMP_INIT_AT_FORK='TRUE'
  [host] KMP_ITT_PREPARE_DELAY='0'
  [host] KMP_LIBRARY='throughput'
  [host] KMP_LOCK_KIND='queuing'
  [host] KMP_MALLOC_POOL_INCR='1M'
  [host] KMP_MWAIT_HINTS='0'
  [host] KMP_NUM_LOCKS_IN_BLOCK='1'
  [host] KMP_PLAIN_BARRIER='2,2'
  [host] KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_REDUCTION_BARRIER='1,1'
  [host] KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_SCHEDULE='static,balanced;guided,iterative'
  [host] KMP_SETTINGS='FALSE'
  [host] KMP_SPIN_BACKOFF_PARAMS='4096,100'
  [host] KMP_STACKOFFSET='64'
  [host] KMP_STACKPAD='0'
  [host] KMP_STACKSIZE='8M'
  [host] KMP_STORAGE_MAP='FALSE'
  [host] KMP_TASKING='2'
  [host] KMP_TASKLOOP_MIN_TASKS='0'
  [host] KMP_TASK_STEALING_CONSTRAINT='1'
  [host] KMP_TEAMS_THREAD_LIMIT='36'
  [host] KMP_TOPOLOGY_METHOD='all'
  [host] KMP_USER_LEVEL_MWAIT='FALSE'
  [host] KMP_USE_YIELD='1'
  [host] KMP_VERSION='FALSE'
  [host] KMP_WARNINGS='TRUE'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='VERBOSE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS: value is not defined
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='false'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='8M'
  [host] OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
  [host] KMP_AFFINITY='noverbose,warnings,respect,granularity=core,none'
OPENMP DISPLAY ENVIRONMENT END

Another weird thing that I noted is that even though the make tests_bin and the python -m Corrfunc.tests have the same number of bins, the total number of cell-pairs created is different -- 359833 cell pairs when called from python, and 360051 from C (through make). This could be why the static scheduling only fails mildly, rather than spectacularly as with dynamic scheduling.

I have also discovered that there is a race condition with numdone that results in numdone > num_cell_pairs. This race condition does not change even if I change the numdone update from atomic to critical.

JEEZ

@manodeep
Copy link
Owner Author

Another weird thing that I noted is that even though the make tests_bin and the python -m Corrfunc.tests have the same number of bins, the total number of cell-pairs created is different -- 359833 cell pairs when called from python, and 360051 from C (through make). This could be why the static scheduling only fails mildly, rather than spectacularly as with dynamic scheduling.

Solved this one - the C tests do not enable_min_sep_opt whereas the python tests do. When I set options.enable_min_sep_opt = 1 within test_periodic_lin.c, then I get the 359833 cell-pairs.

@manodeep
Copy link
Owner Author

I did some further reading (can't find the reference - the name looked familiar - so someone in the open-source python ecosystem) and it is possible that the OpenMP issues are occurring because OpenMP is linked while the static library is generated (i.e., libcountpairs.a etc) and those static libraries are then again linked into the dynamic C extension shared library. Might be worthwhile to create a dynamic library for the Corrfunc pair-counters in the first place (or use a dynamic library when linking to the C extensions).

@manodeep
Copy link
Owner Author

manodeep commented Aug 30, 2023

Creating shared libraries (instead of static libraries) did not resolve the issue of multiple OpenMP runtime libraries (libgomp from compiling with GCC and libiomp5 from mkl within numpy). However, setting "export MKL_THREADING_LAYER=GNU" resolved the problem entirely.

Source - https://github.com/joblib/threadpoolctl/blob/master/multiple_openmp.md
(This isn't the article I was referring to in the previous comment)

@manodeep
Copy link
Owner Author

Setting that environment variable with os.environ does not work. Though I am not sure I understand why that is, since the documentation says all child processes should inherit the new environment variable. I added a small snippet of code with DD to test whether the OMP runtime environment was working correctly:

#if defined(_OPENMP)
#include <omp.h>

int test_omp_duplicate_runtime_lib_DOUBLE(const int64_t N);

int test_omp_duplicate_runtime_lib_DOUBLE(const int64_t N)
{
    int64_t correct_sum=0.0;
    for(int64_t i=0;i<N;i++) {
        correct_sum += i;
    }

    int64_t omp_sum=0.0;
#pragma omp parallel for schedule(dynamic) reduction(+:omp_sum)
    for(int64_t i=0;i<N;i++) {
        omp_sum += i;
    }
    if(omp_sum != correct_sum) {
        fprintf(stderr,"Basic OMP reduction is causing errors. Expected sum = %"PRId64" sum with OpenMP + reduction = %"PRId64"\n",
                correct_sum, omp_sum);
        return EXIT_FAILURE;
    }
    fprintf(stderr,"Correct result returned\n");
    return EXIT_SUCCESS;
}
#endif

For the case with duplicate OMP libraries, this test code errors with schedule(dynamic) but seems to work fine with schedule(static).

Separately, when setting MKL_THREADING_LAYER=GNU outside within the terminal before invoking Corrfunc (python call_correlation_functions.py) still seems to produce slightly incorrect (but not wildly incorrect) results for the final 2-3 bins.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants