Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory Issues Caused by in intents #25005

Open
ShreyasKhandekar opened this issue May 7, 2024 · 2 comments
Open

Out of Memory Issues Caused by in intents #25005

ShreyasKhandekar opened this issue May 7, 2024 · 2 comments

Comments

@ShreyasKhandekar
Copy link
Contributor

ShreyasKhandekar commented May 7, 2024

Summary of Problem

in intents on functions can cause out of memory issues that crash chapel programs.

Description:
This issue is prompted by a bug reported by one of the Arkouda devs here: Bears-R-Us/arkouda#3089.

In the case where it was reported, we had the following line in the Arkouda source code:

var temp = makeDistArray(a);

The makeDistArray function is part of the tryCreateArray interface which throws an ArrayOomError error when the allocation fails because we're out of memory.

However, the Arkouda issue above reported that this interface was not working as expected and was causing a server crash even when setting CHPL_GASNET_SEGMENT=fast (which is the configuration in which this interface for throwing on OOM works best.)

Here the makeDistArray(a) has the following function header:

proc makeDistArray(in a: [?D] ?etype) throws where [irrelevant clause]

Now because of this in intent, the array is copied when the function is called, this invokes our chpl__initCopy function in the ChapelArray.chpl module.

pragma "init copy fn"
proc chpl__initCopy(const ref rhs: [], definedConst: bool) {
pragma "no copy"
var lhs = chpl__coerceCopy(rhs.type, rhs, definedConst);
return lhs;
}

This function does NOT use the tryCreateArray interface and therefore the Arkouda server crashes when the program runs out of memory here.

We should have a discussion about whether this initCopy should be made to use the throwing interface so that the problem can be caught instead of just crashing.

Is this a blocking issue with no known work-arounds?
The original issue was resolved by using a different version of makeDistArray which avoided the array being copied using the in intent.
Thanks a lot to @jeremiah-corrado for suggesting the workaround and helping resolve this issue!

Steps to Reproduce

I haven't found time to come up with a small reproducer which does not involve Arkouda.
But with Arkouda we can do the following in gasnet config

Launch arkouda server:

./arkouda_server -nl 6 --logLevel=LogLevel.DEBUG --ServerDaemon.daemonTypes=ServerDaemonType.DEFAULT,ServerDaemonType.METRICS

(TODO, investigate single locale reproducer)

Run the following commands in the python client:

import arkouda as ak
ak.connect()
pd = ak.randint(131, 131, 1393000000)
ak.argsort(pd)

Now the array size is a tricky argument.
In order to get the failure here the array size needs to be:

  1. Large enough that the machine runs out of memory
  2. Small enough such that the overMemLimit function in Arkouda estimates that it will take less memory that it actually does
    i. This function is how we checked for OOM errors before the tryCreateArray interface and it's estimates aren't always accurate.
  3. Large enough such that the machine runs out of memory at the chpl__initCopy step (i.e. before the makeDistArray function body is actually entered)

On my mac this size was 493000000 but on ChapDL it was 1393000000.

Configuration Information

Reveal
  • Output of chpl --version:
chpl version 2.1.0 pre-release (7e181b1d05)
  built with LLVM version 15.0.4
  available LLVM targets: nvptx64, nvptx, x86-64, x86, amdgcn, r600
Copyright 2020-2024 Hewlett Packard Enterprise Development LP
Copyright 2004-2019 Cray Inc.
(See LICENSE file for more details)
  • Output of $CHPL_HOME/util/printchplenv --anonymize:
CHPL_TARGET_PLATFORM: linux64
CHPL_TARGET_COMPILER: llvm
CHPL_TARGET_ARCH: x86_64
CHPL_TARGET_CPU: none *
CHPL_LOCALE_MODEL: flat
CHPL_COMM: gasnet *
  CHPL_COMM_SUBSTRATE: udp
  CHPL_GASNET_SEGMENT: fast *
CHPL_TASKS: qthreads
CHPL_LAUNCHER: amudprun
CHPL_TIMERS: generic
CHPL_UNWIND: none
CHPL_MEM: jemalloc
CHPL_ATOMICS: cstdlib
  CHPL_NETWORK_ATOMICS: none
CHPL_GMP: bundled
CHPL_HWLOC: bundled
CHPL_RE2: bundled
CHPL_LLVM: system
CHPL_AUX_FILESYS: none

-- gcc --version

gcc (SUSE Linux) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
@mppf
Copy link
Member

mppf commented May 7, 2024

We should have a discussion about whether this initCopy should be made to use the throwing interface so that the problem can be caught instead of just crashing.

Is this the main point of this issue? I see the bug and compiler labels but this feels more like a feature request to me. Anyway, if this is the main point, maybe we can give it a different title to focus on that?

The language has a bunch of features that can request a copy of an array. We try to optimize these to some extent, so an in formal won't necessarily cause a copy. (Of course, it did in the problematic case here). There are quite a few more of these than just in intent (var x = y; and returning by value come to mind). I can't imagine removing these (besides, they are stable), and at the end of the day, any language that allows allocating memory will also allow one to write programs that run out of memory. But, we can make tools and warnings and linters etc.

@ShreyasKhandekar
Copy link
Contributor Author

Ah, Github added the Bug label on its own when I filed it. I changed it to feature request.
I think I agree with you that, in general, people can always write programs that run out of memory. I think this issue is just some fodder to think about it in terms of 'should we allow programs to recover from such errors instead of just crash?' and 'What are the cases where we would want to support this?'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants