weird behavior with verbose=True #224

kstoreyf · 2020-06-30T07:56:53Z

General information

Corrfunc version: 2.3.2
platform: linux
installation method (pip/source/other?): pip

Issue description

I have a strange issue running Corrfunc on my cluster, that I can't reproduce on another cluster. I admit it's quite weird but can't figure it out.
When I submit a Corrfunc job to my cluster, it sometimes crashes partway and hangs indefinitely. Specifically, it hangs when verbose=True, and when I loop over a high enough number of mocks. The same code works fine on an interactive node.

To be clear:
verbose=True, Nmocks=100: hangs
verbose=False, Nmocks=100: works
verbose=True, Nmocks=10: works

This seems like some sort of memory issue, given the Nmocks trend. I also had a similar issue when messing with the code and building from source, and I found the issue to be some printlines I had added in C that I think were causing a memory leak.

Expected behavior

The script completes successfully.

Actual behavior

The job crashes hangs indefinitely.

What have you tried so far?

Running on an interactive node, which works; running on a different cluster's queue system, which works. Also building the code from source.

Minimal failing example

import Corrfunc

Nmocks = 100
verbose = True #fails when true! works when False.

N = 300
L = 750
r_edges = np.arange(40.0, 210.0, 10.0)
nthreads = 1

for n in range(Nmocks):

        # Generate cubic mock
        x = np.random.rand(N)*float(L)
        y = np.random.rand(N)*float(L)
        z = np.random.rand(N)*float(L)

        res = Corrfunc.theory.xi(L, nthreads, r_edges, x, y, z, verbose=verbose)

manodeep · 2020-06-30T08:14:45Z

Thanks for the report. Could this be related to #197 ?

lgarrison · 2020-06-30T15:36:19Z

This is pretty weird! Could you share the job submission script that you're using to run this example? I'm wondering if some kind of stderr/stdout buffer could be filling up (not that I've ever seen that before).

Do you know how many iterations the 100 mock example gets through before it hangs? Does it always hang on the same iteration?

Also, this isn't on the Flatiron cluster by any chance, is it? If it were, I could run the example myself.

kstoreyf · 2020-06-30T17:26:36Z

i suspected something like a full buffer! yes this example actually hangs at 13 iterations, independent of Nmocks or Nbins.

it's on the sirocco cluster at nyu - actually thanks for pointing out #197 because @zxzhai might be running on the same cluster? zhongxu, would be great if you don't mind testing this on sirocco.

here's the job submission script:

#PBS -N cf_min
#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -l mem=100MB #changing this doesn't seem to do anything, strangely 
#PBS -o output/$PBS_JOBNAME.out
#PBS -j oe

cd $PBS_O_WORKDIR
python cf_minimal.py

lgarrison · 2020-06-30T17:44:54Z

So if you put a Python print statement in the loop, it prints 13 times? Do you get 13 chunks of verbose Corrfunc output too? The Corrfunc output comes from the C level, so it would be good to check that print statements from both the C and Python level are making it through. I'm wondering if the wurlitzer package we use to capture output from the C extension could be causing trouble somehow?

…

On Tue, Jun 30, 2020 at 1:26 PM Kate Storey-Fisher ***@***.***> wrote: i suspected something like a full buffer! yes this example actually hangs at 13 iterations, independent of Nmocks or Nbins. it's on the sirocco cluster at nyu - actually thanks for pointing out #197 <#197> because @zxzhai <https://github.com/zxzhai> might be running on the same cluster? zhongxu, would be great if you don't mind testing this on sirocco. here's the job submission script: #PBS -N cf_min#PBS -l walltime=40:00:00#PBS -l nodes=1:ppn=1#PBS -l mem=100MB #changing this doesn't seem to do anything, strangely #PBS -o output/$PBS_JOBNAME.out#PBS -j oe cd $PBS_O_WORKDIR python cf_minimal.py — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#224 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLA7SY5473BLAWRRND4UJLRZIN5XANCNFSM4OL6VKCQ> .

-- Lehman Garrison <lgarrison@flatironinstitute.org> Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

kstoreyf · 2020-06-30T18:01:14Z

so the other strange/frustrating thing is that when it hangs, i have to cancel the job, and then the output file is empty (or if i print things about the nodes, it has that, but no printlines from the code itself). i figured out the 13 by appending to file each iteration. maybe i could also redirect the output to a file or something?

when i run on interactive node, i see both python and c-level Corrfunc output.

oh when i was building from source earlier i actually did have some trouble with wurlitzer's syspipes....

lgarrison · 2020-06-30T18:17:33Z

If it's convenient for you to run Corrfunc from source, could you try a few wurlitzer-related things? Specifically, in utils.py, there's this snippet in sys_pipes():

Corrfunc/Corrfunc/utils.py

Lines 1044 to 1055 in 75b8ac9

    
           kwargs = {'stdout':None if sys.stdout.isatty() else sys.stdout, 
        
                     'stderr':None if sys.stderr.isatty() else sys.stderr } 
        
           # Redirection might break for any number of reasons, like 
        
           # stdout/err already being closed/redirected.  We probably 
        
           # prefer not to crash in that case and instead continue 
        
           # without any redirection. 
        
           try: 
        
               with wurlitzer.pipes(**kwargs): 
        
                   yield 
        
           except: 
        
               yield

Could you print the values of sys.stdout.isatty() and sys.stderr.isatty()?

Could you also try adding raise in the except clause? Maybe there's an exception being masked.

kstoreyf · 2020-06-30T20:08:40Z

could you clarify the syntax for adding raise? want to make sure i'm doing it correctly.

a) on the interactive node, both sys.stdout.isatty() and sys.stderr.isatty() are true.
b) when running Nmocks=10 and verbose=True, both are false.
c) when running Nmocks=20 (>13) and verbose=False, both are false.
d) when running Nmocks=20 and verbose=True, can't see the output!

for case b), verbose output from the C code is printed - but all before the python output, even though it should be interspersed between the python for the realizations (as happens properly in case a)).

lgarrison · 2020-06-30T20:15:31Z

Okay, it sounds like the TTY detection is working properly. The interspersed output is likely just due to buffering and is not unusual. Adding flush=True as an argument to print() would probably fix this, but it's probably not related to the real problem. For the exception, you can just do "raise" as the first line in the "except" block, and Python will figure out to re-raise the exception that was caught by the current block.

…

On Tue, Jun 30, 2020 at 4:08 PM Kate Storey-Fisher ***@***.***> wrote: could you clarify the syntax for adding raise? want to make sure i'm doing it correctly. a) on the interactive node, both sys.stdout.isatty() and sys.stderr.isatty() are true. b) when running Nmocks=10 and verbose=True, both are false. c) when running Nmocks=20 (>13) and verbose=False, both are false. d) when running Nmocks=20 and verbose=True, can't see the output! for case b), verbose output from the C code is printed - but all before the python output, even though it should be interspersed between the python for the realizations (as happens properly in case a)). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#224 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLA7S6VEMQE3EXUEJFEEYLRZJA5LANCNFSM4OL6VKCQ> .

-- Lehman Garrison <lgarrison@flatironinstitute.org> Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

kstoreyf · 2020-06-30T20:41:40Z

ok thanks that's what i was doing, it's not changing anything when i add the "raise" line. in the cases i can see the output, it's not reaching the except block (can see from adding printlines).

and yes i've seen the interspersing before, probably not the issue here.

lgarrison · 2020-06-30T21:15:22Z

Okay, not seeing a problem yet. The following is a little paranoid, but easy to check. Could you pip install psutil and place the following in the Wurlitzer context: p = psutil.Process() print(p.num_fds()) Aside from that, there are a few places in the code where the verbose flag could, in theory, lead to an early exit, like this one: https://github.com/manodeep/Corrfunc/blob/master/theory/xi/countpairs_xi_impl.c.src#L115-L129 Or if we had a bad printf formatting, that could lead to out-of-bounds writes and undefined behavior. There's also the progress bar stuff. I'm not familiar enough with that part of the code to know whether it's a risk.

…

On Tue, Jun 30, 2020 at 4:41 PM Kate Storey-Fisher ***@***.***> wrote: ok thanks that's what i was doing, it's not changing anything when i add the "raise" line. in the cases i can see the output, it's not reaching the except block (can see from adding printlines). and yes i've seen the interspersing before, probably not the issue here. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#224 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLA7S7V3FFJRB6OO4UCYHTRZJEZHANCNFSM4OL6VKCQ> .

-- Lehman Garrison <lgarrison@flatironinstitute.org> Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

kstoreyf · 2020-06-30T22:04:16Z

assuming you mean the wurlitzer context you quoted above. for the submitted job Nmocks=14 verbose=True, output of num_fds() is:

n=0: 14
n=1: 15
...
n=11: 25
n=12: 15 (suspiciously close to when it fails on 13)
n=13: 16

if i more mocks with verbose=False, it keeps increasing then jumps back to 16 at n=31, and so on (with slightly diff mins and maxs).

i see this is the number of file descriptors opened, not sure what this means tho!

manodeep · 2020-06-30T22:25:03Z

Within the C code, everything is printed to stderr, except for the final result that might be written out from a command-line executable. That means nothing is printed out to stdout if you are using the python interface.

I am surprised that num_fds is not constant. Doesn't that mean there is resource leak occurring somewhere?

manodeep · 2020-06-30T22:33:26Z

The progressbar code should be okay (just looked through the code), but it has string handling - so who knows!

lgarrison · 2020-06-30T23:01:17Z

I'm not sure about the number of fds. I guess it is a leak, but this is way under the default max number of fds, which I think is 1024. But maybe this cluster has a lower limit? I will have to look up the query syntax...

…

On Tue, Jun 30, 2020 at 6:33 PM Manodeep Sinha ***@***.***> wrote: The progressbar code should be okay (just looked through the code), but it has string handling - so who knows! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#224 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLA7S23FW33MQJAPO2J2CDRZJR4HANCNFSM4OL6VKCQ> .

-- Lehman Garrison <lgarrison@flatironinstitute.org> Flatiron Research Fellow, Cosmology X Data Science Group Center for Computational Astrophysics, Flatiron Institute lgarrison.github.io

kstoreyf · 2020-06-30T23:27:03Z

thanks both! let me know if you have other ideas i can test. for now i can just turn off verbose, but it took me a while to diagnose, so maybe having this here will help someone else!

lgarrison mentioned this issue Dec 21, 2021

script hangs when using unbuffered output #269

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

weird behavior with verbose=True #224

weird behavior with verbose=True #224

kstoreyf commented Jun 30, 2020 •

edited

manodeep commented Jun 30, 2020

lgarrison commented Jun 30, 2020

kstoreyf commented Jun 30, 2020

lgarrison commented Jun 30, 2020 via email

kstoreyf commented Jun 30, 2020

lgarrison commented Jun 30, 2020

kstoreyf commented Jun 30, 2020

lgarrison commented Jun 30, 2020 via email

kstoreyf commented Jun 30, 2020

lgarrison commented Jun 30, 2020 via email

kstoreyf commented Jun 30, 2020

manodeep commented Jun 30, 2020

manodeep commented Jun 30, 2020

lgarrison commented Jun 30, 2020 via email

kstoreyf commented Jun 30, 2020

weird behavior with verbose=True #224

weird behavior with verbose=True #224

Comments

kstoreyf commented Jun 30, 2020 • edited

General information

Issue description

Expected behavior

Actual behavior

What have you tried so far?

Minimal failing example

manodeep commented Jun 30, 2020

lgarrison commented Jun 30, 2020

kstoreyf commented Jun 30, 2020

lgarrison commented Jun 30, 2020 via email

kstoreyf commented Jun 30, 2020

lgarrison commented Jun 30, 2020

kstoreyf commented Jun 30, 2020

lgarrison commented Jun 30, 2020 via email

kstoreyf commented Jun 30, 2020

lgarrison commented Jun 30, 2020 via email

kstoreyf commented Jun 30, 2020

manodeep commented Jun 30, 2020

manodeep commented Jun 30, 2020

lgarrison commented Jun 30, 2020 via email

kstoreyf commented Jun 30, 2020

kstoreyf commented Jun 30, 2020 •

edited