Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird behavior with verbose=True #224

Open
kstoreyf opened this issue Jun 30, 2020 · 15 comments
Open

weird behavior with verbose=True #224

kstoreyf opened this issue Jun 30, 2020 · 15 comments

Comments

@kstoreyf
Copy link

kstoreyf commented Jun 30, 2020

General information

  • Corrfunc version: 2.3.2
  • platform: linux
  • installation method (pip/source/other?): pip

Issue description

I have a strange issue running Corrfunc on my cluster, that I can't reproduce on another cluster. I admit it's quite weird but can't figure it out.
When I submit a Corrfunc job to my cluster, it sometimes crashes partway and hangs indefinitely. Specifically, it hangs when verbose=True, and when I loop over a high enough number of mocks. The same code works fine on an interactive node.

To be clear:
verbose=True, Nmocks=100: hangs
verbose=False, Nmocks=100: works
verbose=True, Nmocks=10: works

This seems like some sort of memory issue, given the Nmocks trend. I also had a similar issue when messing with the code and building from source, and I found the issue to be some printlines I had added in C that I think were causing a memory leak.

Expected behavior

The script completes successfully.

Actual behavior

The job crashes hangs indefinitely.

What have you tried so far?

Running on an interactive node, which works; running on a different cluster's queue system, which works. Also building the code from source.

Minimal failing example

import Corrfunc

Nmocks = 100
verbose = True #fails when true! works when False.

N = 300
L = 750
r_edges = np.arange(40.0, 210.0, 10.0)
nthreads = 1

for n in range(Nmocks):

        # Generate cubic mock
        x = np.random.rand(N)*float(L)
        y = np.random.rand(N)*float(L)
        z = np.random.rand(N)*float(L)

        res = Corrfunc.theory.xi(L, nthreads, r_edges, x, y, z, verbose=verbose)
@manodeep
Copy link
Owner

Thanks for the report. Could this be related to #197 ?

@lgarrison
Copy link
Collaborator

This is pretty weird! Could you share the job submission script that you're using to run this example? I'm wondering if some kind of stderr/stdout buffer could be filling up (not that I've ever seen that before).

Do you know how many iterations the 100 mock example gets through before it hangs? Does it always hang on the same iteration?

Also, this isn't on the Flatiron cluster by any chance, is it? If it were, I could run the example myself.

@kstoreyf
Copy link
Author

i suspected something like a full buffer! yes this example actually hangs at 13 iterations, independent of Nmocks or Nbins.

it's on the sirocco cluster at nyu - actually thanks for pointing out #197 because @zxzhai might be running on the same cluster? zhongxu, would be great if you don't mind testing this on sirocco.

here's the job submission script:

#PBS -N cf_min
#PBS -l walltime=40:00:00
#PBS -l nodes=1:ppn=1
#PBS -l mem=100MB #changing this doesn't seem to do anything, strangely 
#PBS -o output/$PBS_JOBNAME.out
#PBS -j oe

cd $PBS_O_WORKDIR
python cf_minimal.py

@lgarrison
Copy link
Collaborator

lgarrison commented Jun 30, 2020 via email

@kstoreyf
Copy link
Author

so the other strange/frustrating thing is that when it hangs, i have to cancel the job, and then the output file is empty (or if i print things about the nodes, it has that, but no printlines from the code itself). i figured out the 13 by appending to file each iteration. maybe i could also redirect the output to a file or something?

when i run on interactive node, i see both python and c-level Corrfunc output.

oh when i was building from source earlier i actually did have some trouble with wurlitzer's syspipes....

@lgarrison
Copy link
Collaborator

If it's convenient for you to run Corrfunc from source, could you try a few wurlitzer-related things? Specifically, in utils.py, there's this snippet in sys_pipes():

Corrfunc/Corrfunc/utils.py

Lines 1044 to 1055 in 75b8ac9

kwargs = {'stdout':None if sys.stdout.isatty() else sys.stdout,
'stderr':None if sys.stderr.isatty() else sys.stderr }
# Redirection might break for any number of reasons, like
# stdout/err already being closed/redirected. We probably
# prefer not to crash in that case and instead continue
# without any redirection.
try:
with wurlitzer.pipes(**kwargs):
yield
except:
yield

Could you print the values of sys.stdout.isatty() and sys.stderr.isatty()?

Could you also try adding raise in the except clause? Maybe there's an exception being masked.

@kstoreyf
Copy link
Author

could you clarify the syntax for adding raise? want to make sure i'm doing it correctly.

a) on the interactive node, both sys.stdout.isatty() and sys.stderr.isatty() are true.
b) when running Nmocks=10 and verbose=True, both are false.
c) when running Nmocks=20 (>13) and verbose=False, both are false.
d) when running Nmocks=20 and verbose=True, can't see the output!

for case b), verbose output from the C code is printed - but all before the python output, even though it should be interspersed between the python for the realizations (as happens properly in case a)).

@lgarrison
Copy link
Collaborator

lgarrison commented Jun 30, 2020 via email

@kstoreyf
Copy link
Author

ok thanks that's what i was doing, it's not changing anything when i add the "raise" line. in the cases i can see the output, it's not reaching the except block (can see from adding printlines).

and yes i've seen the interspersing before, probably not the issue here.

@lgarrison
Copy link
Collaborator

lgarrison commented Jun 30, 2020 via email

@kstoreyf
Copy link
Author

assuming you mean the wurlitzer context you quoted above. for the submitted job Nmocks=14 verbose=True, output of num_fds() is:

n=0: 14
n=1: 15
...
n=11: 25
n=12: 15 (suspiciously close to when it fails on 13)
n=13: 16

if i more mocks with verbose=False, it keeps increasing then jumps back to 16 at n=31, and so on (with slightly diff mins and maxs).

i see this is the number of file descriptors opened, not sure what this means tho!

@manodeep
Copy link
Owner

Within the C code, everything is printed to stderr, except for the final result that might be written out from a command-line executable. That means nothing is printed out to stdout if you are using the python interface.

I am surprised that num_fds is not constant. Doesn't that mean there is resource leak occurring somewhere?

@manodeep
Copy link
Owner

The progressbar code should be okay (just looked through the code), but it has string handling - so who knows!

@lgarrison
Copy link
Collaborator

lgarrison commented Jun 30, 2020 via email

@kstoreyf
Copy link
Author

thanks both! let me know if you have other ideas i can test. for now i can just turn off verbose, but it took me a while to diagnose, so maybe having this here will help someone else!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants