Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ltrharvest will hang when run on files stored in NFS #923

Open
rmhubley opened this issue Oct 4, 2019 · 2 comments
Open

ltrharvest will hang when run on files stored in NFS #923

rmhubley opened this issue Oct 4, 2019 · 2 comments

Comments

@rmhubley
Copy link

rmhubley commented Oct 4, 2019

Problem description

gt ltrharvest can hang in a rpc_wait_bit_killable state indefinitely under certain circumstances without any obvious timeout-to-failure.

Exact command line call triggering the problem

% cd /my-nfsv3-mounted-filesystem
% genometools-1.5.10/gt suffixerator -db seq2.fa -indexname seq2 -tis -suf -lcp -des -ssp -sds -dna
% genometools-1.5.10/bin/gt ltrharvest -index seq2

*** Notice that it's taking way too long ***

% ps -elf | grep ltrharvest
0 D rhubley    4912 414105  0  80   0 -  5094 rpc_wa 08:48 pts/31   00:00:00 gt ltrharvest -index seq2

% cat /proc/4912/wchan
rpc_wait_bit_killable

% cat /proc/4912/wchan | grep "State:"
State:	D (disk sleep)

% cat /proc/4912/syscall 
72 0x3 0x7 0x7fff83708d80 0x7fff83708160 0x0 0x0 0x7fff83708d00 0x7fd455bbb874

% grep 72 /usr/include/asm/unistd_64.h 
#define __NR_fcntl 72

% cat /proc/4912/stack
[<ffffffffa05b7bd4>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc]
[<ffffffffa05b881a>] __rpc_execute+0x18a/0x4a0 [sunrpc]
[<ffffffffa05bc97b>] rpc_execute+0x6b/0xd0 [sunrpc]
[<ffffffffa05af4f0>] rpc_run_task+0x70/0x90 [sunrpc]
[<ffffffffa05af560>] rpc_call_sync+0x50/0xc0 [sunrpc]
[<ffffffffa052adf5>] nlmclnt_call+0xb5/0x330 [lockd]
[<ffffffffa052b5cf>] nlmclnt_proc+0x21f/0x830 [lockd]
[<ffffffffa05492c1>] nfs3_proc_lock+0x21/0x30 [nfsv3]
[<ffffffffa064fb00>] do_setlk+0x100/0x120 [nfs]
[<ffffffffa064fbd1>] nfs_lock+0xb1/0x1b0 [nfs]
[<ffffffff812734c1>] vfs_lock_file+0x21/0x40
[<ffffffff812737aa>] do_lock_file_wait.part.25+0x4a/0xf0
[<ffffffff81274def>] fcntl_setlk+0x14f/0x2f0
[<ffffffff812310c0>] SyS_fcntl+0x340/0x670
[<ffffffff8175579e>] system_call_fastpath+0x18/0xd8
[<ffffffffffffffff>] 0xffffffffffffffff

If file locking isn't available the process shouldn't hang. It should die early ( should locking be absolutely necessary ) or timeout after a reasonable amount of time. This holds up pipelines when a user inadvertently runs on an unsupported filesystem.

Example minimal input triggering the problem

Any input

What GenomeTools version are you reporting an issue for (as output by gt -version)?

genometools-1.5.10/bin/gt -version
/home/rhubley/src/genometools-1.5.10/bin/gt (GenomeTools) 1.5.10
Copyright (c) 2003-2016 G. Gremme, S. Steinbiss, S. Kurtz, and CONTRIBUTORS
Copyright (c) 2003-2016 Center for Bioinformatics, University of Hamburg
See LICENSE file or http://genometools.org/license.html for license details.

Used compiler: cc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36.0.1)
Compile flags: -g -Wall -Wunused-parameter -pipe -fPIC -Wpointer-arith -Wno-unknown-pragmas -O3 -Werror

Did you compile GenomeTools from source? If so, please state the make parameters used.

make threads=yes cairo=no

What operating system (e.g. Ubuntu, Mac OS X), OS version (e.g. 15.10, 10.11) and platform (e.g. x86_64) are you using?

RedHat 7.6 - 4.1.12-124.24.3.el7uek.x86_64
NFS V3 with advisory locking through NLM

NOTE: Although I haven't tested it...I suspect NFS V4 fixes this problem.

@satta
Copy link
Member

satta commented Oct 4, 2019

The question is whether we really need it. AFAICS explicit locking is done via fcntl() wrapped in gt_xflock_with_op(), which is only used in gt_fa_lock*(), which in turn is only used in GtMD5Tab.
Maybe there's an easier solution that avoids all the fragility of trying to lock over NFS (i.e. might a lockfile probably be sufficient?)

@gordon any opinions?

@rmhubley
Copy link
Author

rmhubley commented Oct 4, 2019

If the only reason you are locking is to ensure the correct calculation or validation of the MD5 checksum it is probably not necessary at all. How likely is there to be a race condition on the modification/use of these files? But, as you say a simple lockfile would suffice.

EDIT: This could be a problem with the 7.6 kernel as other NFS V3 systems here do not exhibit the blocking behavior. We are going to upgrade to 7.7 at least and see if that makes a difference. I would however, suggest pursuing relaxing the locking behavior unless it's necessary. Our site may not be rare and there have been many problems reported with file-locking over NFS prior to V4 (which many haven't upgraded to yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants