Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefilter step died with easy-cluster #822

Open
ivandatasci opened this issue Mar 14, 2024 · 4 comments
Open

Prefilter step died with easy-cluster #822

ivandatasci opened this issue Mar 14, 2024 · 4 comments

Comments

@ivandatasci
Copy link

ivandatasci commented Mar 14, 2024

Expected Behavior

easy-cluster should finish execution without errors

Current Behavior

mmseqs easy-cluster errors and crashes with:

Error: Prefilter step died
Error: Search died

Steps to Reproduce (for bugs)

a) Get the input sequences which here I have split into 3 files to fit into Github's upload limits:

my_seqs.1of3.fasta.gz
my_seqs.2of3.fasta.gz
my_seqs.3of3.fasta.gz

b) Consolidate the 3 chunks:

zcat my_seqs.*.fasta.gz > /tmp/my_seqs.fasta

c) Execute and expose the bug:

/opt/mmseqs/bin/mmseqs easy-cluster \
/tmp/my_seqs.fasta /tmp/my_seqs/result /tmp/my_seqs/tmp \
--dbtype 2 --threads 8 --local-tmp /tmp \
--cluster-reassign -s 7.5 --cov-mode 0 -c 0.98 --cluster-mode 2 --min-seq-id 0.99 -v 1

and the bug is shown below

MMseqs Output (for bugs)

/tmp/my_seqs/tmp/5280277461515018798/clu_tmp/18196956704942050314/nucleotide_clustering.sh: line 48:  4723 Segmentation fault      (core dumped) $RUNNER "$MMSEQS" prefilter "$QUERY" "$INPUT" "${TMP_PATH}/pref" ${PREFILTER_PAR}
Error: Prefilter step died
Error: Search died

Context

In my hands, this bug is exposed only when the number of nucleotide sequences is in the order of millions. For small sets (thousands) the execution completes uneventfully. I have tried the precompiled AVX2 version, the SSE4.1 version, I have tried my own compilation of the latest release (15-6f452, Oct 31 2023) and also the latest github version (f6c9880) and other variations. All attempts led to the exact same bug.

I have tried also with other three input datasets. All four crash in the same way. All four are in the order of 3 to 4million nucleotide sequences.

When I subset the sequences to about 200K sequences, easy-cluster runs to completion.

Your Environment

I am running this on an AWS EC2 instance of type g4dn (128GB RAM). Here you go:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            4999.98
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht
                         syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid ap
                         erfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                         tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti fsgsbase
                         tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    16 MiB (16 instances)
  L3:                    35.8 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:         
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Vulnerable
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
@milot-mirdita
Copy link
Member

@martin-steinegger

* thread #5, stop reason = EXC_BAD_ACCESS (code=1, address=0x5a7684002)
    frame #0: 0x0000000100169b58 mmseqs`CacheFriendlyOperations<2u>::findDuplicates(this=0x0000600000c08090, output=0x00000005a72a2336, outputSize=580749, computeTotalScore=true) at CacheFriendlyOperations.cpp:229:50
   226 	                const unsigned int element = tmpElementBuffer[n].id;
   227 	                const unsigned int hashBinElement = element >> (MASK_0_5_BIT);
   228 	                output[doubleElementCount].id    = element;
-> 229 	                output[doubleElementCount].count = duplicateBitArray[hashBinElement];
   230 	                output[doubleElementCount].diagonal = tmpElementBuffer[n].diagonal;
   231

(lldb) p hashBinElement
(const unsigned int) 742456
(lldb) p duplicateBitArray
(unsigned char *) 0x00000005b8008000 ""
(lldb) p doubleElementCount
(size_t) 581514
(lldb) p duplicateBitArray
(unsigned char *) 0x00000005b8008000 ""
(lldb) p output[doubleElementCount]
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
(lldb) p output
(CounterResult *) 0x00000005a72a2336
(lldb) p duplicateBitArray[hashBinElement]
(unsigned char) '\x01'

@milot-mirdita
Copy link
Member

Also interesting, a lot of over represented k-mers (same prefix/suffix?)

Query database size: 3083342 type: Nucleotide
Estimated memory consumption: 12G
Target database size: 1541671 type: Nucleotide
Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
[=================================================================] 100.00% 1.54M 2m 38s 193ms
Index table: Masked residues: 141067
Index table: fill
[=================================================================] 100.00% 1.54M 1m 10s 152ms
Index statistics
Entries:          516344842
DB size:          11146 MB
Avg k-mer size:   0.480884
Top 10 k-mers
    GGGCTCAGGATTCTG	1282098
    CTGCTCTGGGCGCGT	1167098
    TGAGCTGGGCATGAG	1134437
    AAGTTCCTCACTCGG	1086133
    CTGTAAGCTGCTCGT	966085
    AGCTACATTGATCGC	943599
    CAGCGACACTGCTTG	913837
    CCTCGCACGCCTGAG	883990
    CCTCTGCACTCGCTG	827574
    GAGCTGGAAGCTGGT	791516

@ivandatasci
Copy link
Author

ivandatasci commented Mar 15, 2024

Also interesting, a lot of over represented k-mers (same prefix/suffix?)

Query database size: 3083342 type: Nucleotide
Estimated memory consumption: 12G
Target database size: 1541671 type: Nucleotide
Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
[=================================================================] 100.00% 1.54M 2m 38s 193ms
Index table: Masked residues: 141067
Index table: fill
[=================================================================] 100.00% 1.54M 1m 10s 152ms
Index statistics
Entries:          516344842
DB size:          11146 MB
Avg k-mer size:   0.480884
Top 10 k-mers
    GGGCTCAGGATTCTG	1282098
    CTGCTCTGGGCGCGT	1167098
    TGAGCTGGGCATGAG	1134437
    AAGTTCCTCACTCGG	1086133
    CTGTAAGCTGCTCGT	966085
    AGCTACATTGATCGC	943599
    CAGCGACACTGCTTG	913837
    CCTCGCACGCCTGAG	883990
    CCTCTGCACTCGCTG	827574
    GAGCTGGAAGCTGGT	791516

@milot-mirdita

That is correct: these millions of sequences are derived from a small set of common ancestor sequences. In short, they are very similar to one another in some portions.

@milot-mirdita
Copy link
Member

We have observed before that it's possible to get the prefilter to crash with many very similar sequences. We will have to investigate how we can deal with this and don't have a solution or workaround for now though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants