-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefilter step died with easy-cluster #822
Comments
|
Also interesting, a lot of over represented k-mers (same prefix/suffix?)
|
That is correct: these millions of sequences are derived from a small set of common ancestor sequences. In short, they are very similar to one another in some portions. |
We have observed before that it's possible to get the prefilter to crash with many very similar sequences. We will have to investigate how we can deal with this and don't have a solution or workaround for now though. |
Expected Behavior
easy-cluster should finish execution without errors
Current Behavior
mmseqs easy-cluster errors and crashes with:
Steps to Reproduce (for bugs)
a) Get the input sequences which here I have split into 3 files to fit into Github's upload limits:
my_seqs.1of3.fasta.gz
my_seqs.2of3.fasta.gz
my_seqs.3of3.fasta.gz
b) Consolidate the 3 chunks:
c) Execute and expose the bug:
and the bug is shown below
MMseqs Output (for bugs)
Context
In my hands, this bug is exposed only when the number of nucleotide sequences is in the order of millions. For small sets (thousands) the execution completes uneventfully. I have tried the precompiled AVX2 version, the SSE4.1 version, I have tried my own compilation of the latest release (15-6f452, Oct 31 2023) and also the latest github version (f6c9880) and other variations. All attempts led to the exact same bug.
I have tried also with other three input datasets. All four crash in the same way. All four are in the order of 3 to 4million nucleotide sequences.
When I subset the sequences to about 200K sequences, easy-cluster runs to completion.
Your Environment
I am running this on an AWS EC2 instance of type g4dn (128GB RAM). Here you go:
The text was updated successfully, but these errors were encountered: