-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831
Comments
There is not that much downside to increasing the kmer-per-seq to 2000. It will slow down linclust somewhat, but for only 2M entries it shouldn't really matter. You might want to use |
Does this mean that for a sequence of 2kbp, it would take 400 k-mers, and for a sequence of 20kbp, it would take 4000 k-mers if I set --kmer-per-seq-scale 0.2? My sequences have an average length of about 20kbp, so theoretically, --kmer-per-seq-scale 0.2 might yield results similar to or better than --kmer-per-seq 2000? However, when I tested --kmer-per-seq-scale with values of 0.2, 1, and 20, I found that the number of clusters did not significantly differ from running Linclust with default parameter. The number of clusters was roughly twice as many as when using --kmer-per-seq 2000. As an example, I examined the largest cluster (containing 19,000 sequences) obtained using --kmer-per-seq 2000 . I inspected the length distribution of sequences within this cluster. The cumulative total sequences counts of the top four sequence lengths was 14,000, as follows [seq count, seq length]: Overall, from my test results, I'm uncertain if there are issues with --kmer-per-seq-scale. Setting it doesn't seem to increase clustering sensitivity as expected. However, perhaps this is because I don't fully understand its principle, so I'm hoping to consult you on this matter.
|
This is surprising. For a 2k long sequence it should generate 420 k-mers (20 base + 0.2 * 2000). Could you post the MMseqs2 terminal output of the different runs here? Maybe that would help us diagnose why |
These are outputs of each run:
|
Since cluster module needs too much memory. (I have 2 million nucleotide seqs, about 30G, and 1T memory, and segment fault occurred).
I try to use LINCLUST instead. But I also want a better performance of clustering. I try to increase the --kmer-per-seq, the number of clusters decreased until about --kmer-per-seq 2000 (My shortest sequences are 2000bp). I think this may indicate that the clustering performance has improved.
I compared results in a 3G test dataset, between "Linclust --kmer-per-seq 2000" and "Cluster", the number of cluster produced by former is relatively closed to the latter.
But I still wonder if it make sense to set --kmer-per-seq 2000, since the default is only 20.
The text was updated successfully, but these errors were encountered: