Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831

actledge · 2024-04-02T07:59:49Z

Since cluster module needs too much memory. (I have 2 million nucleotide seqs, about 30G, and 1T memory, and segment fault occurred).

I try to use LINCLUST instead. But I also want a better performance of clustering. I try to increase the --kmer-per-seq, the number of clusters decreased until about --kmer-per-seq 2000 (My shortest sequences are 2000bp). I think this may indicate that the clustering performance has improved.

I compared results in a 3G test dataset, between "Linclust --kmer-per-seq 2000" and "Cluster"， the number of cluster produced by former is relatively closed to the latter.

But I still wonder if it make sense to set --kmer-per-seq 2000, since the default is only 20.

milot-mirdita · 2024-04-03T02:59:35Z

There is not that much downside to increasing the kmer-per-seq to 2000. It will slow down linclust somewhat, but for only 2M entries it shouldn't really matter.

You might want to use --kmer-per-seq-scale though. This parameter adds additional k-mers linearly based on the sequence length. By default, this is set to 0.2 for nucleotide sequences.

actledge · 2024-04-03T07:43:25Z

Does this mean that for a sequence of 2kbp, it would take 400 k-mers, and for a sequence of 20kbp, it would take 4000 k-mers if I set --kmer-per-seq-scale 0.2? My sequences have an average length of about 20kbp, so theoretically, --kmer-per-seq-scale 0.2 might yield results similar to or better than --kmer-per-seq 2000? However, when I tested --kmer-per-seq-scale with values of 0.2, 1, and 20, I found that the number of clusters did not significantly differ from running Linclust with default parameter. The number of clusters was roughly twice as many as when using --kmer-per-seq 2000.

As an example, I examined the largest cluster (containing 19,000 sequences) obtained using --kmer-per-seq 2000 . I inspected the length distribution of sequences within this cluster. The cumulative total sequences counts of the top four sequence lengths was 14,000, as follows [seq count, seq length]:
1274 4413.0,
2985 4411.0,
3397 5350.0,
7109 4409.0.
I think most of these sequences should be highly similar, indicating that this clustering is relatively accurate.
I took some individual examples from this cluster to examine the results of --kmer-per-seq-scale and found that some nearly identical sequences (same length, 99% identity) were not clustered together when using --kmer-per-seq-scale. Meanwhile the largest cluster of --kmer-per-seq-scale 1 contains only 6341 seqs.
As a side note, I set the threshold to 95% identity and 85% coverage.

Overall, from my test results, I'm uncertain if there are issues with --kmer-per-seq-scale. Setting it doesn't seem to increase clustering sensitivity as expected. However, perhaps this is because I don't fully understand its principle, so I'm hoping to consult you on this matter.

There is not that much downside to increasing the kmer-per-seq to 2000. It will slow down linclust somewhat, but for only 2M entries it shouldn't really matter.

You might want to use --kmer-per-seq-scale though. This parameter adds additional k-mers linearly based on the sequence length. By default, this is set to 0.2 for nucleotide sequences.

milot-mirdita · 2024-04-03T09:28:08Z

This is surprising. For a 2k long sequence it should generate 420 k-mers (20 base + 0.2 * 2000).

Could you post the MMseqs2 terminal output of the different runs here? Maybe that would help us diagnose why --kmer-per-seq-scale underperforms.

actledge · 2024-04-03T09:44:34Z

These are outputs of each run:
linclust_default.txt
linclust_kmer2000_clustmode1.txt
linclust_kmerscale0.2_clustmode1.txt
linclust_kmerscale1_clustmode1.txt
linclust_kmerscale20_clustmode1.txt

This is surprising. For a 2k long sequence it should generate 420 k-mers (20 base + 0.2 * 2000).

Could you post the MMseqs2 terminal output of the different runs here? Maybe that would help us diagnose why --kmer-per-seq-scale underperforms.

hjruscheweyh mentioned this issue Apr 24, 2024

mmseqs as a replacement for cd-hit-est #836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831

Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831

actledge commented Apr 2, 2024

milot-mirdita commented Apr 3, 2024

actledge commented Apr 3, 2024 •

edited

milot-mirdita commented Apr 3, 2024

actledge commented Apr 3, 2024

Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831

Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831

Comments

actledge commented Apr 2, 2024

milot-mirdita commented Apr 3, 2024

actledge commented Apr 3, 2024 • edited

milot-mirdita commented Apr 3, 2024

actledge commented Apr 3, 2024

actledge commented Apr 3, 2024 •

edited