Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831

Open
actledge opened this issue Apr 2, 2024 · 4 comments
Open

Is it reasonable to use LINCLUST with --Kmer-Per-Seq 2000? #831

actledge opened this issue Apr 2, 2024 · 4 comments

Comments

@actledge
Copy link

actledge commented Apr 2, 2024

Since cluster module needs too much memory. (I have 2 million nucleotide seqs, about 30G, and 1T memory, and segment fault occurred).

I try to use LINCLUST instead. But I also want a better performance of clustering. I try to increase the --kmer-per-seq, the number of clusters decreased until about --kmer-per-seq 2000 (My shortest sequences are 2000bp). I think this may indicate that the clustering performance has improved.

I compared results in a 3G test dataset, between "Linclust --kmer-per-seq 2000" and "Cluster", the number of cluster produced by former is relatively closed to the latter.

But I still wonder if it make sense to set --kmer-per-seq 2000, since the default is only 20.

@milot-mirdita
Copy link
Member

There is not that much downside to increasing the kmer-per-seq to 2000. It will slow down linclust somewhat, but for only 2M entries it shouldn't really matter.

You might want to use --kmer-per-seq-scale though. This parameter adds additional k-mers linearly based on the sequence length. By default, this is set to 0.2 for nucleotide sequences.

@actledge
Copy link
Author

actledge commented Apr 3, 2024

Does this mean that for a sequence of 2kbp, it would take 400 k-mers, and for a sequence of 20kbp, it would take 4000 k-mers if I set --kmer-per-seq-scale 0.2? My sequences have an average length of about 20kbp, so theoretically, --kmer-per-seq-scale 0.2 might yield results similar to or better than --kmer-per-seq 2000? However, when I tested --kmer-per-seq-scale with values of 0.2, 1, and 20, I found that the number of clusters did not significantly differ from running Linclust with default parameter. The number of clusters was roughly twice as many as when using --kmer-per-seq 2000.

As an example, I examined the largest cluster (containing 19,000 sequences) obtained using --kmer-per-seq 2000 . I inspected the length distribution of sequences within this cluster. The cumulative total sequences counts of the top four sequence lengths was 14,000, as follows [seq count, seq length]:
1274 4413.0,
2985 4411.0,
3397 5350.0,
7109 4409.0.
I think most of these sequences should be highly similar, indicating that this clustering is relatively accurate.
I took some individual examples from this cluster to examine the results of --kmer-per-seq-scale and found that some nearly identical sequences (same length, 99% identity) were not clustered together when using --kmer-per-seq-scale. Meanwhile the largest cluster of --kmer-per-seq-scale 1 contains only 6341 seqs.
As a side note, I set the threshold to 95% identity and 85% coverage.

Overall, from my test results, I'm uncertain if there are issues with --kmer-per-seq-scale. Setting it doesn't seem to increase clustering sensitivity as expected. However, perhaps this is because I don't fully understand its principle, so I'm hoping to consult you on this matter.

There is not that much downside to increasing the kmer-per-seq to 2000. It will slow down linclust somewhat, but for only 2M entries it shouldn't really matter.

You might want to use --kmer-per-seq-scale though. This parameter adds additional k-mers linearly based on the sequence length. By default, this is set to 0.2 for nucleotide sequences.

@milot-mirdita
Copy link
Member

This is surprising. For a 2k long sequence it should generate 420 k-mers (20 base + 0.2 * 2000).

Could you post the MMseqs2 terminal output of the different runs here? Maybe that would help us diagnose why --kmer-per-seq-scale underperforms.

@actledge
Copy link
Author

actledge commented Apr 3, 2024

These are outputs of each run:
linclust_default.txt
linclust_kmer2000_clustmode1.txt
linclust_kmerscale0.2_clustmode1.txt
linclust_kmerscale1_clustmode1.txt
linclust_kmerscale20_clustmode1.txt

This is surprising. For a 2k long sequence it should generate 420 k-mers (20 base + 0.2 * 2000).

Could you post the MMseqs2 terminal output of the different runs here? Maybe that would help us diagnose why --kmer-per-seq-scale underperforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants