Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mash pre-cluster sketch size? #137

Open
jianshu93 opened this issue Jan 6, 2022 · 5 comments
Open

mash pre-cluster sketch size? #137

jianshu93 opened this issue Jan 6, 2022 · 5 comments

Comments

@jianshu93
Copy link

Dear dRep team,

This is confusing to me when using mash sketch size 1000:

def run_mash_on_genome_chunks(genome_chunks, mash_exe, sketch_folder, MASH_folder, logdir, **kwargs):
dry = kwargs.get('dry', False)
p = kwargs.get('processors', 6)
MASH_s = kwargs.get('MASH_sketch', 1000)
multi_round = kwargs.get('multiround_primary_clustering', True)

If you check the fastANI paper, table 2, sketch size 1000 is very bad at nearly all dataset with traditional blast based ani and fastANI. At lease 10^4 is a good one, or 10^5, so that the pre cluster ANI is close to the FastANI or traditional ANI value. Even with 10^5 (Figure 1 (a)), below 80%, mash is still not close to the real ANI values but an approximate. Any idea why use sketch size 1000, which works only for very distantly related genomes ? Pre cluster at any ANI value larger than 80%, 1000 is far away from enough. It will be nice if there is a sketch size and kmer option passed to mash.

Thanks,

Jianshu

@MrOlm
Copy link
Owner

MrOlm commented Jan 6, 2022

Hi Jianshu,

  1. That sketch size is only used for Mash, not fastANI. In dRep, the goal of Mash is to provide a quick pre-clustering, so the accuracy doesn't matter very much. That small sketch size is chosen to make this first step as fast as possible, since speed is the goal of the primary clustering.

  2. You can adjust this value to be whatever you like using the -ms parameter.

Best,
Matt

@jianshu93
Copy link
Author

Hello Matt,

Thanks for the quick response, what if I want to pre cluster at 85% ANI, then exact ANI at 90%, but the sketch size 1000, will never approximate 85%, but 88% or so (small sketch size will need to underestimate ANI, so a 85% ANI (as you thought) precluster could indicate larger ANI value ). So two pair that is actually around 90% ANI will have the possibility to be put into different clusters, the exact fastANI comparison will then miss this pair of comparison, so dereliction can be not what the user expect. Do you see my point? For very high ANI dereplcation, like 95%, there are no problems because pre cluster will never reach that resolution. This only arise when we want to dereplicated at smaller ANI like 90%, 85%, or so.

Thanks,

Jianshu

@MrOlm
Copy link
Owner

MrOlm commented Jan 6, 2022

Hi Jianshu,

Ah I see- I understand now. In that circumstance it would certainly make sense to increase the -ms parameter to a higher value, but I don't really want to change the default value in order to keep the program run-time up

Best,
Matt

@jianshu93
Copy link
Author

jianshu93 commented Jan 6, 2022

Hi Matt,

True, most of the cases, users want to dereplicate at higher ANI so speed is more important. I was in a case where I want to cluster at 85% ANI, precluster should be 80% or something, even with 10^4 sketch size, mash is till much faster than FastANI, even though the overall process will take a long time. So yes, just a reminder that this could happen and we should be cautious. And say that if users want to have a lower pre cluster ANI value, should increase sketch size. Does that sound reasonable? I have strange dereplication results compare to use FastANI only at 85%.

Thanks,

Jianshu

@MrOlm
Copy link
Owner

MrOlm commented Jan 11, 2022

I see- this does make sense and does sound reasonable. I'll look into adding a warning like this during the next dRep update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants