`sourmash compare` runs out of memory on large comparisons #3134

yuzie0314 · 2024-04-30T03:49:03Z

Hi @ctb the sourmash author,

Currently we are working on using your tool to find the representative MAGs within a customed data set assembled from several deeply sequenced stool shotgun samples. Those MAGs actually are classified as the same family level using gtdbtk reference genomes.
We have up to 14,500 genomes in this data set, and we want to compute an ani pair-wise matrix using the following command.

sourmash compare -p 8 -k 31 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*.sig

However, after around 1 hr processing, we got a weird error called BrokenPipeError, so we started to think if there is any limitation when using sourmash compare to generate an ani matrix. I think this kind of error is dereived from out off memory, correct me if I am wrong.

P.S. We are using 16 cores and 32 Gb ram, aws EC2 Linux.
we also saw a message called Killedison for index 886 done in 9.36945 seconds, which might be another reason why this error happend.
Current version is v4.8.2 sourmash.

The text was updated successfully, but these errors were encountered:

ctb · 2024-04-30T13:47:12Z

hi @yuzie0314 - yes, the Killed means sourmash used too much memory.

there is a long issue #2299 about this. we are still in a bit of a confused state in terms of recommendations, but the gist of that issue is:

you should be able to use pairwise and cluster from the branchwater plugin to do your clustering very quickly and in very low memory.

@bluegenes any words of wisdom here?

yuzie0314 · 2024-05-13T08:50:48Z

Hi @ctb,
I learned that commands pairwise and cluster might be useful, but I have some doubts about the results from them. The main products from sourmash compare is a pairwise ani matrix in csv and a numpy pickle file + label text file, so could we use pairwise to generate this? Another question is that I am a little confused about cluster command, what is the main cutoff/ threshold/ metrix to cluster genomes/signatures? is the clustering result based on ANI similarity or not merely rely on ANI? could you help us to clarify this?

This is a really good news to us,
since we want to improve the speed of our customed pipelines.
Yuzie

yuzie0314 · 2024-05-14T07:59:09Z

Hi @ctb the author,

I used the following command and generated a csv result.

sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_matrix.csv \
        --scaled 1000000 \
        --ksize 21 \
        sig.path \
        --write-all

The result is different from sourmash compare method:

Thanks for your help,
Yuzie

ctb · 2024-05-14T14:02:56Z

matrix vs CSV output

hi @yuzie0314, yep, the output of pairwise is a different format - this is because the numpy matrix format is not a sparse matrix, and for large comparisons/large collections it will have a lot of zeroes in it. In this case the CSV format is better, because it only stores pairs where there is a match.

Please see sourmash-bio/sourmash_plugin_branchwater#198 for a script that converts the CSV file into a numpy matrix. I haven't tried it yet myself, I'm afraid, but if you run into problems please feel free to post here and we'll see what we can do!

how does `cluster` work?

here is the basic description from sourmash-bio/sourmash_plugin_branchwater#234 -

clusteruses rustworkx-core (which internally uses petgraph) to build a graph, adding edges between nodes when the similarity exceeds the user-defined threshold. It can work on any of the similarity columns output by pairwise or multisearch, and will add all nodes to the graph to preserve singleton 'clusters' in the output.

from the docs -

cluster takes a --similarity_column argument to specify which of the similarity columns, with the following choices: containment, max_containment, jaccard, average_containment_ani, maximum_containment_ani. All values should be input as fractions (e.g. 0.9 for 90%)

per @mr-eyes in sourmash-bio/sourmash_plugin_branchwater#252,

Community Detection: The current clustering algorithm is weakly_connected_component, @bluegenes tried it before with kSpider, and -as far as I remember- it did a great job in the ANI-based clustering of the GTDB-207. Here, I propose adopting community detection methods, which have been proven very useful in DBRetina, but I haven't tried them on DNA data.

ordination

we have a script here that can build an ordination view (MDS) from a matrix, and, with some small edits, should work on the output of pairwise - lmk if this is something you would be interested in.

further questions :)

what's your end goal? then we can see if we can help you get there!

cluster visualization?
and/or cluster extraction?

and/or something else?

ctb · 2024-05-19T15:13:11Z

hi @yuzie0314 I got inspired by your question (and also by some of my own research needs ;)) and built a plugin that I think will help you - see https://github.com/sourmash-bio/sourmash_plugin_betterplot/.

Specifically:

the command sourmash scripts pairwise_to_comparison will convert the CSV output of pairwise to standard sourmash compare format. So you can do fast comparisons with pairwise and then turn them into regular matrices.
you might also be interested in the --cut-line and cluster output functionality :)
I mean I'm also kind of happy with the mds commands...

If you have suggestions or requests for further functionality, please let me know! It's easy and fun to add new stuff to this plugin!

yuzie0314 · 2024-05-20T04:14:43Z

wow, a little complicated answers, I would need some time to test on my environment, but thanks for your contribution again, really helpful @ctb.

I think the solutions you provided is worth us to explore. Will update you once we have any doubts and good news.

yuzie0314 · 2024-05-20T06:40:45Z

hi @yuzie0314 I got inspired by your question (and also by some of my own research needs ;)) and built a plugin that I think will help you - see https://github.com/sourmash-bio/sourmash_plugin_betterplot/.

Specifically:

the command sourmash scripts pairwise_to_comparison will convert the CSV output of pairwise to standard sourmash compare format. So you can do fast comparisons with pairwise and then turn them into regular matrices.

you might also be interested in the --cut-line and cluster output functionality :)

I mean I'm also kind of happy with the mds commands...

If you have suggestions or requests for further functionality, please let me know! It's easy and fun to add new stuff to this plugin!

Hi @ctb,
I had a quick test on your command sourmash compare and sourmash scripts pairwise --ani combined with sourmash scripts pairwise_to_compare using four metagenomes.

The results from sourmash compare command alone:

The results from sourmash scripts pairwise --ani combined with sourmash scripts pairwise_to_compare:

As you can notice that compare would outputs the pair-wise ani comparison results even the ani score is low, but for the new combination commands we would only obtain partial information, which would be a little hard for us to interpret the results from the sourmash scripts pairwise. We also tried to tweak the -t 0 and expected that whether sourmash would greedily show the results between different comparison, but the results were all the same as we didn't assigned any threshold.

The commands we used:
1. sourmash compare -p 8 --ani -o ani_matrix.numpy --csv ani_matrix.csv cluster_mash/*

2. sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_pairwise.csv \
        --scaled 1000000 \
        --ksize 21 \
        sig.path \
        --write-all \
        -t 0

3. sourmash scripts pairwise_to_compare ani_pairwise.csv -o ani_matrix.numpy

ctb · 2024-05-20T13:05:28Z

That's a great test! You should get identical results (although I will confess I have not tried it myself). I will try it out on my own set of data, but - I'm curious - why set --scaled 1000000 with pairwise? That would be my main guess as to why you're getting different results.

yuzie0314 · 2024-05-21T07:54:48Z

ohh ya,
I only wated to test whether adding --scaled 1000000 would change the results or not.
BTW, I also tested the command that dropped off this flag, the results are still different the original one.
The row and column are in the same order so we can clearly observed that the ani values in pairwise command are much lower than compare command.

Is there anything I missing? just let me know I would like to test in my environment.

The commands we used:
1. sourmash scripts pairwise --ani \
        --cores 16 \
        --output ani_pairwise.csv \
        --ksize 21 \
        sig.path \
        --write-all \
        -t 0

2. sourmash scripts pairwise_to_compare ani_pairwise.csv -o ani_matrix.numpy

ctb · 2024-05-21T13:51:58Z

I'll have to take a look. Have you compared the Jaccard index or containment matrices, rather than the ANI? I'm wondering if there's a difference in the ANI calculations - then Jaccard/containment would be the same, but ANI would be different. (Which would be a bug, just to be clear!)

yuzie0314 · 2024-05-29T01:27:23Z

Sorry for the late, we were only focusing on the ani comparison results, and didn't check other methods.
Is containment metric similar to ani? because from your document I learned that you have a containment ani column in gather results.

ctb changed the title ~~BrokenPipeError: [Errno 32] Broken pipe happend in sourmash compare command~~ sourmash compare runs out of memory on large comparisons May 14, 2024

ctb mentioned this issue May 20, 2024

write automated tests for pairwise_to_compare sourmash-bio/sourmash_plugin_betterplot#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sourmash compare` runs out of memory on large comparisons #3134

`sourmash compare` runs out of memory on large comparisons #3134

yuzie0314 commented Apr 30, 2024

ctb commented Apr 30, 2024

yuzie0314 commented May 13, 2024

yuzie0314 commented May 14, 2024

ctb commented May 14, 2024

ctb commented May 19, 2024

yuzie0314 commented May 20, 2024 •

edited

yuzie0314 commented May 20, 2024

ctb commented May 20, 2024

yuzie0314 commented May 21, 2024

ctb commented May 21, 2024

yuzie0314 commented May 29, 2024

sourmash compare runs out of memory on large comparisons #3134

sourmash compare runs out of memory on large comparisons #3134

Comments

yuzie0314 commented Apr 30, 2024

ctb commented Apr 30, 2024

yuzie0314 commented May 13, 2024

yuzie0314 commented May 14, 2024

ctb commented May 14, 2024

matrix vs CSV output

how does cluster work?

ordination

further questions :)

ctb commented May 19, 2024

yuzie0314 commented May 20, 2024 • edited

yuzie0314 commented May 20, 2024

ctb commented May 20, 2024

yuzie0314 commented May 21, 2024

ctb commented May 21, 2024

yuzie0314 commented May 29, 2024

`sourmash compare` runs out of memory on large comparisons #3134

`sourmash compare` runs out of memory on large comparisons #3134

how does `cluster` work?

yuzie0314 commented May 20, 2024 •

edited