Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastANI is not working #96

Open
Biofarmer opened this issue Nov 7, 2020 · 89 comments
Open

fastANI is not working #96

Biofarmer opened this issue Nov 7, 2020 · 89 comments
Labels

Comments

@Biofarmer
Copy link

Biofarmer commented Nov 7, 2020

Hi Dr. Olm,

I am using version 2.6.2, when running dRep compare with --S_algorithm fastANI, there is an error:

Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
3355 primary clusters made
Step 3. Perform secondary clustering
Running 8999390 fastANI comparisons- should take ~ 1200.5 min
Traceback (most recent call last):
File "/install/software/anaconda3.6.b/bin/dRep", line 33, in
controller.parseArguments(args)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 146, in parseArguments
self.compare_operation(**vars(args))
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/controller.py", line 91, in compare_operation
drep.d_workflows.compare_wrapper(kwargs['work_directory'],**kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_workflows.py", line 96, in compare_wrapper
drep.d_cluster.d_cluster_wrapper(wd, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 80, in d_cluster_wrapper
data_folder, wd=workDirectory, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 215, in cluster_genomes
ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 921, in compare_genomes
df = run_pairwise_fastANI(genome_list, working_data_folder, **kwargs)
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/d_cluster.py", line 1096, in run_pairwise_fastANI
exe_loc = drep.get_exe('fastANI')
File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/drep/init.py", line 100, in get_exe
assert False, "{0} isn't working- make sure its installed".format(name)
AssertionError: fastANI isn't working- make sure its installed

May I ask that should I install the fastANI separately? If yes, how can I make sure the dRep can call it? We already have FastANI 1.1 installed.

Thanks

@MrOlm
Copy link
Owner

MrOlm commented Nov 8, 2020

Yes, you need to install it separately and need to make sure it's in your system PATH. dRep bonus test --check_dependencies will tell you which programs dRep is able to confirm are working correctly.

@MrOlm MrOlm closed this as completed Nov 8, 2020
@Biofarmer
Copy link
Author

Hi Dr. Olm, thanks for reply. I just runt the code and see:

mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash)
nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer)
checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm)
ANIcalculator........................... !!! ERROR !!! (location = None)
prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal)
centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge)
nsimscan................................ !!! ERROR !!! (location = None)
fastANI................................. !!! ERROR !!! (location = None)

May I ask that if I want to use fastANI for S_algorithm with dRep compare or dereplicate, do all the three "ERROR" stools need to be installed? For sure, fastANI yes. However about the other two? If yes, are the following links correct to download?
For ANIcalculator, is it here: https://ani.jgi.doe.gov/html/download.php?
For nsimscan, is it here: https://github.com/abadona/qsimscan ?

Many thanks for reply.

@MrOlm
Copy link
Owner

MrOlm commented Nov 9, 2020

Hello,

Nope, to use fastANI you only need to correct the fastANI option. The other two are used for other S_algorithms

@Biofarmer
Copy link
Author

Biofarmer commented Nov 9, 2020

Hi Dr. Olm, it is great to know that. I have only fixed the fastANI option, and now looks good.
mash.................................... all good (location = /install/software/anaconda3.6.b/bin/mash)
nucmer.................................. all good (location = /install/software/anaconda3.6.b/bin/nucmer)
checkm.................................. all good (location = /install/software/anaconda3.6.b/bin/checkm)
ANIcalculator........................... !!! ERROR !!! (location = None)
prodigal................................ all good (location = /install/software/anaconda3.6.b/bin/prodigal)
centrifuge.............................. all good (location = /install/software/anaconda3.6.b/bin/centrifuge)
nsimscan................................ !!! ERROR !!! (location = None)
fastANI................................. all good (location = /install/software/fastani/1.32/fastANI)

I have run dRep compare with fastANI, and find that the flag of fastANI '--minFraction' is 0, but the default is 0.2. May I ask that the '--minFraction' is indeed set to 0? or do I misunderstand anything?

11-09 17:51 DEBUG running cluster 1
11-09 17:51 DEBUG /install/software/fastani/1.32/fastANI --ql /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/tmp/genomeList --rl /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/tmp/genomeList -o /data/Food/analysis/R0898_HYDROFish/allbins_ncbi/random_1_drep_20k/data/fastANI_files/fastANI_out_tqcrxtgsrr --matrix -t 50 --minFraction 0 tqcrxtgsrr

Thanks

@MrOlm
Copy link
Owner

MrOlm commented Nov 9, 2020

Hello,

Yes, the --minFraction is not used at the stage of the program being referenced in the log. It's used at a later stage when the clustering happens. The default is indeed 0.2, not 0.

@Biofarmer
Copy link
Author

Hi, thanks for reply. It is good to have the default value used. I will see it when the running is finished.
Thanks again.

@Biofarmer
Copy link
Author

Hi Dr. Olm, I have run with drep compare with fastANI with 50 processors, and it shows in the log that 1200.5 min is needed as below.
11-09 17:51 INFO 3355 primary clusters made
11-09 17:51 INFO Step 3. Perform secondary clustering
11-09 17:51 INFO Running 8999390 fastANI comparisons- should take ~ 1200.5 min

However, it has been running already 45 h (2700 min), and I can see that the job is still running, however may I ask that the time indicated in the log is just estimated and the real time may be longer than that?

Thanks

@MrOlm
Copy link
Owner

MrOlm commented Nov 11, 2020

Hello,

Yes it is indeed just an estimate. Depending on random factors like genome size, the specifics of the genomes being compared, and the number of cores being used, the actual time can vary a bit. The parallelization also doesn't scale perfectly; 50 cores will take more than twice as long as 25 cores, but the simple formula used to estimate how long the job will take doesn't take this into account.

Best,
Matt

@Biofarmer
Copy link
Author

Hi Dr. Olm,
It is good that I have finished my run with 20,000 genomes, with -pa 0.9 -sa 0.95 -nc 0.30 -cm larger.

  1. However there is an error for a cluster:

Running 8999390 fastANI comparisons- should take ~ 1200.5 min
CRITICAL ERROR WITH SECONDARY CLUSTERING CODE hdzdcskeha; SKIPPING
CRITICAL ERROR WITH PRIMARY CLUSTER 592; TRYING AGAIN
CRITICAL ERROR WITH SECONDARY CLUSTERING CODE geyhqnsofz; SKIPPING
DOUBLE CRITICAL ERROR AGAIN WITH PRIMARY CLUSTER 592; SKIPPING

In the Cdb.csv, there is no cluster 592 record. May I ask why?

  1. For dRep compare Step 4. Analyze part, there are many lines as below in my error log:

Plotting MDS plot
/install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:130: RuntimeWarning : divide by zero encountered in double_scalars
old_stress = stress / dis
/install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:125: RuntimeWarning : invalid value encountered in double_scalars
if(old_stress - stress / dis) < eps:
/install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:130: RuntimeWarning : invalid value encountered in double_scalars
old_stress = stress / dis
/install/software/anaconda3.6.b/lib/python3.6/site-packages/sklearn/manifold/mds.py:125: RuntimeWarning : invalid value encountered in double_scalars
if(old_stress - stress / dis) < eps:
...

May I ask why? I think these issues are without any impact on the cluster, and the files in data_tables are totally fine for subsequence analysis, right?

  1. The value (0.30) I set with -nc is applied to --minFraction in fastANI? Because I need the alignment fraction (AF) for subsequent analysis. Where can I check the value used for --minFraction in fastANI?

Sorry about many questions and thanks for your time.

@MrOlm
Copy link
Owner

MrOlm commented Nov 13, 2020

Hello,

  1. This is because FastANI is a somewhat finicky program, and it didn't like something about one of those genomes. Try and run fastANI on those genomes outside of dRep to see what the problem is. dRep runs FastANI like below, where genomeList is a text file listing the genome locations.
$ fastANI --ql genomeList --rl genomeList -o fastANI_out_pqbxqtvbuy --matrix -t 8 --minFraction 0 pqbxqtvbuy

Alternatively you could run dRep again in debug mode (add -d), which will store all logs that FastANI creates to show what the error is (stored in log/cmd_lods/). I must warn you though, often fastANI crashes without an easily interpretable error, which makes it difficult to know what the problem is.

  1. Yeah don't worry about those; it's just matplotlib yelling about unimportant things that might be problems but really aren't.

  2. No; -nc is applied during the clustering step. You can see the AF values for all comparisons in the Ndb.csv file in the data_tables folder. During the clustering step if the AF is less than -nc, dRep will replace the actual ANI value with a 0. This 0 isn't stored in Ndb.csv so that you can see what the reported ANI is, when the clustering happens it will be treated as a 0.

Best,
Matt

@Biofarmer
Copy link
Author

Hi Dr. Olm,

Thanks for reply. May I further ask based on your reply?

  1. Where can I find the genomes that are listed in Cluster 592? I just see one genomeList lin tmp folder. Then I can run fastANI independently to see the problem.

  2. That's great, thanks.

  3. Based on my understanding, the AF used for secondary clustering is from -nc, right? When running fastANI, the --minFraction is set to 0, and the AF is then calculated based one the output of fastANI for each comparsion. Then dRep used -nc to do secondary clustering based on the calculated AF, right?

Thanks.

@MrOlm
Copy link
Owner

MrOlm commented Nov 16, 2020

Hello,

  1. It should just be the genomes that are in the datatable Bdb.csv (which contains all filtered genomes) but are not in Cdb.csv (which has the cluster assignments of all genomes).

  2. That is mostly correct. You are correct and AF is reported and calculated by fastANI with --minFraction set to 0. dRep uses ANI to do the secondary clustering, however, not AF. All genomes with an AF below the nc have their ANI set to 0 by dRep, though.

-Matt

@Biofarmer
Copy link
Author

Biofarmer commented Nov 17, 2020

Hi Dr. Olm,

Thank you for reply.

  1. Yes, but all the genomes are in the datatable Bdb.csv, right? How can I know which genomes that belong to Cluster 592? I think I only need to re-run fastANI on this cluster to check where the problem is, am I right?

  2. Thanks, it makes sense to me now.

Sorry may I introduce a new question:
3.1 Is the final number of clusters from column of "secondary_cluster" in Cdb? I have checked the unique cluster number from "secondary_cluster" is 4546, and the unique cluster from "primary_cluster" is 3354. I think this is expected to have higher cluster number from "secondary_cluster", right?
3.2 I run code with -pa 0.95 -sa 0.95 -nc 0.30 -cm larger for the same genomes. I am just curious that how many clusters from Mash with 0.95, which results in 4081 clusters. I understand the problem of Mash is to underestimate the distance between the incomplete genomes and split same-species genomes into multiple clusters, which means higher number of clusters from Mash itself. However, why the number I got with -pa 0.95 from Mash (the first step of dRep) is lower than the clusters I got from dRep with -sa 0.95? Or I cannot compare them directly like this?

Thanks

@MrOlm
Copy link
Owner

MrOlm commented Nov 17, 2020

Hello,

  1. The genomes that ARE in Bdb.csv and ARE NOT in Cdb.csv will be the ones where fastANI failed.

3.1) Yes, your understanding is correct.

3.2) I'm a little confused by the question but can explain a few things. The purpose of primary clustering (done with Mash, which is fast) is to reduce the number of comparisons that have to be done with fastANI (which is slow). The only genomes that are ever compared with FastANI are those that are in the same primary cluster by Mash; genomes in different primary clusters will never be in the same secondary clusters. Mash is not very accurate however (especially with incomplete genomes), so that's why by default dRep casts a wide net with Primary clustering, and then makes secondary clusters with fastANI.

If you'd like to compare Mash and fastANI directly, you'll need to run dRep once with the command -pa 0.95 --SkipMash and once with the command -sa 0.95 --SkipSecondary. I'll warn you though, this second command will be very slow (because it will not have the primary clustering speeding things up).

A bit more detail on this can be found here: https://drep.readthedocs.io/en/latest/choosing_parameters.html

-Matt

@Biofarmer
Copy link
Author

Biofarmer commented Nov 18, 2020

Hi Dr. Olm,

  1. sorry, I understood now, and found the cluster 592 only has one genome, and run the fastANI as below:
    fastANI --ql genomeList --rl genomeList -o fastANI_out_592 --matrix -t 50 --minFraction 0

It run without error:

Reference = [bin.7.fna]
Query = [bin.7.fna]
Kmer size = 16
Fragment length = 3000
Threads = 50
ANI output file = fastANI_out_592

INFO [thread 0], skch::main, Count of threads executing parallel_for : 50
INFO [thread 0], skch::Sketch::build, window size for minimizer sampling = 24
INFO [thread 18], skch::main, ready to exit the loop
INFO [thread 1], skch::main, ready to exit the loop
INFO [thread 24], skch::main, ready to exit the loop
INFO [thread 17], skch::main, ready to exit the loop
INFO [thread 47], skch::main, ready to exit the loop
INFO [thread 23], skch::main, ready to exit the loop
INFO [thread 41], skch::main, ready to exit the loop
INFO [thread 19], skch::main, ready to exit the loop
INFO [thread 33], skch::main, ready to exit the loop
INFO [thread 43], skch::main, ready to exit the loop
INFO [thread 39], skch::main, ready to exit the loop
INFO [thread 12], skch::main, ready to exit the loop
INFO [thread 2], skch::main, ready to exit the loop
INFO [thread 15], skch::main, ready to exit the loop
INFO [thread 46], skch::main, ready to exit the loop
INFO [thread 44], skch::main, ready to exit the loop
INFO [thread 21], skch::main, ready to exit the loop
INFO [thread 34], skch::main, ready to exit the loop
INFO [thread 7], skch::main, ready to exit the loop
INFO [thread 6], skch::main, ready to exit the loop
INFO [thread 22], skch::main, ready to exit the loop
INFO [thread 38], skch::main, ready to exit the loop
INFO [thread 30], skch::main, ready to exit the loop
INFO [thread 25], skch::main, ready to exit the loop
INFO [thread 3], skch::main, ready to exit the loop
INFO [thread 42], skch::main, ready to exit the loop
INFO [thread 4], skch::main, ready to exit the loop
INFO [thread 45], skch::main, ready to exit the loop
INFO [thread 29], skch::main, ready to exit the loop
INFO [thread 36], skch::main, ready to exit the loop
INFO [thread 14], skch::main, ready to exit the loop
INFO [thread 10], skch::main, ready to exit the loop
INFO [thread 9], skch::main, ready to exit the loop
INFO [thread 32], skch::main, ready to exit the loop
INFO [thread 5], skch::main, ready to exit the loop
INFO [thread 37], skch::main, ready to exit the loop
INFO [thread 28], skch::main, ready to exit the loop
INFO [thread 8], skch::main, ready to exit the loop
INFO [thread 31], skch::main, ready to exit the loop
INFO [thread 49], skch::main, ready to exit the loop
INFO [thread 13], skch::main, ready to exit the loop
INFO [thread 35], skch::main, ready to exit the loop
INFO [thread 16], skch::main, ready to exit the loop
INFO [thread 26], skch::main, ready to exit the loop
INFO [thread 48], skch::main, ready to exit the loop
INFO [thread 27], skch::main, ready to exit the loop
INFO [thread 40], skch::main, ready to exit the loop
INFO [thread 20], skch::main, ready to exit the loop
INFO [thread 11], skch::main, ready to exit the loop
INFO [thread 0], skch::Sketch::build, minimizers picked from reference = 118228
INFO [thread 0], skch::Sketch::index, unique minimizers = 115202
INFO [thread 0], skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 112667) ... (14, 1)
INFO [thread 0], skch::Sketch::computeFreqHist, consider all minimizers during lookup.
INFO [thread 0], skch::main, Time spent sketching the reference : 0.161255 sec
INFO [thread 0], skch::main, Time spent mapping fragments in query #1 : 0.0048895 sec
INFO [thread 0], skch::main, Time spent post mapping : 1.052e-06 sec
INFO [thread 0], skch::main, ready to exit the loop
INFO, skch::main, parallel_for execution finished

However, there is nothing in fastANI_out_592 output. I think there should be one comparison inside.

I run another cluster also with only one genome that was good with dRep, there is one comparison in fastANI_out, and the log is same as above. I am confused, is it due to the bin file itself? Could you have a look where the problem is?

Thanks

@MrOlm
Copy link
Owner

MrOlm commented Nov 19, 2020

...interesting. Are you sure that cluster 592 only has a single genome in it? The size of Bdb.csv (measured with the command cat Bdb.csv | wc -l) is only 1 line longer than Cdb.csv (assessed with the command cat Cdb.csv | wc -l)? The reason I ask is that secondary comparisons are not usually run on primary clusters that only have a single genome in them, so it's a bit surprising to me.

Thank you,
Matt

@Biofarmer
Copy link
Author

Biofarmer commented Nov 19, 2020

Hi Dr. Olm,

Yes, there is only one genome difference between Bdb.csv (200001) and Cdb.csv (20000). I run 20000 genomes together, plus the title line, that's why the Bdb.csv is with 200001 lines. Indeed, I have so many fastANI output with only one genome in /data/fastANI_files/ folder.

The top 10 lines from Cdb.csv:

genome | secondary_cluster | threshold | cluster_method | comparison_algorithm | primary_cluster
-- | -- | -- | -- | -- | --
GCF_000711905.1_ASM71190v1_genomic.fna | 1_0 | 0.05 | average | fastANI | 1
GCF_002002665.1_ASM200266v1_genomic.fna | 2_0 | 0.05 | average | fastANI | 2
GCF_902158755.1_SYNGOMJ08_genomic.fna | 3_0 | 0.05 | average | fastANI | 3
GCF_000008665.1_ASM866v1_genomic.fna | 4_1 | 0.05 | average | fastANI | 4
GCF_000734035.1_ASM73403v1_genomic.fna | 4_1 | 0.05 | average | fastANI | 4
GCF_001610875.1_ASM161087v1_genomic.fna | 5_0 | 0.05 | average | fastANI | 5
GCF_001654775.1_ASM165477v1_genomic.fna | 6_0 | 0.05 | average | fastANI | 6
GCF_002212725.1_ASM221272v1_genomic.fna | 7_0 | 0.05 | average | fastANI | 7
GCF_006740685.1_ASM674068v1_genomic.fna | 8_0 | 0.05 | average | fastANI | 8
GCF_014078535.1_ASM1407853v1_genomic.fna | 9_0 | 0.05 | average | fastANI | 9
GCF_013407165.1_ASM1340716v1_genomic.fna | 10_0 | 0.05 | average | fastANI | 10

The version of fastANI I used is 1.32, and the version of dRep is 2.6.2, and run dRep bonus test --check_dependencies

Loading work directory
Checking dependencies
mash.................................... all good        (location = /install/software/anaconda3.6.b/bin/mash)
nucmer.................................. all good        (location = /install/software/anaconda3.6.b/bin/nucmer)
checkm.................................. all good        (location = /install/software/anaconda3.6.b/bin/checkm)
ANIcalculator........................... !!! ERROR !!!   (location = None)
prodigal................................ all good        (location = /install/software/anaconda3.6.b/bin/prodigal)
centrifuge.............................. all good        (location = /install/software/anaconda3.6.b/bin/centrifuge)
nsimscan................................ !!! ERROR !!!   (location = None)
fastANI................................. all good        (location = /install/software/fastani/1.32/fastANI)

Sorry about the new problem, and thanks a million for your time.

@MrOlm
Copy link
Owner

MrOlm commented Nov 23, 2020

Hello,

OK interesting- thank you for pointing this out to me. You are indeed correct and apologies for not understanding how my own program works.

Unfortunately though I don't think I'll be able to help with this problem, since it seems to be an issue with FastANI. You can try and post an issue on the FastANI page (https://github.com/ParBLiSS/FastANI/issues), but they don't seem to be answered very often.

However, in future versions of dRep I will have it assign genomes that are in a primary cluster by themselves to a cluster without needing to run fastANI.

Best,
Matt

@Biofarmer
Copy link
Author

Hi Dr. Olm,

  1. So do you mean that secondary comparisons are run on primary clusters that only have a single genome in them, and what I have run and the output is correct, right?

  2. As for the problem I met with the cluster 592, I should ask on the FastANI page, right? I will do that. However, I think if only one genome is in this cluster, actually I do not need to run secondary comparison, and just take it as its cluster 592. Am I right?

Thanks

@MrOlm
Copy link
Owner

MrOlm commented Nov 23, 2020

Hello,

Yes, you are correct on both points. FastANI failing doesn't matter in this case, because no matter what that genome would be cluster 592.

-Matt

@Biofarmer
Copy link
Author

Hi Dr. Olm,

Thank you very much for reply fast and now it does make sense to me.

Thanks again.

@Biofarmer
Copy link
Author

Hi Dr. Olm,

FYI, I just searched the answer to my fastANI problem, and I found it here ParBLiSS/FastANI#21. I also changed the fragLen like 200, this genome worked with fastANI.

A quite small question about the command of fastANI in dRep, like below, the last 10 letters are just to distinguish the output of fastANI in fastANI_files folder, making no any effect on the process of fastANI itself, right?

$ fastANI --ql genomeList --rl genomeList -o fastANI_out_pqbxqtvbuy --matrix -t 8 --minFraction 0 pqbxqtvbuy

Thanks

@MrOlm
Copy link
Owner

MrOlm commented Nov 23, 2020

Interesting. Thank you for pointing this out to me.

Yes, the last 10 letters are randomly generated and just used to distinguish output files.

-Matt

@Biofarmer
Copy link
Author

Hi Dr. Olm,

Thank you for reply. Anyway, this is not a problem in the case of one genome in a cluster.
That's great, will run all my genomes.

Thanks again.

@Biofarmer
Copy link
Author

Biofarmer commented Dec 3, 2020

Hi Dr. Olm,

One question from the output of run 30,000 genome together, with

dRep compare output -p 50 -pa 0.90 -sa 0.95 -nc 0.30 -cm larger --S_algorithm fastANI -g genome_path.txt

There is no error reported for Step 1. Cluster

***************************************************
    ..:: dRep compare Step 1. Cluster ::..
***************************************************

Loading genomes from a list
Clustering Step 1. Parse Arguments
Clustering Step 2. Perform MASH (primary) clustering
2a. Run pair-wise MASH clustering
2b. Cluster pair-wise MASH clustering
4362 primary clusters made
Step 3. Perform secondary clustering
Running 21641432 fastANI comparisons- should take ~ 2887.0 min
Step 4. Return output
***************************************************
    ..:: dRep compare Step 2. Bonus ::..
***************************************************

However, in the Cdb, one genome is missing compared to Bdb (there are 30,000 genomes). I also checked the unique primary cluster of Cdb, it is correct with 4362. I also checked the number of comparison in Mdb, it is correct with 59,999, however, there is no record in Ndb.

  1. Do you know why? may be still the FastANI issue as discussed above. In addition, is it possible to check which cluster this genome is from?

  2. If I leave this genome out, is the output of current clustering in Cdb still reliable?

Thanks
Wang

@Biofarmer
Copy link
Author

Hi Dr. Olm,

I updated my question, can you look at it when you are available?
Thanks

@MrOlm
Copy link
Owner

MrOlm commented Dec 8, 2020

Hello,

This is an interesting problem and not something that I've encountered before. I don't see any way that the problem genome could impact anything else, so I believe the rest of the clustering output is still correct.

I think this is probably an issue with FastANI, though it's not clear to me what the issue is. The only way I can think of to tell which primary cluster the missing genome is in is to re-run the program with the --debug flag. This will create an additional file called CdbF.csv in the data_tables folder, which contains the clustering information for all genomes with primary clustering only. You can also run this additional step with the --SkipSecondary flag to make it run much faster if you'd like.

Best,
Matt

@Biofarmer
Copy link
Author

Hi Matt,

Thanks for reply. It is good that all rest clusters are fine.

Best,
Wang

@Biofarmer
Copy link
Author

Biofarmer commented Dec 11, 2020

Hi Matt,

When I run ~50,000 genomes, and there is an error:

File "/install/software/anaconda3.6.b/lib/python3.6/site-packages/pandas/core/reshape/reshape. py", line 124, in __init__
    raise ValueError('Unstacked DataFrame is too big, '
ValueError: Unstacked DataFrame is too big, causing int32 overflow

I think this is due to the large number of genomes, right? Is there any other way to solve this problem without reducing the genome number? In addition, I can shorten my genome names with half length.

Thanks
Wang

@MrOlm
Copy link
Owner

MrOlm commented Jun 28, 2021

Yes your understanding correctly. And yes you can still use the clustering as reliable even when warnings are present; I ignore the warnings in my own research

-Matt

@Biofarmer
Copy link
Author

Thanks a million, Matt. Best, Wang

@Biofarmer
Copy link
Author

Hi Matt, I have sent one email to your gmail obtained from your github homepage, which is about the paper https://www.nature.com/articles/s41467-017-02018-w, may I ask whether you know it? Thanks, Wang

@Biofarmer
Copy link
Author

Biofarmer commented Jul 21, 2021

Hi Matt,
I just found a failure to make plot in dRep (v2.6.2) compare Step 4. Analyze as below:

making plots 1, 2, 3, 4
Plotting primary dendrogram
Failed to make plot #1: Image size of 1000x607960 pixels is too large. It must be less than 2^16 in each direction.

I think this does not matter for the clustering results, right?
Thanks
Wang

@MrOlm
Copy link
Owner

MrOlm commented Jul 21, 2021

Yes, that will not impact clustering results.

In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it?

@Biofarmer
Copy link
Author

Hi Matt,
Thank you for your fast reply, that's great it does not impact the clustering results.

As for the other question, I have a question about the metadata for four samples, and sent it to mattolm@gmail.com, do you receive it?

Thanks
Wang

@Biofarmer
Copy link
Author

Biofarmer commented Jul 21, 2021

Yes, that will not impact clustering results.

In response to your previous question, yes I know about that paper. It uses methods from the pre-inStrain days. Do you have a question about it?

Sorry to follow this question, as indicating to fail to make plot for primary cluster, why I still find 'Primary_clustering_dendrogram.pdf' in the folder of 'figures'?

@MrOlm
Copy link
Owner

MrOlm commented Jul 21, 2021

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.

I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu

@Biofarmer
Copy link
Author

Biofarmer commented Jul 21, 2021

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.

I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu

So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes?

Thanks

@Biofarmer
Copy link
Author

The figure in that .pdf will just be broken; when the .pdf creation crashes in the middle of creation (as it did in this case) it still leaves some junk behind.
I don't believe I got the email. Maybe try re-sending to mattolm@stanford.edu

So, the Primary_clustering_dendrogram.pdf in this case is not complete, right? and this failure to make plot is due to too large number of genomes?

Thanks

Hi Matt, do I understand correctly? Thanks, Wang

@MrOlm
Copy link
Owner

MrOlm commented Jul 23, 2021

I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem

@Biofarmer
Copy link
Author

I think the figure it not complete, but I'm not entirely sure what happens when matplotlib encounters that problem

Hi Matt, okay, thanks. I think this failure to make plot is due to too large number of genomes, right? When I run less genomes, this plot was plotted without failure.

@MrOlm
Copy link
Owner

MrOlm commented Jul 23, 2021

Yes exactly

@Biofarmer
Copy link
Author

Biofarmer commented Jul 23, 2021

Good to learn. Anyway, the unaffected clustering results are the most important. Many thanks.

@Biofarmer
Copy link
Author

Biofarmer commented Apr 29, 2022

Hi Matt, May I ask a question about the memory used? If I run 30000 genomes at once (~90G in size) with 40 threads, how much memory and storage (considering the intermediate files) has to be used approximately? Thanks

@brittanysuttner
Copy link

Hello, I think I found a bug in dRep related to running it with the -l option less than 3000. I am trying to derep a collection of viral genomes (many of them are less than 2k) so I run derep as such:
> dRep dereplicate --S_algorithm fastANI -nc .5 -l 1000 -d -sa 0.99 -N50W 0 -sizeW 1 --ignoreGenomeQuality --clusterAlg single --multiround_primary_clustering drep_pig_virus_sa99 -g genomes/*ffn

This results in the majority of my genomes failing the fastANI step and I think its becasue the default FragmentLength for fastANI is 3000 and that should be lowered when running dRep with -l 1000 (maybe FragmentLength should be set equal to whatever is set for -l if its less than 3000?). Or there should be an option in the dRep command to set options in the fastANI command.
Thanks!

@MrOlm
Copy link
Owner

MrOlm commented Sep 15, 2022

Hello,

Can you please confirm you’re running the most up-to-date versions of dRep and fastANI? I remember this being an issue that I believe I fixed

-Matt

@brittanysuttner
Copy link

Yes, I am running the most up-to-date versions *I think of dRep (v3.4.0) and fastANI (v1.33).
I installed dRep using pip install so it just installed that version by default.

@MrOlm
Copy link
Owner

MrOlm commented Sep 19, 2022

Ahhh OK I see. I used to have an option to set the fragment length, but it ended up being a problem with different version of inStrain. I will try and address this in the next dRep update- thank you for bringing it to my attention

-MO

@Gian77
Copy link

Gian77 commented Jan 10, 2023

Hello,
I know this is closed but I am having this weird output from dRep. dRep is telling me

...
...
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 298, in secondary_clustering
    ndb = compare_genomes(bdb, algorithm, data_folder, **kwargs)
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/compare_utils.py", line 367, in compare_genomes
    df = drep.d_cluster.external.run_pairwise_fastANI(genome_list, working_data_folder, **kwargs)
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/d_cluster/external.py", line 93, in run_pairwise_fastANI
    exe_loc = drep.get_exe('fastANI')
  File "/mnt/home/benucci/anaconda2/envs/drep/lib/python3.9/site-packages/drep/__init__.py", line 100, in get_exe
    raise ValueError("{0} isn't working- make sure its installed".format(name))
ValueError: fastANI isn't working- make sure its installed

but fstani is there

dendropy                  4.5.2              pyh3252c3a_0    bioconda
drep                      3.4.0              pyhdfd78af_0    bioconda
expat                     2.4.4                h295c915_0    anaconda
fastani                   1.33                 h0fdf51a_0    bioconda
fftw                      3.3.9                h27cfd23_1  

and

mash.................................... all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/mash)
nucmer.................................. all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/nucmer)
checkm.................................. all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/checkm)
ANIcalculator........................... !!! ERROR !!!   (location = None)
prodigal................................ all good        (location = /mnt/home/benucci/anaconda2/envs/drep/bin/prodigal)
centrifuge.............................. !!! ERROR !!!   (location = None)
nsimscan................................ !!! ERROR !!!   (location = None)
fastANI................................. !!! ERROR !!!   (location = /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI)

what is happening? Maybe doesnt like the fastANI version?
Thanks much!
G.

@MrOlm
Copy link
Owner

MrOlm commented Jan 10, 2023

Hi @Gian77 -

What happens when you try the command /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI -h?

FastANI sometimes requires its own dependencies, which could be why this isn't working

-Matt

@Gian77
Copy link

Gian77 commented Jan 10, 2023

Hey @MrOlm

Thanks fro your fast answer on this.

This, what comes out, what's libgsl.so.25?

(drep) [benucci@dev-intel18 DAS_Tool_bins]$ /mnt/home/benucci/anaconda2/envs/drep/bin/fastANI -h
/mnt/home/benucci/anaconda2/envs/drep/bin/fastANI: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory

Gian

@MrOlm
Copy link
Owner

MrOlm commented Jan 10, 2023

Hi Gian,

This is what I suspected. This has something to do with C++ and how fastANI is written, which is something I don't fully understand.

Someone else was able to fix the problem here - #146

And a discussion of the problem on fastANI's gitHub can be found here - ParBLiSS/FastANI#96

Best,
Matt

@Gian77
Copy link

Gian77 commented Jan 10, 2023

Thanks much! I dug this out...

I was able to make fastANI to work by installing the right gsl libraries in the conda environemnt

conda install -c conda-forge gsl=2.7=he838d99_0

dRep worked after this fix.

Gian

@AlessioMilanese
Copy link

Following on the discussion.

Main problem: fastANI -h return 1, and I cannot run dRep.


I have fastANI installed and it works. I try to run:

$ fastANI -q test/2838728.3.fna -r test/2838728.3.fna -o test
$ cat test
test/2838728.3.fna      test/2838728.3.fna      100     353     357

But if I do:

$ fastANI -h
$ echo $?
1

Which basically means it return an error when you call it with -h.


dRep cannot proceed, even if fastANI is working, error:

    exe_loc = drep.get_exe('fastANI')
  File "/home/ec2-user/Software/miniconda3/envs/drep/lib/python3.7/site-packages/drep/__init__.py", line 100, in get_exe
    raise ValueError("{0} isn't working- make sure its installed".format(name))
ValueError: fastANI isn't working- make sure its installed

Tool version:

drep                      3.4.5              pyhdfd78af_0    bioconda
fastani                   1.1                  h4ef8376_0    bioconda

@MrOlm
Copy link
Owner

MrOlm commented Sep 27, 2023

Hi @AlessioMilanese -

Thanks for tracking down this issue and posting here- I very much appreciate it! I will address in the next dRep update (or please feel free to submit a pull request if you're inclined). In the meantime, it seems that updating fastANI to new versions fixes this fastANI behavior.

Thanks again!
Matt

@MrOlm MrOlm reopened this Sep 27, 2023
@AlessioMilanese
Copy link

Thanks for the fast response Matt. For some reasons I cannot install a newer version of FastANI, but I will figure it out.

I will address in the next dRep update (or please feel free to submit a pull request if you're inclined)

I'm not sure what is the best solution. I think I would add an option (like -I) to skip the part where you check if the tool is installed and working.

@MrOlm
Copy link
Owner

MrOlm commented Sep 29, 2023

Hi @AlessioMilanese - yeah that would be a fine solution, or (if possible) just doing a different check for fastANI that doesn't return a 0 error code

@MrOlm MrOlm added the bug label Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants