Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with SQMtools #840

Open
mradz19 opened this issue May 10, 2024 · 9 comments
Open

Issue with SQMtools #840

mradz19 opened this issue May 10, 2024 · 9 comments

Comments

@mradz19
Copy link

mradz19 commented May 10, 2024

Sorry another issue. I have loaded the project into R using loadSQM, but when trying to subset taxa I am getting an error which seems to be driven by an issue with the orf$table

water$orfs$table
Error in `[.data.frame`(x, i, j, with = FALSE, drop = FALSE) : 
  unused argument (with = FALSE)
@fpusan
Copy link
Collaborator

fpusan commented May 10, 2024

Any chance this is happening when resuming a previously saved R session?
Does it persist after you run library(data.table); library(SQMtools)?

@fpusan
Copy link
Collaborator

fpusan commented May 10, 2024

Yes, I can reproduce that exact error by

  1. loading a project in SQMtools
library(SQMtools)
SQM=loadSQM("mock1")
  1. Saving the session and quitting R
save.image("foo.RData")
quit(save="no")
  1. Restoring the session and trying to subset
library(SQMtools)
load("foo.RData")
x=subsetTax(SQM,"phylum","Pseudomonadota")
# Error in `[.data.frame`(x, , j, with = FALSE, drop = TRUE) : 
#  unused argument (with = FALSE)
  1. Issue dissapears if I manually load data.table
library(data.table)
x=subsetTax(SQM, "phylum", "Pseudomonadota") # this works

The question is that in

`[.generic.data.table` = function(x,i,j, drop = TRUE)
we rely on data.table:::`[.data.table` being part of our environment, but this does not happen automatically when loading SQMtools. Instead, it only seems to happen after running loadSQM

library(SQMtools)
sessionInfo()

# R version 4.3.2 (2023-10-31)

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] SQMtools_1.7.0

# loaded via a namespace (and not attached):
# [1] compiler_4.3.2

SQM = loadSQM("mock1")
sessionInfo()

# R version 4.3.2 (2023-10-31)

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] SQMtools_1.7.0

# loaded via a namespace (and not attached):
# [1] compiler_4.3.2    data.table_1.14.8

@mradz19
Copy link
Author

mradz19 commented May 14, 2024

Yes this was using an RDS file that had been transferred between R sessions. Doing it in the original session I am having no issues.

However I have noticed that in my bin_table most of the bins have NA in the completeness, contamination and strain het columns. I have never seen that before, what would be causing that data to be missing?

@fpusan
Copy link
Collaborator

fpusan commented May 17, 2024

Maybe some issue running checkm for those bins in particular?
Can you share the syslog file here? that should include the output from checkm

@mradz19
Copy link
Author

mradz19 commented May 20, 2024

Yep sure this is the syslog:
syslog.txt

I processed my SQM object the following way:

bacteria=subsetTax(project, 'superkingdom', 'Bacteria', trusted_functions_only = T)
goodORFs = rownames(bacteria$orfs$tax)[bacteria$orfs$tax[,'superkingdom']=='Bacteria']
b.good = subsetORFs(bacteria, goodORFs, tax_source='orfs', trusted_functions_only = T, ignore_unclassified_functions = T)

bin_table=b.good$bins$table

It was in the table output that I noticed most of the bins had NAs for completeness and contamination. Which I haven't had in the past.

Here is an example of the table:

bin_table.csv

@fpusan
Copy link
Collaborator

fpusan commented May 22, 2024

I suspect the bins with missing checkm info are coming from eukaryotes (which CheckM does not cover).
Can you confirm by looking at the SqueezeMeta taxonomy in those bins?

@mradz19
Copy link
Author

mradz19 commented Jun 6, 2024

I don't think that is the case as when I look at the bin_table.csv the tax for all of them is K_bacteria;...

@mradz19
Copy link
Author

mradz19 commented Jun 6, 2024

I checked the actual .fa.contigs.fa.tax files as well and the consensus tax all seem to be bacteria, for example:

nano metabat2.1187.fa.contigs.fa.tax
List of taxa (abundance >= 1%): k: Bacteria (99.8);p: Proteobacteria (98.2);c: Betaproteobacteria (90.5) Alphaproteobacteria (1.8);o: Burkholderiales (60.2) Hyphomicrobiales (1.4);f: Comamonadaceae (29.5);g: Polaromonas (2.4) Limnohabitans (1.0);s: Betaproteobacteri>
SqueezeMeta tax: k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales
Consensus: k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales;  Total size: 2686974     Disparity: 0.089        Consensus mode: s

However the completeness and contamination values for this bin are both NA

I did go ahead and do a separate install of checkM and ran it successfully on metabat2.1187.fa.contigs.fa:

 checkm lineage_wf -t 8 bins/ test/
[2024-06-06 11:45:32] INFO: CheckM v1.2.2
[2024-06-06 11:45:32] INFO: checkm lineage_wf -t 8 bins/ test/
[2024-06-06 11:45:32] INFO: CheckM data: data/
[2024-06-06 11:45:32] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2024-06-06 11:45:32] INFO: Identifying marker genes in 1 bins with 8 threads:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Saving HMM info to file.
[2024-06-06 11:45:50] INFO: Calculating genome statistics for 1 bins with 8 threads:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Extracting marker genes to align.
[2024-06-06 11:45:50] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Extracting 43 HMMs with 8 threads:
    Finished extracting 43 of 43 (100.00%) HMMs.
[2024-06-06 11:45:50] INFO: Aligning 43 marker genes with 8 threads:
    Finished aligning 43 of 43 (100.00%) marker genes.
[2024-06-06 11:45:50] INFO: Reading marker alignment files.
[2024-06-06 11:45:50] INFO: Concatenating alignments.
[2024-06-06 11:45:50] INFO: Placing 1 bins into the genome tree with pplacer (be patient).
[2024-06-06 11:47:57] INFO: { Current stage: 0:02:25.649 || Total: 0:02:25.649 }
[2024-06-06 11:47:57] INFO: [CheckM - lineage_set] Inferring lineage-specific marker sets.
[2024-06-06 11:47:57] INFO: Reading HMM info from file.
[2024-06-06 11:47:57] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:47:57] INFO: Determining marker sets for each genome bin.
    Finished processing 1 of 1 (100.00%) bins (current: metabat2.1187.fa.contigs).
[2024-06-06 11:47:58] INFO: Marker set written to: test/lineage.ms
[2024-06-06 11:47:58] INFO: { Current stage: 0:00:01.248 || Total: 0:02:26.898 }
[2024-06-06 11:47:58] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2024-06-06 11:47:59] INFO: Identifying marker genes in 1 bins with 8 threads:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:23] INFO: Saving HMM info to file.
[2024-06-06 11:49:23] INFO: { Current stage: 0:01:25.023 || Total: 0:03:51.921 }
[2024-06-06 11:49:23] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:49:24] INFO: Aligning marker genes with multiple hits in a single bin:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:24] INFO: { Current stage: 0:00:00.853 || Total: 0:03:52.775 }
[2024-06-06 11:49:24] INFO: Calculating genome statistics for 1 bins with 8 threads:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:25] INFO: { Current stage: 0:00:00.182 || Total: 0:03:52.958 }
[2024-06-06 11:49:25] INFO: [CheckM - qa] Tabulating genome statistics.
[2024-06-06 11:49:25] INFO: Calculating AAI between multi-copy marker genes.
[2024-06-06 11:49:25] INFO: Reading HMM info from file.
[2024-06-06 11:49:25] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id                            Marker lineage          # genomes   # markers   # marker sets    0     1    2    3   4   5+   Completeness   Contamination   Strain heterogeneity  
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  metabat2.1187.fa.contigs   o__Burkholderiales (UID4000)      193         427           214        155   261   11   0   0   0       66.75            2.96              27.27          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see the contamination and completeness was calculated correctly

@fpusan
Copy link
Collaborator

fpusan commented Jun 7, 2024

Yes, it seems that CheckM should have worked correctly for that one. My recommendation here would be to locate the contigs that are really from prokaryotes by loking at the bin table and running CheckM manually for those.

We are actually going to upgrade to CheckM2 for the next version of SqueezeMeta (which should come sometime in summer) and we have rewritten the corresponding SqueezeMeta scripts, so whatever is causing this bug will not be present (though of course new bugs might!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants