Issue with SQMtools #840

mradz19 · 2024-05-10T05:10:16Z

Sorry another issue. I have loaded the project into R using loadSQM, but when trying to subset taxa I am getting an error which seems to be driven by an issue with the orf$table

water$orfs$table
Error in `[.data.frame`(x, i, j, with = FALSE, drop = FALSE) : 
  unused argument (with = FALSE)

The text was updated successfully, but these errors were encountered:

fpusan · 2024-05-10T07:17:12Z

Any chance this is happening when resuming a previously saved R session?
Does it persist after you run library(data.table); library(SQMtools)?

fpusan · 2024-05-10T08:13:56Z

Yes, I can reproduce that exact error by

loading a project in SQMtools

library(SQMtools)
SQM=loadSQM("mock1")

Saving the session and quitting R

save.image("foo.RData")
quit(save="no")

Restoring the session and trying to subset

library(SQMtools)
load("foo.RData")
x=subsetTax(SQM,"phylum","Pseudomonadota")
# Error in `[.data.frame`(x, , j, with = FALSE, drop = TRUE) : 
#  unused argument (with = FALSE)

Issue dissapears if I manually load data.table

library(data.table)
x=subsetTax(SQM, "phylum", "Pseudomonadota") # this works

The question is that in

SqueezeMeta/lib/SQMtools/R/genericTable.R

Line 55 in edf13b1

`[.generic.data.table` = function(x,i,j, drop = TRUE)

we rely on data.table:::`[.data.table` being part of our environment, but this does not happen automatically when loading SQMtools. Instead, it only seems to happen after running loadSQM

library(SQMtools)
sessionInfo()

# R version 4.3.2 (2023-10-31)

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] SQMtools_1.7.0

# loaded via a namespace (and not attached):
# [1] compiler_4.3.2

SQM = loadSQM("mock1")
sessionInfo()

# R version 4.3.2 (2023-10-31)

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     

# other attached packages:
# [1] SQMtools_1.7.0

# loaded via a namespace (and not attached):
# [1] compiler_4.3.2    data.table_1.14.8

mradz19 · 2024-05-14T23:55:21Z

Yes this was using an RDS file that had been transferred between R sessions. Doing it in the original session I am having no issues.

However I have noticed that in my bin_table most of the bins have NA in the completeness, contamination and strain het columns. I have never seen that before, what would be causing that data to be missing?

fpusan · 2024-05-17T07:54:21Z

Maybe some issue running checkm for those bins in particular?
Can you share the syslog file here? that should include the output from checkm

mradz19 · 2024-05-20T01:34:05Z

Yep sure this is the syslog:
syslog.txt

I processed my SQM object the following way:

bacteria=subsetTax(project, 'superkingdom', 'Bacteria', trusted_functions_only = T)
goodORFs = rownames(bacteria$orfs$tax)[bacteria$orfs$tax[,'superkingdom']=='Bacteria']
b.good = subsetORFs(bacteria, goodORFs, tax_source='orfs', trusted_functions_only = T, ignore_unclassified_functions = T)

bin_table=b.good$bins$table

It was in the table output that I noticed most of the bins had NAs for completeness and contamination. Which I haven't had in the past.

Here is an example of the table:

bin_table.csv

fpusan · 2024-05-22T08:14:55Z

I suspect the bins with missing checkm info are coming from eukaryotes (which CheckM does not cover).
Can you confirm by looking at the SqueezeMeta taxonomy in those bins?

mradz19 · 2024-06-06T00:53:56Z

I don't think that is the case as when I look at the bin_table.csv the tax for all of them is K_bacteria;...

mradz19 · 2024-06-06T01:05:40Z

I checked the actual .fa.contigs.fa.tax files as well and the consensus tax all seem to be bacteria, for example:

nano metabat2.1187.fa.contigs.fa.tax
List of taxa (abundance >= 1%): k: Bacteria (99.8);p: Proteobacteria (98.2);c: Betaproteobacteria (90.5) Alphaproteobacteria (1.8);o: Burkholderiales (60.2) Hyphomicrobiales (1.4);f: Comamonadaceae (29.5);g: Polaromonas (2.4) Limnohabitans (1.0);s: Betaproteobacteri>
SqueezeMeta tax: k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales
Consensus: k_Bacteria;p_Proteobacteria;c_Betaproteobacteria;o_Burkholderiales;  Total size: 2686974     Disparity: 0.089        Consensus mode: s

However the completeness and contamination values for this bin are both NA

I did go ahead and do a separate install of checkM and ran it successfully on metabat2.1187.fa.contigs.fa:

 checkm lineage_wf -t 8 bins/ test/
[2024-06-06 11:45:32] INFO: CheckM v1.2.2
[2024-06-06 11:45:32] INFO: checkm lineage_wf -t 8 bins/ test/
[2024-06-06 11:45:32] INFO: CheckM data: data/
[2024-06-06 11:45:32] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2024-06-06 11:45:32] INFO: Identifying marker genes in 1 bins with 8 threads:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Saving HMM info to file.
[2024-06-06 11:45:50] INFO: Calculating genome statistics for 1 bins with 8 threads:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Extracting marker genes to align.
[2024-06-06 11:45:50] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:45:50] INFO: Extracting 43 HMMs with 8 threads:
    Finished extracting 43 of 43 (100.00%) HMMs.
[2024-06-06 11:45:50] INFO: Aligning 43 marker genes with 8 threads:
    Finished aligning 43 of 43 (100.00%) marker genes.
[2024-06-06 11:45:50] INFO: Reading marker alignment files.
[2024-06-06 11:45:50] INFO: Concatenating alignments.
[2024-06-06 11:45:50] INFO: Placing 1 bins into the genome tree with pplacer (be patient).
[2024-06-06 11:47:57] INFO: { Current stage: 0:02:25.649 || Total: 0:02:25.649 }
[2024-06-06 11:47:57] INFO: [CheckM - lineage_set] Inferring lineage-specific marker sets.
[2024-06-06 11:47:57] INFO: Reading HMM info from file.
[2024-06-06 11:47:57] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:47:57] INFO: Determining marker sets for each genome bin.
    Finished processing 1 of 1 (100.00%) bins (current: metabat2.1187.fa.contigs).
[2024-06-06 11:47:58] INFO: Marker set written to: test/lineage.ms
[2024-06-06 11:47:58] INFO: { Current stage: 0:00:01.248 || Total: 0:02:26.898 }
[2024-06-06 11:47:58] INFO: [CheckM - analyze] Identifying marker genes in bins.
[2024-06-06 11:47:59] INFO: Identifying marker genes in 1 bins with 8 threads:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:23] INFO: Saving HMM info to file.
[2024-06-06 11:49:23] INFO: { Current stage: 0:01:25.023 || Total: 0:03:51.921 }
[2024-06-06 11:49:23] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
[2024-06-06 11:49:24] INFO: Aligning marker genes with multiple hits in a single bin:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:24] INFO: { Current stage: 0:00:00.853 || Total: 0:03:52.775 }
[2024-06-06 11:49:24] INFO: Calculating genome statistics for 1 bins with 8 threads:
    Finished processing 1 of 1 (100.00%) bins.
[2024-06-06 11:49:25] INFO: { Current stage: 0:00:00.182 || Total: 0:03:52.958 }
[2024-06-06 11:49:25] INFO: [CheckM - qa] Tabulating genome statistics.
[2024-06-06 11:49:25] INFO: Calculating AAI between multi-copy marker genes.
[2024-06-06 11:49:25] INFO: Reading HMM info from file.
[2024-06-06 11:49:25] INFO: Parsing HMM hits to marker genes:
    Finished parsing hits for 1 of 1 (100.00%) bins.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Bin Id                            Marker lineage          # genomes   # markers   # marker sets    0     1    2    3   4   5+   Completeness   Contamination   Strain heterogeneity  
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  metabat2.1187.fa.contigs   o__Burkholderiales (UID4000)      193         427           214        155   261   11   0   0   0       66.75            2.96              27.27          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

As you can see the contamination and completeness was calculated correctly

fpusan · 2024-06-07T07:15:08Z

Yes, it seems that CheckM should have worked correctly for that one. My recommendation here would be to locate the contigs that are really from prokaryotes by loking at the bin table and running CheckM manually for those.

We are actually going to upgrade to CheckM2 for the next version of SqueezeMeta (which should come sometime in summer) and we have rewritten the corresponding SqueezeMeta scripts, so whatever is causing this bug will not be present (though of course new bugs might!).

fpusan added a commit that referenced this issue May 10, 2024

attach data.table without needing to run loadSQM, fixes #840

f9afb4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with SQMtools #840

Issue with SQMtools #840

mradz19 commented May 10, 2024

fpusan commented May 10, 2024

fpusan commented May 10, 2024

mradz19 commented May 14, 2024

fpusan commented May 17, 2024

mradz19 commented May 20, 2024 •

edited

fpusan commented May 22, 2024

mradz19 commented Jun 6, 2024

mradz19 commented Jun 6, 2024 •

edited

fpusan commented Jun 7, 2024

Issue with SQMtools #840

Issue with SQMtools #840

Comments

mradz19 commented May 10, 2024

fpusan commented May 10, 2024

fpusan commented May 10, 2024

mradz19 commented May 14, 2024

fpusan commented May 17, 2024

mradz19 commented May 20, 2024 • edited

fpusan commented May 22, 2024

mradz19 commented Jun 6, 2024

mradz19 commented Jun 6, 2024 • edited

fpusan commented Jun 7, 2024

mradz19 commented May 20, 2024 •

edited

mradz19 commented Jun 6, 2024 •

edited