Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doco: GTDB R207 database inconsistency #3118

Open
wwood opened this issue Apr 16, 2024 · 3 comments
Open

doco: GTDB R207 database inconsistency #3118

wwood opened this issue Apr 16, 2024 · 3 comments

Comments

@wwood
Copy link

wwood commented Apr 16, 2024

Hi there,

I've been having some trouble getting R207 databases to work with soumash tax metagenome. I'm using 4.8.8 from conda.

After running sketch, the instructions at https://sourmash.readthedocs.io/en/latest/tutorial-lemonade.html#id7 say

# use tax metagenome to classify the metagenome
sourmash tax metagenome -g SRR8859675.x.gtdb.csv \
    -t gtdb-rs207.taxonomy.sqldb -F human -r order

$ sourmash tax metagenome --gather-csv GCA_020052375.1_genomic.gather_gtdbrs207_reps.csv --taxonomy ../../output_sourmash/sourmash/data/gtdb-rs207.taxonomy.sqldb

There doesn't appear to be any *.sqldb available, now we should just use the taxonomy CSV?

OK, so

The lineage spreadsheet (for sourmash tax commands) is available at the species level and at the genome level.

"I only need to species reps" I think, so I'll just download the first one. But that fails:

== This is sourmash version 4.8.8. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 1 gather results from 'GCA_020052375.1_genomic.gather_gtdbrs207_reps.csv'.
loaded results for 1 queries from 1 gather CSVs
of 1 gather results, lineage assignments for 1 results were missed.
The following are missing from the taxonomy information: GCF_000299365
Starting summarization up rank(s): strain, species, genus, family, order, class, phylum, superkingdom
Traceback (most recent call last):
  File "/home/woodcrob/e/sourmash-v4.8.8/bin/sourmash", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/cli/tax/metagenome.py", line 150, in main
    return sourmash.tax.__main__.metagenome(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/tax/__main__.py", line 203, in metagenome
    tax_utils.write_summary(
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 1124, in write_summary
    header, summary = q_res.make_full_summary(
                      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 2581, in make_full_summary
    self.check_summarization()
  File "/home/woodcrob/e/sourmash-v4.8.8/lib/python3.12/site-packages/sourmash/tax/tax_utils.py", line 2542, in check_summarization
    raise ValueError("lineages not summarized yet.")
ValueError: lineages not summarized yet.

The genome one worked, so I got there in the end.

I'm a bit confused why the species one has ident entries along the lines of s__Escherichia_coli when sketch doesn't generate IDs of this type. Maybe I'm missing something.

Anyway, HTH,
ben

@ctb
Copy link
Contributor

ctb commented Apr 16, 2024

There doesn't appear to be any *.sqldb available, now we should just use the taxonomy CSV?

There are some instructions above that need to be run - see "Let’s index the taxonomy database using SQLite, for faster access later on:".

sourmash tax prepare -t gtdb-rs207.taxonomy.csv \
    -o gtdb-rs207.taxonomy.sqldb -F sql

That having been said, you can use the taxonomy CSV too! It'll just take longer to load each time.

"I only need to species reps" I think, so I'll just download the first one.

Right! It needs to match the content of the database you're searching, which (in this case) is all of the GTDB genomes, not just the species-level representatives. We'll fix the tutorial to make this clear!

The download link is in the tutorial, under "We also want to download the accompanying taxonomy spreadsheet:"

But that fails:

Well, and our error message certainly need some help... we'll fix, thanks!

I'm a bit confused why the species one has ident entries along the lines of s__Escherichia_coli when sketch doesn't generate IDs of this type. Maybe I'm missing something.

Oh dear, that does look incorrect to me - I wonder why we did that... I'll see if I can fix. Thank you very much for reporting all of this!

@ctb
Copy link
Contributor

ctb commented Apr 16, 2024

Fixing link to species database here: #3119

@wwood
Copy link
Author

wwood commented Apr 17, 2024

Thanks for the quick response @ctb - makes sense - fine by me to close this issue.

ctb added a commit that referenced this issue Apr 17, 2024
Per #3118, we linked the
wrong taxonomy spreadsheet! The one in there is an experimental
pangenome one. This PR fixes the links and adds better language.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants