Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suitable for CDS and/or contig taxonomic assignment? #28

Open
AstrobioMike opened this issue May 22, 2023 · 2 comments
Open

suitable for CDS and/or contig taxonomic assignment? #28

AstrobioMike opened this issue May 22, 2023 · 2 comments

Comments

@AstrobioMike
Copy link

AstrobioMike commented May 22, 2023

Hey there, @shenwei356 :)

Thanks again for not only additional wonderful software, but excellent documentation as usual!

I'm looking for a better way to taxonomically classify predicted coding sequences and contigs from metagenomic assemblies (i currently use CAT with NCBI's nr).

I really want a combination of GTDB for bacteria/archaea, and then also be able to combine euks from NCBI, so your infrastructure enabling that sort of thing is really appealing to me 🙏

I see in issue #27 you note that KMCP is not suitable for long reads. Is your thinking similar for assembled contigs too?

And would you expect to have the same thoughts about taxonomically classifying predicted coding sequences (which might average around 800-1000 bases)?

If only one of those would be possible, I can imagine it might be reasonable to use it to infer the other. E.g., if contigs are do-able, then assigning all CDSs whatever their source contig tax was. And if CDSs are do-able, employing some consensus approach to assign to the contig the tax of its CDSs.

Sorry if i'm missing if you've covered this elsewhere already, and thanks for any of your thoughts!

@shenwei356
Copy link
Owner

Hi Mike, thanks for your interest in KMCP.

You can try splitting CDS/contigs when using kmcp search.

in=e.coli.cds.fasta

# search against GTDB part1
seqkit sliding -g -s 100 -W 300 $in \
    | seqkit seq -m 100 \
    | kmcp search -d gtdb.part_1.kmcp/ -o $in.kmcp@gtdb.part_1.tsv.gz

# search against GTDB part2
seqkit sliding -g -s 100 -W 300 $in \
    | seqkit seq -m 100 \
    | kmcp search -d gtdb.part_2.kmcp/ -o $in.kmcp@gtdb.part_2.tsv.gz

# search against other databases

# merge search results
kmcp merge $in.kmcp@*.tsv.gz -o $in.kmcp.tsv.gz

# profiling and output binning/classification results
kmcp profile -X taxdump/ -T taxid.map -m 5 \
    $in.kmcp.tsv.gz -o $in.profile -B $in.binning.gz

# ------------------------------

# collect taxids for each orginal query.
zcat $in.binning.gz | grep -v '@' | grep -v '#' \
    | csvtk replace -Ht -p '_sliding:\d+-\d+$' \
    | csvtk fold -Ht -f 1 -v 2 -s , \
    > $in.id2taxids.tsv

# compute LCA
taxonkit lca --data-dir taxdump/ -i 2 -D -s , $in.id2taxids.tsv \
    | cut -f 1,3 \
    > $in.id2taxids.lca.tsv

# retrieve lineages and count
taxonkit reformat --data-dir taxdump -I 2 $in.id2taxids.lca.tsv \
    | csvtk freq -Ht -f 3 -nr | csvtk pretty -Ht

Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;                         2602
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli         1061
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia fergusonii   96
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia sp. E1V33    71

stats

$ seqkit  stats e.coli.cds.fasta
file              format  type  num_seqs    sum_len  min_len  avg_len  max_len
e.coli.cds.fasta  FASTA   DNA      4,315  4,025,874       27      933    7,077

$ wc -l $in.id2taxids.tsv
3830 e.coli.cds.fasta.id2taxids.tsv

Another example of Klebsiella pneumoniae CDS:

taxonkit reformat --data-dir taxdump -I 2 $in.id2taxids.lca.tsv \
    | csvtk freq -Ht -f 3 -nr | csvtk pretty -Ht
Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Klebsiella;Klebsiella pneumoniae   4723

file      format  type  num_seqs    sum_len  min_len  avg_len  max_len
kp.fasta  FASTA   DNA      5,316  4,696,377      114    883.4    9,492

4723 kp.fasta.id2taxids.tsv

@AstrobioMike
Copy link
Author

Awesome, thanks for plotting out a clear path! I looks forward to trying it out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants