How to run RPS-BLAST+ and `cdd2cog` #1

aleimba · 2016-05-30T16:28:32Z

A master student from Brazil contacted me via email with questions how to run RPS-BLAST+ and cdd2cog.pl correctly. I'm copying the correspondence in here in case it is useful for someone else:

Hi, I am a master student of genetics at the Universidade Federal de Minas Gerais, Brasil. I was reading the cdd2goc description at
https://github.com/aleimba/bac-genomics-scripts/tree/master/cdd2cog#rps-blast

In the line referring to the use of RPS-Blast :

rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

I am confuse about the cog, which I highlighted. Is this another database we must download? If so, where could I find it, and to perform a search for protein sequences of a draft bacterial genome I assembled, how database should I get? Thank you in advance.

The text was updated successfully, but these errors were encountered:

aleimba · 2016-05-30T16:31:17Z

Hi Alexandre,

Cog is the preformatted RPS-BLAST+ database that can be found here: ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz

You're right that could be clearer in the README.

You need to download all the files from the description and provide the paths to them with the respective options (or just put them into your working directory and use my provided examples in Usage).

The query (protein.fasta) is a multi-FASTA protein file, i.e. the proteins you want to assign COGs to. You can get such a file e.g. with my cds_extractor from an annotated bacterial genome file, EMBL or GENBANK. If you don't have that yet I recommend Prokka: http://www.vicbioinformatics.com/software.prokka.shtml

Best,
Andreas

aleimba · 2016-05-30T16:32:00Z

From Alexandre:

Thank you for your answer: I ran the command for RPS-blast

rpsblast -i c_prot.faa -d /home/lgmmicrorganismo/Programas_Analise/cdd2cog/banco_de_dados/ -o rps-blast.out -evalue 1e-2 -outfmt 6

and got this message:

[rpsblast] ERROR: Expectation value (E) [value] is bad or out of range [? to ?]

what e-value should I use? I post this message here only because this seems to be the only remaning issue toi the right execution.

aleimba · 2016-05-30T16:32:17Z

try 0.01:

rpsblast -query c_prot.faa -db /home/lgmmicrorganismo/Programas_Analise/cdd2cog/banco_de_dados/Cog -out rps-blast.out -evalue 0.01 -outfmt 6

HTH,
Andreas

aleimba · 2016-05-30T16:32:42Z

From Alexandre:

Ok, just one last question. After running cdd2cog on my protein I got this:

Overall assignment statistics:
~ Total query proteins categorized into COGs: 3190
~ Total COGs used for the query proteins [of 4873 overall]: 1868
~ Total number of assigned functional categories: 739
~ Total functional categories used for the query proteins [of 25 overall]: 20

I think too few proteins, 739 of 3190 had functional categories assigned. Can this be improved?

aleimba · 2016-05-30T16:34:12Z

Hi Alexandre,

sorry for replying that late.

Unfortunately, the rpsblast is not the most sensitive.

However, it's quite weird, that you only have so few functional categories. Normally the number of assigned functional categories should be higher than the proteins categorized into COGs, as many COGs are associated with several functional categories. What kind of a dataset are you using as queries? Must be proteins from very similar functional categories.

This is what I get if I run the pipeline with all CDS from an E. coli genome (4750 CDS):

Overall assignment statistics:
~ Total query proteins categorized into COGs: 3865
~ Total COGs used for the query proteins [of 4873 overall]: 2212
~ Total number of assigned functional categories: 4353
~ Total functional categories used for the query proteins [of 25 overall]: 22

You can use option -a of cdd2cog.pl to get all COG hits to each query protein (not just the best hit with the lowest e-value for each query). And then filter afterwards in the output file rps-blast_cog.txt as you see fit.

aleimba · 2016-05-30T16:34:32Z

From Alexandre:

Thank you for your answer. Regarding the dataset used as query, it is proteins from a bacteria. The proteins were extracted from the contigs using prodigal.

aleimba · 2016-05-30T16:35:19Z

From Alexandre:

In the command line

rpsblast+ -db < database> -query < query_sequence > -out <result.out > -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gaopen qend sstard send bitscore qcovs'

What's the default value for the highlighted parameters?

aleimba · 2016-05-30T16:37:13Z

Hi Alexandre,

you can run either:
rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

or:
rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'

The strings 'qseqid' etc. are a custom format for the output, this way you can include 'qcovs' to have the 'Query Coverage Per Subject'. See rpsblast -help for further explanations.

But these are both only examples given in the cdd2cog.pl usage and you can adapt these commands as you want.

luciagrami · 2016-12-09T18:28:52Z

Hello, I found this discussion very useful, thanks. I would like to know, what do you expect from those proteins that could not be assigned to a COG, in my case i have several bacterial genomes, and in all cases, at least 1000 genes are not assigned. There are no conserved domains for them? Why can it be assigned as a hypothetical protein or function unknown?

aleimba · 2016-12-19T18:59:39Z

Hi @luciagrami,

the initial COG release was very strict in including orthologs, especially regarding good annotation. Thus, the coverage of bacterial proteins is not very high, especially for bacteria that didn't have closed genomes at that time. COGs can only be assigned via CDD hits, i.e. where a COG is actually associated.

There is a COG functional category with "function unknown" ([S]), but of course it is associated with certain COGs. Thus, for all your proteins without a COG classification or even without a CDD hit you can of course set it to "function unknown" by yourself.

There's a new COG release from 2014 which has higher coverage of bacterial genomes, but haven't integrated it yet (see #2).

gracielad · 2017-02-15T16:56:01Z

Hi Andreas,
I follow all the step from tutorial and this discussion to get the classification by COG, however my statistics was:
Overall assignment statistics:
~ Total query proteins categorized into COGs: 514490
~ Total COGs used for the query proteins [of 4873 overall]: 1
~ Total number of assigned functional categories: 0
~ Total functional categories used for the query proteins [of 25 overall]: 0

I don't have any classification, I annotated the genome of bacteria using PROKKA. I try two ways to get the rps-blast.out. I used the file .faa (6.696 CDS) generated from PROKKA and the cds_extractor.pl script to get a file from .gbk.

the command was: rpsblast -i PROKKA_02152017.faa -d Cog -o rps-blast.out -e 0.01 -m 6

Do you have any idea what's happening?

Thanks in advance!!

Graciela

aleimba · 2017-02-16T12:13:46Z

Hi @gracielad,
two problems. First, NCBI changed the CDD rpsblast output format. Probably a result of changing all the FASTA headers, great ... Anyway now fixed in 1b2388f (-> please download the new script version).

Second, you're using the legacy rpsblast instead of the recommended/newer + version. The cdd2cog workflow only works with the plus version (because of the proposed pre-formatted CDD rpsblast+ database). Get the BLAST+ binaries for your system here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.
On my system I currently have (note the +) :

DESCRIPTION
   Reverse Position Specific BLAST 2.5.0+

Please check your version with rpsblast -h on the command line (legacy help option is --help). Your rpsblast call should then actually be (note the different option flags after the '-'):

rpsblast -query PROKKA_02152017.faa -db Cog -out rps-blast.out -e 0.01 -outfmt 6

For a working example (positive control), here are the commands for a complete run with E. coli K-12 MG1655:

# get script and all needed data
wget https://raw.githubusercontent.com/aleimba/bac-genomics-scripts/master/cdd2cog/cdd2cog.pl

## CDD
wget ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/cddid.tbl.gz
gunzip cddid.tbl.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz
tar xvfz Cog_LE.tar.gz

## COG
wget ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/fun.txt
wget ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/whog

## K-12 MG1655 protein multi-fasta
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
gunzip GCF_000005845.2_ASM584v2_protein.faa.gz


# rpsblast+
rpsblast -query GCF_000005845.2_ASM584v2_protein.faa -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

# cdd2cog
perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog

And here is the corresponding output with MG1655:
Overall assignment statistics:
~ Total query proteins categorized into COGs: 3587
~ Total COGs used for the query proteins [of 4873 overall]: 2107
~ Total number of assigned functional categories: 4039
~ Total functional categories used for the query proteins [of 25 overall]: 21

Best,
Andreas

gracielad · 2017-02-23T17:26:44Z

Hi Andres,

It's work!! Thanks!!!

All the best!!

iquasere · 2018-09-28T16:07:40Z

Following this example rpsblast claims the arguments are wrong, and it works after I adjust the command to

rpsblast -i GCF_000005845.2_ASM584v2_protein.faa -d Cog -o rps-blast.out -m 8

which gets the job done, however the conversion to COGs with

perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog

gets the result

Overall assignment statistics:
~ Total query proteins categorized into COGs: 4134
~ Total COGs used for the query proteins [of 4873 overall]: 1
~ Total number of assigned functional categories: 0
~ Total functional categories used for the query proteins [of 25 overall]: 0

which brings no information, since nothing was identified. It also outputs a lot of "Use of uninitialized value" which probably means the CDD's IDs are not being recognized. The rest of the commands were used as suggested.

utkinaira · 2018-11-26T06:28:10Z

I have exactly the same issue as @iquasere. Please let me know how to deal with it in case if you solved the problem.
Thanks!

iquasere · 2018-11-27T12:22:46Z

@utkinaira I did manage to find the answer here. Turns out the IDs of CDD changed format, changing the output of rps-blast

utkinaira · 2018-11-27T13:49:24Z

@iquasere Thank you so much, I'll try it!

aleimba added the question label May 30, 2016

iquasere mentioned this issue Oct 1, 2018

Can't convert CDDs IDs to COGs #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run RPS-BLAST+ and `cdd2cog` #1

How to run RPS-BLAST+ and `cdd2cog` #1

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

luciagrami commented Dec 9, 2016

aleimba commented Dec 19, 2016

gracielad commented Feb 15, 2017

aleimba commented Feb 16, 2017

gracielad commented Feb 23, 2017

iquasere commented Sep 28, 2018 •

edited

utkinaira commented Nov 26, 2018

iquasere commented Nov 27, 2018

utkinaira commented Nov 27, 2018

How to run RPS-BLAST+ and cdd2cog #1

How to run RPS-BLAST+ and cdd2cog #1

Comments

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

aleimba commented May 30, 2016

luciagrami commented Dec 9, 2016

aleimba commented Dec 19, 2016

gracielad commented Feb 15, 2017

aleimba commented Feb 16, 2017

gracielad commented Feb 23, 2017

iquasere commented Sep 28, 2018 • edited

utkinaira commented Nov 26, 2018

iquasere commented Nov 27, 2018

utkinaira commented Nov 27, 2018

How to run RPS-BLAST+ and `cdd2cog` #1

How to run RPS-BLAST+ and `cdd2cog` #1

iquasere commented Sep 28, 2018 •

edited