Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run RPS-BLAST+ and cdd2cog #1

Open
aleimba opened this issue May 30, 2016 · 17 comments
Open

How to run RPS-BLAST+ and cdd2cog #1

aleimba opened this issue May 30, 2016 · 17 comments
Labels

Comments

@aleimba
Copy link
Owner

aleimba commented May 30, 2016

A master student from Brazil contacted me via email with questions how to run RPS-BLAST+ and cdd2cog.pl correctly. I'm copying the correspondence in here in case it is useful for someone else:

Hi, I am a master student of genetics at the Universidade Federal de Minas Gerais, Brasil. I was reading the cdd2goc description at
https://github.com/aleimba/bac-genomics-scripts/tree/master/cdd2cog#rps-blast

In the line referring to the use of RPS-Blast :

rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

I am confuse about the cog, which I highlighted. Is this another database we must download? If so, where could I find it, and to perform a search for protein sequences of a draft bacterial genome I assembled, how database should I get? Thank you in advance.

@aleimba
Copy link
Owner Author

aleimba commented May 30, 2016

Hi Alexandre,

Cog is the preformatted RPS-BLAST+ database that can be found here: ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz

You're right that could be clearer in the README.

You need to download all the files from the description and provide the paths to them with the respective options (or just put them into your working directory and use my provided examples in Usage).

The query (protein.fasta) is a multi-FASTA protein file, i.e. the proteins you want to assign COGs to. You can get such a file e.g. with my cds_extractor from an annotated bacterial genome file, EMBL or GENBANK. If you don't have that yet I recommend Prokka: http://www.vicbioinformatics.com/software.prokka.shtml

Best,
Andreas

@aleimba
Copy link
Owner Author

aleimba commented May 30, 2016

From Alexandre:

Thank you for your answer: I ran the command for RPS-blast

rpsblast -i c_prot.faa -d /home/lgmmicrorganismo/Programas_Analise/cdd2cog/banco_de_dados/ -o rps-blast.out -evalue 1e-2 -outfmt 6

and got this message:

[rpsblast] ERROR: Expectation value (E) [value] is bad or out of range [? to ?]

what e-value should I use? I post this message here only because this seems to be the only remaning issue toi the right execution.

@aleimba
Copy link
Owner Author

aleimba commented May 30, 2016

try 0.01:

rpsblast -query c_prot.faa -db /home/lgmmicrorganismo/Programas_Analise/cdd2cog/banco_de_dados/Cog -out rps-blast.out -evalue 0.01 -outfmt 6

HTH,
Andreas

@aleimba
Copy link
Owner Author

aleimba commented May 30, 2016

From Alexandre:

Ok, just one last question. After running cdd2cog on my protein I got this:

Overall assignment statistics:
~ Total query proteins categorized into COGs: 3190
~ Total COGs used for the query proteins [of 4873 overall]: 1868
~ Total number of assigned functional categories: 739
~ Total functional categories used for the query proteins [of 25 overall]: 20

I think too few proteins, 739 of 3190 had functional categories assigned. Can this be improved?

@aleimba
Copy link
Owner Author

aleimba commented May 30, 2016

Hi Alexandre,

sorry for replying that late.

Unfortunately, the rpsblast is not the most sensitive.

However, it's quite weird, that you only have so few functional categories. Normally the number of assigned functional categories should be higher than the proteins categorized into COGs, as many COGs are associated with several functional categories. What kind of a dataset are you using as queries? Must be proteins from very similar functional categories.

This is what I get if I run the pipeline with all CDS from an E. coli genome (4750 CDS):

Overall assignment statistics:
~ Total query proteins categorized into COGs: 3865
~ Total COGs used for the query proteins [of 4873 overall]: 2212
~ Total number of assigned functional categories: 4353
~ Total functional categories used for the query proteins [of 25 overall]: 22

You can use option -a of cdd2cog.pl to get all COG hits to each query protein (not just the best hit with the lowest e-value for each query). And then filter afterwards in the output file rps-blast_cog.txt as you see fit.

@aleimba
Copy link
Owner Author

aleimba commented May 30, 2016

From Alexandre:

Thank you for your answer. Regarding the dataset used as query, it is proteins from a bacteria. The proteins were extracted from the contigs using prodigal.

@aleimba
Copy link
Owner Author

aleimba commented May 30, 2016

From Alexandre:

In the command line

rpsblast+ -db < database> -query < query_sequence > -out <result.out > -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gaopen qend sstard send bitscore qcovs'

What's the default value for the highlighted parameters?

@aleimba
Copy link
Owner Author

aleimba commented May 30, 2016

Hi Alexandre,

you can run either:
rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

or:
rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs'

The strings 'qseqid' etc. are a custom format for the output, this way you can include 'qcovs' to have the 'Query Coverage Per Subject'. See rpsblast -help for further explanations.

But these are both only examples given in the cdd2cog.pl usage and you can adapt these commands as you want.

@luciagrami
Copy link

Hello, I found this discussion very useful, thanks. I would like to know, what do you expect from those proteins that could not be assigned to a COG, in my case i have several bacterial genomes, and in all cases, at least 1000 genes are not assigned. There are no conserved domains for them? Why can it be assigned as a hypothetical protein or function unknown?

@aleimba
Copy link
Owner Author

aleimba commented Dec 19, 2016

Hi @luciagrami,

the initial COG release was very strict in including orthologs, especially regarding good annotation. Thus, the coverage of bacterial proteins is not very high, especially for bacteria that didn't have closed genomes at that time. COGs can only be assigned via CDD hits, i.e. where a COG is actually associated.

There is a COG functional category with "function unknown" ([S]), but of course it is associated with certain COGs. Thus, for all your proteins without a COG classification or even without a CDD hit you can of course set it to "function unknown" by yourself.

There's a new COG release from 2014 which has higher coverage of bacterial genomes, but haven't integrated it yet (see #2).

@gracielad
Copy link

Hi Andreas,
I follow all the step from tutorial and this discussion to get the classification by COG, however my statistics was:
Overall assignment statistics:
~ Total query proteins categorized into COGs: 514490
~ Total COGs used for the query proteins [of 4873 overall]: 1
~ Total number of assigned functional categories: 0
~ Total functional categories used for the query proteins [of 25 overall]: 0

I don't have any classification, I annotated the genome of bacteria using PROKKA. I try two ways to get the rps-blast.out. I used the file .faa (6.696 CDS) generated from PROKKA and the cds_extractor.pl script to get a file from .gbk.

the command was: rpsblast -i PROKKA_02152017.faa -d Cog -o rps-blast.out -e 0.01 -m 6

Do you have any idea what's happening?

Thanks in advance!!

Graciela

@aleimba
Copy link
Owner Author

aleimba commented Feb 16, 2017

Hi @gracielad,
two problems. First, NCBI changed the CDD rpsblast output format. Probably a result of changing all the FASTA headers, great ... Anyway now fixed in 1b2388f (-> please download the new script version).

Second, you're using the legacy rpsblast instead of the recommended/newer + version. The cdd2cog workflow only works with the plus version (because of the proposed pre-formatted CDD rpsblast+ database). Get the BLAST+ binaries for your system here: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.
On my system I currently have (note the +) :

DESCRIPTION
   Reverse Position Specific BLAST 2.5.0+

Please check your version with rpsblast -h on the command line (legacy help option is --help). Your rpsblast call should then actually be (note the different option flags after the '-'):

rpsblast -query PROKKA_02152017.faa -db Cog -out rps-blast.out -e 0.01 -outfmt 6

For a working example (positive control), here are the commands for a complete run with E. coli K-12 MG1655:

# get script and all needed data
wget https://raw.githubusercontent.com/aleimba/bac-genomics-scripts/master/cdd2cog/cdd2cog.pl

## CDD
wget ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/cddid.tbl.gz
gunzip cddid.tbl.gz
wget ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz
tar xvfz Cog_LE.tar.gz

## COG
wget ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/fun.txt
wget ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG/whog

## K-12 MG1655 protein multi-fasta
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_protein.faa.gz
gunzip GCF_000005845.2_ASM584v2_protein.faa.gz


# rpsblast+
rpsblast -query GCF_000005845.2_ASM584v2_protein.faa -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

# cdd2cog
perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog

And here is the corresponding output with MG1655:
Overall assignment statistics:
~ Total query proteins categorized into COGs: 3587
~ Total COGs used for the query proteins [of 4873 overall]: 2107
~ Total number of assigned functional categories: 4039
~ Total functional categories used for the query proteins [of 25 overall]: 21

Best,
Andreas

@gracielad
Copy link

Hi Andres,

It's work!! Thanks!!!

All the best!!

@iquasere
Copy link

iquasere commented Sep 28, 2018

Following this example rpsblast claims the arguments are wrong, and it works after I adjust the command to

rpsblast -i GCF_000005845.2_ASM584v2_protein.faa -d Cog -o rps-blast.out -m 8

which gets the job done, however the conversion to COGs with

perl cdd2cog.pl -r rps-blast.out -c cddid.tbl -f fun.txt -w whog

gets the result

Overall assignment statistics:
~ Total query proteins categorized into COGs: 4134
~ Total COGs used for the query proteins [of 4873 overall]: 1
~ Total number of assigned functional categories: 0
~ Total functional categories used for the query proteins [of 25 overall]: 0

which brings no information, since nothing was identified. It also outputs a lot of "Use of uninitialized value" which probably means the CDD's IDs are not being recognized. The rest of the commands were used as suggested.

@utkinaira
Copy link

I have exactly the same issue as @iquasere. Please let me know how to deal with it in case if you solved the problem.
Thanks!

@iquasere
Copy link

@utkinaira I did manage to find the answer here. Turns out the IDs of CDD changed format, changing the output of rps-blast

@utkinaira
Copy link

@iquasere Thank you so much, I'll try it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants