GitHub - Ecogenomics/Mannotator: Microbial annotation pipeline

This repository has been archived by the owner on Sep 30, 2022. It is now read-only.

Ecogenomics / Mannotator Public archive

Notifications You must be signed in to change notification settings
Fork 5
Star 4

Microbial annotation pipeline - in progress

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
bin		bin
lib		lib
misc		misc
tests		tests
vis		vis
.gitmodules		.gitmodules
README		README

Repository files navigation

------------------------------------------------
Mannotator
------------------------------------------------

Behold! The Mannotator walks among us!

------------------------------------------------
What is it?
------------------------------------------------

With such an exciting name, you may be forgiven for thinking that the mannotator is something other than a microbial sequence annotator, BUT, you'd be wrong… If you have a microbial assembly, then you can use this tool to annotate your contigs. This tool is not perfect, nor complete. Any and all suggestions are welcome!

"Annotate your contig" (currently) means:
* Call open reading frames (ORFs) on your contigs or complete genome using a consensus approach (e.g. RAST, Prodigal, others possible)
* Assign function (KEGG, COG/NOG, eggNOG, Gene Ontology (GO)) to these ORFs using a best blast hit approach


------------------------------------------------
What to do before I can use it?
------------------------------------------------

Step 1: Install Mannotator's dependencies. You'll need Perl and the following Perl modules:
   * Bioperl
   * Getopt::Euclid
   * URI::Escape

Step 1a: Install optional dependancies
  * Ruby
  * bioruby

Step 2: Prepare database files to run your annotations with. Mannotator has support for EMBL Uniref/COG/KEGG and NCBI Nr/Taxonomy

Step 2a: Download your protein database of choice, i.e. Uniref90 (http://www.ebi.ac.uk/uniprot/database/download.html) or Nr (ftp://ftp.ncbi.nih.gov/../blast/db/FASTA/nr.gz)

Step 2b: Uncompress it and format it with the BLAST formatdb utility

Step 2b: Use Mannotator's idMergeUniref.pl or idMergeNr.pl to produce a file of ID mappings, i.e. Uniref_mappings.txt or Nr_mappings.txt

Now you are ready to annotate your sequence using Ze Mannotator

------------------------------------------------
How do I use it?
------------------------------------------------

NOTE: These instructions are made for the install on our Eury server. They can easily be modified to work elsewhere though…

Step 1: Ensure all you contig names (fasta IDs) are computer friendly. Avoid the following characters in fasta headers (always a good idea anyway…). Also, make sure that all the fasta headers are unique!

Avoid these characters:

| 
space
\
&
*
(
)
&

Step 2. Submit your sequences for annotation on the RAST server at http://rast.nmpdr.org/. When it is over, export your annotations as a GFF file. The RAST annotation may last several days, so in the meantime, start step 3.

Step 3. Find ORFs in your sequences using Prodigal (on eury)

$ prodigal -i IN_FILE -o OUT_FILE.gff3 -f gff

For metagenomic contigs, try:

$ prodigal -i IN_FILE -o OUT_FILE.gff3 -f gff -p meta

Step 4. Copy the Prodigal and RAST gff3 files and the multiple fasta file containing your contigs into a working directory.

Step 5. Unleash the mannotator!

Here is a quick rundown what is being done by the mannotator and how to run it.
   
   1.
      Combine multiple gff3 files into one and extract fasta sequences of un-annotated ORFs
   2.
      Use Blastx against the Uniref90 or Nr databases to get best similarity for each each ORF fasta sequence
   3.
      Use the Uniref obtained in the previous step to assign KEGG and COG/NOG IDs to each of the un-annotated fasta files. If you used Nr, then you get GI (genbank identifiers) which are used to look up the putative function and taxonomy of the proteic sequence.
   4.
      Mix all the information together and return a single gff3 file called 'mannotatored.gff3' that contains all your annotations. You can visualise the annotations contained in this file using Artemis or GBrowse.

Running the Mannotator will take a while, so perhaps you should use nohup or screen. The order of the GFF files is IMPORTANT. You should put the RAST file first, followed by the prodigal file.

The most basic command to get the mannotator running is:

$ mannotator -g rast.gff3,prodigal.gff3 -c contigs.fa

which (on the Luca server) load the correct balst database and annotation mapping file.
However specifying a custom location for both of these files is also possible:

$ mannotator -g rast.gff3,prodigal.gff3 -c contigs.fa -p <uniref_path> -i <Uniref_mappings_path>

$ mannotator -g rast.gff3,prodigal.gff3 -c contigs.fa -p <nr_path> -i <Nr_mappings_path>

If you ran your data against both Uniref (for function assignment) and Nr (for taxonomy assignment), you might want to recombine your GFF output files into a single file using the combineGffAnnotations.pl script.


Where the file for <uniref_path> can be found at ACE:

Luca: /srv/whitlam/bio/db/uniprot/uniref90/uniref90.fasta
Eury: /Volumes/Biodata/BLAST_databases/UniProt/UniRef90/uniref90.fasta

And the file for <Uniref_mappings_path>:

Luca: /srv/whitlam/bio/db/mannotator/Uniref201106_mappings.txt
Eury: /Volumes/Biodata/ANN_mappings/ANN_mappings.txt

On Luca you no longer need to specify these two file locations as they are automatically used by default

The pipeline will output a file called “mannotatored.gff3”

FIN

------------------------------------------------
More info on the scripts themselves
------------------------------------------------

The mannotator can also be run as individual scripts...

The first is called combineGffOrfs. It will take any number of gff3 files and merge their ORFs into one combined file. Also, it is currently programmed to extract the fasta sequences for ORFs which have not been given a RAST/FIG annotation. This script will run as a stand alone application. The order of the gff3 files is IMPORTANT. You need to supply them in the order in which you trust them.

$ combineGffOrfs -gffs|g gff_file1[,gff_file2[, ... ]] -contigs|c contigs_file

The second script is called blast2Ann. It will take a Uniref90 blastx result and add KEGG and COG/NOG annotations to your contigs. Again, it can be run as a standalone.

$ blast2ann -u2a|u ANN_file -gff3|g gff3_file

The ANN_file is a special text file which links Uniref90 IDs to KEGG and COG/NOG IDs.
The format of the ANN_mappings file is as follows:

UniProtID^[;Ontology_term=COG_ID:COG-NOGID[;Ontology_term=COG_DESC:COG-NOG text description]][;Ontology_term=KEGG_ID:KEGG entryID[;Ontology_term=KEG_PATHWAY:KEGG pathway ID][;Ontology_term=KEGG_ONTOLOGY:KEGG ko ID]]

Some sample lines:

B8NAL8^;Ontology_term=KEGG_ID:afv:AFLA_042430
B9LMU9^;Ontology_term=KEGG_ID:hla:Hlac_1092;Ontology_term=KEGG_PATHWAY:path:hla02010;Ontology_term=KEGG_ONTOLOGY:ko:K01999
Q2S9X8^;Ontology_term=COG_ID:COG2969;Ontology_term=COG_DESC:Stringent starvation protein B;Ontology_term=KEGG_ID:hch:HCH_10003;Ontology_term=KEGG_ONTOLOGY:ko:K03600
B2TZD5^;Ontology_term=KEGG_ID:sbc:SbBS512_E3074;Ontology_term=KEGG_ONTOLOGY:ko:K01146
Q21RE3^;Ontology_term=COG_ID:COG1881;Ontology_term=COG_DESC:Phospholipid-binding protein;Ontology_term=KEGG_ID:rfr:Rfer_3962;Ontology_term=KEGG_ONTOLOGY:ko:K06910

Use this file however you please…

------------------------------------------------
Usage and command-line options
------------------------------------------------


Usage:
           mannotator -g[ffs] | --gffs <gff>... -c[ontigs] | --contigs <contig> [options]
           mannotator --help
           mannotator --version

Required arguments:
    -g[ffs] <gff>... | --gffs <gff>...
        A comma or space separated list of gff3 formatted files

    -c[ontigs] <contig> | --contigs <contig>
        A fasta formated file containing contigs

Options:
    -p[rotdb] <protdb> | --protein-database <protdb>
        Specify a custom location for a blast database of sequences Default:
        /srv/whitlam/bio/db/uniprot/uniref90/uniref90.fasta

    -i[2a] <mapping> | --i2a <mapping>
        Specify a custom location for the database accession number mapping
        file Default:
        /srv/whitlam/bio/db/mannotator/Uniref201106_mappings.txt

    -m[in_len] <minlen> | --minimum-length <minlen>
        Remove ORFs less than this length (in bp). Default: 100 bp

    -d | --seq-embed
        Embed sequences into GFF files (useful to view the annotation in
        Artemis)

    -f[latfile] | --flat-file
         Optionally create multiple genbank files for your contigs

    -b[last_prg] <blastprog> | --blast-program <blastprog>
        The type of blast to run. Default: blastx

    -e[value] <evalue> | --evalue <evalue>
        E-value cutoff when running blast. Default: 1e-08

    -t[hreads] <threads> | --threads <threads>
        Run blast on multiple threads. Default: 1

    -o[ut] <outfile> | --outfile <outfile>
        Filename for the final gff3 file. Default: mannotatored.gff3

    -k[eep] | --keep-tmp
        Keep all tmp files and directories created

    -x | --keep-blast
        Keep only the blast results and delete all other tmp files

    -s[ims] | --blast-sims
        Use a precomputed blast output file

    -n[ew_blast] | --blast+
        Use the blast+ toolkit, default is to use blastall

    -one | --one
        Skip step one of the mannotator

    -two | --two
        Skip step two of the mannotator

    -three | --three
        skip step three of the mannotator

    -four | --four
        Skip step four of the mannotator

    -five | --five
        Skip step five of the mannotator