Skip to content

Latest commit

 

History

History
158 lines (128 loc) · 11.9 KB

models.md

File metadata and controls

158 lines (128 loc) · 11.9 KB

Ribovore model information


Sequence and structure-based ribosomal RNA alignments included with Ribovore

Ribovore includes models built from 18 different alignments. Seven of these derive from Rfam, and 11 were created during the course of Ribovore development.

These alignments were created using 1 of 2 strategies:

Build strategy 1: alignment and consensus secondary structure was determined using the CRW conversion strategy described in the SSU-ALIGN 0.1 user's guide (section 5) using tools/scripts available on GitHub. Sequence and structure information in these alignments derive from the CRW website/database (reference below).

Build strategy 2: initial version of the alignment and consensus secondary structure and model was determined using build strategy 1, but that model was used to create a new alignment and model using the Rfam model building pipeline (as of Rfam release 12.0), as follows:

  • the model from build strategy 1 was used to search the Rfam database sequence database Rfamseq (a large database consisting of much of the GenBank nucleotide database)
  • all resulting hits were filtered by removing highly similar sequences
  • the suriviving hits were aligned to the model from build strategy 1
  • the cmbuild program of Infernal was used with the --refine option to refine to get a final alignment and a new model was built from it

As you can see in the table below, the 7 Rfam alignments and 3 of the other alignments were created with build strategy 2. The other 7 alignments that used build strategy 1 have very few (< 5) sequences in them and should and will be updated in future versions. Models built from these 7 build strategy 1 alignments are less trustworthy and generally less useful than the other models built from build strategy 2 alignments.


Covariance model (CM) files that include a single CM and profile HMM

alignment file name alignment build strategy model files built from alignment # seqs model length Rfam accession Rfam DB release
SSU_rRNA_archaea.RF01959.stk 2 rt.SSU_rRNA_archaea.enone.cm, ra.SSU_rRNA_archaea.edf.cm 86 1477 RF01959 12.2
SSU_rRNA_bacteria.RF00177.stk 2 rt.SSU_rRNA_bacteria.enone.cm, ra.SSU_rRNA_bacteria.edf.cm 99 1533 RF00177 12.2
SSU_rRNA_eukarya.RF01960.stk 2 rt.SSU_rRNA_eukarya.enone.cm, ra.SSU_rRNA_eukarya.edf.cm 91 1851 RF01960 12.2
SSU_rRNA_microsporidia.RF02542.stk 2 rt.SSU_rRNA_microsporidia.enone.cm, ra.SSU_rRNA_microsporidia.edf.cm 46 1312 RF02542 12.2
LSU_rRNA_archaea.RF02540.stk 2 rt.LSU_rRNA_archaea.enone.cm, ra.LSU_rRNA_archaea.edf.cm 91 2990 RF02540 12.2
LSU_rRNA_bacteria.RF02541.stk 2 rt.LSU_rRNA_bacteria.enone.cm, ra.LSU_rRNA_bacteria.edf.cm 102 2925 RF02541 12.2
LSU_rRNA_eukarya.RF02543.stk 2 rt.LSU_rRNA_eukarya.enone.cm, ra.LSU_rRNA_eukarya.edf.cm 88 3401 RF02543 12.2
SSU_rRNA_mitochondria_metazoa.stk 2 rt.SSU_rRNA_mitochondria_metazoa.enone.cm, ra.SSU_rRNA_mitochondria_metazoa.edf.cm 83 954 - -
SSU_rRNA_mitochondria_amoeba.stk 1 rt.SSU_rRNA_mitochondria_amoeba.enone.cm, ra.SSU_rRNA_mitochondria_amoeba.edf.cm 2 1861 - -
SSU_rRNA_mitochondria_chlorophyta.stk 1 rt.SSU_rRNA_mitochondria_chlorophyta.enone.cm, ra.SSU_rRNA_mitochondria_chlorophyta.edf.cm 2 1200 - -
SSU_rRNA_mitochondria_fungi.stk 1 rt.SSU_rRNA_mitochondria_fungi.enone.cm, ra.SSU_rRNA_mitochondria_fungi.edf.cm 4 1603 - -
SSU_rRNA_mitochondria_kinetoplast.stk 1 rt.SSU_rRNA_mitochondria_kinetoplast.enone.cm, ra.SSU_rRNA_mitochondria_kinetoplast.edf.cm 3 624 - -
SSU_rRNA_mitochondria_plant.stk 1 rt.SSU_rRNA_mitochondria_plant.enone.cm, ra.SSU_rRNA_mitochondria_plant.edf.cm 4 1951 - -
SSU_rRNA_mitochondria_protist.stk 1 rt.SSU_rRNA_mitochondria_protist.enone.cm, ra.SSU_rRNA_mitochondria_protist.edf.cm 2 1677 - -
SSU_rRNA_chloroplast.stk 2 rt.SSU_rRNA_chloroplast.enone.cm, ra.SSU_rRNA_chloroplast.edf.cm 94 1488 - -
SSU_rRNA_chloroplast_pilostyles.stk 1 rt.SSU_rRNA_chloroplast_pilostyles.enone.cm, ra.SSU_rRNA_chloroplast_pilostyles.edf.cm 1 1531 - -
SSU_rRNA_cyanobacteria.stk 2 rt.SSU_rRNA_cyanobacteria.enone.cm, ra.SSU_rRNA_cyanobacteria.edf.cm 49 1487 - -
SSU_rRNA_apicoplast.stk 1 rt.SSU_rRNA_apicoplast.enone.cm, ra.SSU_rRNA_apicoplast.edf.cm 3 1463 - -

All files listed in columns 1 and 3 can be found in the ribovore/models directory ($RIBOSCRIPTSDIR/models following installation).


ribotyper versus riboaligner models

The model files that begin with rt. contain ribotyper models and those that begin with ra. contain riboaligner models. These models were built differently. All ribotyper models were built using the cmbuild program from Infernal version 1.1.3 with command line options --p7ml --enone using the alignment files listed in the table above. All riboaligner models were built using the cmbuild program from Infernal version 1.1.2 with default parameters (no command line options) using the aligment files listed in the table above. The riboaligner models were built with cmbuild's entropy weighting feature that controls the average entropy per model position~\cite{Karplus98,Nawrocki09b}, and the ribotyper models were built with this feature turned off. Additionally the ribotyper models were built such that the profile HMM used for filtering was built to be maximally similar to the CM (the --p7ml option). These options were selected because they increased classification accuracy on our internal testing for ribotyper. Default cmbuild built models are provided for riboaligner because those models were more accurate on prediction alignment endpoints correctly in our testing.


ribotyper.cm a multi-model CM library file

The ribotyper.cm file is a CM library of all models that begin with rt in the above table. This file is used in the first stage of ribotyper to classify sequences.


Getting model statistics using Infernal's cmstat program

The program cmstat that is installed as part of the Infernal package with Ribovore installation can be used to output information on the model or models in CM file. For example, below is the output of cmstat on the ribotyper.cm file:

> $RIBOINFERNALDIR/cmstat $RIBOINSTALLDIR/models/ribotyper.cm
# cmstat :: display summary statistics for CMs
# INFERNAL 1.1.4 (Dec 2020)
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#                                                                                                               rel entropy
#                                                                                                              ------------
# idx   name                                   accession      nseq  eff_nseq   clen      W   bps  bifs  model     cm    hmm
# ----  -------------------------------------  ---------  --------  --------  -----  -----  ----  ----  -----  -----  -----
     1  SSU_rRNA_archaea                       RF01959          86     86.00   1477   2998   457    30     cm  1.496  1.315
     2  SSU_rRNA_bacteria                      RF00177          99     99.00   1533   1866   462    31     cm  1.415  1.231
     3  SSU_rRNA_eukarya                       RF01960          91     91.00   1851   2879   447    30     cm  1.004  0.888
     4  SSU_rRNA_microsporidia                 RF02542          46     46.00   1312   1974   366    26     cm  1.231  1.083
     5  SSU_rRNA_chloroplast                   -                94     94.00   1488   2288   446    31     cm  1.602  1.514
     6  SSU_rRNA_mitochondria_metazoa          -                83     83.00    954   1406   254    20     cm  1.089  0.971
     7  SSU_rRNA_cyanobacteria                 -                49     49.00   1487   1576   445    31     cm  1.748  1.668
     8  LSU_rRNA_archaea                       RF02540          91     91.00   2990   6270   786    68     cm  1.323  1.133
     9  LSU_rRNA_bacteria                      RF02541         102    102.00   2925   5920   846    70     cm  1.352  1.153
    10  LSU_rRNA_eukarya                       RF02543          88     88.00   3401   8019   872    71     cm  1.122  0.994
    11  SSU_rRNA_apicoplast                    -                 3      3.00   1463   1685   398    28     cm  0.926  0.721
    12  SSU_rRNA_chloroplast_pilostyles        -                 1      1.00   1531   1557   440    30     cm  0.656  0.399
    13  SSU_rRNA_mitochondria_amoeba           -                 2      2.00   1861   2004   311    25     cm  0.725  0.593
    14  SSU_rRNA_mitochondria_chlorophyta      -                 2      2.00   1200   1549   224    19     cm  0.674  0.509
    15  SSU_rRNA_mitochondria_fungi            -                 4      4.00   1603   2455   334    26     cm  0.908  0.764
    16  SSU_rRNA_mitochondria_kinetoplast      -                 3      3.00    624    652    68     5     cm  0.978  0.910
    17  SSU_rRNA_mitochondria_plant            -                 4      4.00   1951   2211   446    29     cm  1.353  1.256
    18  SSU_rRNA_mitochondria_protist          -                 2      2.00   1677   2051   318    24     cm  0.732  0.582

For more information on cmstat see the Infernal user's guide


CRW database reference

Cannone J.J., Subramanian S., Schnare M.N., Collett J.R., D'Souza L.M., Du Y., Feng B., Lin N., Madabusi L.V., MÜller K.M., Pande N., Shang Z., Yu N., and Gutell R.R. (2002). The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and Other RNAs. BioMed Central Bioinformatics, 3:2. [Correction: BioMed Central Bioinformatics. 3:15.]


Questions, comments or feature requests? Send a mail to eric.nawrocki@nih.gov.