Skip to content
Eric Nawrocki edited this page Oct 23, 2018 · 1 revision

cmcalibrate is slow and requires a lot of memory. This page is meant to help users having difficulty with cmcalibrate.

Calibrate only for cmsearch and cmscan

First, cmcalibrate is only required if you are going to use your CM file with cmsearch or cmscan. Otherwise, there is no reason to run cmcalibrate.

cmcalibrate usage

To calibrate the CM file RF00001.cm, do:

$ cmcalibrate RF00001.cm

You should see output like this:

# cmcalibrate :: fit exponential tails for CM E-values                                                                                                                                                                                                                                                                      
# INFERNAL 1.1.2 (July 2016)                                                                                                                                                                                                                                                                                                
# Copyright (C) 2016 Howard Hughes Medical Institute.                                                                                                                                                                                                                                                                       
# Freely distributed under a BSD open source license.                                                                                                                                                                                                                                                                       
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -                                                                                                                                                                                                                                                   
# CM file:                                     RF00001.cm                                                                                                                                                                                                                                                                   
# number of worker threads:                    32                                                                                                                                                                                                                                                                           
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -                                                                                                                                                                                                                                                   
#                                                                                                                                                                                                                                                                                                                           
# Calibrating CM(s):                                                                                                                                                                                                                                                                                                        
#                                                                                                                                                                                                                                                                                                                           
#                        predicted                                                   actual                                                                                                                                                                                                                                 
#                       running time              percent complete                running time                                                                                                                                                                                                                              
# model name            (hr:min:sec)  [........25........50........75..........]  (hr:min:sec)                                                                                                                                                                                                                              
# --------------------  ------------  ------------------------------------------  ------------                                                                                                                                                                                                                              
  5S_rRNA                   00:01:53  [========================================]      00:03:44                                                                                                                                                                                                                              
#                                                                                                                                                                                                                                                                                                                           
# Calibration summary statistics:                                                                                                                                                                                                                                                                                           
#                                                                                                                                                                                                                                                                                                                           
#                           exponential tail fit mu        exponential tail fit lambda         total number of hits                                                                                                                                                                                                         
#                       -------------------------------  -------------------------------  -------------------------------                                                                                                                                                                                                   
# model name            glc cyk glc ins loc cyk loc ins  glc cyk glc ins loc cyk loc ins  glc cyk glc ins loc cyk loc ins                                                                                                                                                                                                   
# --------------------  ------- ------- ------- -------  ------- ------- ------- -------  ------- ------- ------- -------                                                                                                                                                                                                   
  5S_rRNA                 -6.37   -1.61    0.62    3.39    0.410   0.425   0.677   0.595    17589   17573  339357  213632                                                                                                                                                                                                   
#                                                                                                                                                                                                                                                                                                                           
# CPU time: 5466.31u 30.45s 01:31:36.76 Elapsed: 00:03:44.48                                                                                                                                                                                                                                                                
[ok]

'cmcalibrate` options

To see the available command-line options for cmcalibrate, do:

$ cmcalibrate -h
# cmcalibrate :: fit exponential tails for CM E-values
# INFERNAL 1.1.2 (July 2016)
# Copyright (C) 2016 Howard Hughes Medical Institute.
# Freely distributed under a BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: cmcalibrate [-options] <cmfile>

Basic options:
  -h     : show brief help on version and usage
  -L <x> : set random seq length to search in Mb to <x>  [1.6]  (0.01<=x<=160.)

Options for predicting running time and memory requirements:
  --forecast      : don't do calibration, predict running time and exit
  --nforecast <n> : w/--forecast, predict time with <n> processors (maybe for MPI)
  --memreq        : don't do calibration, print required memory and exit
  --noforecast    : do calibration, but skip running time prediction

Options controlling exponential tail fits:
  --gtailn <n> : fit the top <n> hits/Mb in histogram for glocal modes [df: 250]
  --ltailn <n> : fit the top <n> hits/Mb in histogram for  local modes [df: 750]
  --tailp <x>  : set fraction of histogram tail to fit to exp tail to <x>

Optional output files:
  --hfile <f>  : save fitted score histogram(s) to file <f>
  --sfile <f>  : save survival plot to file <f>
  --qqfile <f> : save Q-Q plot for score histograms to file <f>
  --ffile <f>  : save lambdas for different tail fit probs to file <f>
  --xfile <f>  : save scores in fit tail to file <f>

Other options:
  --seed <n>  : set RNG seed to <n> (if 0: one-time arbitrary seed)
  --beta <x>  : set tail loss prob for query dependent banding (QDB) to <x>
  --nonbanded : do not use QDB
  --nonull3   : turn OFF the NULL3 post hoc additional null model
  --random    : use GC content of random null background model of CM
  --gc <f>    : use GC content distribution from file <f>
  --cpu <n>   : number of parallel CPU workers to use for multithreads

Estimating cmcalibrate running time with the --forecast option

To get an estimate of the running time required for cmcalibrate on a CM file RF00001.cm, do:

$ cmcalibrate --forecast RF00001.cm

You should see output like this:

# cmcalibrate :: fit exponential tails for CM E-values
# INFERNAL 1.1.2 (July 2016)
# Copyright (C) 2016 Howard Hughes Medical Institute.
# Freely distributed under a BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# CM file:                                     RF00001.cm
# forecast mode (no calibration):              on
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Forecasting running time for CM calibration(s) on 32 cpus:
#
#                          predicted
#                       running time
# model name            (hr:min:sec)
# --------------------  ------------
  5S_rRNA                   00:01:53
#
# CPU time: 0.41u 0.01s `00:00:00.42 Elapsed: 00:00:00.45
[ok]

Note that it lists the running time for 32 cpus. This should be the number of cores on the machine you are using and cmcalibrate will use all cores by default. If you want to forecast the running time for <n> cores, use the --nforecast <n> option, like:

cmcalibrate --forecast --nforecast 8 RF00001.cm

When you perform the calibration, you can specify the number of cores that cmcalibrate will use with the --cpu <n> option. You may want to use fewer cores if the required memory (see below) is too high. As a special case, if you want to run on a single core specify --cpu 0.

Estimating memory required for cmcalibrate with the --memreq option

To get an estimate of required memory for cmcalibrate, do:

$ cmcalibrate --memreq RF00001.cm

and you should see output like:

# cmcalibrate :: fit exponential tails for CM E-values
# INFERNAL 1.1.2 (July 2016)
# Copyright (C) 2016 Howard Hughes Medical Institute.
# Freely distributed under a BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# CM file:                                     RF00001.cm
# memory-requirement mode (no calibration):    on
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Predicting required memory for calibration:
#
#                         total Mb    total Mb
#                       single CPU     32 CPUs
# --------------------  ----------  ----------
  5S_rRNA                     77.4      2424.5
#
# To enforce a single CPU be used, use the '--cpu 0' option.
# To enforce <n> CPUs be used, use '--cpu <n>'.
# By default (if '--cpu' is not used), 32 CPUs will be used.
#
# CPU time: 0.00u 0.00s 00:00:00.00 Elapsed: 00:00:00.00
[ok]

This means if you were to run cmcalibrate it would require roughly 2424.5 Mb of RAM. If you were to use --cpu 0 to specify a single core be used, it would only require about 77.4 Mb of RAM. If you were to use --cpu 4 it would require roughly ~320 Mb of RAM.

These estimates are rough estimates. As a usually safe rule of thumb, I make sure that twice as much memory is available as --memreq estimates when I run cmcalibrate.

There is no way to reduce the required memory for calibration below the amount reported for a single CPU.

Reducing cmcalibrate running time with the -L option:

The -L <f> option controls the total length of random sequence that cmcalibrate searches, where is in Mb. The default value for <f> is 1.6 (Mb). Smaller values of will result in quicker searches but less accurate E-value statistics. The default value of 1.6 was chosen as a good compromise between cmcalibrate running time and resulting E-value accuracy. I do not recommend using <f> values less than 0.4. With low values of <f>, you may get error messages like "Not enough hits to fit exponential tail" and the calibration will fail. This is because there's a minimum number of high scoring hits the calibration needs to fit an exponential tail, and reducing the search size from 1.6 down to 0.4 (for example) may mean that the minimum number of hits is not achieved. This 'not enough hits' error is more likely to happen with large models, unfortunately.

You can use -L <f> in combination with --forecast option to see how changing <f> impacts the forecasted running time, like:

cmcalibrate -L 0.4 --forecast RF00001.cm

Reducing cmcalibrate running time with the --beta option:

The --beta <x> option can also speed up cmcalibrate by setting <x> to a value higher than the default of 1E-15. The --beta option controls the width of the bands used during the DP search (see Nawrocki, Eddy, 2007: http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.0030056 for more info). The <x> value for --beta is the amount of probability mass allowed outside the bands, so greater probability loss makes the bands tighter, the DP go faster, and thus the calibration go faster. The default is 1E-15, so setting it to a higher value like 1E-4 (0.0001) will accelerate the search.

You can use --beta <f> in combination with --forecast option to see how changing <f> impacts the forecasted running time, like:

cmcalibrate --beta 1E-4 --forecast RF00001.cm

or in combination with -L too:

cmcalibrate --beta 1E-4 -L 0.4 --forecast RF00001.cm

Skipping the forecast stage for large models

The first step of cmcalibrate is to run a short simulation to estimate the total running time that the full calibration will take. This 'short' simulation can take a long time for large models. To skip it, use the --noforecast option.