cmsearch

Langue: en

Version: 335033 (ubuntu - 24/10/10)

Section: 1 (Commandes utilisateur)

NAME

cmsearch - search a sequence database for RNAs homologous to a CM

SYNOPSIS

cmsearch [options] cmfile seqfile

DESCRIPTION

cmsearch uses the covariance model (CM) in cmfile to search for homologous RNAs in seqfile, and outputs high-scoring alignments.

Currently, the sequence file must be in FASTA format.

CMs are profiles of RNA consensus sequence and secondary structure. A CM file is produced by the cmbuild program, from a given RNA sequence alignment of known consensus structure. CM files can be calibrated prior to running cmsearch with the cmcalibrate program. Searches with calibrated CM files will include E-values and will use appropriate filter thresholds for acceleration. It is strongly recommended to calibrate your CM files before using cmsearch. CM calibration is described in more detail below and in chapters 5 and 6 of the User's Guide.

cmsearch output consists of alignments of all hits in the database sorted by decreasing score per sequence and per strand. That is, all hits for the same sequence and the same (Watson or Crick) strand are sorted, but hits across sequences or strands are not sorted.

The threshold for reporting scores is different depending on whether the CM file has been calibrated or not. If the CM file has been calibrated, the default reporting threshold is an E-value of 1.0. This is the threshold at which 1 hit is expected by chance. It is possible to manually set the threshold to bit score <x> using the -T <x> option as described below, or to E-value <x> using the -E <x> option. The -E option will only work if the CM file has been calibrated.

RNA homology search with CMs is slow. To speed it up, cmsearch by default uses two rounds of filters with faster algorithms to prune the database prior to searching with the slow CM algorithm. The first round of filtering is faster but less strict than the second round. First, the full database is searched with the first round filter, then any hits that survive the first round are searched with the second round filter. Finally any hits that survive the first and second round of filtering are searched with the final round search strategy. During the filter rounds, hits are padded with a short stretch of residues on either side prior to searching with the subsequent round. The exact number of residues is dependent on the size of the model being searched with.

The first round of filtering is performed with an HMM. If the CM file is calibrated, the threshold for the HMM filter will be automatically chosen as an appropriate one as determined in cmcalibrate. The minimum threshold that will automatically be chosen is the threshold that will allow a predicted fraction (0.02 by default, changeable to <x> with --fil-Smin-hmm <x> ) of the database to survive the filter. The maximum threshold that will automatically be chosen is the threshold that will allow a predicted fraction (0.5 by default, changeable to <x> with --fil-Smax-hmm <x> ) of the database to survive the filter. If the threshold from cmcalibrate is greater than this maximum fraction, the HMM filter will be turned off and not used. To ensure that the HMM filter is never turned off and always uses a threshold that gives this maximum fraction you must use the --fil-A-hmm option. If the model is not calibrated, the default HMM filter threshold is 3.0 bits. The HMM filter threshold can be manually set to bit score <x> using the --fil-T-hmm <x> option as described below, or to E-value <x> using the --fil-E-hmm <x> option, or to the bit score that will allow a predicted database fraction of <x> to survive the filter using the --fil-S-hmm <x> option. The --fil-E-hmm and --fil-S-hmm options will only work if the CM file has been calibrated. The HMM filter can be turned off with the --fil-no-hmm option.

The second round of filtering is performed with the CM CYK algorithm (not an HMM) using query-dependent banding (QDB) for acceleration. Briefly, QDB precalculates regions of the dynamic programming matrix that have negligible probability based on the query CM's transition probabilities. During search, these regions of the matrix are ignored to make searches faster. For more information on QDB see (Nawrocki and Eddy, PLoS Computational Biology 3(3): e56). The beta paramater is the amount of probability mass considered negligible during band calculation, lower values of beta yield greater speedups but also a greater chance of missing the optimal alignment. The default beta is 1E-10: determined empirically as a good tradeoff between sensitivity and speed, though this value can be changed with the --fil-beta <x> option. If the CM file has been calibrated, the QDB filter threshold will be automatic set to an appropriate value using an ad-hoc procedure (see the User's Guide). If the CM file has not been calibrated, the default QDB filter threshold is 0.0 bits. The QDB filter threshold can be manually set to bit score <x> using the --fil-T-qdb <x> option as described below, or to E-value <x> using the --fil-E-qdb <x> option. The --fil-E-qdb option will only work if the CM file has been calibrated. The QDB filter can be turned off with the --fil-no-qdb option.

Another way to accelerate cmsearch is to run it in parallel with MPI on multiple computers. To do this, use the --mpi option and run cmsearch inside a MPI wrapper program such as mpirun. For example: mpirun C cmsearch --mpi [other options] cmfile seqfile. The cmsearch program must have been compiled in MPI mode for this to work. See the Installation section of the User's Guide for more information.

The --forecast <n> option will estimate how long a search will take for your cmfile and seqfile on <n> processors. Unless you plan on running cmsearch in MPI mode, <n> should be set as 1.

Another technique for accelerated CM homology search with HMM filters is the construction and use of a "rigorous filter" HMM which was developed by Zasha Weinberg and Larry Ruzzo. All hits above a certain CM bit score threshold are guaranteed to survive the HMM filtering step. Their implementation of rigorous filters has been included in previous versions of Infernal, but not in the current version. For more information see the User's Guide.

OUTPUT

By default, cmsearch outputs the alignments of search hits that score above the final search round threshold. The format of this output is described in the "Tutorial" section of the User's Guide. This format has purposefully not been changed from the 0.x versions of Infernal so as not to break existing parsers. However, it can be augmented with a line of output that marks non-compensatory (negative scoring) basepairs with an 'x' by using the -x option. Alternatively, only negative scoring non-canonical basepairs (those other than A:U, U:A, C:G, G:C, U:G, and G:U) are marked if the -v option is enabled. These two options were added to facilitate quick analysis of the secondary structure of hits by eye. Additionally, the -p option can be used to annotate the posterior probability of each aligned residue in the hit alignments as described below.

The --tabfile <f> outputs a tabular representation of the hits found by cmsearch to the file <f>. Each non-# prefixed line of this file corresponds to a hit, and each such line has 9 fields: "model name" the name of the CM used for the search, "target name" the name of the target sequence the hit was found in, "target coord - start": the start position of the hit in the target sequence, "target coord - stop": the end position of hit in the target sequence, "query coord - start": the start position of the hit in the query model, "query coord - stop": the end position of hit in the query sequence, "bit sc": the bit score of the hit, "E-value": the E-value of the hit (if available, "-" if not), and "GC" the percentage of G and C residues in the hit within the target sequence. cmsearch tab files can be used as input to the Easel miniapp esl-sfetch (included in the easel/miniapp/ subdirectory of infernal) with the -C -f --tabfile options to extract all the hits from the target database file to a new FASTA file. This file can then be aligned to a CM with cmalign.

OPTIONS

-h
Print brief help; includes version number and summary of all options, including expert options.
-o <f>
Save the high-scoring alignments of hits to a file <f>. The default is to write them to standard output.
-g <f>
Turn on the 'glocal' alignment algorithm, local with respect to the target database, and global with respect to the model. By default, the local alignment algorithm is used which is local with respect to both the target sequence and the model. In local mode, the alignment to span two or more subsequences if necessary (e.g. if the structures of the query model and target sequence are only partially shared), allowing certain large insertions and deletions in the structure to be penalized differently than normal indels. Local mode performs better on empirical benchmarks and is significantly more sensitive for remote homology detection. Empirically, glocal searches return many fewer hits than local searches, so glocal may be desired for some applications.
-p
Append posterior probabilities to alignments of hits. For more information on posterior probabilities see the description of the -p option in the manual page for cmalign.
-x
Annotate negative scoring basepairs and basepairs that include a gap in the left or right half of the pair (but not both) with x's in the alignments of hits. The x's appear above the structural annotation in the alignment output. Basepairs without x's above them are compensatory with respect to the model. Compensatory mutations are good evidence for structural homology.
-v
Very similar to -x, but only mark negative scoring basepairs that are non-canonical basepairs (not an A:U, U:A, C:G, G:C, G:U or U:G), and mark them with a 'v' instead of an 'x' in the output.
-Z <x>
Calculate E-values as if the target database size was <x> megabases (Mb). Ignore the actual size of the database. This option is only valid if the CM file has been calibrated. Warning: the predictions for timings and survival fractions will be calculated as if the database was of size <x> Mb, which means they will be inaccurate.
--toponly
Only search the top (Watson) strand of the sequences in seqfile. By default, both strands are searched.
--bottomonly
Only search the bottom (Crick) strand of the sequences in seqfile. By default, both strands are searched.
--forecast <n>
Predict the running time of the search with provided files and options and exit, DO NOT perform the search. This option is only available with calibrated CM files. The predictions should be used as rough estimates and can be fairly inaccurate, especially for highly biased target databases (for example 80% AT genomes). The value for <n> is the number of processors the search will be run on, so <n> equal to 1 is appropriate unless you will run cmsearch in parallel with MPI.
--informat <s>
Assert that the input seqfile is in format <s>. Do not run Babelfish format autodection. This increases the reliability of the program somewhat, because the Babelfish can make mistakes; particularly recommended for unattended, high-throughput runs of Infernal. <s> is case-insensitive. Acceptable formats are: FASTA, EMBL, UNIPROT, GENBANK, and DDBJ. <s> is case-insensitive.
--mxsize <x>
Set the maximum allowable DP matrix size to <x> megabytes. By default this size is 2,048 Mb. This should be large enough for the vast majority of alignments, however if it is not cmsearch will exit prematurely and report an error message that the matrix exceeded it's maximum allowable size. In this case, the --mxsize can be used to raise the limit.
--devhelp
Print help, as with -h, but also include undocumented developer options. These options are not listed below, are under development or experimental, and are not guaranteed to even work correctly. Use developer options at your own risk. The only resources for understanding what they actually do are the brief one-line description printed when --devhelp is enabled, and the source code.
--mpi
Run as an MPI parallel program. This option will only be available if Infernal has been configured and built with the "--enable-mpi" flag (see User's Guide for details).

EXPERT OPTIONS

--inside
Use the Inside algorithm for the final round of searching. This is true by default.
--cyk
Use the CYK algorithm for the final round of searching.
--forward
Search only with an HMM. This is much faster but less sensitive than a CM search. Use the Forward algorithm for the HMM search.
--viterbi
Search only with an HMM. This is much faster but less sensitive than a CM search. Use the Viterbi algorithm for the HMM search.
-E <x>
Set the E-value cutoff for the per-sequence/strand ranked hit list to <x>, where <x> is a positive real number. Hits with E-values better than (less than) or equal to this threshold will be shown. This option is only available if the CM file has been calibrated. This threshold is relevant only to the final round of searching performed after all filters have been used, not to the filter rounds themselves.
-T <x>
Set the bit score cutoff for the per-sequence ranked hit list to <x>, where <x> is a positive real number. Hits with bit scores better than (greater than) this threshold will be shown. This threshold is relevant only to the final round of searching performed after all filters have been used, not to the filter rounds themselves.
--nc
Set the bit score cutoff as the NC cutoff value used by Rfam curators as the noise cutoff score. This is the highest scoring hit found by this model during Rfam curation that the Rfam curators defined as a noise (false positive) sequence. The NC cutoff is defined as <x> bits in the original Stockholm alignment the model was built from with a line: #=GF NC <x> positioned before the sequence alignment. If such a line existed in the alignment provided to cmbuild then the --nc option will be available in cmsearch. If no such line existed when cmbuild was run, then using the --nc option to cmsearch will cause the program to print an error message and exit.
--ga
Set the bit score cutoff as the GA cutoff value used by Rfam curators as the gathering threshold. The GA cutoff is defined in a stockholm file used to build the model in the same way as the NC cutoff (see above), but with a line: #=GF GA <x>
--tc
Set the bit score cutoff as the TC cutoff value used by Rfam curators as the trusted cutoff. The TC cutoff is defined in the stockholm file used to build the model in the same way as the NC cutoff (see above), but with a line: #=GF TC <x>
--no-qdb
Do not use query-dependent banding (QDB) for the final round of search. By default, QDB is used in the final round of search with beta = 1E-15, after all filtering is finished.
--beta <x>
For query-dependent banding (QDB) during the final round of search, set the beta parameter to <x> where <x> is any positive real number less than 1.0. Beta is the probability mass considered negligible during band calculation. The default beta for the final round of search is 1E-15.
--hbanded
Use HMM bands to accelerate the final round of search. Constraints for the CM search are derived from posterior probabilities from an HMM. This is an experimental option and it is not recommended for use unless you know exactly what you're doing.
--tau <x>
Set the tail loss probability during HMM band calculation to <x>. This is the amount of probability mass within the HMM posterior probabilities that is considered negligible. The default value is 1E-7. In general, higher values will result in greater acceleration, but increase the chance of missing the optimal alignment due to the HMM bands. This option only makes sense in combination with --hbanded
--fil-no-hmm
Turn the HMM filter off.
--fil-no-qdb
Turn the QDB filter off.
--fil-beta
For the QDB filter, set the beta parameter to <x> where <x> is any positive real number less than 1.0. Beta is the probability mass considered negligible during band calculation. The default beta for the QDB filter round of search is 1E-10.
--fil-T-qdb <x>
Set the bit score cutoff for the QDB filter round to <x>, where <x> is a positive real number. Hits with bit scores better than (greater than) this threshold will survive the QDB filter and be passed to the final round.
--fil-T-hmm <x>
Set the bit score cutoff for the HMM filter round to <x>, where <x> is a positive real number. Hits with bit scores better than (greater than) this threshold will survive the HMM filter and be passed to the next round, either a QDB filter round, or if the QDB filter is disabled, to the final round of search.
--fil-E-qdb <x>
Set the E-value cutoff for the QDB filter round. <x>, where <x> is a positive real number. Hits with E-values better than (less than) or equal to this threshold will survive and be passed to the final round. This option is only available if the CM file has been calibrated.
--fil-E-hmm <x>
Set the E-value cutoff for the HMM filter round. <x>, where <x> is a positive real number. Hits with E-values better than (less than) or equal to this threshold will survive and be passed to the next round, either a QDB filter round, or if the QDB filter is disable, to the final round of search. This option is only available if the CM file has been calibrated.
--fil-S-hmm <x>
Set the bit score cutoff for the HMM filter round as the score that will allow a predicted <x> fraction of the database to survive the HMM filter round, where <x> is a positive real number between 0 and 1.
--fil-Smax-hmm <x>
When using automatically calibrated HMM thresholds for a CM file calibrated with cmcalibrate, set the maximum HMM filter threshold as the score that will allow a predicted <x> fraction of the database to survive the filter. If the automatic threshold from cmcalibrate exceeds this value, turn the HMM filter off and do not use it for the search. By default, this option is ON with the default value of 0.5 used for <x>. To modify the behavior of this option so it does not turn off the HMM filter if exceeded use the --fil-A-hmm option described below.
--fil-Smin-hmm <x>
When using automatically calibrated HMM thresholds for a CM file calibrated with cmcalibrate, set the minimum HMM filter threshold as the score that will allow a predicted <x> fraction of the database to survive the filter. By default, this option is ON with the default value of 0.02 used for <x>. Setting <x> lower will only accelerate the majority of searches by a small amount.
--fil-A-hmm
Always enforce the maximum HMM filter threshold of <x> from --fil-Smax-hmm <x>. That is, never turn off the HMM filter, or set its threshold above the score that will allow a predicted <x> fraction of the database to survive. This option is OFF by default.
--hmm-W <n>
Set the HMM window size W (maximum size of a hit) to <n>. This option only works in combination with --forward or --viterbi. By default, W is calculated automatically, but this automatic calculation is time consuming for large models.
--hmm-cW <x>
Set the HMM window size W (maximum size of a hit) as <x> times the consensus length of the CM. The consensus length (clen) of the CM can be determined using the cmstat program. This option only works in combination with --forward or --viterbi. By default, W is calculated automatically, but this automatic calculation is time consuming for large models. To find potential full length hits to the model <x> should be greater than 1.0, but values above 2.0 are probably wasteful.
--noalign
Do not calculate and print alignments of each hit, only print locations and scores.
--aln-hbanded
Use HMM bands to accelerate alignment during the hit alignment stage.
--aln-optacc
Calculate alignments of hits from final round of search using the optimal accuracy algorithm which computes the alignment that maximizes the summed posterior probability of all aligned residues given the model, which can be different from the highest scoring one.
--tabfile <f>
Create a new output file <f> and print tabular results to it. The format of the tabular results is listed in the OUTPUT section. The tabular results can be more easily parsed by scripts than the default cmsearch output. The esl-sfetch miniapp included in the easel/miniapps/ subdirectory of infernal has a --tabfile option that allows it to read cmsearch tab files and fetch the hits reported within them from the target database into a new sequence file.
--gcfile <f>
Create a new output file <f> and print statistics of the GC content of the sequences in seqfile to it. The sequences are partitioned into 100 nt non-overlapping windows, and the GC percentage of each window is calculated. A normalized histogram of those GC percentages is then printed to <f> This file can be generated even if cmsearch is run with --forecast and no search is performed.
--rna
Output the hit alignments as RNA sequences alignments. This is true by default.
--dna
Output the hit alignments as DNA sequence alignments.

SEE ALSO

For complete documentation, see the User's Guide (Userguide.pdf) that came with the distribution; or see the Infernal web page, http://infernal.janelia.org/.

 Copyright (C) 2009 HHMI Janelia Farm Research Campus.
 Freely distributed under the GNU General Public License (GPLv3).
 
See the file COPYING that came with the source for details on redistribution conditions.

AUTHOR

 Eric Nawrocki, Diana Kolbe, and Sean Eddy
 HHMI Janelia Farm Research Campus
 19700 Helix Drive
 Ashburn VA 20147
 http://selab.janelia.org/