Bio::Assembly::IO::tigr.3pm

Langue: en

Version: 2010-05-19 (ubuntu - 24/10/10)

Section: 3 (Bibliothèques de fonctions)

NAME

Bio::Assembly::IO::tigr - Driver to read and write assembly files in the TIGR Assembler v2 default format.

SYNOPSIS

     # Building an input stream
     use Bio::Assembly::IO;
 
     # Assembly loading methods
     my $asmio = Bio::Assembly::IO->new( -file   => 'SGC0-424.tasm',
                                         -format => 'tigr' );
     my $scaffold = $asmio->next_assembly;
 
     # Do some things on contigs...
 
     # Assembly writing methods
     my $outasm = Bio::Assembly::IO->new( -file   => ">SGC0-modified.tasm",
                                          -format => 'tigr' );
     $outasm->write_assembly( -scaffold => $assembly,
                              -singlets => 1 );
 
 

DESCRIPTION

This package loads and writes assembly information in/from files in the default TIGR Assembler v2 format. The files are lassie-formatted and often have the .tasm extension. This module was written to be used as a driver module for Bio::Assembly::IO input/output.

Implementation

Assemblies are loaded into Bio::Assembly::Scaffold objects composed of Bio::Assembly::Contig and Bio::Assembly::Singlet objects. Since aligned reads and contig gapped consensus can be obtained in the tasm files, only aligned/gapped sequences are added to the different BioPerl objects.

Additional assembly information is stored as features. Contig objects have SeqFeature information associated with the primary_tag:

     _main_contig_feature:$contig_id -> misc contig information
     _quality_clipping:$read_id      -> quality clipping position
 
 

Read objects have sub_seqFeature information associated with the primary_tag:

     _main_read_feature:$read_id     -> misc read information
 
 

Singlets are considered by TIGR Assembler as contigs of one sequence and are represented here with features having these primary_tag:

     _main_contig_feature:$contig_id
     _quality_clipping:$read_primary_id
     _main_read_feature:$read_primary_id
     _aligned_coord:$read_primary_id
 
 

THE TIGR TASM LASSIEFORMAT

Description

In the TIGR tasm lassie format, contigs are separated by a line containing a single pipe character ``|'', whereas the reads in a contig are separated by a blank line. Singlets can be present in the file and are represented as a contig composed of a single sequence.

Other than the two above-mentioned separators, each line has an attribute name, followed a tab and then an attribute value.

The tasm format is used by more TIGR applications than just TIGR Assembler. Some of the attributes are not used by TIGR Assembler or have constant values. They are indicated by an asterisk *

Contigs have the following attributes:

     asmbl_id   -> contig ID
     sequence   -> contig ungapped consensus sequence (ambiguities are lowercase)
     lsequence  -> gapped consensus sequence (lowercase ambiguities)
     quality    -> gapped consensus quality score (in hexadecimal)
     seq_id     -> *
     com_name   -> *
     type       -> *
     method     -> always 'asmg' *
     ed_status  -> *
     redundancy -> fold coverage of the contig consensus
     perc_N     -> percent of ambiguities in the contig consensus
     seq#       -> number of sequences in the contig
     full_cds   -> *
     cds_start  -> start of coding sequence *
     cds_end    -> end of coding sequence *
     ed_pn      -> name of editor (always 'GRA') *
     ed_date    -> date and time of edition
     comment    -> some comments *
     frameshift -> *
 
 

Each read has the following attributes:

     seq_name  -> read name
     asm_lend  -> position of first base on contig ungapped consensus sequence
     asm_rend  -> position of last base on contig ungapped consensus sequence
     seq_lend  -> start of quality-trimmed sequence (aligned read coordinates)
     seq_rend  -> end of quality-trimmed sequence (aligned read coordinates)
     best      -> always '0' *
     comment   -> some comments *
     db        -> database name associated with the sequence (e.g. >my_db|seq1234)
     offset    -> offset of the sequence (gapped consensus coordinates)
     lsequence -> aligned read sequence (ambiguities are uppercase)
 
 

When asm_rend < asm_lend, the sequence was on the complementary DNA strand but its reverse complement is shown in the aligned sequence of the assembly file, not the original read.

Ambiguities are reflected in the contig consensus sequence as lowercase IUPAC characters: a c g t u m r w s y k x n . In the read sequences, however, ambiguities are uppercase: M R W S Y K X N

Example

Example of a contig containing three sequences:
     sequence    CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTGGTGTAGGAGGCGCCAGCAACACCTTGGGTGATCATGTAGCAGCCTTCTGACAGGCCGTAGGTCAGCATGGTCGGCCACTGGGTACCAGTCAGTCGGGTCAACCGAGATTCGCAsCTGAGCGCCACTGCCGCGCAGAGCGTACATGCCCTTGCGGGTCGCGCCGGTAACACCATCCACGCCGATCAGAACTGCGTCGGTGATGGTGGTGTTACCCGAGGTGCCAGTGGTGAAGGCGACGGTCTGGGTGCTGGCCACAGGCGCCAGAGTGGTCGCGCCAACGGTGGCGATGACCAGTTGCGATGGGCCACGGATACCTGACTGCCCGTTGTTCACGGCGCTGACGATGTTCTGCCACAGCGCCAGGCCAGAGCCGGTGATGTTGTCGAACACTTCGGGCGCAACGCCAGGGAGCGAGACGGTCAGCTTCCAGCTCGAAGCAGCGGAGCCAGTAGCCAGGGCGGCGCTGAGCGAGTTGCCGAGCGTGCCGGTGTAGAACGCGGTCAGCGTGGCGCCGGTGGCGGCGGCAGTGTCCTTCAGCGCACTGGTCGCGGCGGTGTCGGTGCCGTCAGTGACGCGCACGGCGCGGATGTTCGAGGCGCCGCCCTGGATTGATACCGCCAGCGCGGTGCACAGGTCGTACTTGCGCACGGTCyGAGTGCCGAACTTCTGCGATGCGTCACCTGGCGAGCCGATAaGCGTGGCGCTGTTCACCGGCCCCCAGTCAGCAATGCCGACGATGCCGAGAATGTCAGTCGGGACGCCATTGATGTAGCGGGTCTTGGGCGCCACTATTTGTATGTACAAATCTGGCGCAGATAAAGCCGCCGTATTCAAATAACCAGCAGGATAGATAGGCATCACGCCTCCAGAATGAAAAAGGCCACCGATTAGGTGGCCTTTGTTGTGTTCGGCTGGCTGTTAGAGCAGCAGCCCGTTTTCCCGCGCAAACGCGAATGGGTCCTTGTCATGCTTCCTGCAATTGCAGGTAGGACAAAGAATTTGCAGGTTGGATTTGTCGTTCGATCCGCCCTTTGCAAGCGGGAACACGTGGTCAACGTGATACCCATCCCTTATGGATATAGTGCACATGGCGCATTTCCAGCGCTGAGCAGCCAGCAAAAATTTTATGTCGTCGCCGGTGTGTGAGCCGACAGCATTTTTCTTGCGAGCCTTGTATGTCCGCGAGAGTGAACGAACTTGCTCCTTGTTGGCTGTCTTCCAGAGCTTTTGAGTAAGCGCACAGAGATCCTTGTTTCTTGATCTCCACTCTCTGGTTGCGGAAAT
     lsequence   CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTGGTGTAGGAGGCGCCAGCAACACCTTGGGTGATCATGTAGCAGCCTTCTGACAGGCCGTAGGTCAGCATGGTCGGCCACTGGGTACCAGTCAGTCGGGTCAACCGAGATTCG-CAsCTGAGCGCCACTGCCGCGCAGAGCGTACATGCCCTTGCGGGTCGCGCCGGTAACACCATCCACGCCGATCAGAACTGCGTCGGTGATGGTGGTGTTACCCGAGGTGCCAGTGGTGAAGGCGACGGTCTGGGTGCTGGCCACAGGCGCCAGAGTGGTCGCGCCAACGGTGGCGATGACCAGTTGCGATGGGCCACGGATACCTGACTGCCCGTTGTTCACGGCGCTGACGATGTTCTGCCACAGCGCCAGGCCAGAGCCGGTGATGTTGTCGAACACTTCGGGCGCAACGCCAGGGAGCGAGACGGTCAGCTTCCAGCTCGAAGCAGCGGAGCCAGTAGCCAGGGCGGCGCTGAGCGAGTTGCCGAGCGTGCCGGTGTAGAACGCGGTCAGCGTGGCGCCGGTGGCGGCGGCAGTGTCCTTCAGCGCACTGGTCGCGGCGGTGTCGGTGCCGTCAGTGACGCGCACGGCGCGGATGTTCGAGGCGCCGCCCTGGATTGATACCGCCAGCGCGGTGCACAGGTCGTACTTGCGCACGGTCyGAGTGCCGAACTTCTGCGATGCGTCACCTGGCGAGCCGATAaGCGTGGCGCTGTTCACCGGCCCCCAGTCAGCAATGCCGACGATGCCGAGAATGTCAGTCGGGACGCCATTGATGTAGCGGGTCTTGGGCGCCACTATTTGTATGTACAAATCTGGCGCAGATAAAGCCGCCGTATTCAAATAACCAGCAGGATAGATAGGCATCACGCCTCCAGAATGAAAAAGGCCACCGATTAGGTGGCCTTTGTTGTGTTCGGCTGGCTGTTAGAGCAGCAGCCCGTTTTCCCGCGCAAACGCGAATGGGTCCTTGTCATGCTTCCTGCAATTGCAGGTAGGACAAAGAATTTGCAGGTTGGATTTGTCGTTCGATCCGCCCTTTGCAAGCGGGAACACGTGGTCAACGTGATACCCATCCCTTATGGATATAGTGCACATGGCGCATTTCCAGCGCTGAGCAGCCAGCAAAAATTTTATGTCGTCGCCGGTGTGTGAGCCGACAGCATTTTTCTTGCGAGCCTTGTATGTCCGCGAGAGTGAACGAACTTGCTCCTTGTTGGCTGTCTTCCAGAGCTTTTGAGTAAGCGCACAGAGATCCTTGTTTCTTGATCTCCACTCTCTGGTTGCGGAAAT
     quality     0x
     asmbl_id    93
     seq_id      
     com_name    
     type        
     method      asmg
     ed_status   
     redundancy  1.11
     perc_N      0.20
     seq#        3
     full_cds    
     cds_start   
     cds_end     
     ed_pn       GRA
     ed_date     08/16/07 17:10:12
     comment     
     frameshift  
 
     seq_name    SDSU_RFPERU_010_C09.x01.phd.1
     asm_lend    1
     asm_rend    4423
     seq_lend    1
     seq_rend    442
     best        0
     comment     
     db  
     offset      0
     lsequence   CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTGGTGTAGGAGGCGCCAGCAACACCTTGGGTGATCATGTAGCAGCCTTCTGACAGGCCGTAGGTCAGCATGGTCGGCCACTGGGTACCAGTCAGTCGGGTCAACCGAGATTCG-CAGCTGAGCGCCACTGCCGCGCAGAGCGTACATGCCCTTGCGGGTCGCGCCGGTAACACCATCCACGCCGATCAGAACTGCGTCGGTGATGGTGG
 
     seq_name    SDSU_RFPERU_002_H12.x01.phd.1
     asm_lend    339
     asm_rend    940
     seq_lend    1
     seq_rend    602
     best        0
     comment     
     db  
     offset      338
     lsequence   CGAGATTCGCCACCTGAGCGCCACTGCCGCGCAGAGCGTACATGCCCTTGCGGGTCGCGCCGGTAACACCATCCACGCCGATCAGAACTGCGTCGGTGATGGTGGTGTTACCCGAGGTGCCAGTGGTGAAGGCGACGGTCTGGGTGCTGGCCACAGGCGCCAGAGTGGTCGCGCCAACGGTGGCGATGACCAGTTGCGATGGGCCACGGATACCTGACTGCCCGTTGTTCACGGCGCTGACGATGTTCTGCCACAGCGCCAGGCCAGAGCCGGTGATGTTGTCGAACACTTCGGGCGCAACGCCAGGGAGCGAGACGGTCAGCTTCCAGCTCGAAGCAGCGGAGCCAGTAGCCAGGGCGGCGCTGAGCGAGTTGCCGAGCGTGCCGGTGTAGAACGCGGTCAGCGTGGCGCCGGTGGCGGCGGCAGTGTCCTTCAGCGCACTGGTCGCGGCGGTGTCGGTGCCGTCAGTGACGCGCACGGCGCGGATGTTCGAGGCGCCGCCCTGGATTGATACCGCCAGCGCGGTGCACAGGTCGTACTTGCGCACGGTCCGAGTGCCGAACTTCTGCGATGCGTCACCTGGCGAGCCGATA-GCGTGGCGC
 
     seq_name    SDSU_RFPERU_009_E07.x01.phd.1
     asm_lend    880
     asm_rend    1520
     seq_lend    641
     seq_rend    1
     best        0
     comment     
     db  
     offset      8803
     lsequence   CGCACGGTCTGAGTGCCGAACTTCTGCGATGCGTCACCTGGCGAGCCGATAAGCGTGGCGCTGTTCACCGGCCCCCAGTCAGCAATGCCGACGATGCCGAGAATGTCAGTCGGGACGCCATTGATGTAGCGGGTCTTGGGCGCCACTATTTGTATGTACAAATCTGGCGCAGATAAAGCCGCCGTATTCAAATAACCAGCAGGATAGATAGGCATCACGCCTCCAGAATGAAAAAGGCCACCGATTAGGTGGCCTTTGTTGTGTTCGGCTGGCTGTTAGAGCAGCAGCCCGTTTTCCCGCGCAAACGCGAATGGGTCCTTGTCATGCTTCCTGCAATTGCAGGTAGGACAAAGAATTTGCAGGTTGGATTTGTCGTTCGATCCGCCCTTTGCAAGCGGGAACACGTGGTCAACGTGATACCCATCCCTTATGGATATAGTGCACATGGCGCATTTCCAGCGCTGAGCAGCCAGCAAAAATTTTATGTCGTCGCCGGTGTGTGAGCCGACAGCATTTTTCTTGCGAGCCTTGTATGTCCGCGAGAGTGAACGAACTTGCTCCTTGTTGGCTGTCTTCCAGAGCTTTTGAGTAAGCGCACAGAGATCCTTGTTTCTTGATCTCCACTCTCTGGTTGCGGAAAT
     |
 
 

...

FEEDBACK

Mailing Lists

User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing lists Your participation is much appreciated.
   bioperl-l@bioperl.org                  - General discussion
   http://bioperl.org/wiki/Mailing_lists  - About the mailing lists
 
 

Support

Please direct usage questions or support issues to the mailing list:

bioperl-l@bioperl.org

rather than to the module maintainer directly. Many experienced and reponsive experts will be able look at the problem and quickly address it. Please include a thorough description of the problem with code and data examples if at all possible.

Reporting Bugs

Report bugs to the BioPerl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via email or the web:
   bioperl-bugs@bio.perl.org
   http://bugzilla.bioperl.org/
 
 

AUTHOR - Florent E Angly

Email florent dot angly at gmail dot com

APPENDIX

The rest of the documentation details each of the object methods. Internal methods are usually preceded with a ``_''.

next_assembly

  Title   : next_assembly
  Usage   : my $scaffold = $asmio->next_assembly()
  Function: return the next assembly in the tasm-formatted stream
  Returns : Bio::Assembly::Scaffold object
  Args    : none
 
 

_qual_hex2dec

     Title   : _qual_hex2dec
     Usage   : my dec_quality = $self->_qual_hex2dec($hex_quality);
     Function: convert an hexadecimal quality score into a decimal quality score 
     Returns : string
     Args    : string
 
 

_qual_dec2hex

     Title   : _qual_dec2hex
     Usage   : my hex_quality = $self->_qual_dec2hex($dec_quality);
     Function: convert a decimal quality score into an hexadecimal quality score 
     Returns : string
     Args    : string
 
 

_store_contig

     Title   : _store_contig
     Usage   : my $contigobj; $contigobj = $self->_store_contig(
               \%contiginfo, $contigobj, $scaffoldobj);
     Function: store information of a contig belonging to a scaffold in the
               appropriate object
     Returns : Bio::Assembly::Contig object
     Args    : hash, Bio::Assembly::Contig, Bio::Assembly::Scaffold
 
 

_store_read

     Title   : _store_read
     Usage   : my $readobj = $self->_store_read(\%readinfo, $contigobj);
     Function: store information of a read belonging to a contig in the appropriate object
     Returns : Bio::LocatableSeq
     Args    : hash, Bio::Assembly::Contig
 
 

_store_singlet

     Title   : _store_singlet
     Usage   : my $singletobj = $self->_store_read(\%readinfo, \%contiginfo,
                   $scaffoldobj);
     Function: store information of a singlet belonging to a scaffold in the appropriate object
     Returns : Bio::Assembly::Singlet
     Args    : hash, hash, Bio::Assembly::Scaffold
 
 

write_assembly

     Title   : write_assembly
     Usage   : $ass_io->write_assembly($assembly)
     Function: Write the assembly object in TIGR Assembler compatible tasm lassie  
               format
     Returns : 1 on success, 0 for error
     Args    : A Bio::Assembly::Scaffold object
 
 

_perc_N

     Title   : _perc_N
     Usage   : my $perc_N = $ass_io->_perc_N($sequence_string)
     Function: Calculate the percent of ambiguities in a sequence.
               M R W S Y K X N are regarded as ambiguites in an aligned read
               sequence by TIGR Assembler. In the case of a gapped contig
               consensus sequence, all lowercase symbols are ambiguities, i.e.:
               a c g t u m r w s y k x n.
     Returns : decimal number
     Args    : string
 
 

_redundancy

     Title   : _redundancy
     Usage   : my $ref = $ass_io->_redundancy($contigobj)
     Function: Calculate the fold coverage (redundancy) of a contig consensus
               (average number of read base pairs covering the consensus)
     Returns : decimal number
     Args    : Bio::Assembly::Contig
 
 

_ungap

     Title   : _ungap
     Usage   : my $ungapped = $ass_io->_ungap($gapped)
     Function: Remove the gaps from a sequence. Gaps are - in TIGR Assembler
     Returns : string
     Args    : string
 
 

_date_time

     Title   : _date_time
     Usage   : my $timepoint = $ass_io->date_time
     Function: Get date and time (MM//DD/YY HH:MM:SS)
     Returns : string
     Args    : none
 
 

_split_seq_name_and_db

     Title   : _split_seq_name_and_db
     Usage   : my ($seqname, $db) = $ass_io->_split_seq_name_and_db($id)
     Function: Extract seq_name and db from sequence id
     Returns : seq_name, db
     Args    : id
 
 

_merge_seq_name_and_db

     Title   : _merge_seq_name_and_db
     Usage   : my $id = $ass_io->_merge_seq_name_and_db($seq_name, $db)
     Function: Construct id from seq_name and db
     Returns : id
     Args    : seq_name, db
 
 

_coord

     Title   : _coord
     Usage   : my $id = $ass_io->__coord($readobj, $contigobj)
     Function: Get different coordinates for the read
     Returns : number, number, number, number, number
     Args    : Bio::Assembly::Seq, Bio::Assembly::Contig