readAligned            package:ShortRead            R Documentation

_R_e_a_d _a_l_i_g_n_e_d _r_e_a_d_s _a_n_d _t_h_e_i_r _q_u_a_l_i_t_y _s_c_o_r_e_s _i_n_t_o _R _r_e_p_r_e_s_e_n_t_a_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     'readAligned' reads all aligned read files in a directory
     'dirPath' whose file name matches 'pattern', returning a compact
     internal representation of the alignments, sequences, and quality
     scores in the files. Methods read all files into a single R
     object; a typical use is to restrict input to a single aligned
     read file.

_U_s_a_g_e:

     readAligned(dirPath, pattern=character(0), ...)

_A_r_g_u_m_e_n_t_s:

 dirPath: A character vector (or other object; see methods defined on
          this generic) giving the directory path (relative or
          absolute) of aligned read files to be input.

 pattern: The ('grep'-style) pattern describing file names to be read.
          The default ('character(0)') results in (attempted) input of
          all files in the directory.

     ...: Additional arguments, used by methods. When 'dirPath' is a
          character vector, the argument 'type' must be provided.
          Possible values for 'type' and their meaning are described
          below. Most methods implement 'filter=srFilter()', allowing
          objects of 'SRFilter' to selectively returns aligned reads.

_D_e_t_a_i_l_s:

     There is no standard aligned read file format; methods parse
     particular file types.

     The 'readAligned,character-method' interprets file types based on
     an additional 'type' argument. Supported types are:


   '_t_y_p_e="_S_o_l_e_x_a_E_x_p_o_r_t"' This type parses '.*_export.txt' files
        following the documentation in the Solexa Genome Alignment
        software manual, version 0.3.0. These files consist of the
        following columns; consult Solexa documentation for precise
        descriptions. If parsed, values can be retrieved from
        'AlignedRead' as follows:

        _M_a_c_h_i_n_e Ignored

        _R_u_n _n_u_m_b_e_r stored in 'alignData'

        _L_a_n_e stored in 'alignData'

        _T_i_l_e stored in 'alignData'

        _X stored in 'alignData'

        _Y stored in 'alignData'

        _I_n_d_e_x _s_t_r_i_n_g Ignored

        _R_e_a_d _n_u_m_b_e_r Ignored

        _R_e_a_d 'sread'

        _Q_u_a_l_i_t_y 'quality'

        _M_a_t_c_h _c_h_r_o_m_o_s_o_m_e 'chromosome'

        _M_a_t_c_h _c_o_n_t_i_g Ignored

        _M_a_t_c_h _p_o_s_i_t_i_o_n 'position'

        _M_a_t_c_h _s_t_r_a_n_d 'strand'

        _M_a_t_c_h _d_e_s_c_r_i_p_t_i_o_n Ignored

        _S_i_n_g_l_e-_r_e_a_d _a_l_i_g_n_m_e_n_t _s_c_o_r_e 'alignQuality'

        _P_a_i_r_e_d-_r_e_a_d _a_l_i_g_n_m_e_n_t _s_c_o_r_e Ignored

        _P_a_r_t_n_e_r _c_h_r_o_m_o_s_o_m_e Ignored

        _P_a_r_t_n_e_r _c_o_n_t_i_g Ignored

        _P_a_r_t_n_e_r _o_f_f_s_e_t Ignored

        _P_a_r_t_n_e_r _s_t_r_a_n_d Ignored

        _F_i_l_t_e_r_i_n_g 'alignData'

        Paired read columns are not interpreted.  The resulting
        'AlignedRead' object does _not_ contain a meaningful 'id';
        instead, use information from 'alignData' to identify reads.

        Different interfaces to reading alignment files are described
        in 'SolexaPath' and 'SolexaSet'.


   '_t_y_p_e="_S_o_l_e_x_a_P_r_e_a_l_i_g_n"' See SolexaRealign

   '_t_y_p_e="_S_o_l_e_x_a_A_l_i_g_n"' See SolexaRealign

   '_t_y_p_e="_S_o_l_e_x_a_R_e_a_l_i_g_n"' These types parse 's_L_TTTT_prealign.txt',
        's_L_TTTT_align.txt' or 's_L_TTTT_realign.txt' files produced
        by default and eland analyses. From the Solexa documentation,
        'align' corresponds to unfiltered first-pass alignements,
        'prealign' adjusts alignments for error rates (when available),
        'realign' filters alignments to exclude clusters failing to
        pass quality criteria.

        Because base quality scores are not stored with alignments, the
        object returned by 'readAligned' scores all base qualities as
        '-32'.

        If parsed, values can be retrieved from 'AlignedRead' as
        follows:

        _S_e_q_u_e_n_c_e stored in 'sread'

        _B_e_s_t _s_c_o_r_e stored in 'alignQuality'

        _N_u_m_b_e_r _o_f _h_i_t_s stored in 'alignData'

        _T_a_r_g_e_t _p_o_s_i_t_i_o_n stored in 'position'

        _S_t_r_a_n_d stored in 'strand'

        _T_a_r_g_e_t _s_e_q_u_e_n_c_e Ignored; parse using 'readXStringColumns'

        _N_e_x_t _b_e_s_t _s_c_o_r_e stored in 'alignData'


   '_t_y_p_e="_S_o_l_e_x_a_R_e_s_u_l_t"' This parses 's_L_eland_results.txt' files, an
        intermediate format that does not contain read or alignment
        quality scores.

        Because base quality scores are not stored with alignments, the
        object returned by 'readAligned' scores all base qualities as
        '-32'.

        Columns of this file type can be retrieved from 'AlignedRead'
        as follows (description of columns is from Table 19, Genome
        Analyzer Pipeline Software User Guide, Revision A, January
        2008):

        _I_d Not parsed

        _S_e_q_u_e_n_c_e stored in 'sread'

        _T_y_p_e _o_f _m_a_t_c_h _c_o_d_e Stored in 'alignData' as 'matchCode'. Codes
             are (from the Eland manual): NM (no match); QC (no match
             due to quality control failure); RM (no match due to
             repeat masking); U0 (best match was unique and exact); U1
             (best match was unique, with 1 mismatch); U2 (best match
             was unique, with 2 mismatches); R0 (multiple exact matches
             found); R1 (multiple 1 mismatch matches found, no exact
             matches); R2 (multiple 2 mismatch matches found, no exact
             or 1-mismatch matches).

        _N_u_m_b_e_r _o_f _e_x_a_c_t _m_a_t_c_h_e_s stored in 'alignData' as 'nExactMatch'

        _N_u_m_b_e_r _o_f _1-_e_r_r_o_r _m_i_s_m_a_t_c_h_e_s stored in 'alignData' as
             'nOneMismatch'

        _N_u_m_b_e_r _o_f _2-_e_r_r_o_r _m_i_s_m_a_t_c_h_e_s stored in 'alignData' as
             'nTwoMismatch'

        _G_e_n_o_m_e _f_i_l_e _o_f _m_a_t_c_h stored in 'chromosome'

        _P_o_s_i_t_i_o_n stored in 'position'

        _S_t_r_a_n_d (direction of match) stored in 'strand'

        _N _t_r_e_a_t_m_e_n_t stored in 'alignData', as 'NCharacterTreatment'.
             . indicates treatment of N was not applicable; D
             indicates treatment as deletion; | indicates treatment
             as insertion

        _S_u_b_s_t_i_t_u_t_i_o_n _e_r_r_o_r stored in 'alignData' as 'mismatchDetailOne'
             and 'mismatchDetailTwo'. Present only for unique inexact
             matches at one or two positions. Position and type of
             first substituation error, e.g., 11A represents 11 matches
             with 12th base an A in reference but not read. The
             reference manual cited below lists only one field
             ('mismatchDetailOne'), but two are present in files seen
             in the wild.


   '_t_y_p_e="_M_A_Q_M_a_p", _r_e_c_o_r_d_s=-_1_L' Parse binary 'map' files produced by
        MAQ. See details in the next section. The 'records' option
        determines how many lines are read; '-1L' (the default) means
        that all records are input.

   '_t_y_p_e="_M_A_Q_M_a_p_S_h_o_r_t", _r_e_c_o_r_d_s=-_1_L' The same as 'type="MAQMap"' but
        for map files made with Maq prior to version 0.7.0. (These
        files use a different maximum read length [64 instead of 128],
        and are hence incompatible with newer Maq map files.)

   '_t_y_p_e="_M_A_Q_M_a_p_v_i_e_w"' Parse alignment files created by MAQ's mapiew
        command. Interpretation of columns is based on the description
        in the MAQ manual, specifically



                ...each line consists of read name, chromosome,
        position,
                strand, insert size from the outer coordinates of a
        pair,
                paired flag, mapping quality, single-end mapping
        quality,
                alternative mapping quality, number of mismatches of
        the
                best hit, sum of qualities of mismatched bases of the
        best
                hit, number of 0-mismatch hits of the first 24bp,
        number
                of 1-mismatch hits of the first 24bp on the reference,
                length of the read, read sequence and its quality.

        The read name, read sequence, and quality are read as
        'XStringSet' objects. Chromosome and strand are read as
        'factor's.  Position is 'numeric', while mapping quality is
        'numeric'. These fields are mapped to their corresponding
        representation in 'AlignedRead' objects.

        Number of mismatches of the best hit, sum of qualities of
        mismatched bases of the best hit, number of 0-mismatch hits of
        the first 24bp, number of 1-mismatch hits of the first 24bp are
        represented in the 'AlignedRead' object as components of
        'alignData'.

        Remaining fields are currently ignored.


   '_t_y_p_e="_B_o_w_t_i_e"' Parse alignment files created with the Bowtie
        alignment algorithm. Parsed columns can be retrieved from
        'AlignedRead' as follows:

        _I_d_e_n_t_i_f_i_e_r 'id'

        _S_t_r_a_n_d 'strand'

        _C_h_r_o_m_o_s_o_m_e 'chromosome'

        _P_o_s_i_t_i_o_n 'position'; see comment below

        _R_e_a_d 'sread'; see comment below

        _R_e_a_d _q_u_a_l_i_t_y 'quality'; see comments below

        _B_o_w_t_i_e _r_e_s_e_r_v_e_d ignored

        _A_l_i_g_n_m_e_n_t _m_i_s_m_a_t_c_h _l_o_c_a_t_i_o_n_s 'alignData'

        This method includes the argument 'qualityType' to specify how
        quality scores are encoded.  Bowtie quality scores are
        Solexa-like by default, with 'qualityType='SFastqQuality'',
        but can be specified as Phred-like, with
        'qualityType='FastqQuality''.

        Bowtie outputs positions that are 0-offset from the left-most
        end of the '+' strand. 'ShortRead' parses position information
        to be 1-offset from the left-most end of the '+' strand.

        Bowtie outputs reads aligned to the '-' strand as their reverse
        complement, and reverses the quality score string of these
        reads. 'ShortRead' parses these to their original sequence and
        orientation.


   '_t_y_p_e="_S_O_A_P"' Parse alignment files created with the SOAP alignment
        algorithm. Parsed columns can be retrieved from 'AlignedRead'
        as follows:

        _i_d 'id'

        _s_e_q 'sread'; see comment below

        _q_u_a_l 'quality'; see comment below

        _n_u_m_b_e_r _o_f _h_i_t_s 'alignData'

        _a/_b 'alignData' ('pairedEnd')

        _l_e_n_g_t_h 'alignData' ('alignedLength')

        +/- 'strand'

        _c_h_r 'chromosome'

        _l_o_c_a_t_i_o_n 'position'; see comment below

        _t_y_p_e_s 'alignData' ('typeOfHit': integer portion; 'hitDetail':
             text portion)

        This method includes the argument 'qualityType' to specify how
        quality scores are encoded.  It is unclear from SOAP
        documentation what the quality score is; the default is
        Solexa-like, with 'qualityType='SFastqQuality'', but can be
        specified as Phred-like, with 'qualityType='FastqQuality''.

        SOAP outputs positions that are 1-offset from the left-most end
        of the '+' strand. 'ShortRead' preserves this representation.

        SOAP reads aligned to the '-' strand are reported by SOAP as
        their reverse complement, with the quality string of these
        reads reversed. 'ShortRead' parses these to their original
        sequence and orientation.


_V_a_l_u_e:

     A single R object (e.g., 'AlignedRead') containing alignments,
     sequences and qualities of all files in 'dirPath' matching
     'pattern'. There is no guarantee of order in which files are read.

_A_u_t_h_o_r(_s):

     Martin Morgan <mtmorgan@fhcrc.org>, Simon Anders
     <anders@ebi.ac.uk> (MAQ map)

_S_e_e _A_l_s_o:

     A 'AlignedRead' object.

     Genome Analyzer Pipeline Software User Guide, Revision A, January
     2008.

     The MAQ reference manual, <URL:
     http://maq.sourceforge.net/maq-manpage.shtml#5>, 3 May, 2008.

     The Bowtie reference manual, <URL:
     http://bowtie-bio.sourceforge.net>, 28 October, 2008.

     The SOAP reference manual, <URL:
     http://soap.genomics.org.cn/soap1>, 16 December, 2008.

_E_x_a_m_p_l_e_s:

     sp <- SolexaPath(system.file("extdata", package="ShortRead"))
     ap <- analysisPath(sp)
     ## ELAND_EXTENDED
     readAligned(ap, "s_2_export.txt", "SolexaExport")
     ## PhageAlign
     readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign")

     ## MAQ
     dirPath <- system.file('extdata', 'maq', package='ShortRead')
     list.files(dirPath)
     ## First line
     readLines(list.files(dirPath, full.names=TRUE)[[1]], 1)
     countLines(dirPath)
     ## two files collapse into one
     readAligned(dirPath, type="MAQMapview")

     ## select only chr1-5.fa, '+' strand
     filt <- compose(chromosomeFilter("chr[1-5].fa"),
                     strandFilter("+"))
     readAligned(sp, "s_2_export.txt", filter=filt)

