PDict-class            package:Biostrings            R Documentation

_P_D_i_c_t _o_b_j_e_c_t_s

_D_e_s_c_r_i_p_t_i_o_n:

     The PDict class is a container for storing a preprocessed
     dictionary of DNA patterns that can later be passed to the
     'matchPDict' function for fast matching against a reference
     sequence (the subject).

     'PDict' is the constructor function for creating new PDict
     objects.

_U_s_a_g_e:

       PDict(x, max.mismatch=NA, tb.start=NA, tb.end=NA, tb.width=NA,
                algorithm="ACtree2", skip.invalid.patterns=FALSE)

_A_r_g_u_m_e_n_t_s:

       x: A character vector, a DNAStringSet object or an XStringViews
          object with a DNAString subject. 

max.mismatch: A single non-negative integer or 'NA'. See the "Allowing
          a small number of mismatching letters" section below. 

tb.start,tb.end,tb.width: A single integer or 'NA'. See the "Trusted
          Band" section below. 

algorithm: '"ACtree2"' (the default), '"ACtree"' or '"Twobit"'. 

skip.invalid.patterns: This argument is not supported yet (and might in
          fact be replaced by the 'filter' argument very soon). 

_D_e_t_a_i_l_s:

     THIS IS STILL WORK IN PROGRESS!

     If the original dictionary 'x' is a character vector or an
     XStringViews object with a DNAString subject, then the 'PDict'
     constructor will first try to turn it into a DNAStringSet object.

     By default (i.e. if 'PDict' is called with 'max.mismatch=NA',
     'tb.start=NA', 'tb.end=NA' and 'tb.width=NA') the following
     limitations apply: (1) the original dictionary can only contain
     base letters (i.e. only As, Cs, Gs and Ts), therefore IUPAC
     extended letters are not allowed; (2) all the patterns in the
     dictionary must have the same length ("constant width"
     dictionary); and (3) later 'matchPdict' can only be used with
     'max.mismatch=0'.

     A Trusted Band can be used in order to relax these limitations
     (see the "Trusted Band" section below).

     If you are planning to use the resulting 'PDict' object in order
     to do inexact matching where valid hits are allowed to have a
     small number of mismatching letters, then see the "Allowing a
     small number of mismatching letters" section below.

     Three preprocessing algorithms are currently supported:
     'algorithm="ACtree2"' (the default), 'algorithm="ACtree"' and
     'algorithm="Twobit"'. With the '"ACtree2"' and '"ACtree"'
     algorithms, all the oligonucleotides in the Trusted Band are
     stored in a 4-ary Aho-Corasick tree. With the '"Twobit"'
     algorithm, the 2-bit-per-letter signatures of all the
     oligonucleotides in the Trusted Band are computed and the mapping
     from these signatures to the 1-based position of the corresponding
     oligonucleotide in the Trusted Band is stored in a way that allows
     very fast lookup. Only with PDict objects obtained with the
     '"ACtree2"' or '"ACtree"' algos can 'matchPdict' then be called
     with 'fixed="pattern"' (instead of 'fixed=TRUE', the default) so
     that IUPAC extended letters in the subject are treated as
     ambiguities. PDict objects obtained with the '"Twobit"' algo don't
     allow this.

_T_r_u_s_t_e_d _B_a_n_d:

     What's a Trusted Band?

     A Trusted Band is a region defined in the original dictionary
     where the limitations described above will apply.

     Why use a Trusted Band?

     Because the limitations described above will apply to the Trusted
     Band only! For example the Trusted Band cannot contain IUPAC
     extended letters but the "head" and the "tail" can (see below for
     what those are). Also with a Trusted Band, if 'matchPdict' is
     called with a non-null 'max.mismatch' value then mismatching
     letters will be allowed in the head and the tail. Or, if
     'matchPdict' is called with 'fixed="subject"', then IUPAC extended
     letters in the head and the tail will be treated as ambiguities.

     How to specify a Trusted Band?

     Use the 'tb.start', 'tb.end' and 'tb.width' arguments of the
     'PDict' constructor in order to specify a Trusted Band. This will
     divide each pattern in the original dictionary into three parts: a
     left part, a middle part and a right part. The middle part is
     defined by its starting and ending nucleotide positions given
     relatively to each pattern thru the 'tb.start', 'tb.end' and
     'tb.width' arguments. It must have the same length for all
     patterns (this common length is called the width of the Trusted
     Band). The left and right parts are defined implicitely: they are
     the parts that remain before (prefix) and after (suffix) the
     middle part, respectively. Therefore three DNAStringSet objects
     result from this division: the first one is made of all the left
     parts and forms the head of the PDict object, the second one is
     made of all the middle parts and forms the Trusted Band of the
     PDict object, and the third one is made of all the right parts and
     forms the tail of the PDict object.

     In other words you can think of the process of specifying a
     Trusted Band  as drawing 2 vertical lines on the original
     dictionary (note that these 2 lines are not necessarily straight
     lines but the horizontal space between them must be constant).
     When doing this, you are dividing the dictionary into three
     regions (from left to right): the head, the Trusted Band and the
     tail. Each of them is a DNAStringSet object with the same number
     of elements than the original dictionary and the original
     dictionary could easily be reconstructed from those three regions.

     The width of the Trusted Band must be >= 1 because Trusted Bands
     of width 0 are not supported.

     Finally note that calling 'PDict' with 'tb.start=NA', 'tb.end=NA'
     and 'tb.width=NA' (the default) is equivalent to calling it with
     'tb.start=1', 'tb.end=-1' and 'tb.width=NA', which results in a
     full-width Trusted Band i.e. a Trusted Band that covers the entire
     dictionary (no head and no tail).

_A_l_l_o_w_i_n_g _a _s_m_a_l_l _n_u_m_b_e_r _o_f _m_i_s_m_a_t_c_h_i_n_g _l_e_t_t_e_r_s:

     [TODO]

_A_c_c_e_s_s_o_r _m_e_t_h_o_d_s:

     In the code snippets below, 'x' is a PDict object.


      'length(x)': The number of patterns in 'x'.

      'width(x)': A vector of non-negative integers containing the
          number of letters for each pattern in 'x'.

      'names(x)': The names of the patterns in 'x'.

      'head(x)': The head of 'x' or 'NULL' if 'x' has no head.

      'tb(x)': The Trusted Band defined on 'x'.

      'tb.width(x)': The width of the Trusted Band defined on 'x'. Note
          that, unlike 'width(tb(x))', this is a single integer. And
          because the Trusted Band has a constant width, 'tb.width(x)'
          is in fact equivalent to 'unique(width(tb(x)))', or to
          'width(tb(x))[1]'.

      'tail(x)': The tail of 'x' or 'NULL' if 'x' has no tail.


_S_u_b_s_e_t_t_i_n_g _m_e_t_h_o_d_s:

     In the code snippets below, 'x' is a PDict object.


      'x[[i]]': Extract the i-th pattern from 'x' as a DNAString
          object.


_O_t_h_e_r _m_e_t_h_o_d_s:

     In the code snippet below, 'x' is a PDict object.


      'duplicated(x)': [TODO]

      'patternFrequency(x)': [TODO]


_A_u_t_h_o_r(_s):

     H. Pages

_R_e_f_e_r_e_n_c_e_s:

     Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient
     string matching: An aid to bibliographic search". Communications
     of the ACM 18 (6): 333-340.

_S_e_e _A_l_s_o:

     'matchPDict', 'DNA_ALPHABET', DNAStringSet-class,
     XStringViews-class

_E_x_a_m_p_l_e_s:

       ## ---------------------------------------------------------------------
       ## A. NO HEAD AND NO TAIL (THE DEFAULT)
       ## ---------------------------------------------------------------------
       library(drosophila2probe)
       dict0 <- DNAStringSet(drosophila2probe$sequence)
       dict0                                # The original dictionary.
       length(dict0)                        # Hundreds of thousands of patterns.
       unique(nchar(dict0))                 # Patterns are 25-mers.

       pdict0 <- PDict(dict0)               # Store the original dictionary in
                                            # a PDict object (preprocessing).
       pdict0
       class(pdict0)
       length(pdict0)                       # Same as length(dict0).
       tb.width(pdict0)                     # The width of the (implicit)
                                            # Trusted Band.
       sum(duplicated(pdict0))
       table(patternFrequency(pdict0))      # 9 patterns are repeated 3 times.
       pdict0[[1]]
       pdict0[[5]]

       ## ---------------------------------------------------------------------
       ## B. NO HEAD AND A TAIL
       ## ---------------------------------------------------------------------
       dict1 <- c("ACNG", "GT", "CGT", "AC")
       pdict1 <- PDict(dict1, tb.end=2)
       pdict1
       class(pdict1)
       length(pdict1)
       width(pdict1)
       head(pdict1)
       tb(pdict1)
       tb.width(pdict1)
       width(tb(pdict1))
       tail(pdict1)
       pdict1[[3]]

