PDict-class            package:Biostrings            R Documentation

_P_D_i_c_t _o_b_j_e_c_t_s

_D_e_s_c_r_i_p_t_i_o_n:

     The PDict class is a container for storing a preprocessed set of
     patterns (aka the dictionary) that can be used with the
     'matchPDict' function in order to find the occurences of all the
     input patterns (i.e. the patterns stored in the PDict object) in a
     text in an efficient way.

     Converting a set of input sequences into a PDict object is called
     preprocessing. This operation is done by using the 'PDict'
     constructor.

_U_s_a_g_e:

       PDict(dict, tb.start=1, tb.end=NA, drop.head=FALSE, drop.tail=FALSE, skip.invalid.patterns=FALSE)

_A_r_g_u_m_e_n_t_s:

    dict: A character vector, a DNAStringSet object or an XStringViews
          object containing the input sequences. 

tb.start: [DOCUMENT ME] 

  tb.end: [DOCUMENT ME] 

drop.head: [DOCUMENT ME] 

drop.tail: [DOCUMENT ME] 

skip.invalid.patterns: [DOCUMENT ME] 

_D_e_t_a_i_l_s:

     This is a work in progress and only 2 types of dictionaries are
     supported at the moment: constant width DNA dictionaries and
     Trusted Band DNA dictionaries.

     A constant width DNA dictionary is a dictionary where all the
     patterns are DNA sequences of the same length (i.e. all the
     patterns have the same number of nucleotides). For now the
     patterns can only contain As, Cs, Gs and Ts (no IUPAC extended
     letters). The container for this particular type of dictionary is
     the CWdna_PDict class (a subclass of the PDict class).

     A Trusted Band DNA dictionary is a dictionary where the patterns
     are DNA sequences with a trusted region i.e. a region that will
     have to match exactly when the dictionary is used with the
     'matchPDict' function. This trusted region must have the same
     length for all patterns. It can be a prefix (Trusted Prefix), a
     suffix (Trusted Suffix) or more generally any substring (Trusted
     Band) of the input patterns. The container for this particular
     type of dictionary is the TBdna_PDict class (a subclass of the
     PDict class). The dictionary stored in a TBdna_PDict object is
     splitted in 3 parts: the Trusted Band, the head and the tail of
     the dictionary. Each of them contains the same number of sequences
     (eventually empty) which is also the number of input patterns. The
     Trusted Band is the set of sequences obtained by extracting the
     trusted region from each input pattern. The head is the set of
     sequences obtained by extracting the region located before (i.e.
     to the left of) the trusted region from each input pattern. The
     tail is the set of sequences obtained by extracting the region
     located after (i.e. to the right of) the trusted region from each
     input pattern.

     Like for a constant width DNA dictionary, the Trusted Band can
     only contain As, Cs, Gs and Ts (no IUPAC extended letters) for
     now. However, the head and the tail of a Trusted Band DNA
     dictionary can contain any valid DNA letter including IUPAC
     extended letters (see 'DNA_ALPHABET' for the set of valid DNA
     letters).

     Note that a Trusted Band DNA dictionary with no head and no tail
     (i.e. with a head and a tail where all the sequences are empty) is
     in fact a constant width DNA dictionary. A Trusted Band DNA
     dictionary with no head is called a Trusted Prefix DNA dictionary.
     A Trusted Band DNA dictionary with no tail is called a Trusted
     Suffix DNA dictionary. Only Trusted Prefix and Trusted Suffix DNA
     dictionaries are currently supported.

_A_c_c_e_s_o_r _m_e_t_h_o_d_s:

     In the code snippets below, 'x' is a PDict object.


      'length(x)': The number of patterns in the PDict object.

      'width(x)': The number of nucleotides per pattern for a constant
          width DNA dictionary. The width of the Trusted Band (i.e. the
          number of nucleotides in the trusted region of each pattern)
          for a Trusted Band DNA dictionary.

      'names(x)': The names of the patterns in the PDict object.

      'head(x)': The head of the PDict object (a 'DNAStringSet' object
          of length 'length(x)' or 'NULL').

      'tail(x)': The tail of the PDict object (a 'DNAStringSet' object
          of length 'length(x)' or 'NULL').


_O_t_h_e_r _f_u_n_c_t_i_o_n_s _a_n_d _g_e_n_e_r_i_c_s:

     In the code snippet below, 'x' is a PDict object.


      'duplicated(x)': [DOCUMENT ME]

      'patternFrequency(x)': [DOCUMENT ME]


_A_u_t_h_o_r(_s):

     H. Pages

_S_e_e _A_l_s_o:

     'matchPDict', 'DNA_ALPHABET', DNAStringSet-class,
     XStringViews-class

_E_x_a_m_p_l_e_s:

       ## Preprocessing a constant width DNA dictionary
       library(drosophila2probe)
       dict0 <- drosophila2probe$sequence   # The input sequences.
       length(dict0)                        # Hundreds of thousands of patterns.
       unique(nchar(dict0))                 # Patterns are 25-mers.
       dict0[1:5]
       pdict0 <- PDict(dict0)               # Store the input dictionary into a
                                            # PDict object (preprocessing).
       pdict0
       class(pdict0)
       length(pdict0)                       # Same as length(dict0).
       width(pdict0)                        # The number of chars per pattern.
       sum(duplicated(pdict0))
       table(patternFrequency(pdict0))      # 9 patterns are repeated 3 times.

       ## Creating a constant width DNA dictionary by truncating the input
       ## sequences
       dict1 <- c("ACNG", "GT", "CGT", "AC")
       pdict1a <- PDict(dict1, tb.end=2, drop.tail=TRUE)
       pdict1a
       class(pdict1a)
       length(pdict1a)
       width(pdict1a)
       duplicated(pdict1a)
       patternFrequency(pdict1a)

       ## Preprocessing a Trusted Prefix DNA dictionary
       pdict1b <- PDict(dict1, tb.end=2)
       pdict1b
       class(pdict1b)
       length(pdict1b)
       width(pdict1b)
       head(pdict1b)
       tail(pdict1b)

