matchPDict-inexact        package:Biostrings        R Documentation

_I_n_e_x_a_c_t _m_a_t_c_h_i_n_g _w_i_t_h _m_a_t_c_h_P_D_i_c_t()

_D_e_s_c_r_i_p_t_i_o_n:

     The 'matchPDict' function can be used for inexact matching.

     For the moment, only one kind of inexact matching is provided
     where the user must specify the length of the Trusted Prefix for
     each pattern stored in the dictionary and the maximum number of
     mismatching letters allowed in each pattern suffix (or "tail").
     Please refer to the 'PDict' documentation for how to create a
     Trusted Prefix DNA dictionary (preprocessing).

     The inexact matches are searched by using a 2-stage approach: (1)
     use the Aho-Corasick algorithm for fast lookup of all the prefixes
     in the subject, followed by (2) comparison of each tail with the
     short substrings in the subject located next to the positions
     found in (1). Because performance will mostly depend on the number
     of positions found in (1) it's important to minimize this number
     by keeping the length of the Trusted Prefix as large as possible.
     A simple empiric rule is: if TP is the length of the Trusted
     Prefix and L the length of the subject, then L / (4**TP) should be
     as small as possible, typically < 10 or 20.

     Note that, when 'matchPDict' is used for inexact matching,
     variable width dictionaries (i.e. containing patterns of variable
     length) are accepted. In addition, when 'fixed=FALSE', then IUPAC
     extended letters placed in the tail of the dictionary are
     interpreted as ambiguities.

     However, like in exact matching mode, it only works with a
     dictionary of DNA patterns.

_A_u_t_h_o_r(_s):

     H. Pages

_S_e_e _A_l_s_o:

     PDict, matchPDict-exact

_E_x_a_m_p_l_e_s:

       ## WARNING! THIS IS STILL WORK IN PROGRESS!!!
       ## A known issue is that the "startIndex", "[[" and "unlist" methods are
       ## BROKEN on the MIndex object returned by matchPDict() called on a
       ## TBdna_PDict object!

       ## ---------------------------------------------------------------------
       ## A. WITH UNNAMED PATTERNS
       ## ---------------------------------------------------------------------

       library(drosophila2probe)
       dict0 <- drosophila2probe$sequence  # The input dictionary.

       ## Specify the length of the Trusted Prefix via the 'tb.end' argument
       pdict9 <- PDict(dict0, tb.end=9)
       pdict9
       tail(pdict9)
       #sum(duplicated(pdict9)) # not ready yet!
       #table(patternFrequency(pdict9)) # not ready yet!

       library(BSgenome.Dmelanogaster.UCSC.dm3)
       chr3R <- Dmelanogaster$chr3R
       chr3R
       table(countPDict(pdict9, chr3R, max.mismatch=1))
       table(countPDict(pdict9, chr3R, max.mismatch=3))
       table(countPDict(pdict9, chr3R, max.mismatch=5))

       ## ---------------------------------------------------------------------
       ## B. COMPARISON WITH EXACT MATCHING
       ## ---------------------------------------------------------------------

       ## When the input dictionary is of constant width, using 'max.mismatch=0'
       ## on the Trusted Prefix dictionary is equivalent to (but much slower than)
       ## using the full dictionary.
       pdict0 <- PDict(dict0)
       count0 <- countPDict(pdict0, chr3R)
       count_tp9_mm0 <- countPDict(pdict9, chr3R, max.mismatch=0)
       identical(count_tp9_mm0, count0)    # TRUE
       
       ## ---------------------------------------------------------------------
       ## C. VARIABLE WIDTH DICTIONARY
       ## ---------------------------------------------------------------------

       pdict <- PDict(c("ACNG", "GT", "CGT", "AC"), tb.end=2)
       pdict
       tail(pdict) 
       endIndex(matchPDict(pdict, DNAString("ACGGACCG"), max.mismatch=0))
       endIndex(matchPDict(pdict, DNAString("ACGGACCG"), max.mismatch=1))
       mindex <- matchPDict(pdict, DNAString("ACGGACCG"), max.mismatch=2)
       endIndex(mindex)

       startIndex(mindex)                  # CURRENTLY BROKEN!
       mindex[[1]]                         # CURRENTLY BROKEN!
       unlist(mindex)                      # CURRENTLY BROKEN!

       ## -------------------------------------------------------------------------
       ## D. WITH AMBIGUITIES IN THE TAIL
       ## -------------------------------------------------------------------------

       ## Ambiguities (IUPAC extended letters) are supported in the tail only:
       endIndex(matchPDict(pdict, DNAString("ACGGACCG"), fixed=FALSE))

