matchPDict            package:Biostrings            R Documentation

_S_e_a_r_c_h_i_n_g _a _s_e_q_u_e_n_c_e _f_o_r _p_a_t_t_e_r_n_s _s_t_o_r_e_d _i_n _a _p_r_e_p_r_o_c_e_s_s_e_d _d_i_c_t_i_o_n_a_r_y

_D_e_s_c_r_i_p_t_i_o_n:

     A set of functions for finding all the occurences (aka "matches"
     or "hits") of a set of patterns (aka the dictionary) in a
     reference sequence or set of reference sequences (aka the subject)

     The following functions differ in what they return: 'matchPDict'
     returns the "where" information i.e. the positions in the subject
     of all the occurrences of every pattern; 'countPDict' returns the
     "how many times" information i.e. the number of occurrences for
     each pattern; and 'whichPDict' returns the "who" information i.e.
     which patterns in the preprocessed dictionary have at least one
     match. 'vcountPDict' is similar to 'countPDict' but it works on a
     set of reference sequences in a vectorized fashion.

     This man page shows how to use these functions for exact matching
     of a constant width dictionary i.e. a dictionary where all the
     patterns have the same length (same number of nucleotides).

     See '?`matchPDict-inexact`' for how to use these functions for
     inexact matching or when the original dictionary has a variable
     width.

_U_s_a_g_e:

       matchPDict(pdict, subject, algorithm="auto",
                  max.mismatch=0, fixed=TRUE, verbose=FALSE)
       countPDict(pdict, subject, algorithm="auto",
                  max.mismatch=0, fixed=TRUE, verbose=FALSE)
       whichPDict(pdict, subject, algorithm="auto",
                  max.mismatch=0, fixed=TRUE, verbose=FALSE)

       vcountPDict(pdict, subject, algorithm="auto",
                   max.mismatch=0, fixed=TRUE,
                   collapse=FALSE, weight=1L, verbose=FALSE)

_A_r_g_u_m_e_n_t_s:

   pdict: A PDict object containing the preprocessed dictionary. 

 subject: An XString or MaskedXString object containing the subject
          sequence for 'matchPDict', 'countPDict' and 'whichPDict'.

          An XStringSet object containing the subject sequences for
          'vcountPDict'.

          For now, only subjects of base class DNAString are supported. 

algorithm: Not supported yet. 

max.mismatch: The maximum number of mismatching letters allowed (see
          '?isMatching' for the details). This man page focuses on
          exact matching of a constant width dictionary so
          'max.mismatch=0' in the examples below. See
          '?`matchPDict-inexact`' for inexact matching. 

   fixed: If 'FALSE' then IUPAC extended letters are interpreted as
          ambiguities (see '?isMatching' for the details). This man
          page focuses on exact matching of a constant width dictionary
          so 'fixed=TRUE' in the examples below. See
          '?`matchPDict-inexact`' for inexact matching. 

 verbose: 'TRUE' or 'FALSE'. 

collapse,weight: 'collapse' must be 'FALSE', '1', or '2'.

          If 'collapse=FALSE' (the default), then 'weight' is ignored
          and 'vcountPDict' returns the full matrix of counts ('M0').
          If 'collapse=1', then 'M0' is collapsed "horizontally" i.e.
          it is turned into a vector with 'length' equal to
          'length(pdict)'. If 'weight=1L' (the default), then this
          vector is defined by 'rowSums(M0)'. If 'collapse=2', then
          'M0' is collapsed "vertically" i.e. it is turned into a
          vector with 'length' equal to 'length(subject)'. If
          'weight=1L' (the default), then this vector is defined by
          'colSums(M0)'.

          If 'collapse=1' or 'collapse=2', then the elements in
          'subject' ('collapse=1') or in 'pdict' ('collapse=2') can be
          weighted thru the 'weight' argument. In that case, the
          returned vector is defined by 'M0 %*% rep(weight,
          length.out=length(subject))' and 'rep(weight,
          length.out=length(pdict)) %*% M0', respectively. 

_D_e_t_a_i_l_s:

     In this man page, we assume that you know how to preprocess a
     dictionary of DNA patterns that can then be used with
     'matchPDict', 'countPDict', 'whichPDict' or 'vcountPDict'. Please
     see '?PDict' if you don't.

     When using 'matchPDict', 'countPDict', 'whichPDict' or
     'vcountPDict' for exact matching of a constant width dictionary,
     the standard way to preprocess the original dictionary is by
     calling the 'PDict' constructor on it with no extra arguments.
     This returns the preprocessed dictionary in a PDict object that
     can be used with any of the functions described here.

_V_a_l_u_e:

     If 'M' denotes the number of patterns in the 'pdict' argument ('M
     <- length(pdict)'), then 'matchPDict' returns an MIndex object of
     length 'M', and 'countPDict' an integer vector of length 'M'.

     'whichPDict' returns an integer vector made of the indices of the
     patterns in the 'pdict' argument that have at least one match.

     If 'N' denotes the number of sequences in the 'subject' argument
     ('N <- length(subject)'), then 'vcountPDict' returns an integer
     matrix with 'M' rows and 'N' columns, unless the 'collapse'
     argument is used. In that case, depending on the type of 'weight',
     an integer or numeric vector is returned (see above for the
     details).

_A_u_t_h_o_r(_s):

     H. Pages

_R_e_f_e_r_e_n_c_e_s:

     Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient
     string matching: An aid to bibliographic search". Communications
     of the ACM 18 (6): 333-340.

_S_e_e _A_l_s_o:

     PDict-class, MIndex-class, matchPDict-inexact, 'isMatching',
     'coverage,MIndex-method', 'matchPattern', 'alphabetFrequency',
     DNAString-class, DNAStringSet-class, XStringViews-class,
     MaskedDNAString-class

_E_x_a_m_p_l_e_s:

       ## ---------------------------------------------------------------------
       ## A. A SIMPLE EXAMPLE OF EXACT MATCHING
       ## ---------------------------------------------------------------------

       ## Creating the pattern dictionary:
       library(drosophila2probe)
       dict0 <- DNAStringSet(drosophila2probe$sequence)
       dict0                                # The original dictionary.
       length(dict0)                        # Hundreds of thousands of patterns.
       pdict0 <- PDict(dict0)               # Store the original dictionary in
                                            # a PDict object (preprocessing).

       ## Using the pattern dictionary on chromosome 3R:
       library(BSgenome.Dmelanogaster.UCSC.dm3)
       chr3R <- Dmelanogaster$chr3R         # Load chromosome 3R
       chr3R
       mi0 <- matchPDict(pdict0, chr3R)     # Search...

       ## Looking at the matches:
       start_index <- startIndex(mi0)       # Get the start index.
       length(start_index)                  # Same as the original dictionary.
       start_index[[8220]]                  # Starts of the 8220th pattern.
       end_index <- endIndex(mi0)           # Get the end index.
       end_index[[8220]]                    # Ends of the 8220th pattern.
       count_index <- countIndex(mi0)       # Get the number of matches per pattern.
       count_index[[8220]]
       mi0[[8220]]                          # Get the matches for the 8220th pattern.
       start(mi0[[8220]])                   # Equivalent to startIndex(mi0)[[8220]].
       sum(count_index)                     # Total number of matches.
       table(count_index)
       i0 <- which(count_index == max(count_index))
       pdict0[[i0]]                         # The pattern with most occurrences.
       mi0[[i0]]                            # Its matches as an IRanges object.
       Views(chr3R, mi0[[i0]])              # And as an XStringViews object.

       ## Get the coverage of the original subject:
       cov3R <- as.integer(coverage(mi0, width=length(chr3R)))
       max(cov3R)
       mean(cov3R)
       sum(cov3R != 0) / length(cov3R)      # Only 2.44% of chr3R is covered.
       if (interactive()) {
         plotCoverage <- function(cx, start, end)
         {
           plot.new()
           plot.window(c(start, end), c(0, 20))
           axis(1)
           axis(2)
           axis(4)
           lines(start:end, cx[start:end], type="l")
         }
         plotCoverage(cov3R, 27600000, 27900000)
       }

       ## ---------------------------------------------------------------------
       ## B. NAMING THE PATTERNS
       ## ---------------------------------------------------------------------

       ## The names of the original patterns, if any, are propagated to the
       ## PDict and MIndex objects:
       names(dict0) <- mkAllStrings(letters, 4)[seq_len(length(dict0))]
       dict0
       dict0[["abcd"]]
       pdict0n <- PDict(dict0)
       names(pdict0n)[1:30]
       pdict0n[["abcd"]]
       mi0n <- matchPDict(pdict0n, chr3R)
       names(mi0n)[1:30]
       mi0n[["abcd"]]

       ## This is particularly useful when unlisting an MIndex object:
       unlist(mi0)[1:10]
       unlist(mi0n)[1:10]  # keep track of where the matches are coming from

       ## ---------------------------------------------------------------------
       ## C. PERFORMANCE
       ## ---------------------------------------------------------------------

       ## If getting the number of matches is what matters only (without
       ## regarding their positions), then countPDict() will be faster,
       ## especially when there is a high number of matches:

       count_index0 <- countPDict(pdict0, chr3R)
       identical(count_index0, count_index)  # TRUE

       if (interactive()) {
         ## What's the impact of the dictionary width on performance?
         ## Below is some code that can be used to figure out (will take a long
         ## time to run). For different widths of the original dictionary, we
         ## look at:
         ##   o pptime: preprocessing time (in sec.) i.e. time needed for
         ##             building the PDict object from the truncated input
         ##             sequences;
         ##   o nnodes: nb of nodes in the resulting Aho-Corasick tree;
         ##   o nupatt: nb of unique truncated input sequences;
         ##   o matchtime: time (in sec.) needed to find all the matches;
         ##   o totalcount: total number of matches.
         getPDictStats <- function(dict, subject)
         {
           ans_width <- width(dict[1])
           ans_pptime <- system.time(pdict <- PDict(dict))[["elapsed"]]
           pptb <- pdict@threeparts@pptb
           ans_nnodes <- length(pptb@nodes) %/%
                         Biostrings:::.ACtree.ints_per_acnode(pptb)
           ans_nupatt <- sum(!duplicated(pdict))
           ans_matchtime <- system.time(
                              mi0 <- matchPDict(pdict, subject)
                            )[["elapsed"]]
           ans_totalcount <- sum(countIndex(mi0))
           list(
             width=ans_width,
             pptime=ans_pptime,
             nnodes=ans_nnodes,
             nupatt=ans_nupatt,
             matchtime=ans_matchtime,
             totalcount=ans_totalcount
           )
         }
         stats <- lapply(6:25,
                      function(width)
                          getPDictStats(DNAStringSet(dict0, end=width), chr3R))
         stats <- data.frame(do.call(rbind, stats))
         stats
       }

       ## ---------------------------------------------------------------------
       ## D. vcountPDict()
       ## ---------------------------------------------------------------------
       subject <- Dmelanogaster$upstream1000[1:200]
       subject
       mat1 <- vcountPDict(pdict0, subject)
       dim(mat1)  # length(pdict0) x length(subject)
       nhit_per_probe <- rowSums(mat1)
       table(nhit_per_probe)

       ## Without vcountPDict(), 'mat1' could have been computed with:
       mat2 <- sapply(unname(subject), function(x) countPDict(pdict0, x))
       identical(mat1, mat2)  # TRUE
       ## but using vcountPDict() is faster (10x or more, depending of the
       ## average length of the sequences in 'subject').

       if (interactive()) {
         ## This will fail (with message "allocMatrix: too many elements
         ## specified") because, on most platforms, vectors and matrices in R
         ## are limited to 2^31 elements:
         subject <- Dmelanogaster$upstream1000
         vcountPDict(pdict0, subject)
         length(pdict0) * length(Dmelanogaster$upstream1000)
         1 * length(pdict0) * length(Dmelanogaster$upstream1000)  # > 2^31
         ## But this will work:
         nhit_per_seq <- vcountPDict(pdict0, subject, collapse=2)
         sum(nhit_per_seq >= 1)  # nb of subject sequences with at least 1 hit
         table(nhit_per_seq)
         which(nhit_per_seq == 37)  # 603
         sum(countPDict(pdict0, subject[[603]]))  # 37
       }

       ## ---------------------------------------------------------------------
       ## E. RELATIONSHIP BETWEEN vcountPDict(), countPDict() AND
       ## vcountPattern()
       ## ---------------------------------------------------------------------
       dict3 <- DNAStringSet(mkAllStrings(DNA_BASES, 3))  # all trinucleotides
       dict3
       pdict3 <- PDict(dict3)
       subject <- Dmelanogaster$upstream1000
       subject

       ## The 3 following calls are equivalent (from faster to slower):
       mat3a <- vcountPDict(pdict3, subject)
       mat3b <- sapply(dict3, function(pattern) vcountPattern(pattern, subject))
       mat3c <- sapply(unname(subject), function(x) countPDict(pdict3, x))
       stopifnot(identical(mat3a, t(mat3b)))
       stopifnot(identical(mat3a, mat3c))

       ## The 2 following calls are equivalent (from faster to slower):
       nhitpp3a <- vcountPDict(pdict3, subject, collapse=1)  # rowSums(mat3a)
       nhitpp3b <- sapply(dict3, function(pattern) sum(vcountPattern(pattern, subject)))
       stopifnot(identical(nhitpp3a, nhitpp3b))

       ## The 2 following calls are equivalent (from faster to slower):
       nhitps3a <- vcountPDict(pdict3, subject, collapse=2)  # colSums(mat3a)
       nhitps3b <- sapply(unname(subject), function(x) sum(countPDict(pdict3, x)))
       stopifnot(identical(nhitps3a, nhitps3b))

