matchPattern           package:Biostrings           R Documentation

_S_t_r_i_n_g _s_e_a_r_c_h_i_n_g _f_u_n_c_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     A set of functions for finding all the occurences (aka "matches"
     or "hits") of a given pattern (typically short) in a (typically
     long) reference sequence or set of reference sequences (aka the
     subject)

_U_s_a_g_e:

       matchPattern(pattern, subject, algorithm="auto",
                    max.mismatch=0, with.indels=FALSE, fixed=TRUE)
       countPattern(pattern, subject, algorithm="auto",
                    max.mismatch=0, with.indels=FALSE, fixed=TRUE)
       vmatchPattern(pattern, subject, algorithm="auto",
                     max.mismatch=0, with.indels=FALSE, fixed=TRUE)
       vcountPattern(pattern, subject, algorithm="auto",
                     max.mismatch=0, with.indels=FALSE, fixed=TRUE)

_A_r_g_u_m_e_n_t_s:

 pattern: The pattern string. 

 subject: An XString, XStringViews or MaskedXString object for
          'matchPattern' and 'countPattern'.

          An XStringSet or XStringViews object for 'vmatchPattern' and
          'vcountPattern'. 

algorithm: One of the following: '"auto"', '"naive-exact"',
          '"naive-inexact"', '"boyer-moore"', '"shift-or"' or
          '"indels"'. 

max.mismatch: The maximum number of mismatching letters allowed (see
          'isMatchingAt' for the details). If non-zero, an inexact
          matching algorithm is used. 

with.indels: If 'TRUE' then indels are allowed. In that case
          'max.mismatch' is interpreted as the maximum "edit distance"
          allowed between the pattern and a match. Note that in order
          to avoid pollution by redundant matches, only the "best local
          matches" are returned. Roughly speaking, a "best local match"
          is a match that is locally both the closest (to the pattern
          P) and the shortest. More precisely, a substring S' of the
          subject S is a "best local match" iff:


                 (a) nedit(P, S') <= max.mismatch
                 (b) for every substring S1 of S':
                         nedit(P, S1) > nedit(P, S')
                 (c) for every substring S2 of S that contains S':
                         nedit(P, S2) <= nedit(P, S')

          One nice property of "best local matches" is that their first
          and last letters are guaranteed to be aligned with letters in
          P (i.e. they match letters in P). 

   fixed: If 'FALSE' then IUPAC extended letters are interpreted as
          ambiguities (see 'isMatchingAt' for the details). 

_D_e_t_a_i_l_s:

     Available algorithms are: ``naive exact'', ``naive inexact'',
     ``Boyer-Moore-like'', ``shift-or'' and ``indels''. Not all of them
     can be used in all situations: restrictions depend on the length
     of the pattern, the class of the subject, and the values of
     'max.mismatch', 'with.indels' and 'fixed'. All those parameters
     form the search criteria.

     Note that the choice of an algorithm is not part of the search
     criteria. This is because algorithms are interchangeable, that is,
     if 2 different algorithms are compatible with a given search
     criteria, then choosing one over the other will not affect the
     result (but will most likely affect the performance). So there is
     no "wrong choice" of algorithm (strictly speaking).

     Using 'algorithm="auto"' is recommended because then the fastest
     algorithm will automatically be picked up among the set of
     compatible algorithms (if there is more than one).

_V_a_l_u_e:

     An XStringViews object for 'matchPattern'.

     A single integer for 'countPattern'.

     An MIndex object for 'vmatchPattern'.

     An integer vector for 'vcountPattern', with each element in the
     vector corresponding to the number of matches in the corresponding
     element of 'subject'.

_N_o_t_e:

     Use 'matchPDict' if you need to match a (big) set of patterns
     against a reference sequence.

     Use 'pairwiseAlignment' if you need to solve a (Needleman-Wunsch)
     global alignment, a (Smith-Waterman) local alignment, or an
     (ends-free) overlap alignment problem.

_S_e_e _A_l_s_o:

     'matchPDict', 'pairwiseAlignment', 'isMatchingAt', 'mismatch',
     'matchLRPatterns', 'matchProbePair', 'maskMotif',
     'alphabetFrequency', XStringViews-class, MIndex-class

_E_x_a_m_p_l_e_s:

       ## ---------------------------------------------------------------------
       ## A. matchPattern()/countPattern()
       ## ---------------------------------------------------------------------

       ## A simple inexact matching example with a short subject:
       x <- DNAString("AAGCGCGATATG")
       m1 <- matchPattern("GCNNNAT", x)
       m1
       m2 <- matchPattern("GCNNNAT", x, fixed=FALSE)
       m2
       as.matrix(m2)

       ## With DNA sequence of yeast chromosome number 1:
       data(yeastSEQCHR1)
       yeast1 <- DNAString(yeastSEQCHR1)
       PpiI <- "GAACNNNNNCTC" # a restriction enzyme pattern
       match1.PpiI <- matchPattern(PpiI, yeast1, fixed=FALSE)
       match2.PpiI <- matchPattern(PpiI, yeast1, max.mismatch=1, fixed=FALSE)

       ## With a genome containing isolated Ns:
       library(BSgenome.Celegans.UCSC.ce2)
       chrII <- Celegans[["chrII"]]
       alphabetFrequency(chrII)
       matchPattern("N", chrII)
       matchPattern("TGGGTGTCTTT", chrII) # no match
       matchPattern("TGGGTGTCTTT", chrII, fixed=FALSE) # 1 match

       ## Using wildcards ("N") in the pattern on a genome containing N-blocks:
       library(BSgenome.Dmelanogaster.UCSC.dm3)
       chrX <- maskMotif(Dmelanogaster$chrX, "N")
       as(chrX, "XStringViews") # 4 non masked regions
       matchPattern("TTTATGNTTGGTA", chrX, fixed=FALSE)
       ## Can also be achieved with no mask:
       masks(chrX) <- NULL
       matchPattern("TTTATGNTTGGTA", chrX, fixed="subject")

       ## ---------------------------------------------------------------------
       ## B. vmatchPattern()/vcountPattern()
       ## ---------------------------------------------------------------------

       Ebox <- DNAString("CANNTG")
       subject <- Celegans$upstream5000
       mindex <- vmatchPattern(Ebox, subject, fixed=FALSE)
       count_index <- countIndex(mindex)  # Get the number of matches per
                                          # subject element.
       sum(count_index)  # Total number of matches.
       table(count_index)
       i0 <- which(count_index == max(count_index))
       subject[i0]  # The subject element with most matches.

       ## The matches in 'subject[i0]' as an IRanges object:
       mindex[[i0]]
       ## The matches in 'subject[i0]' as an XStringViews object:
       Views(subject[[i0]], mindex[[i0]])

       ## ---------------------------------------------------------------------
       ## C. WITH INDELS
       ## ---------------------------------------------------------------------
       library(BSgenome.Celegans.UCSC.ce2)
       pattern <- DNAString("ACGGACCTAATGTTATC")
       subject <- Celegans$chrI

       ## Allowing up to 2 mismatching letters doesn't give any match:
       matchPattern(pattern, subject, max.mismatch=2)

       ## But allowing up to 2 edit operations gives 3 matches:
       system.time(m <- matchPattern(pattern, subject, max.mismatch=2, with.indels=TRUE))
       m

       ## pairwiseAlignment() returns the (first) best match only:
       if (interactive()) {
         mat <- nucleotideSubstitutionMatrix(match=1, mismatch=0, baseOnly=TRUE)
         ## Note that this call to pairwiseAlignment() will need to
         ## allocate 733.5 Mb of memory (i.e. length(pattern) * length(subject)
         ## * 3 bytes).
         system.time(pwa <- pairwiseAlignment(pattern, subject, type="local",
                                              substitutionMatrix=mat,
                                              gapOpening=0, gapExtension=1))
         pwa
       }

       ## Only "best local matches" are reported:
         ## - with deletions in the subject
       subject <- BString("ACDEFxxxCDEFxxxABCE")
       matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
       matchPattern("ABCDEF", subject, max.mismatch=2)
         ## - with insertions in the subject
       subject <- BString("AiBCDiEFxxxABCDiiFxxxAiBCDEFxxxABCiDEF")
       matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
       matchPattern("ABCDEF", subject, max.mismatch=2)
         ## - with substitutions (note that the "best local matches" can introduce
         ##   indels and therefore be shorter than 6)
       subject <- BString("AsCDEFxxxABDCEFxxxBACDEFxxxABCEDF")
       matchPattern("ABCDEF", subject, max.mismatch=2, with.indels=TRUE)
       matchPattern("ABCDEF", subject, max.mismatch=2)

