match-utils            package:Biostrings            R Documentation

_U_t_i_l_i_t_y _f_u_n_c_t_i_o_n_s _r_e_l_a_t_e_d _t_o _p_a_t_t_e_r_n _m_a_t_c_h_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     In this man page we define precisely and illustrate what a "match"
     of a pattern P in a subject S is in the context of the Biostrings
     package. This definition of a "match" is central to most pattern
     matching functions available in this package: unless specified
     otherwise, most of them will adhere to the definition provided
     here.

     'hasLetterAt' checks whether a sequence or set of sequences has
     the specified letters at the specified positions.

     'neditStartingAt', 'neditEndingAt', 'isMatchingStartingAt' and
     'isMatchingEndingAt' are low-level matching functions that only
     check for matches at the specified positions.

     Other utility functions related to pattern matching are described
     here: the 'mismatch' function for getting the positions of the
     mismatching letters of a given pattern relatively to its matches
     in a given subject, the 'nmatch' and 'nmismatch' functions for
     getting the number of matching and mismatching letters produced by
     the 'mismatch' function, and the 'coverage' function that can be
     used to get the "coverage" of a subject by a given pattern or set
     of patterns.

_U_s_a_g_e:

       hasLetterAt(x, letter, at, fixed=TRUE)

       neditStartingAt(pattern, subject, starting.at=1, with.indels=FALSE, fixed=TRUE)
       neditEndingAt(pattern, subject, ending.at=1, with.indels=FALSE, fixed=TRUE)
       neditAt(pattern, subject, at=1, with.indels=FALSE, fixed=TRUE)

       isMatchingStartingAt(pattern, subject, starting.at=1,
                       max.mismatch=0, with.indels=FALSE, fixed=TRUE)
       isMatchingEndingAt(pattern, subject, ending.at=1,
                       max.mismatch=0, with.indels=FALSE, fixed=TRUE)
       isMatchingAt(pattern, subject, at=1,
                       max.mismatch=0, with.indels=FALSE, fixed=TRUE)

       mismatch(pattern, x, fixed=TRUE)
       nmatch(pattern, x, fixed=TRUE)
       nmismatch(pattern, x, fixed=TRUE)
       ## S4 method for signature 'MIndex':
       coverage(x, start=NA, end=NA, shift=0L, width=NULL, weight=1L)
       ## S4 method for signature 'MaskedXString':
       coverage(x, start=NA, end=NA, shift=0L, width=NULL, weight=1L)

_A_r_g_u_m_e_n_t_s:

       x: A character vector, or an XString or XStringSet object for
          'hasLetterAt'.

          An XStringViews object for 'mismatch' (typically, one
          returned by 'matchPattern(pattern, subject)').

          An MIndex object for 'coverage', or any object for which a
          'coverage' method is defined. See '?coverage'. 

  letter: A character string or an XString object containing the
          letters to check. 

at, starting.at, ending.at: An integer vector specifying the starting
          (for 'starting.at' and 'at') or ending (for 'ending.at')
          positions of the pattern relatively to the subject.

          For the 'hasLetterAt' function, 'letter' and 'at' must have
          the same length. 

 pattern: The pattern string. 

 subject: An XString, XStringSet object, or character vector containing
          the subject sequence(s). 

max.mismatch: See details below. 

with.indels: See details below. 

   fixed: Only with a DNAString or RNAString-based subject can a
          'fixed' value other than the default ('TRUE') be used.

          With 'fixed=FALSE', ambiguities (i.e. letters from the IUPAC
          Extended Genetic Alphabet (see 'IUPAC_CODE_MAP') that are not
          from the base alphabet) in the pattern _and_ in the subject
          are interpreted as wildcards i.e. they match any letter that
          they stand for.

          'fixed' can also be a character vector, a subset of
          'c("pattern", "subject")'. 'fixed=c("pattern", "subject")' is
          equivalent to 'fixed=TRUE' (the default). An empty vector is
          equivalent to 'fixed=FALSE'. With 'fixed="subject"',
          ambiguities in the pattern only are interpreted as wildcards.
          With 'fixed="pattern"', ambiguities in the subject only are
          interpreted as wildcards. 

start, end, shift, width: See '?coverage'. 

  weight: An integer vector specifying how much each element in 'x'
          counts. 

_D_e_t_a_i_l_s:

     A "match" of pattern P in subject S is a substring S' of S that is
     considered similar enough to P according to some distance (or
     metric) specified by the user. 2 distances are supported by most
     pattern matching functions in the Biostrings package. The first
     (and simplest) one is the "number of mismatching letters". It is
     defined only when the 2 strings to compare have the same length,
     so when this distance is used, only matches that have the same
     number of letters as P are considered. The second one is the "edit
     distance" (aka Levenshtein distance): it's the minimum number of
     operations needed to transform P into S', where an operation is an
     insertion, deletion, or substitution of a single letter. When this
     metric is used, matches can have a different number of letters
     than P.

     The 'neditStartingAt' (and 'neditEndingAt') function implements
     these 2 distances. If 'with.indels' is 'FALSE' (the default), then
     the first distance is used i.e. 'neditStartingAt' returns the
     "number of mismatching letters" between the pattern P and the
     substring S' of S starting at the positions specified in
     'starting.at' (note that 'neditStartingAt' and 'neditEndingAt' are
     vectorized so long vectors of integers can be passed thru the
     'starting.at' or 'ending.at' arguments). If 'with.indels' is
     'TRUE', then the "edit distance" distance is used: for each
     position specified in 'starting.at', P is compared to all the
     substrings S' of S starting at this position and the smallest
     distance is returned. Note that this distance is guaranteed to be
     reached for a substrings of length < 2*length(P) so, of course, in
     practice, P only needs to be compared to a small number of
     substrings for every starting position.

_V_a_l_u_e:

     'hasLetterAt': A logical matrix with one row per element in 'x'
     and one column per letter/position to check. When a specified
     position is invalid with respect to an element in 'x' then the
     corresponding matrix element is set to NA.

     'neditStartingAt' and 'neditEndingAt': If 'subject' is an XString
     object, then return an integer vector of the same length as
     'starting.at' (or 'ending.at'). If 'subject' is an XStringSet
     object, then return the integer matrix with 'length(starting.at)'
     (or 'length(ending.at)') rows and 'length(subject)' columns
     defined by (in the case of 'neditStartingAt'):


         sapply(unname(subject),
                function(x) neditStartingAt(pattern, x, ...))

     'isMatchingStartingAt(...)' and 'isMatchingEndingAt(...)': If
     'subject' is an XString object, then return the logical vector
     defined by 'neditStartingAt(...) <= max.mismatch' or
     'neditEndingAt(...) <= max.mismatch', respectively. If 'subject'
     is an XStringSet object, then return the logical matrix with
     'length(starting.at)' (or 'length(ending.at)') rows and
     'length(subject)' columns defined by (in the case of
     'isMatchingStartingAt'):


         sapply(unname(subject),
                function(x) isMatchingStartingAt(pattern, x, ...))

     'neditAt' and 'isMatchingAt' are convenience wrappers for
     'neditStartingAt' and 'isMatchingStartingAt', respectively.

     'mismatch':  a list of integer vectors.

     'nmismatch':  an integer vector containing the length of the
     vectors produced by 'mismatch'.

     'coverage':  an Rle object indicating the coverage of 'x'. See
     '?coverage' for the details. If 'x' is an MIndex object, the
     coverage of a given position in the underlying sequence (typically
     the subject used during the search that returned 'x') is the
     number of matches (or hits) it belongs to.

_S_e_e _A_l_s_o:

     'nucleotideFrequencyAt', 'matchPattern', 'matchPDict',
     'matchLRPatterns', 'trimLRPatterns', 'IUPAC_CODE_MAP',
     XString-class, XStringViews-class, MIndex-class, coverage,
     IRanges-class, MaskCollection-class, MaskedXString-class,
     align-utils

_E_x_a_m_p_l_e_s:

       ## ---------------------------------------------------------------------
       ## hasLetterAt()
       ## ---------------------------------------------------------------------
       x <- DNAStringSet(c("AAACGT", "AACGT", "ACGT", "TAGGA"))
       hasLetterAt(x, "AAAAAA", 1:6)

       ## hasLetterAt() can be used to answer questions like: "which elements
       ## in 'x' have an A at position 2 and a G at position 4?"
       q1 <- hasLetterAt(x, "AG", c(2, 4))
       which(rowSums(q1) == 2)

       ## or "how many probes in the drosophila2 chip have T, G, T, A at
       ## position 2, 4, 13 and 20, respectively?"
       library(drosophila2probe)
       probes <- DNAStringSet(drosophila2probe$sequence)
       q2 <- hasLetterAt(probes, "TGTA", c(2, 4, 13, 20))
       sum(rowSums(q2) == 4)
       ## or "what's the probability to have an A at position 25 if there is
       ## one at position 13?"
       q3 <- hasLetterAt(probes, "AACGT", c(13, 25, 25, 25, 25))
       sum(q3[ , 1] & q3[ , 2]) / sum(q3[ , 1])
       ## Probabilities to have other bases at position 25 if there is an A
       ## at position 13:
       sum(q3[ , 1] & q3[ , 3]) / sum(q3[ , 1])  # C
       sum(q3[ , 1] & q3[ , 4]) / sum(q3[ , 1])  # G
       sum(q3[ , 1] & q3[ , 5]) / sum(q3[ , 1])  # T

       ## See ?nucleotideFrequencyAt for another way to get those results.

       ## ---------------------------------------------------------------------
       ## neditAt() / isMatchingAt()
       ## ---------------------------------------------------------------------
       subject <- DNAString("GTATA")

       ## Pattern "AT" matches subject "GTATA" at position 3 (exact match)
       neditAt("AT", subject, at=3)
       isMatchingAt("AT", subject, at=3)

       ## ... but not at position 1
       neditAt("AT", subject)
       isMatchingAt("AT", subject)

       ## ... unless we allow 1 mismatching letter (inexact match)
       isMatchingAt("AT", subject, max.mismatch=1)

       ## Here we look at 6 different starting positions and find 3 matches if
       ## we allow 1 mismatching letter
       isMatchingAt("AT", subject, at=0:5, max.mismatch=1)

       ## No match
       neditAt("NT", subject, at=1:4)
       isMatchingAt("NT", subject, at=1:4)

       ## 2 matches if N is interpreted as an ambiguity (fixed=FALSE)
       neditAt("NT", subject, at=1:4, fixed=FALSE)
       isMatchingAt("NT", subject, at=1:4, fixed=FALSE)

       ## max.mismatch != 0 and fixed=FALSE can be used together
       neditAt("NCA", subject, at=0:5, fixed=FALSE)
       isMatchingAt("NCA", subject, at=0:5, max.mismatch=1, fixed=FALSE)

       some_starts <- c(10:-10, NA, 6)
       subject <- DNAString("ACGTGCA")
       is_matching <- isMatchingAt("CAT", subject, at=some_starts, max.mismatch=1)
       some_starts[is_matching]

       ## ---------------------------------------------------------------------
       ## mismatch() / nmismatch()
       ## ---------------------------------------------------------------------
       m <- matchPattern("NCA", subject, max.mismatch=1, fixed=FALSE)
       mismatch("NCA", m)
       nmismatch("NCA", m)

       ## ---------------------------------------------------------------------
       ## coverage()
       ## ---------------------------------------------------------------------
       coverage(m)

       ## See ?matchPDict for examples of using coverage() on an MIndex object...

