pairwiseAlignment         package:Biostrings         R Documentation

_O_p_t_i_m_a_l _P_a_i_r_w_i_s_e _A_l_i_g_n_m_e_n_t

_D_e_s_c_r_i_p_t_i_o_n:

     Solves (Needleman-Wunsch) global alignment, (Smith-Waterman) local
     alignment, and (ends-free) overlap alignment problems.

_U_s_a_g_e:

     pairwiseAlignment(pattern, subject, ...)
     ## S4 method for signature 'XStringSet, XStringSet':
     pairwiseAlignment(pattern, subject,
                       patternQuality = PhredQuality(22L), subjectQuality = PhredQuality(22L),
                       type = "global", substitutionMatrix = NULL, fuzzyMatrix = NULL,
                       gapOpening = -10, gapExtension = -4, scoreOnly = FALSE)
     ## S4 method for signature 'QualityScaledXStringSet,
     ##   QualityScaledXStringSet':
     pairwiseAlignment(pattern, subject,
                       type = "global", substitutionMatrix = NULL, fuzzyMatrix = NULL, 
                       gapOpening = -10, gapExtension = -4, scoreOnly = FALSE)

_A_r_g_u_m_e_n_t_s:

 pattern: a character vector of any length, an 'XString', or an
          'XStringSet' object.

 subject: a character vector of length 1 or an 'XString' object.

patternQuality, subjectQuality: objects of class 'XStringQuality'
          representing the respective quality scores for 'pattern' and
          'subject' that are used in a quality-based method for
          generating a substitution matrix. These two arguments are
          ignored if '!is.null(substitutionMatrix)' or if its
          respective string set ('pattern', 'subject') is of class
          'QualityScaledXStringSet'.

    type: type of alignment. One of '"global"', '"local"', '"overlap"',
          '"global-local"', and '"local-global"' where '"global"' =
          align whole strings with end gap penalties, '"local"' = align
          string fragments, '"overlap"' = align whole strings without
          end gap penalties, '"global-local"' = align whole strings
          with end gap penalties on 'pattern' and without end gap
          penalties on 'subject' '"local-global"' = align whole strings
          without end gap penalties on 'pattern' and with end gap
          penalties on 'subject'.

substitutionMatrix: substitution matrix representing the fixed
          substitution scores for an alignment. It cannot be used in
          conjunction with 'patternQuality' and 'subjectQuality'
          arguments.

fuzzyMatrix: fuzzy match matrix for quality-based alignments. It takes
          values between 0 and 1; where 0 is an unambiguous mismatch, 1
          is an unambiguous match, and values in between represent a
          fraction of "matchiness". (See details section below.)

gapOpening: the cost for opening a gap in the alignment.

gapExtension: the incremental cost incurred along the length of the gap
          in the alignment.

scoreOnly: logical to denote whether or not to return just the scores
          of the optimal pairwise alignment.

     ...: optional arguments to generic function to support additional
          methods.

_D_e_t_a_i_l_s:

     Quality-based alignments are based on the paper the Bioinformatics
     article by Ketil Malde listed in the Reference section below. Let
     epsilon_i be the probability of an error in the base read. For
     '"Phred"' quality measures Q in [0, 99], these error probabilities
     are given by epsilon_i = 10^{-Q/10}. For '"Solexa"' quality
     measures Q in [-5, 99], they are given by epsilon_i = 1 - 1/(1 +
     10^{-Q/10}). Assuming independence within and between base reads,
     the combined error probability of a mismatch when the underlying
     bases do match is epsilon_c = epsilon_1 + epsilon_2 - (n/(n-1)) *
     epsilon_1 * epsilon_2, where n is the number of letters in the
     underlying alphabet. Using epsilon_c, the substitution score is
     given by when two bases match is given by b * log_2(gamma_{x,y} *
     (1 - epsilon_c) * n + (1 - gamma_{x,y}) * epsilon_c * (n/(n-1))),
     where b is the bit-scaling for the scoring and gamma_{x,y} is the
     probability that characters x and y represents the same underlying
     information (e.g. using IUPAC, gamma_{A,A} = 1 and gamma_{A,N} =
     1/4. In the arguments listed above 'fuzzyMatch' represents
     gamma_{x,y} and 'patternQuality' and 'subjectQuality' represents
     epsilon_1 and epsilon_2 respectively.

     If 'scoreOnly == FALSE', the pairwise alignment with the maximum
     alignment score is returned. If more than one pairwise alignment
     has the maximum alignment score exists, the first alignment along
     the subject is returned. If there are multiple pairwise alignments
     with the maximum alignment score at the chosen subject location,
     then at each location along the alignment mismatches are given
     preference to insertions/deletions. For example, 'pattern: [1]
     ATTA; subject: [1] AT-A' is chosen above 'pattern: [1] ATTA;
     subject: [1] A-TA' if they both have the maximum alignment score.

_V_a_l_u_e:

     If 'scoreOnly == FALSE', an instance of class
     'PairwiseAlignedXStringSet' or 'PairwiseAlignedFixedSubject' is
     returned. If 'scoreOnly == TRUE', a numeric vector containing the
     scores for the optimal pairwise alignments is returned.

_N_o_t_e:

     Use 'matchPattern' or 'vmatchPattern' if you need to find all the
     occurrences (eventually with indels) of a given pattern in a
     reference sequence or set of sequences.

     Use 'matchPDict' if you need to match a (big) set of patterns
     against a reference sequence.

_A_u_t_h_o_r(_s):

     P. Aboyoun and H. Pages

_R_e_f_e_r_e_n_c_e_s:

     R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence
     Analysis, Cambridge UP 1998, sec 2.3.

     B. Haubold, T. Wiehe, Introduction to Computational Biology,
     Birkhauser Verlag 2006, Chapter 2.

     K. Malde, The effect of sequence quality on sequence alignment,
     Bioinformatics 2008 24(7):897-900.

_S_e_e _A_l_s_o:

     'stringDist', PairwiseAlignedXStringSet-class,
     XStringQuality-class, substitution.matrices, 'matchPattern'

_E_x_a_m_p_l_e_s:

       ## Nucleotide global, local, and overlap alignments
       s1 <- 
         DNAString("ACTTCACCAGCTCCCTGGCGGTAAGTTGATCAAAGGAAACGCAAAGTTTTCAAG")
       s2 <-
         DNAString("GTTTCACTACTTCCTTTCGGGTAAGTAAATATATAAATATATAAAAATATAATTTTCATC")

       # First use a fixed substitution matrix
       mat <- nucleotideSubstitutionMatrix(match = 1, mismatch = -3, baseOnly = TRUE)
       globalAlign <-
         pairwiseAlignment(s1, s2, substitutionMatrix = mat, gapOpening = -5, gapExtension = -2)
       localAlign <-
         pairwiseAlignment(s1, s2, type = "local", substitutionMatrix = mat, gapOpening = -5, gapExtension = -2)
       overlapAlign <-
         pairwiseAlignment(s1, s2, type = "overlap", substitutionMatrix = mat, gapOpening = -5, gapExtension = -2)

       # Then use quality-based method for generating a substitution matrix
       pairwiseAlignment(s1, s2,
                         patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                         subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                         scoreOnly = TRUE)

       # Now assume can't distinguish between C/T and G/A
       pairwiseAlignment(s1, s2,
                         patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                         subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                         type = "local")
       mapping <- diag(4)
       dimnames(mapping) <- list(DNA_BASES, DNA_BASES)
       mapping["C", "T"] <- mapping["T", "C"] <- 1
       mapping["G", "A"] <- mapping["A", "G"] <- 1
       pairwiseAlignment(s1, s2,
                         patternQuality = SolexaQuality(rep(c(22L, 12L), times = c(36, 18))),
                         subjectQuality = SolexaQuality(rep(c(22L, 12L), times = c(40, 20))),
                         fuzzyMatrix = mapping,
                         type = "local")

       ## Amino acid global alignment
       pairwiseAlignment(AAString("PAWHEAE"), AAString("HEAGAWGHEE"), substitutionMatrix = "BLOSUM50",
                         gapOpening = 0, gapExtension = -8)

