letterFrequency          package:Biostrings          R Documentation

_C_a_l_c_u_l_a_t_e _t_h_e _f_r_e_q_u_e_n_c_y _o_f _l_e_t_t_e_r_s _i_n _a _b_i_o_l_o_g_i_c_a_l
_s_e_q_u_e_n_c_e, _o_r _t_h_e _c_o_n_s_e_n_s_u_s _m_a_t_r_i_x _o_f _a _s_e_t _o_f _s_e_q_u_e_n_c_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Given a biological sequence (or a set of biological sequences),
     the 'alphabetFrequency' function computes the frequency of each
     letter in the (base) alphabet.

     The 'consensusMatrix' function computes the consensus matrix of a
     set of sequences, and the 'consensusString' function creates the
     consensus sequence based on a 50% + 1 vote from the consensus
     matrix (using the '"?"' letter to represent the lack of
     consensus).

     In this man page we call "DNA input" (or "RNA input") an XString,
     XStringSet, XStringViews or MaskedXString object of base type DNA
     (or RNA).

_U_s_a_g_e:

       alphabetFrequency(x, baseOnly=FALSE, freq=FALSE, ...)
       hasOnlyBaseLetters(x)
       uniqueLetters(x)

       ## S4 method for signature 'character':
       consensusMatrix(x, freq=FALSE)
       ## S4 method for signature 'XStringSet':
       consensusMatrix(x,
           baseOnly=FALSE, freq=FALSE, shift=0L, width=NULL)

       ## S4 method for signature 'matrix':
       consensusString(x)
       ## S4 method for signature 'XStringSet':
       consensusString(x, shift=0L, width=NULL)
       ## S4 method for signature 'ANY':
       consensusString(x)

_A_r_g_u_m_e_n_t_s:

       x: An XString, XStringSet, XStringViews or MaskedXString object
          for 'alphabetFrequency' and 'uniqueLetters'.

          DNA or RNA input for 'hasOnlyBaseLetters'.

          A character vector, or an XStringSet or XStringViews object
          for 'consensusMatrix'.

          A consensus matrix (as returned by 'consensusMatrix'), or an
          XStringSet or XStringViews object for 'consensusString'. 

baseOnly: 'TRUE' or 'FALSE'. If 'TRUE', the returned vector (or matrix)
          only contains the frequencies of the letters that belong to
          the "base" alphabet of 'x' i.e. to the alphabet returned by
          'alphabet(x, baseOnly=TRUE)'. Note that when 'x' is not a DNA
          or RNA input, then specifying 'baseOnly' has no effect. 

    freq: If 'TRUE' then relative frequencies are reported, otherwise
          counts (the default). 

     ...: Further arguments to be passed to or from other methods. For
          the XStringViews and XStringSet methods, the 'collapse'
          argument is accepted. 

   shift: An integer vector (recycled to the length of 'x') specifying
          how each sequence in 'x' should be (horizontally) shifted
          with respect to the first column of the consensus matrix to
          be returned. By default ('shift=0'), each sequence in 'x' has
          its first letter aligned with the first column of the matrix.
          A positive 'shift' value means that the corresponding
          sequence must be shifted to the right, and a negative 'shift'
          value that it must be shifted to the left. For example, a
          shift of 5 means that it must be shifted 5 positions to the
          right (i.e. the first letter in the sequence must be aligned
          with the 6th column of the matrix), and a shift of -3 means
          that it must be shifted 3 positions to the left (i.e. the 4th
          letter in the sequence must be aligned with the first column
          of the matrix). 

   width: The number of columns of the returned matrix for the
          'consensusMatrix' method for XStringSet objects. When
          'width=NULL' (the default), then this method returns a matrix
          that has just enough columns to have its last column aligned
          with the rightmost letter of all the sequences in 'x' after
          those sequences have been shifted (see the 'shift' argument
          above). This ensures that any wider consensus matrix would be
          a "padded with zeros" version of the matrix returned when
          'width=NULL'.

          The length of the returned sequence for the 'consensusString'
          method for XStringSet objects. 

_D_e_t_a_i_l_s:

     'alphabetFrequency' is a generic function defined in the
     Biostrings package.

_V_a_l_u_e:

     'alphabetFrequency' returns a numeric vector when 'x' is an
     XString or MaskedXString object. When 'x' is an XStringSet or
     XStringViews object, then it returns a numeric matrix with
     'length(x)' rows where the 'i'-th row contains the frequencies for
     'x[[i]]'. If 'x' is a DNA or RNA input, then the returned vector
     is named with the letters in the alphabet. If the 'baseOnly'
     argument is 'TRUE', then the returned vector has only 5 elements:
     4 elements corresponding to the 4 nucleotides + the 'other'
     element.

     'hasOnlyBaseLetters' returns 'TRUE' or 'FALSE' indicating whether
     or not 'x' contains only base letters (i.e. As, Cs, Gs and Ts for
     DNA input and As, Cs, Gs and Us for RNA input).

     'uniqueLetters' returns a vector of 1-letter or empty strings. The
     empty string is used to represent the nul character if 'x' happens
     to contain any. Note that this can only happen if the base class
     of 'x' is BString.

     An integer matrix with letters as row names for 'consensusMatrix'.

     A standard character string for 'consensusString'.

_A_u_t_h_o_r(_s):

     H. Pages and P. Aboyoun

_S_e_e _A_l_s_o:

     'alphabet', 'coverage', 'oligonucleotideFrequency', 'countPDict',
     XString-class, XStringSet-class, XStringViews-class,
     MaskedXString-class, 'strsplit'

_E_x_a_m_p_l_e_s:

       ## ---------------------------------------------------------------------
       ## A. BASIC alphabetFrequency() EXAMPLES
       ## ---------------------------------------------------------------------
       data(yeastSEQCHR1)
       yeast1 <- DNAString(yeastSEQCHR1)

       alphabetFrequency(yeast1)
       alphabetFrequency(yeast1, baseOnly=TRUE)
       hasOnlyBaseLetters(yeast1)
       uniqueLetters(yeast1)

       ## With input made of multiple sequences:
       library(drosophila2probe)
       probes <- DNAStringSet(drosophila2probe$sequence)
       alphabetFrequency(probes[1:50], baseOnly=TRUE)
       alphabetFrequency(probes, baseOnly=TRUE, collapse=TRUE)

       ## ---------------------------------------------------------------------
       ## B. consensus*() EXAMPLES
       ## ---------------------------------------------------------------------
       ## Read in ORF data:
       file <- system.file("extdata", "someORF.fa", package="Biostrings")
       orf <- read.DNAStringSet(file, "fasta")

       ## To illustrate, the following example assumes the ORF data
       ## to be aligned for the first 10 positions (patently false):
       orf10 <- DNAStringSet(orf, end=10)
       consensusMatrix(orf10, baseOnly=TRUE)

       ## The following example assumes the first 10 positions to be aligned
       ## after some incremental shifting to the right (patently false):
       consensusMatrix(orf10, baseOnly=TRUE, shift=0:6)
       consensusMatrix(orf10, baseOnly=TRUE, shift=0:6, width=10)

       ## For the character matrix containing the "exploded" representation
       ## of the strings, do:
       as.matrix(orf10, use.names=FALSE)

       ## consensusMatrix() can be used to just compute the alphabet frequency
       ## for each position in the input sequences:
       consensusMatrix(probes, baseOnly=TRUE)

       ## After sorting, the first 5 probes might look similar (at least on
       ## their first bases):
       consensusString(sort(probes)[1:5])

       ## ---------------------------------------------------------------------
       ## C. RELATIONSHIP BETWEEN consensusMatrix() AND coverage()
       ## ---------------------------------------------------------------------
       ## Applying colSums() on a consensus matrix gives the coverage that
       ## would be obtained by piling up (after shifting) the input sequences
       ## on top of an (imaginary) reference sequence:
       cm <- consensusMatrix(orf10, shift=0:6, width=10)
       colSums(cm)

       ## Note that this coverage can also be obtained with:
       as.integer(coverage(IRanges(rep(1, length(orf)), width(orf)), shift=0:6, width=10))

