nucleotideFrequency        package:Biostrings        R Documentation

_C_a_l_c_u_l_a_t_e _t_h_e _f_r_e_q_u_e_n_c_y _o_f _o_l_i_g_o_n_u_c_l_e_o_t_i_d_e_s _i_n _a _D_N_A
_o_r _R_N_A _s_e_q_u_e_n_c_e, _p_l_u_s _s_o_m_e _r_e_l_a_t_e_d _f_u_n_c_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Given a DNA or RNA sequence (or a set of DNA or RNA sequences),
     the 'oligonucleotideFrequency' function computes the frequency of
     all possible oligonucleotides of a given length (called the
     "width" in this particular context).

     The 'dinucleotideFrequency' and 'trinucleotideFrequency' functions
     are convenient wrappers for calling 'oligonucleotideFrequency'
     with 'width=2' and 'width=3', respectively.

     The 'nucleotideFrequencyAt' function computes the frequency of the
     short sequences formed by extracting the nucleotides found at some
     fixed positions from each sequence of a set of DNA or RNA
     sequences.

     In this man page we call "DNA input" (or "RNA input") an XString,
     XStringSet, XStringViews or MaskedXString object of base type DNA
     (or RNA).

_U_s_a_g_e:

       oligonucleotideFrequency(x, width, freq=FALSE, as.array=FALSE,
                                fast.moving.side="right", with.labels=TRUE, ...)

       ## S4 method for signature 'XStringSet':
       oligonucleotideFrequency(x,
           width, freq=FALSE, as.array=FALSE,
           fast.moving.side="right", with.labels=TRUE, simplify.as="matrix")

       dinucleotideFrequency(x, freq=FALSE, as.matrix=FALSE,
                             fast.moving.side="right", with.labels=TRUE, ...)
       trinucleotideFrequency(x, freq=FALSE, as.array=FALSE,
                              fast.moving.side="right", with.labels=TRUE, ...)

       nucleotideFrequencyAt(x, at, freq=FALSE, as.array=TRUE,
                             fast.moving.side="right", with.labels=TRUE, ...)

       ## Some related functions:
       oligonucleotideTransitions(x, left=1, right=1, freq=FALSE)
       mkAllStrings(alphabet, width, fast.moving.side="right")

_A_r_g_u_m_e_n_t_s:

       x: Any DNA or RNA input for the '*Frequency' and
          'oligonucleotideTransitions' functions.

          An XStringSet or XStringViews object of base type DNA or RNA
          for 'nucleotideFrequencyAt'. 

   width: The number of nucleotides per oligonucleotide for
          'oligonucleotideFrequency'.

          The number of letters per string for 'mkAllStrings'. 

      at: An integer vector containing the positions to look at in each
          element of 'x'. 

    freq: If 'TRUE' then relative frequencies are reported, otherwise
          counts (the default). 

as.array,as.matrix: Controls the "shape" of the returned object. If
          'TRUE' (the default for 'nucleotideFrequencyAt') then it's a
          numeric matrix (or array), otherwise it's just a "flat"
          numeric vector i.e. a vector with no dim attribute (the
          default for the '*Frequency' functions). 

fast.moving.side: Which side of the strings should move fastest? Note
          that, when 'as.array' is TRUE, then the supplied value is
          ignored and the effective value is '"left"'. 

with.labels: If 'TRUE' then the returned object is named. 

     ...: Further arguments to be passed to or from other methods. 

simplify.as: Together with the 'as.array' and 'as.matrix' arguments,
          controls the "shape" of the returned object when the input
          'x' is an XStringSet or XStringViews object. Supported
          'simplify.as' values are '"matrix"' (the default), '"list"'
          and '"collapsed"'. If 'simplify.as' is '"matrix"', the
          returned object is a matrix with 'length(x)' rows where the
          'i'-th row contains the frequencies for 'x[[i]]'. If
          'simplify.as' is '"list"', the returned object is a list of
          the same length as 'length(x)' where the 'i'-th element
          contains the frequencies for 'x[[i]]'. If 'simplify.as' is
          '"collapsed"', then the the frequencies are computed for the
          entire object 'x' as a whole (i.e. frequencies cumulated
          across all sequences in 'x'). 

left, right: The number of nucleotides per oligonucleotide for the rows
          and columns respectively in the transition matrix created by
          'oligonucleotideTransitions'. 

alphabet: The alphabet to use to make the strings. 

_V_a_l_u_e:

     If 'x' is an XString or MaskedXString object, the '*Frequency'
     functions return a numeric vector of length '4^width'. If
     'as.array' (or 'as.matrix') is 'TRUE', then this vector is
     formatted as an array (or matrix). If 'x' is an XStringSet or
     XStringViews object, the returned object has the shape specified
     by the 'simplify.as' argument.

_A_u_t_h_o_r(_s):

     H. Pages and P. Aboyoun

_S_e_e _A_l_s_o:

     'alphabetFrequency', 'alphabet', 'hasLetterAt', XString-class,
     XStringSet-class, XStringViews-class, MaskedXString-class,
     'GENETIC_CODE', 'AMINO_ACID_CODE', 'reverse,XString-method', 'rev'

_E_x_a_m_p_l_e_s:

       ## ---------------------------------------------------------------------
       ## A. BASIC *Frequency() EXAMPLES
       ## ---------------------------------------------------------------------
       data(yeastSEQCHR1)
       yeast1 <- DNAString(yeastSEQCHR1)

       dinucleotideFrequency(yeast1)
       trinucleotideFrequency(yeast1)
       oligonucleotideFrequency(yeast1, 4)

       ## Get the less and most represented 6-mers:
       f6 <- oligonucleotideFrequency(yeast1, 6)
       f6[f6 == min(f6)]
       f6[f6 == max(f6)]

       ## Get the result as an array:
       tri <- trinucleotideFrequency(yeast1, as.array=TRUE)
       tri["A", "A", "C"] # == trinucleotideFrequency(yeast1)["AAC"]
       tri["T", , ] # frequencies of trinucleotides starting with a "T"

       ## With input made of multiple sequences:
       library(drosophila2probe)
       probes <- DNAStringSet(drosophila2probe$sequence)
       dfmat <- dinucleotideFrequency(probes)  # a big matrix
       dinucleotideFrequency(probes, simplify.as="collapsed")
       dinucleotideFrequency(probes, simplify.as="collapsed", as.matrix=TRUE)

       ## ---------------------------------------------------------------------
       ## B. nucleotideFrequencyAt()
       ## ---------------------------------------------------------------------
       nucleotideFrequencyAt(probes, 13)
       nucleotideFrequencyAt(probes, c(13, 20))
       nucleotideFrequencyAt(probes, c(13, 20), as.array=FALSE)

       ## nucleotideFrequencyAt() can be used to answer questions like: "how
       ## many probes in the drosophila2 chip have T, G, T, A at position
       ## 2, 4, 13 and 20, respectively?"
       nucleotideFrequencyAt(probes, c(2, 4, 13, 20))["T", "G", "T", "A"]
       ## or "what's the probability to have an A at position 25 if there is
       ## one at position 13?"
       nf <- nucleotideFrequencyAt(probes, c(13, 25))
       sum(nf["A", "A"]) / sum(nf["A", ])
       ## Probabilities to have other bases at position 25 if there is an A
       ## at position 13:
       sum(nf["A", "C"]) / sum(nf["A", ])  # C
       sum(nf["A", "G"]) / sum(nf["A", ])  # G
       sum(nf["A", "T"]) / sum(nf["A", ])  # T

       ## See ?hasLetterAt for another way to get those results.

       ## ---------------------------------------------------------------------
       ## C. oligonucleotideTransitions()
       ## ---------------------------------------------------------------------
       ## Get nucleotide transition matrices for yeast1
       oligonucleotideTransitions(yeast1)
       oligonucleotideTransitions(yeast1, 2, freq=TRUE)

       ## ---------------------------------------------------------------------
       ## D. ADVANCED *Frequency() EXAMPLES
       ## ---------------------------------------------------------------------
       ## Note that when dropping the dimensions of the 'tri' array, elements
       ## in the resulting vector are ordered as if they were obtained with
       ## 'fast.moving.side="left"':
       triL <- trinucleotideFrequency(yeast1, fast.moving.side="left")
       all(as.vector(tri) == triL) # TRUE

       ## Convert the trinucleotide frequency into the amino acid frequency
       ## based on translation:
       tri1 <- trinucleotideFrequency(yeast1)
       names(tri1) <- GENETIC_CODE[names(tri1)]
       sapply(split(tri1, names(tri1)), sum) # 12512 occurrences of the stop codon

       ## When the returned vector is very long (e.g. width >= 10), using
       ## 'with.labels=FALSE' can improve performance significantly.
       ## Here for example, the observed speed up is between 25x and 500x:
       f12 <- oligonucleotideFrequency(yeast1, 12, with.labels=FALSE) # very fast!

       ## Spome related functions:
       dict1 <- mkAllStrings(LETTERS[1:3], 4)
       dict2 <- mkAllStrings(LETTERS[1:3], 4, fast.moving.side="left")
       identical(reverse(dict1), dict2) # TRUE 

