alphabetFrequency         package:Biostrings         R Documentation

_F_u_n_c_t_i_o_n_s _t_o _c_a_l_c_u_l_a_t_e _t_h_e _f_r_e_q_u_e_n_c_y _o_f _l_e_t_t_e_r_s _i_n _a _b_i_o_l_o_g_i_c_a_l _s_e_q_u_e_n_c_e

_D_e_s_c_r_i_p_t_i_o_n:

     Given a biological sequence, the 'alphabetFrequency' function will
     calculate the frequency of each letter in the (base) alphabet, the
     'dinucleotideFrequency' function the frequency of all possible
     dinucleotides and the 'trinucleotideFrequency' function the
     frequency of all possible trinucleotides.

     More generally, the 'oligonucleotideFrequency' function will
     calculate the frequency of all possible oligonucleotides of a
     given length (called the "width" in this particular context).

     In this man page we call "DNA input" a DNAString object, or a
     DNAStringSet object, or an XStringViews object with a DNAString
     subject, or a MaskedDNAString object. Similarly we call "RNA
     input" an RNAString object, or an RNAStringSet object, or an
     XStringViews object with an RNAString subject, or a
     MaskedRNAString object.

_U_s_a_g_e:

       alphabetFrequency(x, baseOnly=FALSE, freq=FALSE, ...)
       dinucleotideFrequency(x, freq=FALSE, fast.moving.side="right", as.matrix=FALSE, with.labels=TRUE, ...)
       trinucleotideFrequency(x, freq=FALSE, fast.moving.side="right", as.array=FALSE, with.labels=TRUE, ...)
       oligonucleotideFrequency(x, width, freq=FALSE, fast.moving.side="right", as.array=FALSE, with.labels=TRUE, ...)

       ## Some related utility functions
       strrev(x)
       mkAllStrings(alphabet, width, fast.moving.side="right")

_A_r_g_u_m_e_n_t_s:

       x: An XString, XStringSet, XStringViews or MaskedXString object
          for all the '*Frequency' functions. A character vector for
          'strrev'. 

baseOnly: 'TRUE' or 'FALSE'. If 'TRUE', the returned vector only
          contains frequencies for the letters in the "base" alphabet
          i.e. "A", "C", "G", "T" if 'x' is a "DNA input", and "A",
          "C", "G", "U" if 'x' is "RNA input". When 'x' is a BString
          object (or an XStringViews object with a BString subject, or
          a BStringSet object), then the 'baseOnly' argument is
          ignored. 

    freq: If 'TRUE' then frequencies are reported, otherwise counts. 

     ...: Further arguments to be passed to or from other methods. For
          the XStringViews and XStringSet methods, the 'collapse'
          argument is accepted. 

fast.moving.side: Which side of the strings should move fastest? 

as.matrix: If 'TRUE' then return a numeric matrix, otherwise a numeric
          vector with no dim attribute. 

as.array: If 'TRUE' then return a numeric array, otherwise a numeric
          vector with no dim attribute. 

with.labels: If 'TRUE' then return a named vector (or array). 

   width: The number of nucleotides per oligonucleotide for
          'oligonucleotideFrequency'. The number of letters per string
          for 'mkAllStrings'. 

alphabet: The alphabet to use to make the strings. 

_D_e_t_a_i_l_s:

     'alphabetFrequency' and 'oligonucleotideFrequency' are generic
     functions with methods in the Biostrings package defined for
     BString, DNAString, RNAString, XStringViews and XStringSet
     objects.

_V_a_l_u_e:

     All the '*Frequency' functions return an integer vector if 'freq'
     is 'FALSE' (default), otherwise a double vector. If 'as.matrix' or
     'as.array' is 'TRUE', this vector is formatted as a matrix or an
     array.

     For 'alphabetFrequency': if 'x' is a "DNA or RNA input", then the
     returned vector is named with the letters in the alphabet (unless
     'with.labels' is 'FALSE'). If the 'baseOnly' argument is 'TRUE',
     then the returned vector has only 5 elements: 4 elements
     corresponding to the 4 nucleotides + the 'other' element.

     'dinucleotideFrequency' (resp. 'trinucleotideFrequency' and
     'oligonucleotideFrequency') only works on "DNA or RNA input" and
     returns a vector named with all the possible dinucleotides (resp.
     trinucleotides or oligonucleotides).

     If 'x' is a multiple sequence input (i.e. an XStringViews or
     XStringSet object), then the returned object is a matrix (or a
     list) with the same number of rows (or elements) as 'x' unless
     'collapse=TRUE' is specified. In that case the returned vector (or
     array) contains the frequencies cumulated across all sequences in
     'x'.

_A_u_t_h_o_r(_s):

     H. Pages

_S_e_e _A_l_s_o:

     'countPDict', XString-class, XStringSet-class, XStringViews-class,
     MaskedXString-class, 'reverse', 'rev', 'strsplit', 'GENETIC_CODE',
     'AMINO_ACID_CODE'

_E_x_a_m_p_l_e_s:

       data(yeastSEQCHR1)
       yeast1 <- DNAString(yeastSEQCHR1)

       alphabetFrequency(yeast1)
       alphabetFrequency(yeast1, baseOnly=TRUE)

       dinucleotideFrequency(yeast1)
       trinucleotideFrequency(yeast1)
       oligonucleotideFrequency(yeast1, 4)

       ## With a multiple sequence input
       library(drosophila2probe)
       x <- DNAStringSet(drosophila2probe$sequence)
       alphabetFrequency(x[1:50], baseOnly=TRUE)
       alphabetFrequency(x, baseOnly=TRUE, collapse=TRUE)

       ## Get the less and most represented 6-mers
       f6 <- oligonucleotideFrequency(yeast1, 6)
       f6[f6 == min(f6)]
       f6[f6 == max(f6)]

       ## Get the result as an array
       tri <- trinucleotideFrequency(yeast1, as.array=TRUE)
       tri["A", "A", "C"] # == trinucleotideFrequency(yeast1)["AAC"]
       tri["T", , ] # frequencies of trinucleotides starting with a "T"

       ## Note that when dropping the dimensions of the 'tri' array, elements
       ## in the resulting vector are ordered as if they were obtained with
       ## 'fast.moving.side="left"':
       triL <- trinucleotideFrequency(yeast1, fast.moving.side="left")
       all(as.vector(tri) == triL) # TRUE

       ## Convert the trinucleotide frequency into the amino acid frequency based on
       ## translation
       tri1 <- trinucleotideFrequency(yeast1)
       names(tri1) <- GENETIC_CODE[names(tri1)]
       sapply(split(tri1, names(tri1)), sum) # 12512 occurences of the stop codon

       ## When the returned vector is very long (e.g. width >= 10), using
       ## 'with.labels=FALSE' will improve the performance considerably (100x, 1000x
       ## or more):
       f12 <- oligonucleotideFrequency(yeast1, 12, with.labels=FALSE) # very fast!

       ## Some related utility functions
       dict1 <- mkAllStrings(LETTERS[1:3], 4)
       dict2 <- mkAllStrings(LETTERS[1:3], 4, fast.moving.side="left")
       identical(strrev(dict1), dict2) # TRUE 

