getSeq               package:BSgenome               R Documentation

_g_e_t_S_e_q

_D_e_s_c_r_i_p_t_i_o_n:

     A convenience function for extracting a set of sequences (or
     subsequences) from a BSgenome object.

_U_s_a_g_e:

       getSeq(bsgenome, names, start=NA, end=NA, width=NA, as.character=TRUE)

_A_r_g_u_m_e_n_t_s:

bsgenome: A BSgenome object. See the 'available.genomes' function for
          how to install a genome. 

   names: The names of the sequences to extract from 'bsgenome'. If
          missing, then 'seqnames(bsgenome)' is used.

          See '?seqnames' and '?mseqnames' to get the list of single
          sequences and multiple sequences (respectively) contained in
          'bsgenome'.

          Here is how the lookup between the names passed to the
          'names' argument and the sequences in 'bsgenome' is
          performed. For each 'name' in 'names': (1) if 'bsgenome'
          contains a single sequence with that name then this sequence
          is returned; (2) otherwise the names of all the elements in
          all the multiple sequences are searched: 'name' is treated as
          a regular expression and 'grep' is used for this search. If
          exactly one sequence is found, then it's returned, otherwise
          an error is raised. 

start, end, width: Specify these arguments only if you don't want to
          extract the entire sequences. Then the subsequences specified
          by 'start', 'end' and 'width' (single integers or NAs) will
          be extracted by a call to 'subseq' before they are returned
          by 'getSeq'. 

as.character: 'TRUE' or 'FALSE'. Should the extracted sequences be
          returned in a standard character vector? 

_V_a_l_u_e:

     A standard character vector when 'as.character=TRUE'. Note that
     when 'as.character=TRUE', then the masks that are defined on top
     of the sequences to extract are ignored if any (see
     '?`MaskedXString-class`' for more information about masked
     sequences).

     A DNAString or MaskedDNAString object when 'as.character=FALSE'.
     Note that 'as.character=FALSE' is not supported when more than one
     sequence name is supplied.

_N_o_t_e:

     Be aware that using 'as.character=TRUE' can be very inefficient
     when the returned character vector contains very long strings (> 1
     million letters) or is itself a long vector (> 10000 strings).

     'getSeq' is much more efficient when used with
     'as.character=FALSE' but this works only for extracting one
     sequence at a time for now.

_A_u_t_h_o_r(_s):

     H. Pages; improvements suggested by Matt Settles

_S_e_e _A_l_s_o:

     'available.genomes', BSgenome-class, 'seqnames', 'mseqnames',
     'grep', 'subseq', 'DNAString', 'MaskedDNAString',
     '[[,BSgenome-method'

_E_x_a_m_p_l_e_s:

       # Load the Caenorhabditis elegans genome (UCSC Release ce2):
       library(BSgenome.Celegans.UCSC.ce2)

       # Look at the index of sequences:
       Celegans

       # Get chromosome V as a DNAString object:
       getSeq(Celegans, "chrV", as.character=FALSE)
       # which is in fact the same as doing:
       Celegans$chrV

       # Never try this:
       #getSeq(Celegans, "chrV")
       # or this (even worse):
       #getSeq(Celegans)

       # Get the first 20 bases of each chromosome:
       getSeq(Celegans, end=20)

       # Get the last 20 bases of each chromosome:
       getSeq(Celegans, start=-20)

       # Get the "NM_058280_up_1000" sequence (belongs to the upstream1000
       # multiple sequence) as a character string:
       s1 <- getSeq(Celegans, "NM_058280_up_1000")
       # or a DNAString object (more efficient):
       s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE)

       getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1  # TRUE

       getSeq(Celegans, "NM_058280_up_5000",
              start=-1000, as.character=FALSE) == s2  # TRUE

