getSeq               package:BSgenome               R Documentation

_g_e_t_S_e_q

_D_e_s_c_r_i_p_t_i_o_n:

     A convenience function for extracting a set of sequences (or
     subsequences) from a BSgenome or other object. This man page
     specifically documents the BSgenome method.

_U_s_a_g_e:

       getSeq(x, ...)

       ## S4 method for signature 'BSgenome':
       getSeq(x, names, start=NA, end=NA, width=NA, strand="+", as.character=TRUE)

_A_r_g_u_m_e_n_t_s:

       x: A BSgenome object. See the 'available.genomes' function for
          how to install a genome. 

   names: The names of the sequences to extract from 'x'. If missing,
          then 'seqnames(x)' is used.

          See '?seqnames' and '?mseqnames' to get the list of single
          sequences and multiple sequences (respectively) contained in
          'x'.

          Here is how the lookup between the names passed to the
          'names' argument and the sequences in 'x' is performed. For
          each 'name' in 'names': (1) if 'x' contains a single sequence
          with that name then this sequence is returned; (2) otherwise
          the names of all the elements in all the multiple sequences
          are searched: 'name' is treated as a regular expression and
          'grep' is used for this search. If exactly one sequence is
          found, then it's returned, otherwise an error is raised. 

start, end, width: Vector of integers (eventually with NAs). 

  strand: A vector containing '+'s or/and '-'s. 

as.character: 'TRUE' or 'FALSE'. Should the extracted sequences be
          returned in a standard character vector? 

_D_e_t_a_i_l_s:

     The 'names', 'start', 'end', 'width' and 'strand' arguments are
     expanded cyclically to the length of the longest provided none are
     of zero length.

_V_a_l_u_e:

     A standard character vector when 'as.character=TRUE'. Note that
     when 'as.character=TRUE', then the masks that are defined on top
     of the sequences to extract are ignored (i.e. dropped) if any (see
     '?`MaskedXString-class`' for more information about masked
     sequences).

     A DNAString or MaskedDNAString object when 'as.character=FALSE'.
     Note that 'as.character=FALSE' is not supported yet when
     extracting more than one sequence.

_N_o_t_e:

     Be aware that using 'as.character=TRUE' can be very inefficient
     when the returned character vector contains very long strings (> 1
     million letters) or is itself a long vector (> 10000 strings).

     'getSeq' is much more efficient when used with
     'as.character=FALSE' but this works only for extracting one
     sequence at a time for now.

_A_u_t_h_o_r(_s):

     H. Pages; improvements suggested by Matt Settles

_S_e_e _A_l_s_o:

     'available.genomes', BSgenome-class, 'seqnames', 'mseqnames',
     'grep', 'subseq', 'DNAString', 'MaskedDNAString',
     '[[,BSgenome-method'

_E_x_a_m_p_l_e_s:

       # Load the Caenorhabditis elegans genome (UCSC Release ce2):
       library(BSgenome.Celegans.UCSC.ce2)

       # Look at the index of sequences:
       Celegans

       # Get chromosome V as a DNAString object:
       getSeq(Celegans, "chrV", as.character=FALSE)
       # which is in fact the same as doing:
       Celegans$chrV

       # Never try this:
       #getSeq(Celegans, "chrV")
       # or this (even worse):
       #getSeq(Celegans)

       # Get the first 20 bases of each chromosome:
       getSeq(Celegans, end=20)

       # Get the last 20 bases of each chromosome:
       getSeq(Celegans, start=-20)

       # Extracting small sequences from different chromosomes:
       myseqs <- data.frame(
         chr=c("chrI", "chrX", "chrM", "chrM", "chrX", "chrI", "chrM", "chrI"),
         start=c(NA, -40, 8510, 301, 30001, 9220500, -2804, -30),
         end=c(50, NA, 8522, 324, 30011, 9220555, -2801, -11),
         strand=c("+", "-", "+", "+", "-", "-", "+", "-")
       )
       getSeq(Celegans, myseqs$chr,
              start=myseqs$start, end=myseqs$end)
       getSeq(Celegans, myseqs$chr,
              start=myseqs$start, end=myseqs$end, strand=myseqs$strand)

       # Get the "NM_058280_up_1000" sequence (belongs to the upstream1000
       # multiple sequence) as a character string:
       s1 <- getSeq(Celegans, "NM_058280_up_1000")
       # or a DNAString object (more efficient):
       s2 <- getSeq(Celegans, "NM_058280_up_1000", as.character=FALSE)

       getSeq(Celegans, "NM_058280_up_5000", start=-1000) == s1  # TRUE

       getSeq(Celegans, "NM_058280_up_5000",
              start=-1000, as.character=FALSE) == s2  # TRUE

