BSgenome-class           package:BSgenome           R Documentation

_B_S_g_e_n_o_m_e _o_b_j_e_c_t_s

_D_e_s_c_r_i_p_t_i_o_n:

     The BSgenome class is a container for the complete genome sequence
     of a given organism.

_A_c_c_e_s_s_o_r _m_e_t_h_o_d_s:

     In the code snippets below, 'x' is a BSgenome object and 'name' is
     the name of a sequence (character-string). Note that, because the
     BSgenome class contains the GenomeDescription class, then all the
     accessor methods for GenomeDescription objects can also be used on
     'x'.


      'sourceUrl(x)': Return the source URL i.e. the permanent URL to
          the place where the FASTA files used to produce the sequences
          contained in 'x' can be found (and downloaded).

      'seqnames(x)': Return the index of the single sequences contained
          in 'x'. Each single sequence is stored in an XString or
          MaskedXString object and typically comes from a source file
          (FASTA) with a single record. The names returned by
          'seqnames(x)' usually reflect the names of those source files
          but a common prefix or suffix was eventually removed in order
          to keep them as short as possible.

      'seqlengths(x)': Return the lengths of the single sequences
          contained in 'x'.

          See '?`length,XString-method`' and
          '?`length,MaskedXString-method`' for the definition of the
          length of an XString or MaskedXString object. Note that the
          length of a masked sequence (MaskedXString object) is not
          affected by the current set of active masks but the 'nchar'
          method for MaskedXString is.

          'names(seqlengths(x))' is guaranteed to be identical to
          'seqnames(x)'.

      'mseqnames(x)': Return the index of the multiple sequences
          contained in 'x'. Each multiple sequence is stored in an
          XStringSet object and typically comes from a source file
          (FASTA) with multiple records. The names returned by
          'mseqnames(x)' usually reflect the names of those source
          files but a common prefix or suffix was eventually removed in
          order to keep them as short as possible.

      'names(x)': Return the index of all sequences contained in 'x'.
          This is the same as 'c(seqnames(x), mseqnames(x))'.

      'length(x)': Return the length of 'x', i.e., the number of all
          sequences that it contains. This is the same as
          'length(names(x))'.

      'x[[name]]': Return sequence (single or multiple) named 'name'.
          No sequence is actually loaded into memory until this is
          explicitely requested with a call to 'x[[name]]' or 'x$name'.
          When loaded, a sequence is kept in a cache. It will be
          automatically removed from the cache at garbage collection if
          it's not in use anymore i.e. if there are no reference to it
          (other than the reference stored in the cache). With
          'options(verbose=TRUE)', a message is printed each time a
          sequence is removed from the cache. 

      'x$name': Same as 'x[[name]]' but 'name' is not evaluated and
          therefore must be a literal character string or a name
          (possibly backtick quoted).

      'masknames(x)': The names of the built-in masks that are defined
          for all the single sequences. There can be up to 4 built-in
          masks per sequence. These will always be (in this order): (1)
          the mask of assembly gaps, aka "the AGAPS mask"; (2) the mask
          of intra-contig ambiguities, aka "the AMB mask"; (3) the mask
          of repeat regions that were determined by the RepeatMasker
          software, aka "the RM mask"; (4) the mask of repeat regions
          that were determined by the Tandem Repeats Finder software
          (where only repeats with period less than or equal to 12 were
          kept), aka "the TRF mask". All the single sequences in a
          given package are guaranteed to have the same collection of
          built-in masks (same number of masks and in the same order).

          'masknames(x)' gives the names of the masks in this
          collection. Therefore the value returned by 'masknames(x)' is
          a character vector made of the first N elements of
          'c("AGAPS", "AMB", "RM", "TRF")', where N depends only on the
          BSgenome data package being looked at (0 <= N <= 4). The man
          page for most BSgenome data packages should provide the exact
          list and permanent URLs of the source data files that were
          used to extract the built-in masks. For example, if you've
          installed the BSgenome.Hsapiens.UCSC.hg18 package, load it
          and see the Note section in '?`BSgenome.Hsapiens.UCSC.hg18`'.


_A_u_t_h_o_r(_s):

     H. Pages

_S_e_e _A_l_s_o:

     'available.genomes', GenomeDescription-class, XString-class,
     MaskedXString-class, XStringSet-class, 'injectSNPs', 'subseq',
     'getSeq', 'matchPattern', 'rm', 'gc'

_E_x_a_m_p_l_e_s:

       ## Loading a BSgenome data package doesn't load its sequences
       ## into memory:
       library(BSgenome.Celegans.UCSC.ce2)

       ## Number of sequences in this genome:
       length(Celegans) 

       ## Display a summary of the sequences:
       Celegans

       ## Index of single sequences:
       seqnames(Celegans)

       ## Lengths (i.e. number of nucleotides) of the sequences:
       seqlengths(Celegans)

       ## Load chromosome I from disk to memory (hence takes some time)
       ## and keep a reference to it:
       chrI <- Celegans[["chrI"]]  # equivalent to Celegans$chrI

       chrI

       class(chrI)   # a DNAString instance
       length(chrI)  # with 15080483 nucleotides

       ## Multiple sequences:
       mseqnames(Celegans) 
       upstream1000 <- Celegans$upstream1000
       upstream1000
       class(upstream1000)  # a DNAStringSet instance
       ## Character vector containing the description lines of the first
       ## 4 sequences in the original FASTA file:
       names(upstream1000)[1:4]

       ## ---------------------------------------------------------------------
       ## PASS-BY-ADDRESS SEMANTIC, CACHING AND MEMORY USAGE
       ## ---------------------------------------------------------------------

       ## We want a message to be printed each time a sequence is removed
       ## from the cache:
       options(verbose=TRUE)

       gc()  # nothing seems to be removed from the cache
       rm(chrI, upstream1000)
       gc()  # chrI and upstream1000 are removed from the cache (they are
             # not in use anymore)

       options(verbose=FALSE)

       ## Get the current amount of data in memory (in Mb):
       mem0 <- gc()["Vcells", "(Mb)"]

       system.time(chrV <- Celegans[["chrV"]])  # read from disk
       
       gc()["Vcells", "(Mb)"] - mem0  # chrV occupies 20Mb in memory

       system.time(tmp <- Celegans[["chrV"]])  # much faster! (sequence
                                               # is in the cache)

       gc()["Vcells", "(Mb)"] - mem0  # we're still using 20Mb (sequences
                                      # have a pass-by-address semantic
                                      # i.e. the sequence data are not
                                      # duplicated)
       
       ## subseq() doesn't copy the sequence data either, hence it is very
       ## fast and memory efficient (but the returned object will hold a
       ## reference to chrV):
       y <- subseq(chrV, 10, 8000000) 
       gc()["Vcells", "(Mb)"] - mem0

       ## We must remove all references to chrV before it can be removed from
       ## the cache (so the 20Mb of memory used by this sequence are freed).
       options(verbose=TRUE)
       rm(chrV, tmp)
       gc()

       ## Remember that 'y' holds a reference to chrV too:
       rm(y)
       gc()

       options(verbose=FALSE)
       gc()["Vcells", "(Mb)"] - mem0

