BSgenome-class           package:BSgenome           R Documentation

_T_h_e _B_S_g_e_n_o_m_e _c_l_a_s_s

_D_e_s_c_r_i_p_t_i_o_n:

     A container for the complete genome sequence of a given species.

_A_c_c_e_s_o_r _m_e_t_h_o_d_s:

     In the code snippets below, 'x' is a BSgenome object and 'name' is
     the name of a sequence (character-string).


      'organism(x)': Return the target organism for this genome e.g.
          '"Homo sapiens"', '"Mus musculus"', '"Caenorhabditis
          elegans"', etc...

      'species(x)': Return the target species for this genome e.g.
          '"Human"', '"Mouse"', '"C. elegans"', etc...

      'provider(x)': Return the provider of this genome e.g. '"UCSC"',
          '"BDGP"', '"FlyBase"', etc...

      'providerVersion(x)': Return the provider-side version of this
          genome. For example UCSC uses versions '"hg18"', '"hg17"',
          etc... for the different Builds of the Human genome.

      'releaseDate(x)': Return the release date of this genome e.g.
          '"Mar. 2006"'.

      'releaseName(x)': Return the release name of this genome, which
          is generally made of the name of the organization who
          assembled it plus its Build version. For example, UCSC uses
          '"hg18"' for the version of the Human genome corresponding to
          the Build 36.1 from NCBI hence the release name for this
          genome is '"NCBI Build 36.1"'.

      'sourceUrl(x)': Return the source URL i.e. the permanent URL to
          the place where the FASTA files used to produce the sequences
          contained in 'x' can be found (and downloaded).

      'SNPlocs_pkgname(x)': Return the name of the package from which
          the SNPs are injected, if any.

      'seqnames(x)': Return the index of the single sequences contained
          in 'x'. Each single sequence is stored in an XString or
          MaskedXString object and typically comes from a source file
          (FASTA) with a single record. The names returned by
          'seqnames(x)' usually reflect the names of those source files
          but a common prefix or suffix was eventually removed in order
          to keep them as short as possible.

      'mseqnames(x)': Return the index of the multiple sequences
          contained in 'x'. Each multiple sequence is stored in an
          XStringSet object and typically comes from a source file
          (FASTA) with multiple records. The names returned by
          'mseqnames(x)' usually reflect the names of those source
          files but a common prefix or suffix was eventually removed in
          order to keep them as short as possible.

      'names(x)': Return the index of all sequences contained in 'x'.
          This is the same as 'c(seqnames(x), mseqnames(x))'.

      'length(x)': Return the length of 'x', i.e., the number of all
          sequences that it contains. This is the same as
          'length(names(x))'.

      'x[[name]]': Return sequence (single or multiple) named 'name'.
          No sequence is actually loaded into memory until this is
          explicitely requested with a call to 'x[[name]]' or 'x$name'.

      'x$name': Same as 'x[[name]]' but 'name' is not evaluated and
          therefore must be a literal character string or a name
          (possibly backtick quoted).


_O_t_h_e_r _f_u_n_c_t_i_o_n_s _a_n_d _g_e_n_e_r_i_c_s:

     In the code snippets below, 'x' is a BSgenome object and 'name' is
     the name of a sequence (character-string).


      'unload(x, name)': Try to free the memory occupied by a loaded
          sequence by removing the 1st reference to this sequence. This
          1st reference is a hidden reference that is created behind
          the scene by 'x[[name]]' or 'x$name'. See below for an
          example of how to make proper use of 'unload()'.


_A_u_t_h_o_r(_s):

     H. Pages

_S_e_e _A_l_s_o:

     'available.genomes', XString-class, MaskedXString-class,
     XStringSet-class, 'getSeq', 'matchPattern', 'rm', 'gc'

_E_x_a_m_p_l_e_s:

       library(BSgenome.Celegans.UCSC.ce2)   # This doesn't load the chromosome 
                                             # sequences into memory.
       length(Celegans)                      # Number of sequences in this genome.
       Celegans                              # Displays a summary of the sequences
                                             # provided in this genome.
       seqnames(Celegans)                    # Index of single sequences.
       class(Celegans$chrI)                  # A DNAString instance.
       mseqnames(Celegans)                   # Index of multiple sequences.
       class(Celegans$upstream1000)          # A DNAStringSet instance.
       desc(Celegans$upstream1000)[1:4]      # Character vector containing the
                                             # description line found in the FASTA
                                             # file for the first 4 FASTA records.

       ## Some important considerations about memory usage:
       mem0 <- gc()["Vcells", "(Mb)"]        # Current amount of data in memory (in
                                             # Mb).
       Celegans[["chrV"]]                    # Loads chromosome V into memory (hence
                                             # takes a long time).
       gc()["Vcells", "(Mb)"] - mem0         # Chromosome V occupies 20Mb of memory.
       Celegans[["chrV"]]                    # Much faster (sequence is already in
                                             # memory, hence it's not loaded again).
       Celegans$chrV                         # Equivalent to Celegans[["chrV"]].
       class(Celegans$chrV)                  # Chromosome V (like any other
                                             # chromosome sequence) is a DNAString
                                             # object.
       nchar(Celegans$chrV)                  # It has 20922231 letters (nucleotides).
       x <- Celegans$chrV                    # Very fast because a BString object
                                             # doesn't contain the sequence, only a
                                             # pointer to the sequence, hence chrV
                                             # seq is not duplicated in memory. But
                                             # we now have 2 objects pointing to the
                                             # same place in memory.
       y <- substr(x, 10, 100)               # A 3rd object pointing to chrV seq.
       
       ## We must remove all references to chrV seq if we want the 20Mb of memory
       ## used by it to be freed (note that it can be hard to keep track of all the
       ## references to a given sequence).
       ## IMPORTANT: The 1st reference to this seq (Celegans$chrV) should be removed
       ## last. This is achieved with unload(). All other references are removed by
       ## just removing the referencing object.
       rm(x)
       rm(y)
       unload(Celegans, "chrV")
       gc()["Vcells", "(Mb)"]

