crossBuilder_DB         package:PAnnBuilder         R Documentation

_B_u_i_l_d _D_a_t_a _P_a_c_k_a_g_e_s _f_o_r _P_r_o_t_e_i_n _I_D _M_a_p_p_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     This function creates a data package with the protein id mapping
     stored  as R environment objects in the data directory.

_U_s_a_g_e:

     crossBuilder_DB(src = c("sp","ipi","gi"), organism, 
                  blast, match, 
                  prefix, pkgPath, version, author       
                  ) 
     fasta2list(type, srcUrl,organism="")
     idBlast(query, subject, blast, match)

_A_r_g_u_m_e_n_t_s:

     src: a character vector that can be "sp", "trembl", "ipi" or "gi" 
          to indicate which protein sequence databases will be used.

organism: a character string for the name of the organism of concern.
          (eg: "Homo sapiens")

   blast: a named character vector defining the parameters of blastall.

   match: a named character vector defining the parameters of two
          sequence  matching.

  prefix: the prefix of the name of the data package to be built. (e.g.
           "hsaSP"). The name of builded package is prefix+".db". 

 pkgPath: a character string for the full path of an existing directory
          where the built backage will be stored.

 version: a character string for the version number.

  author: a list with named elements "authors" containing a character
          vector of author names and "maintainer" containing the
          complete character string for the maintainer field, for
          example, "Jane Doe <jdoe@doe.com>".

    type: a character string for the type of sequence data file, can be
           "sp", "trembl", "ipi" or "gi"

  srcUrl: a character string for the url where sequence data file with 
          fasta format will be retained.

   query: a named vector of query sequences

 subject: a named vector of subject sequences

_D_e_t_a_i_l_s:

     Build annotation data packages for protein id mapping. formatdb
     and blastall  are need to be installed.

     Parameter "blast" is a named character vector defining the
     parameters  of blastall. Possible names and their meaning are
     listed as follows: p:  Program Name [String]. e:  Expectation
     value (E) [Real]. M:  Matrix [String]. W:  World Size, default if
     zero (blastn 11, megablast 28, all others 3)  [Integer] default =
     0. G:  Cost to open a gap (-1 invokes default behavior) [Integer].
     E:  Cost to open a gap (-1 invokes default behavior) [Integer]. U:
      Use lower case filtering of FASTA sequence [T/F]  Optional. F: 
     Filter query sequence (DUST with blastn, SEG with others)
     [String].

     Parameter "match" a named character vector defining the parameters
     of  two sequence matching. Possible names and their meaning are
     listed as follows: e:  Expectation value of two sequence matching
     [Real]. c:  Coverage of the longest High-scoring Segment Pair
     (HSP) to the whole  protein sequence. (range: 0~1) i:  Identity of
     the longest High-scoring Segment Pair (HSP). (range: 0~1)

     Data files in the database will be automatically downloaded to the
     tmp directory, so enough space is needed for the data files. After
     downloading, files are parsed by perl, so perl must be installed. 
     It may  take a long time to parse database and build R package.
     Alternatively, we have  produced diverse R packages by
     PAnnBuilder, and you can download appropriate  package via <URL:
     http://www.biosino.org/PAnnBuilder>.

_A_u_t_h_o_r(_s):

     Hong Li

_E_x_a_m_p_l_e_s:

     # Set path, version and author for the package.
     pkgPath <- tempdir()
     version <- "1.0.0"
     author <- list()
     author[["authors"]] <- "Hong Li"
     author[["maintainer"]] <- "Hong Li <sysptm@gmail.com>"

     # Set parameters for sequence similarity.
     blast <- c("blastp", "10.0", "BLOSUM62", "0", "-1", "-1", "T", "F")
     names(blast) <- c("p","e","M","W","G","E","U","F")
     match <- c(0.00001, 0.95, 0.95)
     names(match) <- c("e","c","i")

     ## It may take a long time to parse database and build R package.
     # Build annotation data packages "org.Hs.cross" for id mapping of three major 
     # protein sequence databases.
     if(interactive()){
         crossBuilder_DB(src=c("sp","ipi","gi"), organism="Homo sapiens", 
                         blast, match, 
                         prefix="org.Hs.cross", pkgPath, version, author)
     }

