This folder contains the tools used for making the .rda files contained in
this package from the dbSNP dump files.

dbSNP Home Page:

  http://www.ncbi.nlm.nih.gov/snp/

Here is how these .rda files were made:

  1. Download all the ds_flat_ch*.flat.gz files from

       ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/ASN1_flat

  2. Uncompress the downloaded files.
     These uncompressed files are the "source files".
     NB: The ASN.1 flatfile format (and many other formats used on
     the snp section of the FTP site) is described here:

       ftp://ftp.ncbi.nih.gov/snp/00readme.html

  3. Check the source files with for example:

       ./prechecking.sh path/to/ds_flat_ch16.flat

  4. Adjust settings in make_rdas.sh and run it (the .rda files will be dumped
     in the current folder).


Notes:
  Not all SNPs are consistent with hg19 genome i.e. the ambiguity letter for
  the SNP is not necessarily compatible with the nucleotide found at the SNP
  position. For example in 'ch1_snplocs.rda' 3084/2509872 SNPs (0.12%) are
  inconsistent with hg19 chr1.
  To get the number of inconsistent SNPs:

    library(SNPlocs.Hsapiens.dbSNP.20110815)
    ch1snps <- getSNPlocs("ch1")
    all_alleles <- paste(ch1snps$alleles_as_ambig, collapse="")
    library(BSgenome.Hsapiens.UCSC.hg19)
    neditAt(all_alleles, unmasked(Hsapiens$chr1)[ch1snps$loc], fixed=FALSE)

  The case of chrM is of course hopeless because GRCh37.p2 and hg19
  use different sequences for this chromosome, and those sequence don't
  have compatible chromosome coordinates.

