This folder (SNPlocs.Hsapiens.dbSNP.20111119/inst/tools) contains the
tools used for making the .rda files contained in this package from the
dbSNP dump files.

dbSNP Home Page:

  http://www.ncbi.nlm.nih.gov/snp/

Here is how these .rda files were made:

  1. Download the ds_flat_ch*.flat.gz files for chromosomes 1-22, X, Y,
     MT from:

       ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/ASN1_flat

     You can use the download_ds_flat.sh script located in this folder
     for this.

  2. Uncompress the downloaded files.
     These uncompressed files are the "source files".
     NB: The ASN.1 flatfile format (and many other formats used on
     the snp section of the FTP site) is described here:

       ftp://ftp.ncbi.nih.gov/snp/00readme.txt

  3. Check the source files with for example and pay attention to the output:

       ./prechecking.sh path/to/ds_flat_ch16.flat

     Nb of records tagged with "snp" (note that the final nb of SNPs per
     chromosome will be less than this because of additional filtering during
     step 4.):
       ch1   3398949
       ch2   3645931
       ch3   3027638
       ch4   2999867
       ch5   2748676
       ch6   2721135
       ch7   2475750
       ch8   2376661
       ch9   1875332
       ch10  2103392
       ch11  2121285
       ch12  2031257
       ch13  1494167
       ch14  1374067
       ch15  1263913
       ch16  1406145
       ch17  1206136
       ch18  1189938
       ch19   941693
       ch20  1016652
       ch21   609483
       ch22   583627
       chX   1647703
       chY     87707
       chMT      785

  4. Adjust settings in make_rdas.sh and run it. This script will extract and
     curate the SNPs from the flat files (see man/package.Rd for how the SNPs
     are filtered), and dump them into .rda files (those files will be created
     in the current folder).
     This step took about 14.5 hours on rhino1 (64-bit openSUSE 11.3 with 12
     cpus and 128MB of RAM) and resulted in the extraction of 43938303 SNPs in
     total.

  5. Install SNPlocs.Hsapiens.dbSNP.20111119 (this will install the .rda
     files generated in 4.), start R, and run the update_SNPlocs_data.R
     script to update the datasets to the latest format.

Notes:
  Not all SNPs are consistent with hg19 genome i.e. the ambiguity letter for
  the SNP is not necessarily compatible with the nucleotide found at the SNP
  position. For example in 'ch1_snplocs.rda' 3473/3363233 SNPs (0.103%) are
  inconsistent with hg19 chr1.
  To get the number of inconsistent SNPs:

    library(SNPlocs.Hsapiens.dbSNP.20111119)
    ch1snps <- getSNPlocs("ch1")
    all_alleles <- paste(ch1snps$alleles_as_ambig, collapse="")
    library(BSgenome.Hsapiens.UCSC.hg19)
    neditAt(all_alleles, unmasked(Hsapiens$chr1)[ch1snps$loc], fixed=FALSE)

  The case of chrM is of course hopeless because GRCh37.p5 and hg19
  use different sequences for this chromosome, and those sequence don't
  have compatible chromosome coordinates.

