This folder (SNPlocs.Hsapiens.dbSNP.20120608/inst/tools) contains the
tools used for making the .rda files contained in this package from the
dbSNP dump files.

dbSNP Home Page:

  http://www.ncbi.nlm.nih.gov/snp/

Here is how these .rda files were made:

  1. Download the ds_flat_ch*.flat.gz files for chromosomes 1-22, X, Y,
     and MT from:

       ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/ASN1_flat

     You can use the download_ds_flat.sh script located in this folder
     for this.

  2. Uncompress the downloaded files.
     These uncompressed files are the "source files".
     NB: The ASN.1 flatfile format (and many other formats used on
     the snp section of the FTP site) is described here:

       ftp://ftp.ncbi.nih.gov/snp/00readme.txt

  3. Check the source files with for example

       ./prechecking.sh path/to/ds_flat_ch16.flat

     and pay attention to the output.

     Nb of records tagged with "snp" (note that the final nb of SNPs per
     chromosome will be less than this because of additional filtering during
     step 5.):
       ch1   3544400
       ch2   3776502
       ch3   3133038
       ch4   3098784
       ch5   2835591
       ch6   2757043
       ch7   2574095
       ch8   2466112
       ch9   1977075
       ch10  2178294
       ch11  2203668
       ch12  2107703
       ch13  1536669
       ch14  1424437
       ch15  1314853
       ch16  1468091
       ch17  1257091
       ch18  1225798
       ch19  1000922
       ch20  1054152
       ch21   632410
       ch22   613676
       chX   1653650
       chY     93913
       chMT     1065 

  4. Compile filter2_ds_flat.c with:

       gcc -Wall filter2_ds_flat.c -o filter2_ds_flat

  5. Adjust settings in make_rdas.sh and run it. This script will extract and
     curate the SNPs from the flat files (see man/package.Rd for how the SNPs
     are filtered), and dump them into .rda files (those files will be created
     in the current folder).
     This step took about 11 hours on rhino3 (64-bit Ubuntu 12.04 with 12
     cpus and 128GB of RAM) and resulted in the extraction of 45416711 SNPs in
     total.

  6. Install SNPlocs.Hsapiens.dbSNP.20120608 (this will install the .rda
     files generated in 5.), start R, and run the update_SNPlocs_data.R
     script to update the datasets to the latest format.

Notes:
  Not all SNPs are consistent with hg19 genome i.e. the ambiguity letter
  associated to a SNP (and representing the alleles with respect to the plus
  strand) is not necessarily compatible with the nucleotide found at the SNP
  position. For example in 'ch1_snplocs.rda' 3039/3517088 SNPs (0.086%) are
  inconsistent with hg19 chr1.
  To get the number of inconsistent SNPs:

    library(SNPlocs.Hsapiens.dbSNP.20120608)
    ch1snps <- getSNPlocs("ch1")
    all_alleles <- paste(ch1snps$alleles_as_ambig, collapse="")
    library(BSgenome.Hsapiens.UCSC.hg19)
    genome <- BSgenome.Hsapiens.UCSC.hg19
    neditAt(all_alleles, unmasked(genome$chr1)[ch1snps$loc], fixed=FALSE)

  The case of chrM is of course hopeless because GRCh37.p5 and hg19
  use different sequences for this chromosome, and those sequence don't
  have compatible chromosome coordinates.

