The VariantAnnotation package has facilities for reading in all or portions of Variant Call Format (VCF) files. Structural location information can be determined as well as amino acid coding changes for non-synonymous variants. Consequences of the coding changes can be investigated with the SIFT and PolyPhen database packages.
Performed with Bioconductor 2.11 and R >= 2.15; VariantAnnotation 1.3.9.
This workflow annotates variants found in the Transient Receptor Potential Vanilloid (TRPV) gene family on chromosome 17. The VCF file is available in the cgdv17 data package and contains Complete Genomics data for population type CEU.
library(VariantAnnotation)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, as.vector, cbind, colnames, do.call,
## duplicated, eval, evalq, get, intersect, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rep.int, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unlist
##
## Loading required package: GenomicRanges
## Warning: package 'GenomicRanges' was built under R version 3.1.1
## Loading required package: IRanges
## Warning: package 'IRanges' was built under R version 3.1.1
## Loading required package: GenomeInfoDb
## Loading required package: Rsamtools
## Loading required package: XVector
## Loading required package: Biostrings
##
## Attaching package: 'VariantAnnotation'
##
## The following object is masked from 'package:base':
##
## tabulate
library(cgdv17)
## Loading required package: org.Hs.eg.db
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Loading required package: DBI
## Warning: package 'DBI' was built under R version 3.1.1
##
## Loading required package: GGtools
## Loading required package: GGBase
## Warning: package 'GGBase' was built under R version 3.1.1
## Loading required package: snpStats
## Loading required package: survival
## Loading required package: splines
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following object is masked from 'package:VariantAnnotation':
##
## expand
##
## The following object is masked from 'package:IRanges':
##
## expand
##
## Loading required package: data.table
##
## Attaching package: 'GGtools'
##
## The following object is masked from 'package:stats':
##
## getCall
##
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
## Loading required package: GenomicFeatures
file <- system.file("vcf", "NA06985_17.vcf.gz", package = "cgdv17")
## Explore the file header with scanVcfHeader
hdr <- scanVcfHeader(file)
info(hdr)
## DataFrame with 3 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## NS 1 Integer Number of Samples With Data
## DP 1 Integer Total Depth
## DB 0 Flag dbSNP membership, build 131
geno(hdr)
## DataFrame with 12 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## GT 1 String Genotype
## GQ 1 Integer Genotype Quality
## DP 1 Integer Read Depth
## HDP 2 Integer Haplotype Read Depth
## HQ 2 Integer Haplotype Quality
## ... ... ... ...
## mRNA . String Overlaping mRNA
## rmsk . String Overlaping Repeats
## segDup . String Overlaping segmentation duplication
## rCov 1 Float relative Coverage
## cPd 1 String called Ploidy(level)
Convert the gene symbols to gene ids compatible with the TxDb.Hsapiens.UCSC.hg19.knownGene annotations. The annotaions are used to define the TRPV ranges that will be extracted from the VCF file.
## get entrez ids from gene symbols
library(org.Hs.eg.db)
genesym <- c("TRPV1", "TRPV2", "TRPV3")
geneid <- select(org.Hs.eg.db, keys=genesym, keytype="SYMBOL",
columns="ENTREZID")
geneid
## SYMBOL ENTREZID
## 1 TRPV1 7442
## 2 TRPV2 51393
## 3 TRPV3 162514
Load the annotation package and create a list of transcripts by gene.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txbygene = transcriptsBy(txdb, "gene")
Subset the annotations on chromosome 17 and adjust the seqlevels to match those in the VCF file.
tx_chr17 <- keepSeqlevels(txbygene, "chr17")
tx_17 <- renameSeqlevels(tx_chr17, c(chr17="17"))
## Create the gene ranges for the TRPV genes
rngs <- lapply(geneid$ENTREZID,
function(id)
range(tx_17[names(tx_17) %in% id]))
gnrng <- unlist(do.call(c, rngs), use.names=FALSE)
names(gnrng) <- geneid$SYMBOL
To retrieve a subset of data from a VCF file, create a ScanVcfParam object. This object can specify genomic coordinates (ranges) or individual VCF elements to be extracted. When ranges are extracted, a tabix index file must exist for the VCF. See ?indexTabix for details.
param <- ScanVcfParam(which = gnrng, info = "DP", geno = c("GT", "cPd"))
param
## class: ScanVcfParam
## vcfWhich: 1 elements
## vcfFixed: character() [All]
## vcfInfo: DP
## vcfGeno: GT cPd
## vcfSamples:
## Extract the TRPV ranges from the VCF file
vcf <- readVcf(file, "hg19", param)
## Inspect the VCF object with the 'fixed', 'info' and 'geno' accessors
vcf
## class: CollapsedVCF
## dim: 405 1
## rowData(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 1 column: DP
## info(header(vcf)):
## Number Type Description
## DP 1 Integer Total Depth
## geno(vcf):
## SimpleList of length 2: GT, cPd
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## cPd 1 String called Ploidy(level)
head(fixed(vcf))
## DataFrame with 6 rows and 4 columns
## REF ALT QUAL FILTER
## <DNAStringSet> <DNAStringSetList> <numeric> <character>
## 1 A G 120 PASS
## 2 A 0 PASS
## 3 AAAAA 0 PASS
## 4 AA 0 PASS
## 5 C T 59 PASS
## 6 T C 157 PASS
geno(vcf)
## List of length 2
## names(2): GT cPd
To find the structural location of the variants, use the locateVariants function with the TxDb.Hsapiens.UCSC.hg19.knownGene package that was loaded eariler. The variants in the VCF object have chromosome name “17” while the annotation has “chr17”. Adjust the seqlevels (chromosome names) of the VCF object to match that of the annotation.
seqlevels(vcf)
## [1] "17" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13"
## [15] "14" "15" "16" "18" "19" "20" "21" "22" "X" "Y" "M"
head(seqlevels(txdb))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6"
## seqlevels do not match
intersect(seqlevels(vcf), seqlevels(txdb))
## character(0)
vcf_mod <- renameSeqlevels(vcf, c("17"="chr17"))
## seqlevels now match
intersect(seqlevels(vcf_mod), seqlevels(txdb))
## [1] "chr17"
## Use the 'region' argument to define the region
## of interest. See ?locateVariants for details.
cds <- locateVariants(vcf_mod, txdb, CodingVariants())
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
five <- locateVariants(vcf_mod, txdb, FiveUTRVariants())
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
splice <- locateVariants(vcf_mod, txdb, SpliceSiteVariants())
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
intron <- locateVariants(vcf_mod, txdb, IntronVariants())
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
all <- locateVariants(vcf_mod, txdb, AllVariants())
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning: trimmed start values to be positive
## Warning: trimmed end values to be <= seqlengths
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
Each row in cds represents a variant-transcript match so multiple rows per variant are possible. If we are interested in gene-centric questions the data can be summarized by gene regardless of transcript.
## Did any variants match more than one gene
table(sapply(split(values(all)[["GENEID"]], values(all)[["QUERYID"]]),
function(x)
length(unique(x)) > 1))
##
## FALSE TRUE
## 367 38
## Summarize the number of variants by gene
idx <- sapply(split(values(all)[["QUERYID"]], values(all)[["GENEID"]]),
unique)
sapply(idx, length)
## 125144 162514 23729 51393 7442 84690
## 1 196 2 63 146 35
## Summarize variant location by gene
sapply(names(idx),
function(nm) {
d <- all[values(all)[["GENEID"]] %in% nm, c("QUERYID", "LOCATION")]
table(values(d)[["LOCATION"]][duplicated(d) == FALSE])
})
## 125144 162514 23729 51393 7442 84690
## spliceSite 0 2 0 0 1 0
## intron 0 153 0 58 117 19
## fiveUTR 0 2 0 1 3 5
## threeUTR 0 24 2 1 2 0
## coding 0 5 0 3 8 0
## intergenic 0 0 0 0 0 0
## promoter 1 10 0 0 15 11
Amino acid coding for non-synonymous variants can be computed with the function predictCoding. The BSgenome.Hsapiens.UCSC.hg19 package is used as the source of the reference alleles. Variant alleles are provided by the user.
library(BSgenome.Hsapiens.UCSC.hg19)
## Loading required package: BSgenome
##
## Attaching package: 'BSgenome'
##
## The following object is masked from 'package:AnnotationDbi':
##
## species
aa <- predictCoding(vcf_mod, txdb, Hsapiens)
## Warning: Each of the 2 combined objects has sequence levels not in the other:
## - in 'x': 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, X, Y, M
## - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249
## Make sure to always combine/compare objects based on the same reference
## genome (use suppressWarnings() to suppress this warning).
## Warning: records with missing 'varAllele' were ignored
## Warning: varAllele values containing 'N' were not translated
predictCoding returns results for coding variants only. As with locateVariants, the output has one row per variant-transcript match so multiple rows per variant are possible.
## Did any variants match more than one gene
table(sapply(split(values(aa)[["GENEID"]], values(aa)[["QUERYID"]]),
function(x)
length(unique(x)) > 1))
##
## FALSE
## 17
## Summarize the number of variants by gene
idx <- sapply(split(values(aa)[["QUERYID"]], values(aa)[["GENEID"]],
drop=TRUE), unique)
sapply(idx, length)
## 162514 51393 7442
## 6 3 8
## Summarize variant consequence by gene
sapply(names(idx),
function(nm) {
d <- aa[values(aa)[["GENEID"]] %in% nm, c("QUERYID","CONSEQUENCE")]
table(values(d)[["CONSEQUENCE"]][duplicated(d) == FALSE])
})
## 162514 51393 7442
## nonsynonymous 2 0 2
## not translated 1 0 5
## synonymous 3 3 1
The variants 'not translated' are explained by the warnings thrown when predictCoding was called. Variants that have a missing varAllele or have an 'N' in the varAllele are not translated. If the varAllele substitution had resulted in a frameshift the consequence would be 'frameshift'. See ?predictCoding for details.
The SIFT.Hsapiens.dbSNP132 and PolyPhen.Hsapiens.dbSNP131 packages provide predictions of how damaging amino acid coding changes may be to protein structure and function. Both packages search on rsid.
The pre-computed predictions in the SIFT and PolyPhen packages are based on specific gene models. SIFT is based on Ensembl and PolyPhen on UCSC Known Gene. The TranscriptDb we used to identify coding variants was from UCSC Known Gene so we will use PolyPhen for predictions.
## Load the PolyPhen package and explore the available keys and columns
library(PolyPhen.Hsapiens.dbSNP131)
keys <- keys(PolyPhen.Hsapiens.dbSNP131)
cols <- columns(PolyPhen.Hsapiens.dbSNP131)
## column descriptions are found at ?PolyPhenDbColumns
columns(PolyPhen.Hsapiens.dbSNP131)
## [1] "RSID" "TRAININGSET" "OSNPID" "OACC" "OPOS"
## [6] "OAA1" "OAA2" "SNPID" "ACC" "POS"
## [11] "AA1" "AA2" "NT1" "NT2" "PREDICTION"
## [16] "BASEDON" "EFFECT" "PPH2CLASS" "PPH2PROB" "PPH2FPR"
## [21] "PPH2TPR" "PPH2FDR" "SITE" "REGION" "PHAT"
## [26] "DSCORE" "SCORE1" "SCORE2" "NOBS" "NSTRUCT"
## [31] "NFILT" "PDBID" "PDBPOS" "PDBCH" "IDENT"
## [36] "LENGTH" "NORMACC" "SECSTR" "MAPREG" "DVOL"
## [41] "DPROP" "BFACT" "HBONDS" "AVENHET" "MINDHET"
## [46] "AVENINT" "MINDINT" "AVENSIT" "MINDSIT" "TRANSV"
## [51] "CODPOS" "CPG" "MINDJNC" "PFAMHIT" "IDPMAX"
## [56] "IDPSNP" "IDQMIN" "COMMENTS"
## Get the rsids for the non-synonymous variants from the
## predictCoding results
rsid <- unique(names(aa)[values(aa)[["CONSEQUENCE"]] == "nonsynonymous"])
## Retrieve predictions for non-synonymous variants. Two of the six variants
## are found in the PolyPhen database.
select(PolyPhen.Hsapiens.dbSNP131, keys=rsid,
columns=c("AA1", "AA2", "PREDICTION"))
## RSID AA1 AA2 PREDICTION
## 1 rs224534 T I benign
## 2 rs222747 M I benign
## 3 rs322937 R G possibly damaging
## 4 rs322937 R G benign
## 5 rs322965 I V benign
[ Back to top ]
Follow installation instructions to start using these packages. To install VariantAnnotation use
library(BiocInstaller)
biocLite("VariantAnnotation")
Package installation is required only once per R installation. View a full list of available software and annotation packages.
To use the VariantAnnotation, evaluate the commands
library(VariantAnnotation)
These commands are required once in each R session.
[ Back to top ]
Packages have extensive help pages, and include vignettes highlighting common use cases. The help pages and vignettes are available from within R. After loading a package, use syntax like
help(package="VariantAnnotation")
?predictCoding
to obtain an overview of help on the VariantAnnotation package, and
the predictCoding function. View the package vignette with
browseVignettes(package="VariantAnnotation")
To view vignettes providing a more comprehensive introduction to package functionality use
help.start()
[ Back to top ]
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: i386-w64-mingw32/i386 (32-bit)
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] splines parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] PolyPhen.Hsapiens.dbSNP131_1.0.2
## [2] BSgenome.Hsapiens.UCSC.hg19_1.3.1000
## [3] BSgenome_1.32.0
## [4] cgdv17_0.2.0
## [5] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0
## [6] GenomicFeatures_1.16.2
## [7] GGtools_5.0.0
## [8] data.table_1.9.2
## [9] GGBase_3.26.1
## [10] snpStats_1.14.0
## [11] Matrix_1.1-4
## [12] survival_2.37-7
## [13] org.Hs.eg.db_2.14.0
## [14] RSQLite_0.11.4
## [15] DBI_0.3.0
## [16] AnnotationDbi_1.26.0
## [17] Biobase_2.24.0
## [18] VariantAnnotation_1.10.5
## [19] Rsamtools_1.16.1
## [20] Biostrings_2.32.1
## [21] XVector_0.4.0
## [22] GenomicRanges_1.16.4
## [23] GenomeInfoDb_1.0.2
## [24] IRanges_1.22.10
## [25] BiocGenerics_0.10.0
##
## loaded via a namespace (and not attached):
## [1] BBmisc_1.7 BatchJobs_1.3
## [3] BiocParallel_0.6.1 Formula_1.1-2
## [5] GenomicAlignments_1.0.6 Gviz_1.8.4
## [7] Hmisc_3.14-5 KernSmooth_2.23-13
## [9] R.methodsS3_1.6.1 RColorBrewer_1.0-5
## [11] RCurl_1.95-4.3 ROCR_1.0-5
## [13] Rcpp_0.11.2 XML_3.98-1.1
## [15] acepack_1.3-3.3 annotate_1.42.1
## [17] base64enc_0.1-2 biglm_0.9-1
## [19] biomaRt_2.20.0 biovizBase_1.12.3
## [21] bit_1.1-12 bitops_1.0-6
## [23] brew_1.0-6 caTools_1.17.1
## [25] checkmate_1.4 cluster_1.15.3
## [27] codetools_0.2-9 colorspace_1.2-4
## [29] dichromat_2.0-0 digest_0.6.4
## [31] evaluate_0.5.5 fail_1.2
## [33] ff_2.2-13 foreach_1.4.2
## [35] foreign_0.8-61 formatR_1.0
## [37] gdata_2.13.3 genefilter_1.46.1
## [39] gplots_2.14.2 grid_3.1.0
## [41] gtools_3.4.1 hexbin_1.27.0
## [43] iterators_1.0.7 knitr_1.6
## [45] lattice_0.20-29 latticeExtra_0.6-26
## [47] matrixStats_0.10.0 munsell_0.4.2
## [49] nnet_7.3-8 plyr_1.8.1
## [51] reshape2_1.4 rpart_4.1-8
## [53] rtracklayer_1.24.2 scales_0.2.4
## [55] sendmailR_1.2-1 stats4_3.1.0
## [57] stringr_0.6.2 tools_3.1.0
## [59] xtable_1.7-4 zlibbioc_1.10.0
[ Back to top ]