| get_genome_fasta {ORFik} | R Documentation |
This function automatically downloads (if files not already exists)
genomes and contaminants specified for genome alignment.
Will create a R transcript database (TxDb object) from the annotation.
It will also index the genome for you
If you misspelled something or crashed, delete wrong files and
run again.
Do remake = TRUE, to do it all over again.
get_genome_fasta(genome, output.dir, organism, assembly_type, db, gunzip)
genome |
logical, default: TRUE, download genome of organism
specified in "organism" argument. If FALSE, check if the downloaded
file already exist. If you want to use a custom gtf from you hard drive,
set GTF = FALSE,
and assign: |
output.dir |
directory to save downloaded data |
organism |
scientific name of organism, Homo sapiens,
Danio rerio, Mus musculus, etc. See |
assembly_type |
a character string specifying from which assembly type
the genome shall be retrieved from (ensembl only, else this argument is ignored):
Default is
|
db |
database to use for genome and GTF, default adviced: "ensembl" (remember to set assembly_type to "primary_assembly", else it will contain haplotypes, very large file!). Alternatives: "refseq" (primary assembly) and "genbank" (mix) |
gunzip |
logical, default TRUE, uncompress downloaded files that are zipped when downloaded, should be TRUE! |
If you want custom genome or gtf from you hard drive, assign it
after you run this function, like this:
annotation <- getGenomeAndAnnotation(GTF = FALSE, genome = FALSE)
annotation["genome"] = "path/to/genome.fasta"
annotation["gtf"] = "path/to/gtf.gtf"
a named character vector of path to genomes and gtf downloaded, and additional contaminants if used. If merge_contaminants is TRUE, will not give individual fasta files to contaminants, but only the merged one.
Other STAR:
STAR.align.folder(),
STAR.align.single(),
STAR.allsteps.multiQC(),
STAR.index(),
STAR.install(),
STAR.multiQC(),
STAR.remove.crashed.genome(),
install.fastp()
## Get Saccharomyces cerevisiae genome and gtf (create txdb for R)
#getGenomeAndAnnotation("Saccharomyces cerevisiae", tempdir(), assembly_type = "toplevel")
## Get Danio rerio genome and gtf (create txdb for R)
#getGenomeAndAnnotation("Danio rerio", tempdir())
output.dir <- "/Bio_data/references/zebrafish"
## Get Danio rerio and Phix contamints to deplete during alignment
#getGenomeAndAnnotation("Danio rerio", output.dir, phix = TRUE)
## Optimize for ORFik (speed up for large annotations like human or zebrafish)
#getGenomeAndAnnotation("Danio rerio", tempdir(), optimize = TRUE)
## How to save malformed refseq gffs:
## First run function and let it crash:
#annotation <- getGenomeAndAnnotation(organism = "Arabidopsis thaliana", output.dir = "~/Desktop/test_plant/",
# assembly_type = "primary_assembly", db = "refseq")
## Then apply a fix (example for linux, too long rows):
# \code{system("cat ~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.gff | awk '{ if (length($0) < 32768) print }' > ~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq_trimmed2.gff")}
## Then updated arguments:
annotation <- c("~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq_trimmed.gff",
"~/Desktop/test_plant/Arabidopsis_thaliana_genomic_refseq.fna")
names(annotation) <- c("gtf", "genome")
# Make the txdb (for faster R use)
# makeTxdbFromGenome(annotation["gtf"], annotation["genome"], organism = "Arabidopsis thaliana")