| IdTaxa {DECIPHER} | R Documentation |
Classifies sequences according to a training set by assigning a confidence to taxonomic labels for each taxonomic level.
IdTaxa(test,
trainingSet,
type = "extended",
strand = "both",
threshold = 60,
bootstraps = 100,
samples = L^0.47,
minDescend = 0.98,
fullLength = 0,
processors = 1,
verbose = TRUE)
test |
An |
trainingSet |
An object of class |
type |
Character string indicating the type of output desired. This should be (an abbreviation of) one of |
strand |
Character string indicating the orientation of the |
threshold |
Numeric specifying the confidence at which to truncate the output taxonomic classifications. Lower values of |
bootstraps |
Integer giving the maximum number of bootstrap replicates to perform for each sequence. The number of bootstrap replicates is set automatically such that (on average) 99% of k-mers are sampled in each |
samples |
A function or call written as a function of ‘L’, which will evaluate to a numeric vector the same length as ‘L’. Typically of the form “ |
minDescend |
Numeric giving the minimum fraction of |
fullLength |
Numeric specifying the fold-difference in sequence lengths between sequences in |
processors |
The number of processors to use, or |
verbose |
Logical indicating whether to display progress. |
Sequences in test are each assigned a taxonomic classification based on the trainingSet created with LearnTaxa. Each taxonomic level is given a confidence between 0% and 100%, and the taxonomy is truncated where confidence drops below threshold. If the taxonomic classification was truncated, the last group is labeled with “unclassified_” followed by the final taxon's name. Note that the reported confidence is not a p-value but does directly relate to a given classification's probability of being wrong. The default threshold of 60% is intended to minimize the rate of incorrect classifications. Lower values of threshold (e.g., 50%) may be preferred to increase the taxonomic depth of classifications. Values of 60% or 50% are recommended for nucleotide sequences and 50% or 40% for amino acid sequences.
If type is "extended" (the default) then an object of class Taxa and subclass Train is returned. This is stored as a list with elements corresponding to their respective sequence in test. Each list element contains components:
taxon |
A character vector containing the taxa to which the sequence was assigned. |
confidence |
A numeric vector giving the corresponding percent confidence for each taxon. |
rank |
If the classifier was trained with a set of |
If type is "collapsed" then a character vector is returned with the taxonomic assignment for each sequence. This takes the repeating form “Taxon name [rank, confidence%]; ...” if ranks were supplied during training, or “Taxon name [confidence%]; ...” otherwise.
Erik Wright eswright@pitt.edu
Murali, A., et al. (2018). IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences. Microbiome, 6, 140. https://doi.org/10.1186/s40168-018-0521-5
data("TrainingSet_16S")
# import test sequences
fas <- system.file("extdata", "Bacteria_175seqs.fas", package="DECIPHER")
dna <- readDNAStringSet(fas)
# remove any gaps in the sequences
dna <- RemoveGaps(dna)
# classify the test sequences
ids <- IdTaxa(dna, TrainingSet_16S, strand="top")
ids
# view the results
plot(ids, TrainingSet_16S)