| cellCounts {Rsubread} | R Documentation |
Process raw 10X scRNA-seq data and generate UMI counts for each gene in each cell.
cellCounts(
# input data
index,
sample,
input.mode = "BCL",
cell.barcode = NULL,
# specify the aligner used for read mapping
aligner = "align",
# parameters used by featureCounts for assigning and counting UMIs
annot.inbuilt = "mm10",
annot.ext = NULL,
isGTFAnnotationFile = FALSE,
GTF.featureType = "exon",
GTF.attrType = "gene_id",
useMetaFeatures = TRUE,
# user provided UMI cutoff for cell calling
umi.cutoff = NULL,
# number of threads
nthreads = 10,
# dealing with multi-mapping reads in the alignment step
nBestLocations = 1,
unique.mapping = FALSE,
# other parameters passed to align, subjunc and featureCounts functions
...)
index |
A character string providing the base name of index files created for a reference genome by the |
sample |
A data frame or a character string providing sample-related information, including location of the data, sample names and index set names. See the Details section below for more details. |
input.mode |
A character string specifying the input mode. The supported input modes include |
cell.barcode |
A character string giving the name of a text file (can be gzipped) that contains the set of cell barcodes used in sample preparation. If |
aligner |
Specify the name of the aligner used for read mapping. Currently only the |
annot.inbuilt |
Specify an inbuilt annotation for UMI counting. See |
annot.ext |
Specify an external annotation for UMI counting. See |
isGTFAnnotationFile |
See |
GTF.featureType |
See |
GTF.attrType |
See |
useMetaFeatures |
Specify if UMI counting should be carried out at the meta-feature level (eg. gene level). See |
umi.cutoff |
Specify a UMI count cutoff for cell calling. All the cells with a total UMI count greater than this cutoff will be called. If |
nthreads |
A numeric value giving the number of threads used for read mapping and counting. |
nBestLocations |
A numeric value giving the maximum number of reported alignments for each multi-mapping read. |
unique.mapping |
A logical value specifying if the multi-mapping reads should not be reported as mapped (i.e. reporting uniquely mapped reads only). |
... |
other parameters passed to |
This function takes as input scRNA-seq reads generated by the 10X platform, maps them to the reference genome and then produces UMI (Unique Molecular Identifier) counts for each gene in each cell.
The align read mapping function and the featureCounts quantification function, both included in this package, are utilised by this function.
Sample demultiplexing, cell barcode demultiplexing and read deduplication are carried out before generating the UMI counts.
cellCounts can process multiple datasets at the same time.
The sample information should be provided to cellCounts via the sample parameter.
If the input format is BCL (ie. input.mode="BCL"), the provided sample information should include the location where the read data are stored, flowcell lanes used for sequencing, sample names and names of index sets used for indexing samples.
These information should be saved to a data.frame object and then provided to the sample parameter.
Below shows an example of this data frame:
InputDirectory Lane SampleName IndexSetName /path/to/dataset1 1 Sample1 SI-GA-E1 /path/to/dataset1 1 Sample2 SI-GA-E2 /path/to/dataset1 2 Sample1 SI-GA-E1 /path/to/dataset1 2 Sample2 SI-GA-E2 /path/to/dataset2 1 Sample3 SI-GA-E3 /path/to/dataset2 1 Sample4 SI-GA-E4 /path/to/dataset2 2 Sample3 SI-GA-E3 /path/to/dataset2 2 Sample4 SI-GA-E4 ...
It is compulsory to have the four column headers shown in the example above when generating this data frame for a 10X dataset.
If more than one datasets are provided for analysis, the InputDirectory column should include more than one distinct directory.
Note that this data frame is different from the Sample Sheet generated by the Illumina sequencer.
The cellCounts function uses the index set names included in this data frame to generate an Illumina Sample Sheet and then uses it to demultiplex all the samples.
If the input format is FASTQ, a data.frame object containing the following three columns, BarcodeUMIFile, ReadFile and SampleName, should be provided to the sample parameter.
Each row in the data frame represents a sample.
The ReadFile column includes names of FASTQ files that contain read data for the samples.
Each FASTQ file corresponds to a sample.
The read data included in these FASTQ files only contain genomic sequences of the reads.
The cell barcode and UMI sequences of these reads can be found in the corresponding FASTQ files included in the BarcodeUMIFile column.
Finally, if the input format is FASTQ-dir, a character string, which includes the path to the directory where the FASTQ-format read data are stored, should be provided to the sample parameter.
The data in this directory are expected to be generated by the bcl2fastq program or the bamtofastq program (a program developed by 10X).
The cellCounts function returns a List object to R.
It also outputs three gzipped FASTQ files and one BAM file for each sample.
The three gzipped FASTQ files include cell barcode and UMI sequences (R1), sample index sequences (I1) and the actual genomic sequences of the reads (R2), respectively.
The BAM file includes location-sorted read mapping results.
The returned List object contains the following components:
counts |
a |
annotation |
a |
sample.info |
a |
cell.confidence |
a |
Yang Liao and Wei Shi
buildindex, align, featureCounts