This pipeline will import data from single-cell preprocessing algorithms (e.g. CellRanger, HCA Optimus, Alevin), generate various quality control metrics (e.g. general metrics, doublet scores, contamination estimates) using multiple tools, and output results in standard data containers (e.g. SingleCellExperiment, Seurat object, AnnData).
For data generated with microfluidic devices, the first major step after UMI counting is to detect cell barcodes that represent droplets containing a true cell and exclude empty droplets that only contain ambient RNA.
The Droplet and Cell matrices have also been called “raw” and “filtered” matrices, respectively, by tools such as CellRanger. However, using the term “filtered” can be ambiguous as other forms of cell filtering can be applied beyond empty droplets (e.g. excluding poor-quality cells based on low number of UMIs). Both the original droplet matrix and the filtered cell matrix can be QC’ed in this pipeline. However, QC of the droplet matrix is specific for single cell data generated from microfluidic devices (e.g. 10X).
To run the pipeline, users can install the singleCellTK (SCTK) package along with Python dependencies that may be potentially used (mainly Scrublet and AnnData). Alternatively, users can run the Docker version of the pipeline which is described in detail further down in this page.
A simple example to run this pipeline on the ‘Cell’ matrix generated by Cellranger V3 is shown below:
Rscript SCTK_runQC.R \ \ -b /base/path \ -P CellRangerV3 \ -s SampleName \ -o Output_Directory \ -S TRUE \ -F SCE,AnnData,FlatFile,Seurat \ -g /Path_to_gmt/name_of_gmt_file.gmt \ -d Cell \ -n 2 -T MulticoreParam
This pipeline enables different ways to import CellrangerV2/CellrangerV3 data for flexibility. Also, the pipeline is compatible with datasets generated by other algorithms. Please refer to the section Importing data from different preprocessing tools for more details.
User can quantify expression of mitochondrial genes by passing a GMT files containing mitochondrial genes (with
-g/--gmt argument). User can also quantify expression of mitochondrial genes for human or mouse dataset by setting
-M/--detectMitoLevel argument. Please refer to the section Gene sets for more details.
Besides, the pipeline contains various parameters to control the process of quality control. Please refer to the section Parameters for more details.
The Docker image can be obtained by running:
docker pull campbio/sctk_qc:2.8.1
The usage of each argument is the same as running command line analysis. Here is an example code to perform QC on CellRangerV3 data with SCTK docker:
docker run --rm -v /path/to/data:/SCTK_docker \ -it campbio/sctk_qc:2.8.1 \ -b /SCTK_docker/cellranger_folder \ -P CellRangerV3 \ -s SampleName \ -o /SCTK_docker/Output_Directory \ -g /SCTK_docker/name_of_gmt_file.gmt \ -S TRUE \ -F SCE,AnnData,FlatFile,Seurat
The docker image will not access files in your host file system by default. To get access to the files on your machine, you can properly set up a mount volume. Noted that the transcriptome data and GMT file needed to be accessible to the container through mounted volume. In the above example, mount volume is enabled for accessing input and output directory using argument
-v. The transcriptome and GMT files stored in the path
/path/to/data of your machine file system is now available in
/SCTK_docker folder inside the docker. To learn more about mounted volumes, please check out this post.
Please refer to the section Parameters for more details about parameters.
singularity pull docker://campbio/sctk_qc:2.8.1
The usage of singleCellTK Singularity image is very similar to that of Docker. In Singularity 3.0+, the mount volume is automatically overlaid.
It’s recommended to re-set the home directory when you run singularity. Singularity will mount
\$HOME path on your file system by default, which might contain your personal R/Python library folder. If we don’t re-set the home to mount, singularity will try to use R/Python libraries which are not built within the singularity image and cause some conflicts. You can point to some “sanitized home”, which is different from
\$HOME path on your machine, using argument
--home (see more information). Or you can disable the
\$HOME binding by setting the argument
--no-home. Besides, you can use argument
-B to specify your own mount volume, which is the path that contains the dataset and will be used to store the output of QC pipeline. The example is shown as below:
singularity run --home=/PathToSanitizedHome \ --bind /PathToData:/data sctk_qc_2.8.1.sif \ -b /SCTK_docker/cellranger_folder \ -P CellRangerV3 \ -s SampleName \ -o /SCTK_docker/Output_Directory \ -g /SCTK_docker/name_of_gmt_file.gmt \ -S TRUE \ -F SCE,AnnData,FlatFile,Seurat
One important note about this docker image: please run the docker image on a machine / node which has a CPU with the following architecture: broadwell, haswell, skylake, cascadelake or the latest architecture. This can avoid having the “illegal operation” issue from Scrublet package, because this Python package are compiled by SIMD instructions that are compatible with these CPU architectures. Please specify the CPU architecture, at the script header after
#$ -l cpu_arch=, as one of the following:
cascadelake or latest architecture. One of the example is shown below:
#!/bin/bash #$ -cwd #$ -j y #$ -pe omp 16 #$ -l cpu_arch=broadwell|haswell|skylake|cascadelake singularity run --home=/PathToSanitizedHome ### this also works for 'docker run' --bind /PathToData:/data sctk_qc_2.8.1.sif \ -b /SCTK_docker/cellranger_folder \ -P CellRangerV3 \ -s SampleName \ -o /SCTK_docker/Output_Directory \ -g /SCTK_docker/name_of_gmt_file.gmt \ -S TRUE \ -F SCE,AnnData,FlatFile,Seurat
The pipeline contains various parameters to control the process of quality control. The function of each parameter is shown below:
The required arguments are as follows:
||Base path for the output from the preprocessing algorithm.|
||Algorithm used for preprocessing. One of
||Name of the sample. This will be prepended to the cell barcodes.|
||Output directory. A new subdirectory will be created with the name “sample”. R, Python, and FlatFile directories will be created under the “sample” directory containing the data containers with QC metrics. Default
||The output format of this QC pipeline. Currently, it supports
The optional arguments are as follows. Their usage depend on type of data and user-defined behaviour.
||GMT file containing gene sets for quality control.|
||Delimiter used in GMT file. Default
||The name of genome reference. This is only required for CellRangerV2 data.|
||YAML file used to specify parameters of QC functions called by SCTK-QC pipeline. Please check Specify parameters using yaml file section for details.|
||The full path of the RDS file or Matrix file of the cell matrix. This would be use only when
||The full path of the RDS file or Matrix file of the droplet matrix. This would be provided only when
||The directory containing
||The directory containing
||Type of data as input. Default is
||Detect cells from droplet matrix. Default is
||Methods to detect cells from droplet matrix. Default is
||Number of cores used to run the pipeline. By default is
||Type of parallel computing used for parallel computing. Parallel computing used in this pipeline depends on BiocParallel package. Default is
||The TXT file containing the description of the study design. Default is
||The subtitle used in the cell and droplet QC HTML report. Default is
||Detect mitochondrial gene expression level. If
||Type of mitochondrial gene-set to be loaded when
Users can specify parameters for QC algorithms in this pipeline with a YAML file (supplied with
--yamlFile argument). The current supported QC algorithms including doublet detection (
scrublet), decontamination (
decontX), emptyDrop detection (
emptyDrops) and barcodeRankDrops (
barcodeRanks). A summary of each function is shown below:
An example of QC parameters YAML file is shown below:
--- Params: ### should not be omitted bcds: ntop: 600 cxds: ntop: 600 cxds_bcds_hybrid: nTop: 600 decontX: maxIter: 600 emptyDrops: lower: 50 niters: 5000 testAmbient: True barcodeRanks: lower: 50
The format of YAML file can be found here. The parameters should be consistent with the parameters of each QC function in SCTK. Parameters that are not defined in this YAML file will use the default value. Please refer reference for detailed information about arguments of each QC function.
SCTK-QC pipeline enables parallel computing to speed up the analysis. Parallel computing is enabled by setting
--numCores greater than 1. The
--numCores is used to set the number of cores used for the pipeline.
The backend of parallel computing is supported by BiocParallel package. Therefore, users can select different types of parallel evaluation by setting
--parallelType argument. Default is
SnowParam are supported for
-T argument. However,
MulticoreParam is not supported by Windows system. Windows user can choose
SnowParam as the backend of parallel computing.
-d argument is used to specify which type of count matrix is used in the pipeline. Default is
Both, which means the pipeline will run quality control on both droplet and cell count data.
Users can also choose to run SCTK-QC pipeline on only the droplet count or cell count matrix, instead of running on both. In this case, the pipeline will only take the single input and perform QC on it. An example is shown below:
Rscript SCTK_runQC.R \ \ -b /base/path \ -P Preprocessing_Algorithm \ -s SampleName \ -o Output_Directory \ -S TRUE \ -F SCE,AnnData,FlatFile,Seurat \ -d Droplet \ -D TRUE -m EmptyDrops
-d argument is set as
Droplet, the QC pipeline will only take droplet count matrix as input and perform quality control. You can choose whether to detect cells from the droplet matrix by setting
TRUE. If yes, cell count matrix will be detected and the pipeline will also perform quality control on this matrix and output the result. You could further define the method used to detect cells from droplet matrix by setting
-m could be one of
EmptyDrops will keep cells that pass the
runEmptyDrops() function test.
Inflection will keep cells that pass the knee or inflection point returned from
-d argument is set as
Cell, the QC pipeline will only take cell count matrix as input and perform quality control. A figure showing the analysis steps and outputs of different inputs is shown below:
This pipeline enables different ways to import CellrangerV2/CellrangerV3 data for flexibility.
Rscript SCTK_runQC.R \ \ -b /base/path \ -P CellRangerV3 \ -s SampleName \ -o Output_Directory \ -S TRUE -F SCE,AnnData,FlatFile,Seurat
For CellRangerV2, the reference used by cellranger needs to be specified by
Rscript SCTK_runQC.R \ \ -b /base/path \ -P CellRangerV2 \ -s SampleName \ -o Output_Directory \ -S TRUE \ -G GenomeName -F SCE,AnnData,FlatFile,Seurat
-b specifies the base path and usually it’s the output folder of 10x
-s specifies the sample name, which has to be the same as the name of the sample folder under the base folder. The folder layout would look like the following:
├── BasePath └── SampleName ├── outs | ├── filtered_feature_bc_matrix | | ├── barcodes.tsv.gz | | ├── features.tsv.gz | | └── matrix.mtx.gz | ├── raw_feature_bc_matrix | | ├── barcodes.tsv.gz | | ├── features.tsv.gz | | └── matrix.mtx.gz ...
cellranger-countoutput have been moved out of the default cellranger output directory, you can specified the path to droplet and cell count matrix using arguments
Rscript SCTK_runQC.R \ \ -P CellRangerV2 \ -C /path/to/cell/matrix \ -R /path/to/droplet/matrix \ -s SampleName \ -o Output_Directory \ -S TRUE -F SCE,AnnData,FlatFile,Seurat
In this case, you must skip
-b arguments and you can also skip
-G argument for CellRangerV2 data.
SingleCellExperimentobject in RDS file, singleCellTK also supports this type of input. To run quality control with RDS file as input, run the following code:
Rscript SCTK_runQC.R \ \ -P SceRDS \ -s Samplename \ -o Output_Directory \ -S TRUE \ -F SCE,AnnData,FlatFile,Seurat \ -r /path/to/rds/file/droplet.RDS -c /path/to/rds/file/cell.RDS
Rscript SCTK_runQC.R \ \ -P CountMatrix \ -s Samplename \ -o Output_Directory \ -S TRUE \ -F SCE,AnnData,FlatFile,Seurat \ -r /path/to/matrix/file/droplet.txt -c path/to/matrix/file/cell.txt
SCTK-QC pipeline allows importing data from the following pre-processing tools or objects:
If your data is preprocessed by other algorithms, you might want to make sure the ‘-b’ argument matches the path storing the data and the ‘-P’ argument matches the right preprocessing tools. Basically, the templated is shown below:
Rscript SCTK_runQC.R \ \ -b /base/path \ -P Preprocessing_Algorithm \ -s SampleName \ -o Output_Directory \ -S TRUE -F SCE,AnnData,FlatFile,Seurat
The following table describes how SCTK expects the inputs to be structured and passed for each import function. In all cases, SCTK retains the standard output directory structure from upstream tools. All the import functions return the imported counts matrix as an
assay in a
SingleCellExperiment object, with associated information in respective
The table above also shows the R console functions for the QC algorithms. Detailed information about function parameters and defaults are available in the Reference section.
Quantifying the level of gene sets can be useful quality control. For example, the percentage of counts from mitochondrial genes can be an indicator of cell stress or death.
User can quantify the expression of mitochondrial genes for human or mouse dataset by setting
TRUE. User needs to specify the correct
--mitoType argument for the dataset. The SCTK-QC pipeline has built-in mitochondrial gene sets for human and mouse genes. It supports four different type of gene id:
ensembl ID and
ensembl transcript ID. Therefore, there are eight options for
To quantify expression of other gene set, users can pass a GMT file (with
--gmt argument) to the pipeline with one row for each gene set. The first column should be the name of the gene set (e.g.
The second column for each gene set in the GMT file (i.e. the description) should contain the location of where to look for the matching IDs in the data. If set to
rownames, then the gene set IDs will be matched with the row IDs of the data matrix. If a character string or an integer index is supplied, then gene set IDs will be matched to the IDs in that column of feature table. Gene sets with mitochondrial genes can be found here.
The output directory is created under the path specified by
--directory argument. Each sample is stored in the subdirectory (named by
--sample argument) within this output direcotry. Within each sample directory, each output format will be separated into subdirectories. The output file hierarchy is shown below:
(root; output directory) ├── level3Meta.csv ├── level4Meta.csv ├── sample1_dropletQC.html ├── sample1_cellQC.html └── sample1 ├──sample1_cellQC_summary.csv ├── R | ├── sample1_Droplets.rds | └── sample1_Cells.rds ├── Python | ├── Droplets | | └── sample1.h5ad | └── Cells | └── sample1.h5ad ├── FlatFile | ├── Droplets | | ├── assays | | | └── sample1_counts.mtx.gz | | ├── metadata | | | └── sample1_metadata.rds | | ├── sample1_colData.txt.gz | | └── sample1_rowData.txt.gz | └── Cells | ├── assays | | └── sample1_counts.mtx.gz | ├── metadata | | └── sample1_metadata.rds | ├── reducedDims | | ├──sample1_decontX_UMAP.txt.gz | | ├──sample1_scrublet_TSNE.txt.gz | | └──sample1_scrublet_UMAP.txt.gz | ├── sample1_colData.txt.gz | └── sample1_rowData.txt.gz └── sample1_QCparameters.yaml