vignettes/articles/01_import_and_qc_tutorial.Rmd
01_import_and_qc_tutorial.Rmd
Single Cell Toolkit (singleCellTK, SCTK) is a package that works on single-cell RNA-seq (scRNAseq) dataset. SCTK allows users to import multiple datasets, perform quality control and a series of preprocessing, get clustering on cells and markers of clusters, and run various downstream analysis. Meanwhile, SCTK also wraps curated workflows for celda and Seurat.
The SCTK tutorial series start from importing, QC and filtering. After this, users can go through either the A La Carte Workflow or the Curated Workflows (Seurat or Celda). These tutorials will take the real-world scRNAseq dataset as an example, which consists of 2,700 Peripheral Blood Mononuclear Cells (PBMCs) collected from a healthy donor, namingly PBMC3K. This dataset is available from 10X Genomics and can be found on the 10X website.
Before starting the analysis from either UI or console, users should
have SCTK properly installed first. Please refer to the Installation documentation for detail of
installing the singleCellTK
package and relevant
dependencies. Alternatively, users can also access the functionality by
visiting our online web
server deployment https://sctk.bu.edu/, or using a stand-alone docker image.
SCTK allows importing dataset from various type of sources, such as preprocessing tools like cellranger or flat text files. SCTK also allows importing multiple batches (samples) of dataset at the same time. Please refer to the detailed documentation of Importing for different ways of importing. This tutorial only shows how we can import the example PBMC3K data.
After starting the UI, the landing page will be for importing the scRNAseq data. To import the PBMC3K dataset previously mentioned, users should follow these steps:
After successfully importing, the third collapse box will pop up and
show users the basic summary stats of the imported dataset. Meanwhile,
users can set feature display options here. Most of the time, the
dataset has default feature ID (usually seen as row names of a matrix)
together with other types of ID (e.g. symbol) in the feature metadata.
The first option “Set feature ID” sets the type of
default feature ID, which should be unique and has no NA
value. The second option “Set feature names to be displayed in
downstream visualization” does what it says. When we need to
show features on a plot, such as a heatmap or a volcano plot, usually it
is better to show gene symbols rather than ensembl IDs.
Here we use importExampleData()
to load PBMC3k data from
the Bioconductor package TENxPBMCData.
library(singleCellTK)
sce <- importExampleData("pbmc3k")
An SingleCellExperiment
(SCE) object is returned as the data container. An SCE object is
designed for storing and manipulating expression matrices, gene/cell
metadata, low-dimension representation and unstructured information. All
the methods wrapped by SCTK will be performed on the SCE object.
Importing CellRanger Output Data
Here, we briefly introduce the approach to importing the output of
the widely used preprocessing tool, cellranger
. SCTK has a
generic function importCellRanger()
for this purpose, and,
explicitly, importCellRangerV2()
and
importCellRangerV3()
for different versions of
cellranger
. For the detail of these functions, please click
on the function names to be redirected to the reference page.
The input arguments basically asks users what the exact paths of the
input data files are (i.e. "matrix.mtx"
,
"features.tsv"
, and "barcodes.tsv"
). They are
cellRangerDirs
, sampleDirs
,
cellRangerOuts
, matrixFileNames
,
featuresFileNames
and barcodesFileNames
. And
the function will identify the specified path, for example, of the
barcode file, as a combination of:
{cellRangerDirs}/{sampleDirs}/{cellRangerOuts}/{barcodesFileNames}
.
Theses functions automatically try to recognize a preset substucture of
cellRangerDirs
, so that in most of the cases, users only
need to specify cellRangerDirs
to tell where the top level
directory of the output is. However, sometimes the three essential files
may be placed or named in a different way and the default detection
method won’t find them. In this case, users will need to check the exact
paths and manually specify the correct input according to the
combination rule above and the error messages.
An example folder structure:
./datasets/
sample1/
outs/filtered_feature_bc_matrix/
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
sample2/
outs/filtered_feature_bc_matrix/
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
./otherCellRangerData/
barcodes.tsv
genes.tsv
matrix.mtx
# Default use case
sce <- importCellRanger(cellRangerDirs = "dataset")
# In case the three files are placed in a different way
sce <- importCellRanger(sampleDirs = "otherCellRangerData",
cellRangerOuts = "",
barcodesFileNames = "barcodes.tsv",
featuresFileNames = "genes.tsv",
matrixFileNames = "matrix.mtx")
Quality control of cells is often needed before downstream analyses such as dimension reduction and clustering. Typical filtering procedures include exclusion of poor quality cells with low numbers of counts/UMIs, estimation and removal of ambient RNA, and identification of potential doublet/multiplets. Many tools and packages are available to perform these operations and users are free to apply their tool(s) of choice with SCTK.
Below is a quick example of how to perform standard QC before heading to the downstream analyses. If your data is already QC’ed or you decide to skip this step, you can directly move to the workflows (A La Carte, Seurat or Celda). For this tutorial, we will only run one doublet detection algorithm (scDblFinder) and one decontamination algorithm (decontX). We will also quantify the percentage of mitochondrial genes in each cell as this is often used as a measure of cell viability.
Running QC methods
Right after running, the visualization of QC results will automatically show up in the right main panel, with tabs for different methods.
Please find the detailed documentation of the QC UI for more explanation of each parameter.
Running QC methods
To perform QC, we suggest using the runCellQC()
function. This is a wrapper for several methods for calculation of QC
metrics, doublet detection, and estimation of ambient RNA. For a full
list of algorithms that this function runs by default, see
?runCellQC
.
# Run QC
sce <- runCellQC(sce, sample = NULL,
algorithms = c("QCMetrics", "scDblFinder", "decontX"),
mitoRef = "human", mitoIDType = "symbol",
mitoGeneLocation = "rownames", seed = 12345)
sce <- getUMAP(inSCE = sce, useAssay = "counts", logNorm = TRUE,
reducedDimName = "QC_UMAP", seed = 12345)
Note: If you have cells from multiple samples stored in the SCE object, make sure to supply the
sample
parameter as the QC tools need to be applied to cells from each sample individually.
Visualization
Individual sets of QC metrics can be plotted with specific functions.
For example to plot distributions of total numbers of UMIs derived from
runPerCellQC()
, doublet scores from
runScDblFinder()
, and contamination scores from
runDecontX()
(all of which were run by the
runCellQC()
function), the following plotting functions can
be used:
plotScDblFinderResults(sce, reducedDimName = "QC_UMAP")
plotDecontXResults(sce, reducedDimName = "QC_UMAP")
A comprehensive HTML report can be generated to visualize and explore the QC metrics in greater detail:
reportCellQC(sce)
Please find the detailed documentation for running QC functions for detail.
Besides the approaches described above, we have also published the SCTK-QC pipeline [1]. Please refer to the SCTK-QC Documentation for detail.
After examining the distributions of various QC metrics, poor quality cells will need to be removed. Typically, thresholds for QC metrics should exclude cells that are outliers of the distribution (i.e. long tails in the violin or density plots). Here is how we limit the data to cells with 1. at least 600 counts, 2. at least 300 genes detected, and 3. at most 5% of detected counts are mitochondrial. For more detail about filtering, please refer to the Filtering documentation
total
,
detected
and mito_percent
for each
criteria.total
greater than 600, you would check
“Greater than”, and enter the number 600
.
Similarly, if you want to keep cells with value
mito_percent
less than 5, you would check “Less
than” and enter the number 5
.After successfully filtering the data, you can see a comparison table shows up at the bottom. You can apply new filters by repeating the steps above, but it will only be applied on the filtered data instead of the original data.
Cells can be removed using the subsetSCECols()
function.
Metrics stored in the colData
slot of the SCE object can be
filtered using the colData
parameter.
# See number of cells before filtering
ncol(sce)
## [1] 2700
sce <- subsetSCECols(sce, colData = c("total > 600",
"detected > 300",
"mito_percent < 5"))
# See number of cells after filtering
ncol(sce)
## [1] 2627