A wrapper function that calls scrub_doublets
from python
module scrublet
. Simulates doublets from the observed data and uses
a k-nearest-neighbor classifier to calculate a continuous
scrublet_score
(between 0 and 1) for each transcriptome. The score
is automatically thresholded to generate scrublet_call
, a boolean
array that is TRUE
for predicted doublets and FALSE
otherwise.
runScrublet(
inSCE,
sample = NULL,
useAssay = "counts",
simDoubletRatio = 2,
nNeighbors = NULL,
minDist = NULL,
expectedDoubletRate = 0.1,
stdevDoubletRate = 0.02,
syntheticDoubletUmiSubsampling = 1,
useApproxNeighbors = TRUE,
distanceMetric = "euclidean",
getDoubletNeighborParents = FALSE,
minCounts = 3,
minCells = 3L,
minGeneVariabilityPctl = 85,
logTransform = FALSE,
meanCenter = TRUE,
normalizeVariance = TRUE,
nPrinComps = 30L,
tsneAngle = NULL,
tsnePerplexity = NULL,
verbose = TRUE,
seed = 12345
)
A SingleCellExperiment object.
Character vector or colData variable name. Indicates which
sample each cell belongs to. Default NULL
.
A string specifying which assay in the SCE to use. Default
"counts"
.
Numeric. Number of doublets to simulate relative to
the number of observed transcriptomes. Default 2.0
.
Integer. Number of neighbors used to construct the KNN
graph of observed transcriptomes and simulated doublets. If NULL
,
this is set to round(0.5 * sqrt(n_cells))
. Default NULL
.
Float Determines how tightly UMAP packs points together. If
NULL
, this is set to 0.1
. Default NULL
.
The estimated doublet rate for the experiment.
Default 0.1
.
Uncertainty in the expected doublet rate. Default
0.02
.
Numeric. Rate for sampling UMIs when
creating synthetic doublets. If 1.0
, each doublet is created by simply
adding the UMIs from two randomly sampled observed transcriptomes. For
values less than 1, the UMI counts are added and then randomly sampled at
the specified rate. Defuault 1.0
.
Boolean. Use approximate nearest neighbor method
(annoy) for the KNN classifier. Default TRUE
.
Character. Distance metric used when finding nearest
neighbors. See detail. Default "euclidean"
.
Boolean. If TRUE
, return the
parent transcriptomes that generated the doublet neighbors of each
observed transcriptome. This information can be used to infer the cell states
that generated a given doublet state. Default FALSE
.
Numeric. Used for gene filtering prior to PCA. Genes
expressed at fewer than minCounts
in fewer than minCells
are
excluded. Default 3
.
Integer. Used for gene filtering prior to PCA. Genes
expressed at fewer than minCounts
in fewer than minCells
are
excluded. Default 3
.
Numeric. Used for gene filtering prior to
PCA. Keep the most highly variable genes (in the top
minGeneVariabilityPctl
percentile), as measured by the v-statistic
(Klein et al., Cell 2015). Default 85
.
Boolean. If TRUE
, log-transform the counts matrix
(log1p(TPM)). sklearn.decomposition.TruncatedSVD
will be used for
dimensionality reduction, unless meanCenter
is TRUE
. Default
FALSE
.
If TRUE
, center the data such that each gene has a
mean of 0
. sklearn.decomposition.PCA
will be used for
dimensionality reduction. Default TRUE
.
Boolean. If TRUE
, normalize the data such
that each gene has a variance of 1.
sklearn.decomposition.TruncatedSVD
will be used for dimensionality
reduction, unless meanCenter
is TRUE
. Default TRUE
.
Integer. Number of principal components used to embed
the transcriptomes prior to k-nearest-neighbor graph construction.
Default 30
.
Float. Determines angular size of a distant node as measured
from a point in the t-SNE plot. If NULL
, it is set to 0.5
.
Default NULL
.
Integer. The number of nearest neighbors that is used
in other manifold learning algorithms. If NULL
, it is set to 30.
Default NULL
.
Boolean. If TRUE
, print progress updates. Default
TRUE
.
Seed for the random number generator, can be set to NULL
.
Default 12345
.
A SingleCellExperiment object with
scrub_doublets
output appended to the colData slot. The columns
include scrublet_score
and scrublet_call
.
For the list of valid values for distanceMetric
, see the
documentation for
annoy (if
useApproxNeighbors
is TRUE
) or
sklearn.neighbors.NearestNeighbors
(if useApproxNeighbors
is FALSE
).
data(scExample, package = "singleCellTK")
if (FALSE) {
sce <- subsetSCECols(sce, colData = "type != 'EmptyDroplet'")
sce <- runScrublet(sce)
}