A wrapper function that calls scrub_doublets from python module scrublet. Simulates doublets from the observed data and uses a k-nearest-neighbor classifier to calculate a continuous scrublet_score (between 0 and 1) for each transcriptome. The score is automatically thresholded to generate scrublet_call, a boolean array that is TRUE for predicted doublets and FALSE otherwise.

runScrublet(
  inSCE,
  sample = NULL,
  useAssay = "counts",
  simDoubletRatio = 2,
  nNeighbors = NULL,
  minDist = NULL,
  expectedDoubletRate = 0.1,
  stdevDoubletRate = 0.02,
  syntheticDoubletUmiSubsampling = 1,
  useApproxNeighbors = TRUE,
  distanceMetric = "euclidean",
  getDoubletNeighborParents = FALSE,
  minCounts = 3,
  minCells = 3L,
  minGeneVariabilityPctl = 85,
  logTransform = FALSE,
  meanCenter = TRUE,
  normalizeVariance = TRUE,
  nPrinComps = 30L,
  tsneAngle = NULL,
  tsnePerplexity = NULL,
  verbose = TRUE,
  seed = 12345
)

Arguments

inSCE

A SingleCellExperiment object.

sample

Character vector or colData variable name. Indicates which sample each cell belongs to. Default NULL.

useAssay

A string specifying which assay in the SCE to use. Default "counts".

simDoubletRatio

Numeric. Number of doublets to simulate relative to the number of observed transcriptomes. Default 2.0.

nNeighbors

Integer. Number of neighbors used to construct the KNN graph of observed transcriptomes and simulated doublets. If NULL, this is set to round(0.5 * sqrt(n_cells)). Default NULL.

minDist

Float Determines how tightly UMAP packs points together. If NULL, this is set to 0.1. Default NULL.

expectedDoubletRate

The estimated doublet rate for the experiment. Default 0.1.

stdevDoubletRate

Uncertainty in the expected doublet rate. Default 0.02.

syntheticDoubletUmiSubsampling

Numeric. Rate for sampling UMIs when creating synthetic doublets. If 1.0, each doublet is created by simply adding the UMIs from two randomly sampled observed transcriptomes. For values less than 1, the UMI counts are added and then randomly sampled at the specified rate. Defuault 1.0.

useApproxNeighbors

Boolean. Use approximate nearest neighbor method (annoy) for the KNN classifier. Default TRUE.

distanceMetric

Character. Distance metric used when finding nearest neighbors. See detail. Default "euclidean".

getDoubletNeighborParents

Boolean. If TRUE, return the parent transcriptomes that generated the doublet neighbors of each observed transcriptome. This information can be used to infer the cell states that generated a given doublet state. Default FALSE.

minCounts

Numeric. Used for gene filtering prior to PCA. Genes expressed at fewer than minCounts in fewer than minCells are excluded. Default 3.

minCells

Integer. Used for gene filtering prior to PCA. Genes expressed at fewer than minCounts in fewer than minCells are excluded. Default 3.

minGeneVariabilityPctl

Numeric. Used for gene filtering prior to PCA. Keep the most highly variable genes (in the top minGeneVariabilityPctl percentile), as measured by the v-statistic (Klein et al., Cell 2015). Default 85.

logTransform

Boolean. If TRUE, log-transform the counts matrix (log1p(TPM)). sklearn.decomposition.TruncatedSVD will be used for dimensionality reduction, unless meanCenter is TRUE. Default FALSE.

meanCenter

If TRUE, center the data such that each gene has a mean of 0. sklearn.decomposition.PCA will be used for dimensionality reduction. Default TRUE.

normalizeVariance

Boolean. If TRUE, normalize the data such that each gene has a variance of 1. sklearn.decomposition.TruncatedSVD will be used for dimensionality reduction, unless meanCenter is TRUE. Default TRUE.

nPrinComps

Integer. Number of principal components used to embed the transcriptomes prior to k-nearest-neighbor graph construction. Default 30.

tsneAngle

Float. Determines angular size of a distant node as measured from a point in the t-SNE plot. If NULL, it is set to 0.5. Default NULL.

tsnePerplexity

Integer. The number of nearest neighbors that is used in other manifold learning algorithms. If NULL, it is set to 30. Default NULL.

verbose

Boolean. If TRUE, print progress updates. Default TRUE.

seed

Seed for the random number generator, can be set to NULL. Default 12345.

Value

A SingleCellExperiment object with scrub_doublets output appended to the colData slot. The columns include scrublet_score and scrublet_call.

Details

For the list of valid values for distanceMetric, see the documentation for annoy (if useApproxNeighbors is TRUE) or sklearn.neighbors.NearestNeighbors (if useApproxNeighbors is FALSE).

Examples

data(scExample, package = "singleCellTK")
if (FALSE) {
sce <- subsetSCECols(sce, colData = "type != 'EmptyDroplet'")
sce <- runScrublet(sce)
}