引言

本系列讲解 空间转录组学 (Spatial Transcriptomics) 相关基础知识与数据分析教程[1]，持续更新，欢迎关注，转发，文末有交流群！

全局异常值检测

许多目前应用于 ST 的常用 QC 方法是从 snRNA-seq 工作流程改编而来，比如全局异常值检测。识别潜在低质量 spot 的最简单方法，是在整个样本范围内对每个 QC 指标应用固定阈值，并移除任何在一个或多个指标上未达到阈值的 spot。

例如，我们可能会将 library size 低于 600 UMI 或 mitochondrial proportion 高于 30% 的 spot 视为低质量。这些 cut-off 有些武断，通常需要了解所研究的 tissue type 和 dataset。特别是，exploratory visualizations 可以用来帮助选择合适的阈值，而这些阈值可能因 dataset 而异。

在此，我们使用 visualizations 为我们的人类 DLPFC dataset 中的几个 QC 指标选择阈值：

library size
number of expressed genes
proportion of mitochondrial reads
number of cells per spot

par(mfrow = c(1, 4))
hist(colData(spe)$sum, xlab = "sum", main = "UMIs per spot")
hist(spe$detected, xlab = "detected", main = "Genes per spot")
hist(spe$subsets_mito_percent, xlab = "pct mito", main = "Percent mito UMIs")
hist(spe$cell_count, xlab = "no. cells", main = "No. cells per spot")

这些分布相对平滑，没有出现诸如在极低 library size 处出现尖峰之类的明显问题。

我们还可以将多个 QC 指标一起绘制，例如将 library size 或 mitochondrial proportion 与每个 spot 的细胞数对比。这有助于我们在确定最佳阈值的同时，检查是否无意中去除了具有生物学意义的 spot 组。水平线（参数 threshold）显示我们对 library size 和 mitochondrial proportion 的初步过滤阈值猜测。

# plot library size vs. number of cells per spot
p1 <- 
  plotObsQC(spe, plot_type = "scatter", 
            x_metric = "cell_count", y_metric = "sum", 
            y_threshold = 600) + 
  ggtitle("Library size vs. cells per spot")

# plot mito proportion vs. number of cells per spot
p2 <- 
  plotObsQC(spe, plot_type = "scatter", 
            x_metric = "cell_count", y_metric = "subsets_mito_percent", 
            y_threshold = 30) + 
  ggtitle("Mito proportion vs. cells per spot")

p1 | p2

图中显示，这些过滤阈值似乎并未挑选出任何明显具有生物学一致性的 spot 组，也不会导致剔除过多的观测值。我们可以通过查看被剔除的 spot 确切数量来进一步验证这一点。

# select QC threshold for library size, add to colData
spe$qc_lib_size <- spe$sum < 600
table(spe$qc_lib_size)

##  FALSE  TRUE 
##   3631     8

# select QC threshold for detected features, add to colData
spe$qc_detected <- spe$detected < 400
table(spe$qc_detected)
##  FALSE  TRUE 
##   3632     7

# select QC threshold for mito proportion, add to colData
spe$qc_mito_prop <- spe$subsets_mito_percent > 30
table(spe$qc_mito_prop)

##  FALSE  TRUE 
##   3636     3

最后，我们还要检查被丢弃的 spot 是否存在任何与已知生物学特征明显相关的空间模式。否则，移除这些 spot 可能意味着我们设定的阈值过高，从而把具有生物学信息的 spot 也一并删除了。

# check spatial pattern of discarded spots
p1 <- 
  plotObsQC(spe, plot_type = "spot", 
            annotate = "qc_lib_size") + 
  ggtitle("Library size (< 600 UMI)")

p2 <- 
  plotObsQC(spe, plot_type = "spot", 
            annotate = "qc_detected") + 
  ggtitle("Detected genes (< 400 genes)")

p3 <- 
  plotObsQC(spe, plot_type = "spot", 
            annotate = "qc_mito_prop") + 
  ggtitle("Mito proportion (> 30%)")

p1 | p2 | p3

顺带一提，这里我们也能展示如果把阈值设得过高会发生什么。例如，如果我们将阈值设为每个 spot 2000 UMI counts —— 根据直方图和散点图，这看起来也可能是一个合理的数值 —— 那么我们会在被丢弃的 spot 中观察到一种可能的空间模式，该模式与已知的 cortical layers 相符。这说明了在选择这些阈值时，交互式地检查探索性可视化是多么重要。为了说明这一点，我们可以把人工注释的参考（“ground truth”）DLPFC layers 绘制在这里作为参照。

# check spatial pattern of discarded spots if threshold is too high
spe$qc_lib_size_2000 <- spe$sum < 2000

# plot the spots flagged with the high threshold
p1 <- 
  plotObsQC(spe, plot_type = "spot", 
            annotate = "qc_lib_size_2000") + 
  ggtitle("Library size (< 2000 UMI)")

# plot manually annotated reference layers
p2 <- 
  plotCoords(spe, annotate = "ground_truth", 
             pal = "libd_layer_colors") + 
  ggtitle("Manually annotated layers")

# plot library size by manual annotation
p3 <- 
  plotColData(spe, x = "ground_truth", 
              y = "sum", 
              colour_by = "ground_truth") + 
  ggtitle("Library size by layer") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("")

p1 | p2 | p3

将标记出的 spot 与人工注释的层别对照后可以发现，把 library size 阈值设在 2000 UMI 时，被标记的 spot 在 layer 1 和 layer 6 里的比例明显高于其他层或空间区域。换言之，library size 与生物学特征存在混淆，这一点在先前的研究中已有证实（Bhuva et al. 2024；Totty, Hicks, and Guo 2025）。

我们还可以借助 violin plot 来展示 QC 指标的分布，并把异常值标注出来，以此判断所选阈值是否合适，或是否排除了过多的 spot。

# library size and outliers
p1 <- 
  plotObsQC(spe, plot_type = "violin", x_metric = "sum", 
            annotate = "qc_lib_size", point_size = 0.5) + 
  xlab("Library size")

# detected genes and outliers
p2 <- 
  plotObsQC(spe, plot_type = "violin", x_metric = "detected", 
            annotate = "qc_detected", point_size = 0.5) + 
  xlab("Detected genes") 

# mito proportion and outliers
p3 <- plotObsQC(spe, plot_type = "violin", x_metric = "subsets_mito_percent", 
                annotate = "qc_mito_prop", point_size = 0.5) + 
  xlab("Mito proportion")

p1 | p2 | p3

[1]Ref: https://lmweber.org/OSTA/

醋醋百科网

Good Luck To You!

空间转录组学: 全局异常值检测_空间转录组测序技术

引言

全局异常值检测