Example #3

Overview

A stripped down version of Example #1, modified to allow key parameters to be easily changed in order to look at the stability of the final results.

Tests and results:

  • Like Example #2, nFeature_RNA is cut at 4000 (not cut actually because it includes all data point from the created object). Now we modulate FindClusters resolution.
  • Changing FindClusters resolution to 0.2 (from 0.5) causes a merge of CD14+ Mono Naive and Memory cells. Likewise a merge of CD8 T and NK cells. Other cells populations remain distinct. Overall, a change from 9 to 7 communities.
  • Changing FindClusters resolution to 1.0 (from 0.5) causes a split in the CD4 T/CD8 T region and a split in the CD14+ Mono / FCGR3A+ Mono region to give 11 communities (from the original 9). The high nFeature_RNA extension of the NK cells remains.

ChangeLog

  • Updated to Seurat v5

Includes

Environment Load and Check

  • this code section is packaged as an include for reuse across all examples
  • it uses the HTML details tag directly to wrap code blocks and output as drop-down sections
Show Environment
library(patchwork)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(Seurat)
Loading required package: SeuratObject
Loading required package: sp
'SeuratObject' was built under R 4.3.0 but the current version is
4.3.2; it is recomended that you reinstall 'SeuratObject' as the ABI
for R may have changed

Attaching package: 'SeuratObject'
The following object is masked from 'package:base':

    intersect
  • Is there a typo in the message above? Application Programming Interface, API != ABI

  • Rolling back to R 4.3.0 was not possible with the current version of Seurat

    • the indication was that Seurat requires a version of base Matrix that is not present in R 4.3.0
# which Seurat?
packageVersion("Seurat")
[1] '5.0.0'
# which R?
version[['version.string']]
[1] "R version 4.3.2 (2023-10-31)"
# presto was installed 
# For a (much!) faster implementation of the 
# Wilcoxon Rank Sum Test
packageVersion('presto')
[1] '1.0.0'
# check python is available via reticulate
import sys
print(sys.version.split(" ")[0])
3.12.0
# shell check
python3 -V
Python 3.12.0
# shell check
quarto -v
1.4.489

Functions

Show Functions
# Useful for code development.
# Save the object at a point and reload it into the R console 
# i.e. for developing alternative reports 
# without having to run the pipeline right from the start
# which can be slow
#
# NB: Files produced by saveRDS (or serialized to a file connection) 
# are not suitable as an interchange format between machines
# 
# For that use hdf5 or transfer data and code to reproduce the result 

saveRDS_overwrite <- function(file_path) {
  if (file.exists(file_path)) {
    file.remove(file_path)
  } 
  saveRDS(pbmc, file = file_path)
}

Process

# Verbose comments in Example #1 
EGN <- '_Eg3'
# read & create object
data_dir <- "./filtered_gene_bc_matrices/hg19/"
pbmc.data <- Read10X(data.dir = data_dir)
pbmc <- CreateSeuratObject(counts = pbmc.data, 
                           project = "pbmc3k", 
                           min.cells = 3, 
                           min.features = 200)
Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')
# calculate %mitochondrial
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
        ncol = 3)
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

# apply filters and overwrite the object
# 
# nFeature_RNA <- 2500
# nFeature_RNA_Max <- 3000
# nFeature_RNA_Max <- 2100
# i.e. not cut
nFeature_RNA_Max <- 4000

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & 
                 nFeature_RNA < nFeature_RNA_Max & percent.mt < 5)
# And we can see clearly the effect of filtering
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

# normalise 
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
                      scale.factor = 10000)
Normalizing layer: counts
# variable freatures (genes)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
Finding variable features for layer counts
n_top <- 15
top_v_genes <- head(VariableFeatures(pbmc), n_top)
top_v_genes
 [1] "PPBP"    "S100A9"  "LYZ"     "IGLL5"   "GNLY"    "FTL"     "PF4"    
 [8] "FTH1"    "GNG11"   "FCER1A"  "S100A8"  "HLA-DRA" "CD74"    "CLU"    
[15] "GZMB"   
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top_v_genes, repel = TRUE)
When using repel, set xnudge and ynudge to 0 for optimal results
plot2 

# scale all 
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
Centering and scaling data matrix
# linear dim reduction
pbmc <- RunPCA(pbmc, verbose = FALSE)
n_features <- 8
print(pbmc[["pca"]], dims = 1:4, nfeatures = n_features)
PC_ 1 
Positive:  CST3, TYROBP, LST1, AIF1, FTL, LYZ, FTH1, FCN1 
Negative:  MALAT1, LTB, IL32, IL7R, CD2, ACAP1, STK17A, CTSW 
PC_ 2 
Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DRA, HLA-DQB1, LINC00926, CD79B 
Negative:  NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, CTSW, GNLY 
PC_ 3 
Positive:  HLA-DQA1, CD79A, HLA-DQB1, CD79B, HLA-DPB1, HLA-DPA1, CD74, MS4A1 
Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18 
PC_ 4 
Positive:  VIM, IL7R, S100A6, S100A8, S100A4, IL32, S100A9, GIMAP7 
Negative:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1, CD74, HLA-DPB1, HLA-DRB1 
# V
n_mulitplier = 3
VizDimLoadings(pbmc, dims = 1:2, 
               reduction = "pca", 
               nfeatures = n_features * n_mulitplier)

VizDimLoadings(pbmc, dims = 3:4, 
               reduction = "pca", 
               nfeatures = n_features * n_mulitplier)

# U 
# Dimplot has lots of options
DimPlot(pbmc, reduction = "pca",  dims = c(1, 2))

DimPlot(pbmc, reduction = "pca",  dims = c(2, 3))

# as in quote above, 
# ordererd cells and genes according to PCA scores 

# heatmaps 
n_cells <- 300
n_genes <- 24
# 1
DimHeatmap(pbmc, dims = 1, cells = n_cells, nfeatures = n_genes, balanced = TRUE)

# many
n_PC <- 6
plots <- DimHeatmap(pbmc, dims = 1:n_PC, cells = n_cells, 
             nfeatures = n_genes, balanced = TRUE)

# elbow plot
ElbowPlot(pbmc)

n_pcs_chosen <- 10

# clusters
# create the K-nearest neighbor (KNN) graph of cells
pbmc <- FindNeighbors(pbmc, dims = 1:n_pcs_chosen)
Computing nearest neighbor graph
Computing SNN
# define clusters according to resolution
#pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- FindClusters(pbmc, resolution = 1.0)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2643
Number of edges: 96640

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8123
Number of communities: 11
Elapsed time: 0 seconds
#pbmc <- FindClusters(pbmc, resolution = 0.2)

# Look at cluster IDs of the first 5 cells
# In this case we have 9 levels (0 - 8)
# The structure is the relation between cell barcode and the cluster (community) 
head(Idents(pbmc), 5)
AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
               5                2                1                7 
AAACCGTGTATGCG-1 
               8 
Levels: 0 1 2 3 4 5 6 7 8 9 10
# Each cell that survived filtering above is represented 
length(Idents(pbmc))
[1] 2643
pbmc
An object of class Seurat 
13714 features across 2643 samples within 1 assay 
Active assay: RNA (13714 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 dimensional reduction calculated: pca
# UMAP
pbmc <- RunUMAP(pbmc, dims = 1:n_pcs_chosen)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
16:27:53 UMAP embedding parameters a = 0.9922 b = 1.112
16:27:53 Read 2643 rows and found 10 numeric columns
16:27:53 Using Annoy for neighbor search, n_neighbors = 30
16:27:53 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
16:27:54 Writing NN index file to temp file /var/folders/zp/wn7hdby52ds73hv9g_dxlk380000gn/T//RtmpbfaDre/file186d7c8821e
16:27:54 Searching Annoy index using 1 thread, search_k = 3000
16:27:54 Annoy recall = 100%
16:27:55 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:27:55 Initializing from normalized Laplacian + noise (using RSpectra)
16:27:55 Commencing optimization for 500 epochs, with 105742 positive edges
16:28:01 Optimization finished
DimPlot(pbmc, reduction = "umap")

# tSNE
pbmc <- RunTSNE(pbmc, dims = 1:n_pcs_chosen)
DimPlot(pbmc, reduction = "tsne")

# save the object
file_path <- paste0("./seurat_object_checkpoints/pbmc_sw1",EGN,".rds")
saveRDS_overwrite(file_path)
# to restore
# pbmc <- readRDS(file_path)

# find markers for every cluster compared to all remaining cells, 
# report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, 
                                 min.pct = 0.25, logfc.threshold = 0.25)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
# heatmaps 
# note that 'wt' specifies the variable to use for ordering 
# we get the best markers in terms of size effect
pbmc.markers %>%
    group_by(cluster) %>%
    top_n(n = 5, wt = avg_log2FC) -> top_n

head(top_n,10)
# A tibble: 10 × 7
# Groups:   cluster [2]
      p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
      <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
 1 2.34e-87       2.25 0.502 0.122  3.22e-83 0       CCR7     
 2 2.78e-54       2.01 0.388 0.111  3.81e-50 0       LEF1     
 3 7.72e-53       1.89 0.39  0.114  1.06e-48 0       PRKCQ-AS1
 4 6.43e-46       2.10 0.322 0.088  8.81e-42 0       MAL      
 5 5.53e-26       1.69 0.251 0.085  7.59e-22 0       OXNAD1   
 6 3.84e-53       2.12 0.434 0.12   5.26e-49 1       AQP3     
 7 1.04e-51       1.57 0.658 0.257  1.43e-47 1       CD2      
 8 4.94e-42       2.13 0.298 0.07   6.77e-38 1       CD40LG   
 9 6.85e-39       1.51 0.409 0.13   9.39e-35 1       TRAT1    
10 1.27e-31       1.55 0.385 0.142  1.74e-27 1       TRADD    
# heat map shows that cluster 1 and 2 are not easily distiguished
# by just a few genes others are 
DoHeatmap(pbmc, features = top_n$gene)

# + NoLegend()

# consider the canonical cluster 0 markers
# "IL7R" is not particularly good marker for cluster 0 
VlnPlot(pbmc, features = c("IL7R", "CCR7"))

FeaturePlot(pbmc, features = c("IL7R", "CCR7"))

# Try the top 5
# Seems like there are some better markers than IL7R e.g.s LEF1 and PRKCQ-AS1
VlnPlot(pbmc, features = c("LDHB", "CCR7","LEF1","PRKCQ-AS1","LDLRAP1"))

FeaturePlot(pbmc, features = c("LDHB", "CCR7","LEF1","PRKCQ-AS1","LDLRAP1"))

Compare results

  • Changing FindClusters resolution to 0.2 (from 0.5) causes a merge of CD14+ Mono Naive and Memory cells. Likewise a merge of CD8 T and NK cells. Other cells populations remain distinct. Overall, a change from 9 to 7 communities.

Cell communities with a FindClusters resolution of 0.2
# the unlabelled clusters
DimPlot(pbmc, reduction = "umap")

  • Changing FindClusters resolution to 1.0 (from 0.5) causes a split in the CD4 T/CD8 T region and a split in the CD14+ Mono / FCGR3A+ Mono region to give 11 communities (from the original 9). The high nFeature_RNA extension of the NK cells remains.

The default labelled UMAP clusters for comparison from Example #1

The default unlabelled UMAP clusters for comparison from Example #1

  • TODO

    • Look at and compare the tSNE results - tSNEs seem to give nice separation
# save final 
file_path <- paste0("./seurat_object_checkpoints/pbmc_sw1",EGN,"_final.rds")
saveRDS_overwrite(file_path)
# Done. See yah :)