Example #2

Overview

A stripped down version of Example #1, modified to allow key parameters to be easily changed in order to look at the stability of the final results.

Tests:

nFeature_RNA is cut at 2100 rather than the default of 2500 which seems to include one outlier at ca. 2400
- Quite similar, 9 communities, UMAP just a little different but the same messages from data
nFeature_RNA is cut at 3000
- Quite similar, 9 communities, UMAP just a little different
nFeature_RNA is cut at 4000 (not cut)
- Now interesting
- 9 communities are generated but the cluster integer order is different and so labels need to be readjusted
- Cells NK cluster is extended
- Seems like there are some better gene expression markers than IL7R e.g.s LEF1 and PRKCQ-AS1
Seurat v3 and v5 UMAP have different default labels - v3 uppercase and v5 are lowercase. Useful for identification of versions but also the UMAP cluster rotations in 2D are different, together with some cluster separation and cluster shape aspects. Suspect Seurat is simply using a refined version of UMAP in v5.0.0.

ChangeLog

Updated to Seurat v5

Includes

Environment Load and Check

this code section is packaged as an include for reuse across all examples
it uses the HTML details tag directly to wrap code blocks and output as drop-down sections

Show Environment

library(patchwork)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(Seurat)

Loading required package: SeuratObject

Loading required package: sp

'SeuratObject' was built under R 4.3.0 but the current version is
4.3.2; it is recomended that you reinstall 'SeuratObject' as the ABI
for R may have changed


Attaching package: 'SeuratObject'

The following object is masked from 'package:base':

    intersect

Is there a typo in the message above? Application Programming Interface, API != ABI
Rolling back to R 4.3.0 was not possible with the current version of Seurat
- the indication was that Seurat requires a version of base Matrix that is not present in R 4.3.0

# which Seurat?
packageVersion("Seurat")

[1] '5.0.0'

# which R?
version[['version.string']]

[1] "R version 4.3.2 (2023-10-31)"

# presto was installed 
# For a (much!) faster implementation of the 
# Wilcoxon Rank Sum Test
packageVersion('presto')

[1] '1.0.0'

# check python is available via reticulate
import sys
print(sys.version.split(" ")[0])

3.12.0

# shell check
python3 -V

Python 3.12.0

# shell check
quarto -v

1.4.489

Functions

Show Functions

# Useful for code development.
# Save the object at a point and reload it into the R console 
# i.e. for developing alternative reports 
# without having to run the pipeline right from the start
# which can be slow
#
# NB: Files produced by saveRDS (or serialized to a file connection) 
# are not suitable as an interchange format between machines
# 
# For that use hdf5 or transfer data and code to reproduce the result 

saveRDS_overwrite <- function(file_path) {
  if (file.exists(file_path)) {
    file.remove(file_path)
  } 
  saveRDS(pbmc, file = file_path)
}

Process

# Verbose comments in Example #1 
EGN <- '_Eg2'

# read & create object
data_dir <- "./filtered_gene_bc_matrices/hg19/"
pbmc.data <- Read10X(data.dir = data_dir)
pbmc <- CreateSeuratObject(counts = pbmc.data, 
                           project = "pbmc3k", 
                           min.cells = 3, 
                           min.features = 200)

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

# calculate %mitochondrial
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
        ncol = 3)

Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

# apply filters and overwrite the object
# 
# nFeature_RNA <- 2500
# nFeature_RNA_Max <- 3000
# nFeature_RNA_Max <- 2100
# i.e. not cut
nFeature_RNA_Max <- 4000

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & 
                 nFeature_RNA < nFeature_RNA_Max & percent.mt < 5)
# And we can see clearly the effect of filtering
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

# normalise 
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", 
                      scale.factor = 10000)

Normalizing layer: counts

# variable freatures (genes)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

Finding variable features for layer counts

n_top <- 15
top_v_genes <- head(VariableFeatures(pbmc), n_top)
top_v_genes

 [1] "PPBP"    "S100A9"  "LYZ"     "IGLL5"   "GNLY"    "FTL"     "PF4"    
 [8] "FTH1"    "GNG11"   "FCER1A"  "S100A8"  "HLA-DRA" "CD74"    "CLU"    
[15] "GZMB"

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top_v_genes, repel = TRUE)

When using repel, set xnudge and ynudge to 0 for optimal results

plot2

# scale all 
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

Centering and scaling data matrix

# linear dim reduction
pbmc <- RunPCA(pbmc, verbose = FALSE)
n_features <- 8
print(pbmc[["pca"]], dims = 1:4, nfeatures = n_features)

PC_ 1 
Positive:  CST3, TYROBP, LST1, AIF1, FTL, LYZ, FTH1, FCN1 
Negative:  MALAT1, LTB, IL32, IL7R, CD2, ACAP1, STK17A, CTSW 
PC_ 2 
Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DRA, HLA-DQB1, LINC00926, CD79B 
Negative:  NKG7, PRF1, CST7, GZMA, GZMB, FGFBP2, CTSW, GNLY 
PC_ 3 
Positive:  HLA-DQA1, CD79A, HLA-DQB1, CD79B, HLA-DPB1, HLA-DPA1, CD74, MS4A1 
Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18 
PC_ 4 
Positive:  VIM, IL7R, S100A6, S100A8, S100A4, IL32, S100A9, GIMAP7 
Negative:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, MS4A1, CD74, HLA-DPB1, HLA-DRB1

# V
n_mulitplier = 3
VizDimLoadings(pbmc, dims = 1:2, 
               reduction = "pca", 
               nfeatures = n_features * n_mulitplier)

VizDimLoadings(pbmc, dims = 3:4, 
               reduction = "pca", 
               nfeatures = n_features * n_mulitplier)

# U 
# Dimplot has lots of options
DimPlot(pbmc, reduction = "pca",  dims = c(1, 2))

DimPlot(pbmc, reduction = "pca",  dims = c(2, 3))

# as in quote above, 
# ordererd cells and genes according to PCA scores 

# heatmaps 
n_cells <- 300
n_genes <- 24
# 1
DimHeatmap(pbmc, dims = 1, cells = n_cells, nfeatures = n_genes, balanced = TRUE)

# many
n_PC <- 6
plots <- DimHeatmap(pbmc, dims = 1:n_PC, cells = n_cells, 
             nfeatures = n_genes, balanced = TRUE)

# elbow plot
ElbowPlot(pbmc)

n_pcs_chosen <- 10

# clusters
# create the K-nearest neighbor (KNN) graph of cells
pbmc <- FindNeighbors(pbmc, dims = 1:n_pcs_chosen)

Computing nearest neighbor graph

Computing SNN

# define clusters according to resolution
pbmc <- FindClusters(pbmc, resolution = 0.5)

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2643
Number of edges: 96640

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8753
Number of communities: 9
Elapsed time: 0 seconds

# Look at cluster IDs of the first 5 cells
# In this case we have 9 levels (0 - 8)
# The structure is the relation between cell barcode and the cluster (community) 
head(Idents(pbmc), 5)

AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
               1                3                1                5 
AAACCGTGTATGCG-1 
               6 
Levels: 0 1 2 3 4 5 6 7 8

# Each cell that survived filtering above is represented 
length(Idents(pbmc))

[1] 2643

pbmc

An object of class Seurat 
13714 features across 2643 samples within 1 assay 
Active assay: RNA (13714 features, 2000 variable features)
 3 layers present: counts, data, scale.data
 1 dimensional reduction calculated: pca

# UMAP
pbmc <- RunUMAP(pbmc, dims = 1:n_pcs_chosen)

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

16:26:11 UMAP embedding parameters a = 0.9922 b = 1.112

16:26:11 Read 2643 rows and found 10 numeric columns

16:26:11 Using Annoy for neighbor search, n_neighbors = 30

16:26:11 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
16:26:11 Writing NN index file to temp file /var/folders/zp/wn7hdby52ds73hv9g_dxlk380000gn/T//RtmpSBvP2r/file17fa1463b1e8
16:26:11 Searching Annoy index using 1 thread, search_k = 3000
16:26:12 Annoy recall = 100%
16:26:12 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
16:26:13 Initializing from normalized Laplacian + noise (using RSpectra)
16:26:13 Commencing optimization for 500 epochs, with 105742 positive edges
16:26:18 Optimization finished

DimPlot(pbmc, reduction = "umap")

# tSNE
pbmc <- RunTSNE(pbmc, dims = 1:n_pcs_chosen)
DimPlot(pbmc, reduction = "tsne")

# save the object
file_path <- paste0("./seurat_object_checkpoints/pbmc_sw1",EGN,".rds")
saveRDS_overwrite(file_path)
# to restore
# pbmc <- readRDS(file_path)

# find markers for every cluster compared to all remaining cells, 
# report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, 
                                 min.pct = 0.25, logfc.threshold = 0.25)

Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8

# heatmaps 
# note that 'wt' specifies the variable to use for ordering 
# we get the best markers in terms of size effect
pbmc.markers %>%
    group_by(cluster) %>%
    top_n(n = 5, wt = avg_log2FC) -> top_n

head(top_n,10)

# A tibble: 10 × 7
# Groups:   cluster [2]
      p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
      <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
 1 2.99e-94       2.48 0.472 0.11   4.09e-90 0       CCR7     
 2 2.10e-57       2.18 0.364 0.103  2.89e-53 0       LEF1     
 3 2.03e-52       2.10 0.359 0.108  2.78e-48 0       PRKCQ-AS1
 4 1.69e-40       2.04 0.288 0.085  2.32e-36 0       MAL      
 5 2.94e-35       2.49 0.266 0.084  4.03e-31 0       LDLRAP1  
 6 8.30e-62       2.14 0.421 0.109  1.14e-57 1       AQP3     
 7 4.86e-57       1.64 0.63  0.246  6.66e-53 1       CD2      
 8 1.57e-41       2.06 0.274 0.065  2.16e-37 1       CD40LG   
 9 8.92e-39       1.62 0.378 0.132  1.22e-34 1       TRADD    
10 1.61e-19       1.61 0.296 0.136  2.20e-15 1       CORO1B

# heat map shows that cluster 1 and 2 are not easily distiguished
# by just a few genes others are 
DoHeatmap(pbmc, features = top_n$gene)

# + NoLegend()

# consider the canonical cluster 0 markers
# "IL7R" is not particularly good marker for cluster 0 
VlnPlot(pbmc, features = c("IL7R", "CCR7"))

FeaturePlot(pbmc, features = c("IL7R", "CCR7"))

# Try the top 5
# Seems like there are some better markers than IL7R e.g.s LEF1 and PRKCQ-AS1
VlnPlot(pbmc, features = c("LDHB", "CCR7","LEF1","PRKCQ-AS1","LDLRAP1"))

FeaturePlot(pbmc, features = c("LDHB", "CCR7","LEF1","PRKCQ-AS1","LDLRAP1"))

Compare results

# cell type to cluster
#new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", 
#                     "B", "CD8 T", "FCGR3A+ Mono",
#                    "NK", "DC", "Platelet")

# the unlabelled clusters
DimPlot(pbmc, reduction = "umap")

# label adjust 
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono",
                     "B", "CD8 T", "FCGR3A+ Mono",
                    "NK", "DC", "Platelet")

names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, 
        pt.size = 0.5) + NoLegend()

The default labelled UMAP clusters for comparison from Example #1 (Seurat v5)

The default labelled UMAP clusters for comparison from Example #1 (Seurat v3)

The default unlabelled UMAP clusters for comparison from Example #1 (Seurat v3)

# save final 
file_path <- paste0("./seurat_object_checkpoints/pbmc_sw1",EGN,"_final.rds")
saveRDS_overwrite(file_path)
# Done. See yah :)