Single-cell pipeline

Last updated: 2025-04-14

Checks: 6 1

Knit directory: KODAMA-Analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown is untracked by Git. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20240618)

The command set.seed(20240618) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 5f5ac63

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 5f5ac63. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .RData
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  KODAMA.svg
    Untracked:  analysis/singlecell_datamatrix.Rmd
    Untracked:  analysis/singlecell_seurat.Rmd
    Untracked:  code/Acinar_Cell_Carcinoma.ipynb
    Untracked:  code/Adenocarcinoma.ipynb
    Untracked:  code/Adjacent_normal_section.ipynb
    Untracked:  code/DLFPC_preprocessing.R
    Untracked:  code/DLPFC - BANKSY.R
    Untracked:  code/DLPFC - BASS.R
    Untracked:  code/DLPFC - BAYESPACE.R
    Untracked:  code/DLPFC - Nonspatial.R
    Untracked:  code/DLPFC - PRECAST.R
    Untracked:  code/DLPFC_comparison.R
    Untracked:  code/DLPFC_results_analysis.R
    Untracked:  code/MERFISH - BANKSY.R
    Untracked:  code/MERFISH - BASS.R
    Untracked:  code/MERFISH - BAYESPACE.R
    Untracked:  code/MERFISH - Nonspatial.R
    Untracked:  code/MERFISH - PRECAST.R
    Untracked:  code/MERFISH_comparison.R
    Untracked:  code/MERFISH_results_analysis.R
    Untracked:  code/VisiumHD-CRC.ipynb
    Untracked:  code/VisiumHDassignment.py
    Untracked:  code/deep learning code DLPFC.R
    Untracked:  code/save tiles.py
    Untracked:  data/Annotations/
    Untracked:  data/DLFPC-Br5292-input.RData
    Untracked:  data/DLFPC-Br5595-input.RData
    Untracked:  data/DLFPC-Br8100-input.RData
    Untracked:  data/DLPFC-general.RData
    Untracked:  data/MERFISH-input.RData
    Untracked:  data/trajectories.RData
    Untracked:  data/trajectories_VISIUMHD.RData
    Untracked:  output/BANSKY-results.RData
    Untracked:  output/BASS-results.RData
    Untracked:  output/BayesSpace-results.RData
    Untracked:  output/CRC-image.RData
    Untracked:  output/CRC-image2.RData
    Untracked:  output/CRC.png
    Untracked:  output/CRC2.png
    Untracked:  output/CRC7.png
    Untracked:  output/CRC8.png
    Untracked:  output/CRC_boxplot.png
    Untracked:  output/CRC_boxplot.svg
    Untracked:  output/CRC_boxplot2.svg
    Untracked:  output/CRC_linee.svg
    Untracked:  output/DL.RData
    Untracked:  output/DLFPC-All-2.RData
    Untracked:  output/DLFPC-All.RData
    Untracked:  output/DLFPC-Br5292.RData
    Untracked:  output/DLFPC-Br5595.RData
    Untracked:  output/DLFPC-Br8100.RData
    Untracked:  output/DLFPC-variablesXdeeplearning.RData
    Untracked:  output/DLPFC-BANSKY-results.RData
    Untracked:  output/DLPFC-BASS-results.RData
    Untracked:  output/DLPFC-BayesSpace-results.RData
    Untracked:  output/DLPFC-Nonspatial-results.RData
    Untracked:  output/DLPFC-PRECAST-results.RData
    Untracked:  output/DLPFC_all_cluster.svg
    Untracked:  output/DLPFCpathway.RData
    Untracked:  output/Figure 1 - boxplot.pdf
    Untracked:  output/Figure 2 - DLPFC 10.pdf
    Untracked:  output/Figures/
    Untracked:  output/KODAMA-results.RData
    Untracked:  output/KODAMA_DLPFC_All_original.svg
    Untracked:  output/KODAMA_DLPFC_Br5595.svg
    Untracked:  output/KODAMA_DLPFC_Br5595_slide.svg
    Untracked:  output/Loupe.csv
    Untracked:  output/MERFISH-BANSKY-results.RData
    Untracked:  output/MERFISH-BASS-results.RData
    Untracked:  output/MERFISH-BayesSpace-results.RData
    Untracked:  output/MERFISH-KODAMA-results.RData
    Untracked:  output/MERFISH-Nonspatial-results.RData
    Untracked:  output/MERFISH-PRECAST-results.RData
    Untracked:  output/MERFISH.RData
    Untracked:  output/Nonspatial-results.RData
    Untracked:  output/Prostate-GSEA.csv
    Untracked:  output/Prostate-KODAMA.RData
    Untracked:  output/Prostate-trajectory.csv
    Untracked:  output/Prostate.RData
    Untracked:  output/VisiumHD-RNA.RData
    Untracked:  output/VisiumHD-genes.pdf
    Untracked:  output/VisiumHD.RData
    Untracked:  output/boh.svg
    Untracked:  output/desmoplastic_distance_carcinoma.csv
    Untracked:  output/image.RData
    Untracked:  output/pp.RData
    Untracked:  output/pp2.RData
    Untracked:  output/pp3.RData
    Untracked:  output/pp4.RData
    Untracked:  output/pp5.RData
    Untracked:  output/prostate1.svg
    Untracked:  output/prostate2.svg
    Untracked:  output/prostate3.svg
    Untracked:  output/subclusters1.csv
    Untracked:  output/subclusters2.csv
    Untracked:  output/subclusters3.csv
    Untracked:  output/tight_boundary.geojson
    Untracked:  output/trajectory.csv

Unstaged changes:
    Deleted:    analysis/D1.Rmd
    Deleted:    analysis/DLPFC-12.Rmd
    Deleted:    analysis/DLPFC-4.Rmd
    Modified:   analysis/DLPFC.Rmd
    Deleted:    analysis/DLPFC1.Rmd
    Deleted:    analysis/DLPFC10.Rmd
    Deleted:    analysis/DLPFC2.Rmd
    Deleted:    analysis/DLPFC3.Rmd
    Deleted:    analysis/DLPFC4.Rmd
    Deleted:    analysis/DLPFC5.Rmd
    Deleted:    analysis/DLPFC6.Rmd
    Deleted:    analysis/DLPFC7.Rmd
    Deleted:    analysis/DLPFC8.Rmd
    Deleted:    analysis/DLPFC9.Rmd
    Deleted:    analysis/Du1.Rmd
    Deleted:    analysis/Du10.Rmd
    Deleted:    analysis/Du11.Rmd
    Deleted:    analysis/Du12.Rmd
    Deleted:    analysis/Du13.Rmd
    Deleted:    analysis/Du14.Rmd
    Deleted:    analysis/Du15.Rmd
    Deleted:    analysis/Du16.Rmd
    Deleted:    analysis/Du17.Rmd
    Deleted:    analysis/Du18.Rmd
    Deleted:    analysis/Du19.Rmd
    Deleted:    analysis/Du2.Rmd
    Deleted:    analysis/Du20.Rmd
    Deleted:    analysis/Du3.Rmd
    Deleted:    analysis/Du4.Rmd
    Deleted:    analysis/Du5.Rmd
    Deleted:    analysis/Du6.Rmd
    Deleted:    analysis/Du7.Rmd
    Deleted:    analysis/Du8.Rmd
    Deleted:    analysis/Du9.Rmd
    Modified:   analysis/Giotto.Rmd
    Modified:   analysis/MERFISH.Rmd
    Deleted:    analysis/MERFISH1a (copy).Rmd
    Deleted:    analysis/MERFISH1a.Rmd
    Deleted:    analysis/MERFISH1b (copy).Rmd
    Deleted:    analysis/MERFISH1b.Rmd
    Deleted:    analysis/MERFISH2a (copy).Rmd
    Deleted:    analysis/MERFISH2a.Rmd
    Deleted:    analysis/MERFISH2b (copy).Rmd
    Deleted:    analysis/MERFISH2b.Rmd
    Deleted:    analysis/MERFISH3a (copy).Rmd
    Deleted:    analysis/MERFISH3a.Rmd
    Deleted:    analysis/MERFISH3b (copy).Rmd
    Deleted:    analysis/MERFISH3b.Rmd
    Deleted:    analysis/MERFISH4a (copy).Rmd
    Deleted:    analysis/MERFISH4a.Rmd
    Deleted:    analysis/MERFISH4b (copy).Rmd
    Deleted:    analysis/MERFISH4b.Rmd
    Modified:   analysis/Prostate.Rmd
    Deleted:    analysis/STARmap.Rmd
    Modified:   analysis/Seurat.Rmd
    Deleted:    analysis/Simulation.Rmd
    Deleted:    analysis/Single-cell.Rmd
    Modified:   analysis/SpatialExperiment.Rmd
    Modified:   analysis/VisiumHD.Rmd
    Modified:   code/VisiumHD_CRC_download.sh
    Deleted:    data/Pathology.csv
    Deleted:    data/merfish.Rmd
    Deleted:    data/vis.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

There are no past versions. Publish this analysis with wflow_publish() to start tracking its development.

Single-cell data

For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.

We start by reading in the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column). Note that more recent versions of cellranger now also output using the h5 file format, which can be read in using the Read10X_h5() function in Seurat.

We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For more information, check out our [Seurat object interaction vignette], or our GitHub Wiki. For example, in Seurat v5, the count matrix is stored in pbmc[[“RNA”]]$counts.

library(dplyr)
library(Seurat)
library(patchwork)
library(KODAMA)
library(KODAMAextra)


# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../singlecell_tutorial/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)
 1 layer present: counts

QC and selecting cells for further analysis Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include

The number of unique genes detected in each cell. Low-quality cells or empty droplets will often have very few genes Cell doublets or multiplets may exhibit an aberrantly high gene count Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes) The percentage of reads that map to the mitochondrial genome Low-quality / dying cells often exhibit extensive mitochondrial contamination We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features We use the set of all genes starting with MT- as a set of mitochondrial genes # The [[ operator can add columns to object metadata. This is a great place to stash QC stats

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

Where are QC metrics stored in Seurat?

In the example below, we visualize QC metrics, and use these to filter cells.

We filter cells that have unique feature counts over 2,500 or less than 200 We filter cells that have >5% mitochondrial counts

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Normalizing the data After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. In Seurat v5, Normalized values are stored in pbmc[[“RNA”]]$data.

pbmc <- NormalizeData(pbmc)

While this method of normalization is standard and widely used in scRNA-seq analysis, global-scaling relies on an assumption that each cell originally contains the same number of RNA molecules. We and others have developed alternative workflows for the single cell preprocessing that do not make these assumptions. For users who are interested, please check out our SCTransform() normalization workflow. The method is described in ourpaper, with a separate vignette using Seurat here. The use of SCTransform replaces the need to run NormalizeData, FindVariableFeatures, or ScaleData (described below.)

Identification of highly variable features (feature selection) We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). We and others have found that focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.

Our procedure in Seurat is described in detail here, and improves on previous versions by directly modeling the mean-variance relationship inherent in single-cell data, and is implemented in the FindVariableFeatures() function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

Scaling the data Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:

Shifts the expression of each gene, so that the mean expression across cells is 0 Scales the expression of each gene, so that the variance across cells is 1 This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate The results of this are stored in pbmc[[“RNA”]]$scale.data By default, only variable features are scaled. You can specify the features argument to scale additional features

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

Perform linear dimensional reduction Next we perform PCA on the scaled data. By default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to choose a different subset (if you do want to use a custom subset of features, make sure you pass these to ScaleData first).

For the first principal components, Seurat outputs a list of genes with the most positive and negative loadings, representing modules of genes that exhibit either correlation (or anti-correlation) across single-cells in the dataset.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

pbmc=RunKODAMAmatrix(pbmc,reduction="pca",dims=20,n.cores=4)

Calculating Network

socket cluster with 4 nodes on host 'localhost'
================================================================================
Finished parallel computation

[1] "Calculation of dissimilarity matrix..."
================================================================================

pbmc=RunKODAMAvisualization(pbmc)

pbmc <- FindNeighbors(pbmc, dims=1:2, reduction="KODAMA")
pbmc <- FindClusters(pbmc, resolution = 0.05)

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2638
Number of edges: 55505

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9914
Number of communities: 7
Elapsed time: 0 seconds

DimPlot(pbmc, reduction = "KODAMA")

sessionInfo()

R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] KODAMAextra_1.2    e1071_1.7-16       doParallel_1.0.17  iterators_1.0.14  
 [5] foreach_1.5.2      KODAMA_3.0         Matrix_1.7-3       umap_0.2.10.0     
 [9] Rtsne_0.17         minerva_1.5.10     patchwork_1.3.0    Seurat_5.2.1      
[13] SeuratObject_5.0.2 sp_2.2-0           dplyr_1.1.4        workflowr_1.7.1   

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3     rstudioapi_0.17.1      jsonlite_2.0.0        
  [4] magrittr_2.0.3         spatstat.utils_3.1-3   farver_2.1.2          
  [7] rmarkdown_2.29         fs_1.6.5               vctrs_0.6.5           
 [10] ROCR_1.0-11            spatstat.explore_3.4-2 askpass_1.2.1         
 [13] htmltools_0.5.8.1      sass_0.4.9             sctransform_0.4.1     
 [16] parallelly_1.43.0      KernSmooth_2.23-26     bslib_0.9.0           
 [19] htmlwidgets_1.6.4      ica_1.0-3              plyr_1.8.9            
 [22] plotly_4.10.4          zoo_1.8-13             cachem_1.1.0          
 [25] misc3d_0.9-1           whisker_0.4.1          igraph_2.1.4          
 [28] mime_0.13              lifecycle_1.0.4        pkgconfig_2.0.3       
 [31] R6_2.6.1               fastmap_1.2.0          fitdistrplus_1.2-2    
 [34] future_1.34.0          shiny_1.10.0           digest_0.6.37         
 [37] colorspace_2.1-1       ps_1.9.0               rprojroot_2.0.4       
 [40] tensor_1.5             RSpectra_0.16-2        irlba_2.3.5.1         
 [43] labeling_0.4.3         progressr_0.15.1       spatstat.sparse_3.1-0 
 [46] httr_1.4.7             polyclip_1.10-7        abind_1.4-8           
 [49] compiler_4.4.3         proxy_0.4-27           withr_3.0.2           
 [52] fastDummies_1.7.5      R.utils_2.13.0         MASS_7.3-65           
 [55] openssl_2.3.2          tools_4.4.3            lmtest_0.9-40         
 [58] httpuv_1.6.15          future.apply_1.11.3    goftest_1.2-3         
 [61] R.oo_1.27.0            glue_1.8.0             callr_3.7.6           
 [64] nlme_3.1-168           promises_1.3.2         grid_4.4.3            
 [67] getPass_0.2-4          cluster_2.1.8.1        reshape2_1.4.4        
 [70] snow_0.4-4             generics_0.1.3         gtable_0.3.6          
 [73] spatstat.data_3.1-6    R.methodsS3_1.8.2      class_7.3-23          
 [76] tidyr_1.3.1            data.table_1.17.0      spatstat.geom_3.3-6   
 [79] RcppAnnoy_0.0.22       ggrepel_0.9.6          RANN_2.6.2            
 [82] pillar_1.10.1          stringr_1.5.1          spam_2.11-1           
 [85] RcppHNSW_0.6.0         later_1.4.1            splines_4.4.3         
 [88] lattice_0.22-7         survival_3.8-3         deldir_2.0-4          
 [91] tidyselect_1.2.1       Rnanoflann_0.0.3       miniUI_0.1.1.1        
 [94] pbapply_1.7-2          knitr_1.50             git2r_0.33.0          
 [97] gridExtra_2.3          scattermore_1.2        xfun_0.51             
[100] matrixStats_1.5.0      stringi_1.8.7          lazyeval_0.2.2        
[103] yaml_2.3.10            evaluate_1.0.3         codetools_0.2-20      
[106] tcltk_4.4.3            tibble_3.2.1           cli_3.6.4             
[109] uwot_0.2.3             xtable_1.8-4           reticulate_1.42.0     
[112] munsell_0.5.1          processx_3.8.6         jquerylib_0.1.4       
[115] Rcpp_1.0.14            doSNOW_1.0.20          globals_0.16.3        
[118] spatstat.random_3.3-3  png_0.1-8              spatstat.univar_3.1-2 
[121] ggplot2_3.5.1          dotCall64_1.2          listenv_0.9.1         
[124] viridisLite_0.4.2      scales_1.3.0           ggridges_0.5.6        
[127] purrr_1.0.4            rlang_1.1.5            cowplot_1.1.3