Integrating STARmap spatial transcriptomic and scRNA datasets using UINMF

Here we demonstrate integrating a scRNA (Dropviz) dataset of the mouse frontal cortex and a spatial transcriptomics (STARmap) dataset of the same region. For this tutorial, we prepared the two datasets at a Dropbox folder

The scRNA mouse dataset Dropviz_starmap_vig.RDS is 28,366 genes by 71,639 cells.
The spatial transcriptomic STARmap dataset STARmap_vig.RDS is 28 genes by 31,294 cells.
The annotation data of the Dropviz mouse data Dropviz_general_annotations.RDS.
The 3D coordinate data of the STARmap dataset STARmap_3D_Annotations.RDS

Step 1: Load the data

The datasets provided are presented in “dgCMatrix” class, which is a common form of sparse matrix in R. We can then create a liger object with a named list of the matrices. The unshared features should not be subsetted out, or submitted separately. Rather, they should be included in the matrix submitted for that dataset. This helps ensure proper normalization. Different from other UINMF tutorials, we set the modality of each dataset with argument modal, in this way, a slot for spatial coordinates is preserved for the STARmap dataset.

library(rliger)

dropviz <- readRDS("starmap_data/Dropviz_starmap_vig.RDS")
starmap <- readRDS("starmap_data/STARmap_vig.RDS")

lig <- createLiger(list(starmap = starmap, dropviz = dropviz),
                   modal = c("spatial", "rna"))

Step 2: Preprocessing

2.1 Normalization

Liger simply normalizes the matrices by library size, without multiplying a scale factor or applying “log1p” transformation.

lig <- normalize(lig)

2.2 Select variable genes

Given that the STARmap dataset has only 28 genes available and these are also presented in the Dropviz dataset, we directly set all of them as shared variable genes. Please use varFeatures<-() method for accessing the shared variable genes. For the Dropviz dataset, we further select variable genes from the unshared part of it. To enable selection of unshared genes, users need to specify the name(s) of dataset(s) where unshared genes should be chosen from to useUnsharedDatasets.

lig <- selectGenes(lig, thresh = 0.3, useUnsharedDatasets = "dropviz", unsharedThresh = 0.3)

## ℹ Selecting variable features for dataset "starmap"

## ✔ ... 0 features selected out of 28 shared features.

## ℹ Selecting variable features for dataset "dropviz"

## ✔ ... 19 features selected out of 28 shared features.

## ✔ ... 2775 features selected out of 28338 unshared features.

## ✔ Finally 19 shared variable features are selected.

varFeatures(lig) <- rownames(dataset(lig, "starmap"))

2.3 Scale not center

Then we scale the dataset. Three new matrices will be created under the hook, two containing the 28 shared variable genes for both datasets, and one for the unshared variable genes of the Dropviz data. Note that LIGER does not center the data before scaling because iNMF/UINMF methods require non-negative input.

lig <- scaleNotCenter(lig)

Step 3: Joint Matrix Factorization

Unshared Integrative Non-negative Matrix Factorization (UINMF) can be applied with runIntegration(..., method = "UINMF"). A standalone function runUINMF() is also provided with more detailed documentation and initialization setup. This step produces factor gene loading matrices of the shared genes: \(W\) for shared information across datasets, and \(V\) for dataset specific information. Specific to UINMF method, additional factor gene loading matrix of unshared features, \(U\) is also produced. \(H\) matrices, the cell factor loading matrices are produced for each dataset and can be interpreted as low-rank representation of the cells.

In this tutorial, we set dataset specific lambda (regularization parameter) values to penalize the dataset specific effect differently.

Another noteworthy advantage of UINMF is that we are able to use a larger number of factors than there are shared features. We captilize on this by changing the default value of k to 40.

lig <- runUINMF(lig, k = 40, lambda = c(10, 1))

Step 4: Quantile Normalization and Joint Clustering

4.1 Quantile normalization

The default reference dataset for quantile normalization is the larger dataset, but users should select the higher quality dataset as the reference dataset, even if it is the smaller dataset. In this case, the Dropviz dataset is considered higher quality than the STARmap dataset, so we set the Dropviz dataset to be the reference dataset.

After this step, the low-rank cell factor loading matrices, \(H\), are aligned and ready for cluster definition.

lig <- quantileNorm(lig, reference = "dropviz")

4.2 Leiden clustering

With the aligned cell factor loading information, we next apply Leiden community detection algorithm to identify cell clusters.

lig <- runCluster(lig)

Step 5: Visualizations and Downstream processing

5.1 Dimensionality reduction

We create a UMAP with the quantile normalized cell factor loading.

lig <- runUMAP(lig)

5.2 Plot UMAP

Next, we can visualize our factorized object by dataset to check the integration between datasets, and also the cluster labeling on the global population.

plotDatasetDimRed(lig)

plotClusterDimRed(lig, legendNCol = 2)

We can also use the Dropviz labels to help us annotate the clusters. Link to the downloading page of the labeling data is provided at top of this page.

mouse_annies <- readRDS("starmap_data/Dropviz_general_annotations.RDS")
dropviz_id <- paste0("dropviz_", names(mouse_annies))
cellMeta(lig, "dropviz_ann", cellIdx = dropviz_id) <- factor(mouse_annies)
plotClusterDimRed(lig, "dropviz_ann", legendNCol = 1)

cellMeta()<- method has the feature for partial insertion implemented and is great for adding dataset specific metadata to a new variable. If you are unsure about the order between given variable and the cells in the object, argument cellIdx can be used for precise locating. Users might use different strategies when creating the index depending on different situation.

Step 6: Visualization with spatial information

Here we demonstrate how to use the annotation labels derived from the above analysis within the context of 3D space. We provide the annotation labels for the sake of simplicity. The link to the download page is provided at the top of this page. These labels were generated using the high quality Dropviz annotations to re-annotate the STARmap cells after completing the above analysis. We have provided the exact annotations and colors used in the publication such that the interested user may captilize on the 3D dimensional sample space.

starmap_annies <- readRDS("starmap_data/STARmap_3D_Annotations.RDS")
dplyr::glimpse(starmap_annies)

## Rows: 29,163
## Columns: 9
## $ Cell_Barcode       <chr> "starmap_11", "starmap_30", "starmap_51", "starmap_…
## $ Cluster            <fct> 15, 15, 15, 13, 15, 15, 13, 13, 15, 15, 1, 1, 15, 1…
## $ Cluster_Annotation <chr> "Layer 5A", "Layer 5A", "Layer 5A", "Layer 5A", "La…
## $ Coord1             <int> 12, 46, 81, 87, 92, 105, 117, 123, 128, 133, 170, 1…
## $ Coord2             <int> 937, 152, 76, 967, 948, 951, 284, 1004, 217, 319, 1…
## $ Coord3             <int> 8, 12, 7, 8, 8, 6, 7, 6, 6, 9, 6, 6, 7, 9, 6, 6, 6,…
## $ Cell_Type_Color    <chr> "plum", "plum", "plum", "plum", "plum", "plum", "pl…
## $ General_Class      <chr> "Neuron", "Neuron", "Neuron", "Neuron", "Neuron", "…
## $ Sub_Color          <chr> "#0066FF", "#0066FF", "#0066FF", "#0066FF", "#0066F…

6.1 Manage ligerSpatialDataset class (Optional)

Recall that we set the “modal” of the STARmap dataset to “spatial” when we created the liger object, here we can insert the coordinate information into the object.

coords <- as.matrix(starmap_annies[, c("Coord1", "Coord2", "Coord3")])
rownames(coords) <- starmap_annies$Cell_Barcode
coordinate(lig, "starmap") <- coords

LIGER does not yet have any analysis method that incorporate the spatial coordinate information. Given that most of the spatial transcriptomics technologies only support 2D spatial information, LIGER only has 2D plotting (See plotSpatial2D()) method implemented for the spatial coordinate at this point.

6.2 3D visualization

The STARmap technology we work with in this tutorial is special for 3D information. LIGER does not provide any function for the visualization of 3D coordinates, but here we provide a simple guide on it using package rgl.

The code chunk below shows how to create the 3D view of all cells.

library(rgl)

# Set the perspective parameters
rgl.viewpoint(theta = 25, phi = 5, zoom = 0.7)
# Generate scatter plot in 3D space
plot3d(x = starmap_annies$Coord1, 
       y = starmap_annies$Coord2, 
       z = starmap_annies$Coord3, 
       xlab = "", ylab = "", zlab = "",
       col = starmap_annies$Cell_Type_Color, 
       size = 2, axes= FALSE, labels = FALSE)
aspect3d(1.7, 1.4, 0.1)
first = c(0)
second = c(0)
# Set background color to black
bg3d("black", labels = FALSE)
axes3d(edges = "bbox",col = 'white', labels = FALSE, tick = FALSE)

As we can see above, cells of different types can be located at different depth (z-axis) and can interfere the observation of each other. We next show how to visualize with only neuronal cells.

# Subset annotation data to only Neurons
neuronal_cells <- starmap_annies[starmap_annies$General_Class == "Neuron",]
# Create 3D scatter plot with the subset
plot3d(x = neuronal_cells$Coord1, 
       y = neuronal_cells$Coord2, 
       z = neuronal_cells$Coord3, 
       xlab = "", ylab = "", zlab = "", 
       col = neuronal_cells$Sub_Color, 
       size = 3, axes = FALSE, labels = FALSE)
aspect3d(1.7, 1.4, 0.1)
first = c(0)
second = c(0)
bg3d("black", labels = FALSE)
axes3d(edges = "bbox", col = 'white', labels = FALSE, tick = FALSE)

April Kriebel and Joshua Welch

12/03/2021