UINMF integration of Dual-omics data

Here we integrate the scATAC and scRNA reads from the dual-omics dataset SNARE-seq as an illustration of how UINMF can be used to integrate cross-modality data. For this tutorial, we will use three matrices, which can all be downloaded from our Dropbox folder

The transcriptomic measures SNAREseq_RNA.RDS is the SNARE-seq scRNA dataset (31,367 genes by 10,309 cells).
For the shared epigenomic features SNARE_seq_shared_chromatin_features.RDS, we create a gene-centric matrix, such that we sum of the number of accessibiltiy peaks that occur over the gene body and promoter regions for each gene. For a detailed walkthough of how to generate such a matrix, please see our Integrating scRNA and scATAC data vignette. The resulting matrix of gene-centric chromatin accessibility is 22,379 genes by 10,309 cells
For the unshared epigenomic features, we binned the genome into bins of 100,000 bp, and summed the number of peaks occuring in each bin. We then filtered this matrix for all bins that overlapped with ENCODE Blacklist regions, genes, and promoters. Our filtered matrix SNARE_seq_unshared_chromatin_features.RDS is 10,309 cells by 7,437 bins.

Step 1: Load data

library(rliger)

rna <- readRDS("SNAREseq_data/SNAREseq_RNA.RDS")
atac_shared <- readRDS("SNAREseq_data/SNAREseq_chromatin_accessibility_shared.RDS")
atac_unshared <- readRDS("SNAREseq_data/SNARE_seq_unshared_chromatin_features.RDS")

Step 2: Preprocessing

Integrative non-negative matrix factorization with unshared features (UINMF) performs factorization using shared feature matrix of all datasets and includes unshared features from dataset(s) with such information. In this tutorial, we plan to use the variable feature selected from the scRNA dataset. Since we prepared the gene-centric matrix from the scATAC dataset, these genes can be accounted as the shared features. For the unshared features, normally we select variable genes that are out of the intersection of gene sets from all datasets. In this tutorial, we directly use the peaks that are identified variable. Therefore, special processing will be needed compared to the other UINMF tutorials.

2.1: Create liger object for shared genes

lig <- createLiger(list(rna = rna, atac = atac_shared))

2.2: Normalize and select shared variable features

lig <- normalize(lig) %>%
    selectGenes(useDatasets = "rna", thresh = 0.1) %>%
    scaleNotCenter()

Step 3: Selecting the unshared features

When selecting unshared features for the UINMF integration, it is critical to consider the type of data you are working with. For unshared features that gene-centric, the user should follow the feature selection process outlined in the ‘Integrating unshared features with UINMF’ tutorial.

However, when dealing with features that are not gene-centric (as is the case with our binned intergenic peak counts), we must select the most variable features using an appropriate method. Here, we opt to use Seurat’s FindVariableFeatures() function.

First, we normalize the unshared features:

unshareNormed <- normalize(atac_unshared)

Then we select the top 2,000 variable features with Seurat VST method:

library(Seurat)

se <- CreateSeuratObject(unshareNormed)
se <- FindVariableFeatures(se, selection.method = "vst", nfeatures = 2000)
top2000 <- VariableFeatures(se)

Then we scale, but do not center the unshared features

unshareScaled <- scaleNotCenter(unshareNormed[top2000,])

Add the unshared features that have been properly selected, such that they are added as a genes by cells matrix.

varUnsharedFeatures(lig, "atac") <- top2000
scaleUnsharedData(lig, "atac") <- unshareScaled

Step 4: Joint Matrix Factorization

To factorize the datasets and include the unshared datasets, set useUnshared = TRUE.

lig <- runUINMF(lig, k = 30, nIteration = 30)

Step 5: Quantile Normalization and Joint Clustering

After factorization, the resulting Liger object can be used in all downstream LIGER functions without adjustment. The default reference dataset for quantile normalization is the larger dataset, but the user should select the higher quality dataset as the reference dataset, even if it is the smaller dataset.

lig <- quantileNorm(lig)
lig <- runCluster(lig, resolution = 0.8, nNeighbors = 30)

Step 6: Visualizations and Downstream processing

lig <- runUMAP(lig, nNeighbors = 30)

Next, we can visualize our returned factorized object by dataset to check the alignment between datasets, as well as by cluster determined in the factorization.

options(ligerDotSize = 0.5)
plotDatasetDimRed(lig)

plotClusterDimRed(lig, legendNCol = 2)

April Kriebel and Joshua Welch

04/25/2023