Skip to contents

This guide will demonstrate the usage of the Liger package in the style of the R Console, which can be accessed through an R development environment (e.g., RStudio) or directly from the R command line.

Stage I: Preprocessing and Normalization

1. Loading data

For the first portion of this protocol, we will be integrating data from control and interferon-stimulated PBMCs from Kang et al, 2017. The data can be found in the Gene Expression Omnibus, Series GSE96583. This dataset was originally in the form of output from the 10X Cellranger pipeline, though we will directly load downsampled versions of the control and stimulated DGEs here.

For convenience, we have prepared the pre-processed data which are ready to use. There are two datasets: PBMC control and PBMC interferon-stimulated. We provided ready-to-use liger object, which can be easily loaded with importPBMC().

library(rliger)
pbmcLiger <- importPBMC()

For creating a liger object from raw counts data or any other types of source, please refer to the detailed tutorial for importing data.

2. Preprocess

Before we can run iNMF on our datasets, we must run several preprocessing steps to normalize expression data to account for differences in sequencing depth and efficiency between cells, identify variably expressed genes, and scale the data so that each gene has the same variance. Note that because nonnegative matrix factorization requires positive values, we do not center the data by subtracting the mean. We also do not log transform the data.

pbmcLiger <- pbmcLiger %>%
    normalize() %>%
    selectGenes() %>%
    scaleNotCenter()

Stage II: Integration with Joint Matrix Factorization

3. Determine parameters and perform iNMF integration

We are now able to run integrative non-negative matrix factorization (iNMF) on the normalized and scaled datasets. The key parameter for this analysis is k, the number of matrix factors (analogous to the number of principal components in PCA). In general, we find that a value of k between 20 and 40 is suitable for most analyses and that results are robust for choice of k. Because LIGER is an unsupervised, exploratory approach, there is no single “right” value for k. In practice, users choose k from a combination of biological prior knowledge and other information. For this tutorial, we set k = 20.

pbmcLiger <- runIntegration(pbmcLiger, k = 20)

Starting from rliger 2.0.0, we use an optimized implementation of iNMF. Here we deprecated the parameter thresh which stands for a convergence detecter in order to speed up each algorithm iteration by omitting the calculation of objective error.

The factorization yields several lower dimension matrices, including the \(H\) matrices of metagene loadings for each cell, the \(W\) matrix of shared factor loadings and the \(V\) matrices of dataset-specific factor loadings. Please refer to liger object documentation for how to access them.

The time consumption of this step is dependent of the size of the datasets, in terms of number of cells, number of variable genes selected, and the value of k. The implementation supports OpenMP multi-threading, and there for using a machine with a number of cores allocated helps speeding it up.

Stage III: Quantile Normalization and Joint Clustering

4. Align the factors

We can now use the resulting factors to jointly cluster cells and perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. All of this functionality is encapsulated within the quantileNorm() function, which uses max factor assignment followed by refinement using a k-nearest neighbors graph.

pbmcLiger <- quantileNorm(pbmcLiger)

5. Clustering

The quantileNorm() procedure produces joint clustering assignments and a low-dimensional representation that integrates the datasets together. These joint clusters directly from iNMF can be used for downstream analyses (see below). Alternatively, you can also run Louvain community detection, an algorithm commonly used for single-cell data, on the normalized cell factors. The Louvain algorithm excels at merging small clusters into broad cell classes and thus may be more desirable in some cases than the maximum factor assignments produced directly by iNMF.

pbmcLiger <- runCluster(pbmcLiger, resolution = 0.25, nNeighbors = 30)

Starting from rliger 2.0.0, cluster labeling will be stored in cell metadata, which can be accessed with cellMeta(pbmcLiger). Use argument clusterName to specify unique variable names for the result can enable storing multiple cluster labeling variables at the same time.

Stage IV: Visualization and Downstream Analysis

6. Generate dimensionality reduced embedding

To visualize the clustering of cells graphically, we can project the normalized cell factors to two or three dimensions. LIGER supports both UMAP and t-SNE for this purpose.

pbmcLiger <- runUMAP(pbmcLiger, n_neighbors = 30, min_dist = 0.3)

Starting from rliger 2.0.0, the slot for storing dimensionality reduction matrices will be renamed to “dimReds”. It will be a list that can hold multiple low dimensional matrices that match to the dataset by cell identifiers. Users can access individual matrix with dimRed(object, "name"). Use argument dimredName to specify unique names for the UMAP result so that it allows storing multiple low-dimensional representation matrices at the same time.

7. Create plots

We provide a variety of utilities for visualization and analysis of clustering, gene expression across datasets, and comparisons of cluster assignments. Here we demonstrate several commonly used examples.

plotByDatasetAndCluster() returns two graphs, generated by t-SNE or UMAP in the previous step. The first colors cells by dataset of origin, and the second by cluster as determined by previous clustering step. The plots provide visual confirmation that the datasets are well aligned and the clusters are consistent with the shape of the data as revealed by UMAP.

The two subplots can individually be generated with plotDatasetDimRed() and plotClusterDimRed(), respectively.

To directly study the impact of factors on the clustering and determine what genes load most highly on each factor, we use the plotGeneLoadings() function, which returns plots of factor loading on the dimensionality reduction and highly loaded genes by dataset for each factor.

factorMarkers <- getFactorMarkers(pbmcLiger, dataset1 = "ctrl", dataset2 = "stim")
plotGeneLoadings(pbmcLiger, markerTable = factorMarkers, useFactor = 11)

8. Differential expression

Using the runMarkerDEG() function, we can next identify gene markers for all clusters. We can also compare expression within each cluster across datasets, which in this case reveals markers of interferon-beta stimulation. The function returns a table of data that allows us to determine the significance of each gene’s differential expression, including log fold change, area under the curve (auc) and p-value.

The default parameters performs Wilcoxon rank-sum test at a cluster level:

cluster.results <- runMarkerDEG(pbmcLiger)
head(cluster.results)
##     feature group     avgExpr        logFC statistic       auc          pval
## 1 LINC00115     0  0.06557615 -0.006707732  19046596 0.4998280  7.902339e-01
## 2     NOC2L     0  0.42228382 -0.916821744  17926394 0.4704313  1.876299e-33
## 3    KLHL17     0  0.01232972 -0.009023670  19042684 0.4997253  4.105767e-01
## 4   PLEKHN1     0  0.18811948  0.141802251  19232439 0.5047050  3.185813e-11
## 5      HES4     0  4.78053287  3.500041958  23275316 0.6107997 4.655312e-241
## 6     ISG15     0 13.53789841  2.106235323  25048607 0.6573351 7.640631e-185
##            padj      pct_in    pct_out
## 1  8.228727e-01  0.43243243  0.4660647
## 2  1.603211e-32  2.81081081  8.5542286
## 3  4.708915e-01  0.08108108  0.1359355
## 4  1.232679e-10  1.24324324  0.3010001
## 5 1.816141e-239 30.48648649  8.1270026
## 6 2.512153e-183 71.37837838 64.4819885

Alternatively, it is also helpful to identify dataset specific markers within each cluster. For example in this tutorial, we can split data by cluster labeling, and within a cluster, find the markers from interferon-stimulated cells.

datasets.results <- runMarkerDEG(pbmcLiger, conditionBy = "dataset", splitBy = "leiden_cluster")
head(datasets.results$`0`) # Note that the first cluster is "0"
##     feature group    avgExpr        logFC statistic          auc          pval
## 1 LINC00115  ctrl 0.07183401   0.01158862 1701494.0 0.5003517031  7.453285e-01
## 2     NOC2L  ctrl 0.66393901   0.44750960 1750817.0 0.5148559253  5.085806e-08
## 3    KLHL17  ctrl 0.02680373   0.02680373 1703295.0 0.5008813161  6.054223e-02
## 4   PLEKHN1  ctrl 0.09765942  -0.16751863 1681486.5 0.4944681756  2.474522e-03
## 5      HES4  ctrl 0.18904920  -8.50274754  775284.5 0.2279848885 3.942972e-269
## 6     ISG15  ctrl 6.05199165 -13.86279031    2251.0 0.0006619428  0.000000e+00
##            padj     pct_in     pct_out
## 1  8.278741e-01  0.4700353   0.4004004
## 2  3.231322e-07  4.4065805   1.4514515
## 3  1.220948e-01  0.1762632   0.0000000
## 4  7.671268e-03  0.6462985   1.7517518
## 5 7.147217e-267  1.2338425  55.4054054
## 6  0.000000e+00 37.7790834 100.0000000

The number of significant genes identified by runMarkerDEG() varies and depends on the datasets used. And the raw output of the function contains the statistics of all tested genes in all groups (clusters). In order to pick out the top markers for each cluster, we strongly suggest using package “dplyr”, which provides a user-friendly interface for data table manipulation. The following code chunk first filters the markers which are statistically and biologically significant. For example, we filter the output by taking markers which have padj (Benjamini-Hochberg adjusted p-value) less than 0.05 and logFC (log fold change between observations in group versus out) larger than 3. Then for each cluster, we sort the markers primarily by its padj value in ascending order. Given that mathematically, the lowest padj values are rounded to 0 as they are too small, for genes tying on this metric, we then sort the markers by logFC in descending order. Finally, we select the top 20 markers for each cluster.

library(dplyr)
cluster.results.sort <- cluster.results %>%
    filter(padj < 0.05, logFC > 3) %>%
    group_by(group) %>%
    arrange(padj, -logFC, .by_group = TRUE) %>%
    top_n(20)# rank by logFC from high to low

# Show the markers for cluster 3
cluster.results.sort %>% filter(group == 3)
## # A tibble: 20 × 10
## # Groups:   group [1]
##    feature  group avgExpr logFC statistic   auc      pval      padj pct_in
##    <chr>    <fct>   <dbl> <dbl>     <dbl> <dbl>     <dbl>     <dbl>  <dbl>
##  1 HLA-DRA  3        17.3  8.42 14081152  0.800 0         0           97.9
##  2 CD74     3        18.0  7.01 16134037  0.917 0         0           99.1
##  3 HLA-DPA1 3        14.5  7.76 13686101  0.778 3.53e-302 2.50e-299   87.2
##  4 RPS5     3        17.0  3.20 13721636  0.780 7.87e-260 4.19e-257   98.2
##  5 HLA-DPB1 3        13.6  6.73 13249646. 0.753 2.68e-251 1.27e-248   81.3
##  6 HLA-DRB1 3        15.3  7.22 13192870. 0.750 3.92e-231 1.47e-228   90.1
##  7 CCR7     3        14.0  7.14 12975000. 0.737 5.12e-223 1.72e-220   83.0
##  8 RPL10A   3        16.2  3.85 12668400  0.720 6.70e-163 1.40e-160   95.0
##  9 EEF1B2   3        13.1  4.34 12392988. 0.704 1.99e-150 3.57e-148   80.0
## 10 RPLP0    3        15.5  3.55 12437076. 0.707 1.40e-144 2.23e-142   92.4
## 11 HSP90AB1 3        12.4  3.63 12280914. 0.698 5.75e-142 8.63e-140   74.6
## 12 RPL5     3        14.5  4.01 12272426. 0.698 8.83e-135 1.24e-132   87.2
## 13 RPSA     3        14.8  3.96 12242276. 0.696 1.30e-131 1.71e-129   89.0
## 14 CXCR4    3        12.9  3.83 12106296. 0.688 1.21e-127 1.47e-125   76.2
## 15 RPL7A    3        14.9  3.03 12167762. 0.692 2.55e-124 2.93e-122   89.7
## 16 RAN      3        10.9  3.41 11805844. 0.671 1.96e-112 1.94e-110   66.5
## 17 RPL4     3        13.8  3.24 11883184. 0.675 1.05e-106 8.96e-105   83.2
## 18 NPM1     3        12.2  3.74 11778232. 0.669 7.79e-106 6.58e-104   73.6
## 19 CREM     3        10.1  3.38 11163252. 0.634 5.25e- 74 3.00e- 72   61.7
## 20 GLTSCR2  3        10.2  3.26 11106918. 0.631 3.24e- 69 1.71e- 67   63.4
## # ℹ 1 more variable: pct_out <dbl>

We can then visualize the expression profiles of individual genes, such as the differentially expressed genes that we just identified. This allows us to visually confirm the cluster- or dataset-specific expression patterns of marker genes. plotGeneDimRed() returns graphs of gene loading on the dimensionality reduced graph for each dataset.

plotGeneDimRed(pbmcLiger, "PRF1")

We can also plot the gene expression by dataset.

prf1List <- plotGeneDimRed(pbmcLiger, "PRF1", splitBy = "dataset")
cowplot::plot_grid(plotlist = prf1List, labels = names(prf1List))

We can also use plotGeneDimRed() to compare the loading of cluster markers within and between datasets.

IFIT3 <- plotGeneDimRed(pbmcLiger, "IFIT3", splitBy = "dataset")
IFITM3 <- plotGeneDimRed(pbmcLiger, "IFITM3", splitBy = "dataset")
cowplot::plot_grid(IFIT3[[1]], IFIT3[[2]], IFITM3[[1]], IFITM3[[2]], ncol = 2, labels = c("ctrl", "stim", "ctrl", "stim"))

R Session Info

## R version 4.3.2 RC (2023-10-30 r85440)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Detroit
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.4  rliger_2.0.1
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9          utf8_1.2.4          generics_0.1.3     
##  [4] lattice_0.22-6      digest_0.6.35       magrittr_2.0.3     
##  [7] evaluate_0.23       grid_4.3.2          fastmap_1.1.1      
## [10] jsonlite_1.8.8      Matrix_1.6-5        ggrepel_0.9.5      
## [13] scattermore_1.2     purrr_1.0.2         fansi_1.0.6        
## [16] viridisLite_0.4.2   scales_1.3.0        codetools_0.2-20   
## [19] textshaping_0.3.7   jquerylib_0.1.4     cli_3.6.2          
## [22] rlang_1.1.3         RcppAnnoy_0.0.22    uwot_0.1.16        
## [25] cowplot_1.1.3       munsell_0.5.1       withr_3.0.0        
## [28] RANN_2.6.1          cachem_1.0.8        yaml_2.3.8         
## [31] tools_4.3.2         parallel_4.3.2      memoise_2.0.1      
## [34] colorspace_2.1-0    ggplot2_3.5.0       sccore_1.0.5       
## [37] BiocGenerics_0.48.1 vctrs_0.6.5         R6_2.5.1           
## [40] stats4_4.3.2        lifecycle_1.0.4     S4Vectors_0.40.2   
## [43] fs_1.6.3            irlba_2.3.5.1       ragg_1.3.0         
## [46] pkgconfig_2.0.3     leidenAlg_1.1.3     desc_1.4.3         
## [49] pkgdown_2.0.7       pillar_1.9.0        bslib_0.7.0        
## [52] gtable_0.3.4        glue_1.7.0          Rcpp_1.0.12        
## [55] systemfonts_1.0.6   highr_0.10          xfun_0.43          
## [58] tibble_3.2.1        tidyselect_1.2.1    rstudioapi_0.16.0  
## [61] knitr_1.45          farver_2.1.1        htmltools_0.5.8    
## [64] igraph_2.0.3        labeling_0.4.3      rmarkdown_2.26     
## [67] RcppPlanc_1.0.0     compiler_4.3.2