Perform online iNMF on scaled datasets — runOnlineINMF • rliger

Perform online integrative non-negative matrix factorization to represent multiple single-cell datasets in terms of \(H\), \(W\), and \(V\) matrices. It optimizes the iNMF objective function (see runINMF) using online learning (non-negative least squares for \(H\) matrices, and hierarchical alternating least squares (HALS) for \(V\) matrices and \(W\)), where the number of factors is set by k. The function allows online learning in 3 scenarios:

Fully observed datasets;
Iterative refinement using continually arriving datasets;
Projection of new datasets without updating the existing factorization

All three scenarios require fixed memory independent of the number of cells.

For each dataset, this factorization produces an \(H\) matrix (k by cell), a \(V\) matrix (genes by k), and a shared \(W\) matrix (genes by k). The \(H\) matrices represent the cell factor loadings. \(W\) is identical among all datasets, as it represents the shared components of the metagenes across datasets. The \(V\) matrices represent the dataset-specific components of the metagenes.

Usage

runOnlineINMF(object, k = 20, lambda = 5, ...)

# S3 method for liger
runOnlineINMF(
  object,
  k = 20,
  lambda = 5,
  newDatasets = NULL,
  projection = FALSE,
  maxEpochs = 5,
  HALSiter = 1,
  minibatchSize = 5000,
  WInit = NULL,
  VInit = NULL,
  AInit = NULL,
  BInit = NULL,
  seed = 1,
  nCores = 2L,
  verbose = getOption("ligerVerbose", TRUE),
  ...
)

# S3 method for Seurat
runOnlineINMF(
  object,
  k = 20,
  lambda = 5,
  datasetVar = "orig.ident",
  layer = "ligerScaleData",
  assay = NULL,
  reduction = "onlineINMF",
  maxEpochs = 5,
  HALSiter = 1,
  minibatchSize = 5000,
  seed = 1,
  nCores = 2L,
  verbose = getOption("ligerVerbose", TRUE),
  ...
)

Arguments

object: liger object. Scaled data required.
k: Inner dimension of factorization--number of metagenes. A value in the range 20-50 works well for most analyses. Default 20.
lambda: Regularization parameter. Larger values penalize dataset-specific effects more strongly (i.e. alignment should increase as lambda increases). We recommend always using the default value except possibly for analyses with relatively small differences (biological replicates, male/female comparisons, etc.) in which case a lower value such as 1.0 may improve reconstruction quality. Default 5.0.
...: Arguments passed to other S3 methods of this function.
newDatasets: Named list of dgCMatrix. New datasets for scenario 2 or scenario 3. Default NULL triggers scenario 1.
projection: Whether to perform data integration with scenario 3 when newDatasets is specified. See description. Default FALSE.
maxEpochs: The number of epochs to iterate through. See detail. Default 5.
HALSiter: Maximum number of block coordinate descent (HALS algorithm) iterations to perform for each update of \(W\) and \(V\). Default 1. Changing this parameter is not recommended.
minibatchSize: Total number of cells in each minibatch. See detail. Default 5000.
WInit, VInit, AInit, BInit: Optional initialization for \(W\), \(V\), \(A\), and \(B\) matrices, respectively. Must be presented all together. See detail. Default NULL.
seed: Random seed to allow reproducible results. Default 1.
nCores: The number of parallel tasks to speed up the computation. Default 2L. Only supported for platform with OpenMP support.
verbose: Logical. Whether to show information of the progress. Default getOption("ligerVerbose") or TRUE if users have not set.
datasetVar: Metadata variable name that stores the dataset source annotation. Default "orig.ident".
layer: For Seurat>=4.9.9, the name of layer to retrieve input non-negative scaled data. Default "ligerScaleData". For older Seurat, always retrieve from scale.data slot.
assay: Name of assay to use. Default NULL uses current active assay.
reduction: Name of the reduction to store result. Also used as the feature key. Default "onlineINMF".

Value

liger method - Returns updated input liger object.
- A list of all \(H\) matrices can be accessed with getMatrix(object, "H")
- A list of all \(V\) matrices can be accessed with getMatrix(object, "V")
- The \(W\) matrix can be accessed with getMatrix(object, "W")
- Meanwhile, intermediate matrices \(A\) and \(B\) produced in HALS update can also be accessed similarly.
Seurat method - Returns updated input Seurat object.
- \(H\) matrices for all datasets will be concatenated and transposed (all cells by k), and form a DimReduc object in the reductions slot named by argument reduction.
- \(W\) matrix will be presented as feature.loadings in the same DimReduc object.
- \(V\) matrices, \(A\) matrices, \(B\) matricesm an objective error value and the dataset variable used for the factorization is currently stored in misc slot of the same DimReduc object.

Details

For performing scenario 2 or 3, a complete set of factorization result from a run of scenario 1 is required. Given the structure of a liger object, all of the required information can be retrieved automatically. Under the circumstance where users need customized information for existing factorization, arguments WInit, VInit, AInit and BInit are exposed. The requirements for these argument follows:

WInit - A matrix object of size \(m \times k\). (see runINMF for notation)
VInit - A list object of matrices each of size \(m \times k\). Number of matrices should match with newDatasets.
AInit - A list object of matrices each of size \(k \times k\). Number of matrices should match with newDatasets.
BInit - A list object of matrices each of size \(m \times k\). Number of matrices should match with newDatasets.

Minibatch iterations is performed on small subset of cells. The exact minibatch size applied on each dataset is minibatchSize multiplied by the proportion of cells in this dataset out of all cells. In general, minibatchSize should be no larger than the number of cells in the smallest dataset (considering both object and newDatasets). Therefore, a smaller value may be necessary for analyzing very small datasets.

An epoch is one completion of calculation on all cells after a number of iterations of minibatches. Therefore, the total number of iterations is determined by the setting of maxEpochs, total number of cells, and minibatchSize.

Currently, Seurat S3 method does not support working on Scenario 2 and 3, because there is no simple solution for organizing a number of miscellaneous matrices with a single Seurat object. We strongly recommend that users create a liger object which has the specific structure.

References

Chao Gao and et al., Iterative single-cell multi-omic integration using online learning, Nat Biotechnol., 2021

Examples

pbmc <- normalize(pbmc)
#> ℹ Normalizing datasets "ctrl"
#> ℹ Normalizing datasets "stim"
#> ✔ Normalizing datasets "stim" ... done
#> 
#> ℹ Normalizing datasets "ctrl"

#> ✔ Normalizing datasets "ctrl" ... done
#> 
pbmc <- selectGenes(pbmc)
#> ℹ Selecting variable features for dataset "ctrl"
#> ✔ ... 168 features selected out of 249 shared features.
#> ℹ Selecting variable features for dataset "stim"
#> ✔ ... 166 features selected out of 249 shared features.
#> ✔ Finally 173 shared variable features are selected.
pbmc <- scaleNotCenter(pbmc)
#> ℹ Scaling dataset "ctrl"
#> ✔ Scaling dataset "ctrl" ... done
#> 
#> ℹ Scaling dataset "stim"
#> ✔ Scaling dataset "stim" ... done
#> 
if (requireNamespace("RcppPlanc", quietly = TRUE)) {
    # Scenario 1
    pbmc <- runOnlineINMF(pbmc, minibatchSize = 200)
    # Scenario 2
    # Fake new dataset by increasing all non-zero value in "ctrl" by 1
    ctrl2 <- rawData(dataset(pbmc, "ctrl"))
    ctrl2@x <- ctrl2@x + 1
    colnames(ctrl2) <- paste0(colnames(ctrl2), 2)
    pbmc2 <- runOnlineINMF(pbmc, k = 20, newDatasets = list(ctrl2 = ctrl2),
                           minibatchSize = 100)
    # Scenario 3
    pbmc3 <- runOnlineINMF(pbmc, k = 20, newDatasets = list(ctrl2 = ctrl2),
                           projection = TRUE)
}
#> ! Colnames of `value` do not all start with "ctrl2_".
#> Prefix added.
#> ℹ Normalizing datasets "ctrl2"
#> ✔ Normalizing datasets "ctrl2" ... done
#> 
#> ℹ Scaling dataset "ctrl2"
#> ✔ Scaling dataset "ctrl2" ... done
#> 
#> ! Colnames of `value` do not all start with "ctrl2_".
#> Prefix added.
#> ℹ Normalizing datasets "ctrl2"
#> ✔ Normalizing datasets "ctrl2" ... done
#> 
#> ℹ Scaling dataset "ctrl2"
#> ✔ Scaling dataset "ctrl2" ... done
#>