Perform online iNMF on scaled datasets — online

Please turn to runOnlineINMF or runIntegration.

Perform online integrative non-negative matrix factorization to represent multiple single-cell datasets in terms of H, W, and V matrices. It optimizes the iNMF objective function using online learning (non-negative least squares for H matrix, hierarchical alternating least squares for W and V matrices), where the number of factors is set by k. The function allows online learning in 3 scenarios: (1) fully observed datasets; (2) iterative refinement using continually arriving datasets; and (3) projection of new datasets without updating the existing factorization. All three scenarios require fixed memory independent of the number of cells.

For each dataset, this factorization produces an H matrix (cells by k), a V matrix (k by genes), and a shared W matrix (k by genes). The H matrices represent the cell factor loadings. W is identical among all datasets, as it represents the shared components of the metagenes across datasets. The V matrices represent the dataset-specific components of the metagenes.

Arguments

object: liger object with data stored in HDF5 files. Should normalize, select genes, and scale before calling.
X_new: List of new datasets for scenario 2 or scenario 3. Each list element should be the name of an HDF5 file.
projection: Perform data integration by shared metagene (W) projection (scenario 3). (default FALSE)
W.init: Optional initialization for W. (default NULL)
V.init: Optional initialization for V (default NULL)
H.init: Optional initialization for H (default NULL)
A.init: Optional initialization for A (default NULL)
B.init: Optional initialization for B (default NULL)
k: Inner dimension of factorization--number of metagenes (default 20). A value in the range 20-50 works well for most analyses.
lambda: Regularization parameter. Larger values penalize dataset-specific effects more strongly (ie. alignment should increase as lambda increases). We recommend always using the default value except possibly for analyses with relatively small differences (biological replicates, male/female comparisons, etc.) in which case a lower value such as 1.0 may improve reconstruction quality. (default 5.0).
max.epochs: Maximum number of epochs (complete passes through the data). (default 5)
miniBatch_max_iters: Maximum number of block coordinate descent (HALS algorithm) iterations to perform for each update of W and V (default 1). Changing this parameter is not recommended.
miniBatch_size: Total number of cells in each minibatch (default 5000). This is a reasonable default, but a smaller value such as 1000 may be necessary for analyzing very small datasets. In general, minibatch size should be no larger than the number of cells in the smallest dataset.
h5_chunk_size: Chunk size of input hdf5 files (default 1000). The chunk size should be no larger than the batch size.
seed: Random seed to allow reproducible results (default 123).
verbose: Print progress bar/messages (TRUE by default)

Value

liger object with H, W, V, A and B slots set.