Import Data from Various Source

This article only focuses on reading in data presented in various forms and provides detailed examples and explanations. Currently, we support importing the following forms of input data into a liger object:

Flat MTX files containing sparse matrices, together with files containing barcodes and feature names. These are mostly the 10X cellranger output.
H5 files containing the sparse data, barcodes, and feature names. These can be found also in 10X cellranger output, as well as H5AD files written from a Python AnnData object.
Loaded R objects of the following classes:

Whichever type of data you want to bring to a LIGER analysis, we generally recommend that it represents the raw information, such as RNA raw counts, peak counts. LIGER performs optimally with its own normalization (library-size/L1 normalization only) and scaling (no centering) procedures, and other types of transformation may not be compatible with LIGER’s assumptions.

The data imported from flat MTX files will be read into memory, internally converted into dgCMatrix objects. Data from other R object classes will also be converted into dgCMatrix objects. Data stored in H5 files will not be read into memory, as they are employed for helping with large datasets that cannot be loaded into memory. The main liger object constructor function, createLiger(), requires a named list as its first argument, where each element is a dataset to be integrated. Each element of the named list can be of different types from described above and below. However, it is limited that we don’t currently provide integration method that integrates H5 data with other types of data. That is, either you can integrate H5 data with H5 data, or you can integrate in-memory data with in-memory data. If integration under this situation is needed, we still provide some solution.

If you need to import data presented in other forms but encounter problems coercing them to supported formats, please feel free to open an Issue on our GitHub repository to request for support.

Importing 10X Cellranger Output

Given the prevalence of 10X Genomics’ single-cell sequencing technology, the cellranger output has been the most commonly seen data source for real-world analyses. We provide the following examples showing how to read multi-sample datasets from cellranger output into a liger object.

Data presented with MTX files

MTX files are proposed for efficiently storing sparse matrix data into flat text files. Cellranger output MEX format data that contains a MTX file together with a text file with barcodes and a text file with feature names. For example, the following is the structure of the cellranger output directory of one certain sample.

# shell commands in unix terminal
$ cd /home/jdoe/runs/sample345/outs
$ tree filtered_feature_bc_matrix
filtered_feature_bc_matrix
    ├── barcodes.tsv.gz
    ├── features.tsv.gz
    └── matrix.mtx.gz
0 directories, 3 files

We provide a function read10X() for identifying proper files to use and loading them into memory. The function works for either just one sample or the root directory that conatins multiple samples.

Reading a single sample

To load a single sample:

sample1 <- read10X("path/to/filtered_feature_bc_matrix")

Depending on whether the cellranger version is before 3.0 or after, the returned object sample1 varies in structure. For cellranger version 3.0 and later, the returned object is a named list where each element is for a sample, even if we opt to only read one sample. And this element would be another named list for available feature types. For example when doing scRNAseq, it will be "Gene Expression".

> sample1
$sampleName
$sampleName$`Gene Expression`
  [... dgCMatrix]

For older cellranger, sample1 will a named list, named by sample names, where each element is directly a dgCMatrix object.

> sample1
$sampleName
  [... dgCMatrix]

Reading multiple samples from root directory

The root directory that read10X() expects is the directory where you can see subdirecories named by sample names.

# shell commands in unix terminal
$ cd /home/jdoe/
$ tree runs
runs
    ├── sample123
    │   └── outs
    │       └── filtered_feature_bc_matrix
    │           ├── barcodes.tsv.gz
    │           ├── features.tsv.gz
    │           └── matrix.mtx.gz
    └── sample456
        └── outs
            └── filtered_feature_bc_matrix
                ├── barcodes.tsv.gz
                ├── features.tsv.gz
                └── matrix.mtx.gz

multiSample <- read10X("path/to/runs")

The returned object multiSample is a named list for each sample available in the runs/ folder. Similar to the case of reading a single sample, the element for each sample will be a list for available feature types for cellranger>=3.0 or a dgCMatrix object for older cellranger.

Creating a liger object from loaded data

The required structure for the input of createLiger() is a named list of matrices. For cellranger>=3.0 case where the loaded data contains substructure for potentially multiple feature types, we need to extract the desired data and organize them a bit. The following example shows how to take only the gene expression out of the loaded data and organize them into the desired format to finally create a liger object.

multiSample <- read10X("path/to/runs")
multiSample
## $sample123
## $sample123$`Gene Expression`
##   [... dgCMatrix]
## $sample123$`Antibody Capture`
##   [... dgCMatrix]
## $sample456
## $sample456$`Gene Expression`
##   [... dgCMatrix]
## $sample456$`Antibody Capture`
##   [... dgCMatrix]
multiSample <- lapply(multiSample, `[[`, i = "Gene Expression")
multiSample
## $sample123
##   [... dgCMatrix]
## $sample456
##   [... dgCMatrix]
ligerObj <- createLiger(multiSample)

For convenience, if you only want to work on RNA data, you can directly load with read10XRNA().

multiSample <- read10XRNA("path/to/runs")
multiSample
## $sample123
##   [... dgCMatrix]
## $sample456
##   [... dgCMatrix]
ligerObj <- createLiger(multiSample)

We also provide read10XATAC() for loading only the ATAC data from cellranger output. Note here that 10X provides different platforms that contain ATAC modality, namingly “cellranger atac” and “cellranger arc”. Users need to specify argument pipeline for the function to correctly infer the file structure.

Data presented with H5 files

Cellranger outputs identical information into H5 files, which contain the same sparse matrix, barcodes and feature names. H5 files are based on the HDF5 library, which is designed for storing and managing large and complex data. They can be found at the same folder where you see the filtered_feature_bc_matrix/ folder, and the file would be named by filtered_feature_bc_matrix.h5.

# shell commands in unix terminal
$ cd /home/jdoe/runs/sample345/
$ tree outs
outs
    ├── filtered_feature_bc_matrix
    │   ├── barcodes.tsv.gz
    │   ├── features.tsv.gz
    │   └── matrix.mtx.gz
    └── filtered_feature_bc_matrix.h5
    ...

To create a liger object with H5 files, you simply need to provide the path to each H5 files in a named list.

H5PathList <- list(
    sample123 = "/home/jdoe/runs/sample123/outs/filtered_feature_bc_matrix.h5",
    sample456 = "/home/jdoe/runs/sample456/outs/filtered_feature_bc_matrix.h5"
)
ligerObj <- createLiger(H5PathList)

Importing R object - dgCMatrix

“dgCMatrix”, supported by package Matrix, is the core class we adopt for holding the cell feature matrix in rliger. It represents CSC (Compressed Sparse Column) matrix, which is the most common form of sparse matrix in R given its efficiency in both memory usage and computation.

We expect that users have each dgCMatrix object representing a single dataset to be integrated. The main object constructor function, createLiger(), expects datasets are organized in a named list and being passed to the first argument.

# NOT RUN, for demonstration only
datasetList <- list(
    "name1" = dgCMatrix1, 
    "name2" = dgCMatrix2,
    "name3" = dgCMatrix3
)
ligerObj <- createLiger(datasetList)

If your matrix is a combination of multiple datasets to be integrated, we suggest using as.liger() with argument datasetVar supplied with a factor object as the labeling to indicate the proper splitting behavior.

# NOT RUN, for demonstration only
datasetVar
##  [1] sample1 sample1 sample1 sample1 ...
## Levels: sample1 sample2 sample3
ligerObj <- as.liger(dgCMatrix, datasetVar = datasetVar)

Importing R object - Seurat

Seurat has been the most popular R package for single-cell analysis. We directly support running rliger functions on a Seurat object without having to convert a Seurat object into a liger object. Please click on the link to see a detailed example, together with many use cases of interoperation between Seurat and rliger.

Importing R object - SingleCellExperiment

SingleCellExperiment (SCE) has also been a popular class for single-cell analysis. Here’s how we bring raw data from SCEs.

Multiple SCEs, each a dataset

This is the most preferable case. Most of the times, people do QC and filterings to remove unexpressing genes and low quality cells. If a SCE is presented as a combined representation of QC’ed datasets, it’s likely that unwanted zero entries would be introduced to keep the dimensionality consistent, and this would unnecessarily increase the memory usage and computation time. If you have multiple SCEs, each representing a dataset, we recommend to pass them as a named list to createLiger().

# NOT RUN, for demonstration only
sceList <- list(
    "name1" = sce1, 
    "name2" = sce2,
    "name3" = sce3
)
ligerObj <- createLiger(sceList)

A single SCE, including many datasets

If it happens that your single SCE object includes multiple datasets that need to be integrated. We suggest using as.liger() to convert it into a liger object. In this case, the second argument datasetVar must be provided for inferring how to split the data. Assume that the example sce object has a column sample in its colData:

# NOT RUN, for demonstration only
ligerObj <- as.liger(sce, datasetVar = "sample")

Importing from H5AD file, Python AnnData object

WARNING: We strongly suggest that you make a COPY of the H5AD file for rliger analyis. rliger load the HDF5 file with read-and-write mode and occasionally writes to the file, which might cause the file unable to be loaded back to Python AnnData object. This will be improved in the future.

AnnData is a popular Python class for single-cell analysis. When storing the object, it is written to an H5AD file, which is also based on HDF5 library, and is totally supported to be loaded into a liger object. Note that the H5AD specification has been changed since anndata 0.8.0, and we only support the new format. If you have H5AD file written by an older version of anndata, you need to convert it to the new format using an upgraded version of Python anndata package.

# NOT RUN, for demonstration only
ligerObj <- createLiger(
  list(
    sample1 = "path/to/sample1.h5ad",
    sample2 = "path/to/sample2.h5ad"
  ),
  formatType = "anndata"
)

A few limitations to note here are that:

We by default read the data located at adata.X. However, due to the fact that people might be storing data at adata.layers['key'] or adata.raw.X, and LIGER requires the raw data as the start point, we provide argument anndataX to allow setting the path to the data. See createH5LigerDataset() for more details.
LIGER expects that each H5AD file contains only one dataset to be integrated, because we are not loading the H5 based data into memory and hence don’t yet have method to split the data by a given variable. This will be updated in the future.

Working with mixed types of data

The createLiger() constructor function expects a named list as its first input, while the element in this list can be of various forms.

Mixing cellranger output H5 files with anndata H5AD files

When importing a number of H5-based datasets that are of the same specification, one can tweak the additional arguments of the function to apply the same settings to all these datasets. However, if you want to integrate H5 datasets of different types (e.g. cellranger output and H5AD), you will need to load each dataset into ligerDataset object individually, and then call the createLiger() function with a list of these ligerDataset objects.

# NOT RUN, for demonstration only
ligerDataset1 <- createH5LigerDataset("path/to/runs/sample/outs/filtered_feature_bc_matrix.h5", formatType = "10x")
ligerDataset2 <- createH5LigerDataset("path/to/sample2.h5ad", formatType = "anndata", anndataX = "raw/X")
ligerObj <- createLiger(list(ligerDataset1, ligerDataset2))

Mixing all kinds of R in-memory objects

If each of your R objects contains a single dataset to be integrated, you can simply pass them to the named list.

# NOT RUN, for demonstration only
ligerObj <- createLiger(
  list(
    sample1 = dgCMatrix1, 
    seurat2 = seuratObj,
    sce3 = sce, 
  )
)

Function as.ligerDataset is called under the hook to convert each R object to a ligerDataset object, and it directly fetches the 'counts' layer/slot of the default assay from a Seurat object, or the 'counts' assay from an SCE object. If this is not the desired behavior, you can manually extract the correct matrix.

# NOT RUN, for demonstration only
ligerObj <- createLiger(
  list(
    sample1 = dgCMatrix1, 
    seurat2 = SeuratObject::LayerData(seuratObj, layer = "somelayer", assay = "someassay"),
    sce3 = SummarizedExperiment::assay(sce, "someassay"), 
  )
)

If the given Seurat or SCE objects contain multiple datasets, you can use as.liger() to convert them into a liger object, with datasetVar properly specified to have the datasets split. Then c() method can be applied to merge multiple liger objects into one, analogous to how one would concatenate R list objects.

# NOT RUN, for demonstration only
ligerObj1 <- as.liger(seuratObj, datasetVar = "sample")
ligerObj1
## An object of class liger with XXXXX cells
## datasets(2): patient001 (XXXX cells), patient002 (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0): 
## dimReds(0):
ligerObj2 <- as.liger(sce, datasetVar = "batch")
ligerObj2
## An object of class liger with XXXXX cells
## datasets(2): someid101 (XXXX cells), someid103 (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0):
## dimReds(0):
ligerObj <- c(ligerObj1, ligerObj2)
# You can even insert more datasets with `dataset<-()`
dataset(ligerObj, "pbmc") <- dgCMatrix1
ligerObj
## An object of class liger with XXXXX cells
## datasets(5): patient001 (XXXX cells), patient002 (XXXX cells), someid101 (XXXX cells), someid103 (XXXX cells), pbmc (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0):
## dimReds(0):

Mixing H5 files with in-memory objects

Technically, with the hints provided above, you can create a liger object with both in-memory and H5-based datasets presented at the same time. However, the integration methods will not work in this situation. Generally, users will have to either write the in-memory part to H5 files and replace the datasets with the H5 copies accordingly, or read subsampled data from H5 files to memory and replace the datasets with the in-memory copies. Note that we assume that H5 files are employed for holding large datasets that cannot be fit into the memory and this is why we suggest subsampling them. If they are indeed okay to be loaded into memory, you can totally load them with subsetLigerDataset(..., newH5 = FALSE) and replace the datasets with the full in-memory copies.

It is becoming more common that users have large datasets stored in H5 files and are expensive to be all loaded into memory. runOnlineINMF() runs highly efficient methods to integrate datasets presented in this form. If you want to bring a rather smaller in-memory dataset to the pool and have it integrated together, you can write the dataset to an H5 file and then create a liger object with it. In this way, the integration will have the maximum information captured from all datasets.

# NOT RUN, for demonstration only
class(smallData)
## [1] "dgCMatrix"
write(smallData, "path/to/smallData.h5")

largeLiger
## An object of class liger with XXXXXX cells
## datasets(n): sample1 (XXXXX cells), sample2 (XXXXX cells), ...
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo 
## varFeatures(6000): AAA1, AAA2, BBB1, BBB2, ...
## dimReds(0):  
smallDataset <- createH5LigerDataset("path/to/smallData.h5")
dataset(largeLiger, "sampleSmall") <- smallDataset

largeLiger
## An object of class liger with XXXXXX cells
## datasets(n): sample1 (XXXXX cells), sample2 (XXXXX cells), ..., sampleSmall (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo 
## varFeatures(6000): AAA1, AAA2, BBB1, BBB2, ...
## dimReds(0):

However, after integration, downstream analyses that rely on gene/feature expression values (e.g. differential expression analysis) would still need in-memory form of the data. We provide downsample() to sub-sample data from H5 files and read the subset to memory, with fraction of dataset source and/or cell type balanced. This might be improved in the future.

The following example would randomly sample 4,000 cells from each dataset in largeLiger and load only the normalized data into memory. The newH5 argument is set to FALSE to load the data into memory, and the useSlot argument is set to "normData" to load only the normalized data. The maxCells argument is set to 4000 to sample 4,000 cells from each dataset. The resulting downsampledData is a liger object with the same structure as largeLiger, but with only the downsampled normalized in-memory data.

# NOT RUN, for demonstration only
downsampledData <- downsample(largeLiger, newH5 = FALSE, maxCells = 4000, useSlot = "normData")
downsampledData
## An object of class liger with XXXXXX cells
## datasets(n): sample1 (4000 cells), sample2 (4000 cells), ..., sampleSmall (4000 cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., leiden_cluster
## varFeatures(6000): AAA1, AAA2, BBB1, BBB2, ...
## dimReds(1): UMAP

Yichen Wang

2024-03-16