Import Data from Various Source
Yichen Wang
2024-03-16
Source:vignettes/articles/import_data.Rmd
import_data.Rmd
This article only focuses on reading in data presented in various forms and provides detailed examples and explanations. Currently, we support importing the following forms of input data into a liger object:
- Flat MTX files containing sparse matrices, together with files containing barcodes and feature names. These are mostly the 10X cellranger output.
- H5 files containing the sparse data, barcodes, and feature names. These can be found also in 10X cellranger output, as well as H5AD files written from a Python AnnData object.
- Loaded R objects of the following classes:
Whichever type of data you want to bring to a LIGER analysis, we generally recommend that it represents the raw information, such as RNA raw counts, peak counts. LIGER performs optimally with its own normalization (library-size/L1 normalization only) and scaling (no centering) procedures, and other types of transformation may not be compatible with LIGER’s assumptions.
The data imported from flat MTX files will be read into memory,
internally converted into dgCMatrix objects. Data from other R object
classes will also be converted into dgCMatrix objects. Data stored in H5
files will not be read into memory, as they are employed for helping
with large datasets that cannot be loaded into memory. The main liger
object constructor function, createLiger()
, requires a
named list as its first argument, where each element is
a dataset to be integrated. Each element of the named list can be of
different types from described above and below. However, it is limited
that we don’t currently provide integration method that integrates H5
data with other types of data. That is, either you can integrate H5 data
with H5 data, or you can integrate in-memory data with in-memory data.
If integration under this situation is needed, we still provide some
solution.
If you need to import data presented in other forms but encounter problems coercing them to supported formats, please feel free to open an Issue on our GitHub repository to request for support.
Importing 10X Cellranger Output
Given the prevalence of 10X Genomics’ single-cell sequencing technology, the cellranger output has been the most commonly seen data source for real-world analyses. We provide the following examples showing how to read multi-sample datasets from cellranger output into a liger object.
Data presented with MTX files
MTX files are proposed for efficiently storing sparse matrix data into flat text files. Cellranger output MEX format data that contains a MTX file together with a text file with barcodes and a text file with feature names. For example, the following is the structure of the cellranger output directory of one certain sample.
# shell commands in unix terminal
$ cd /home/jdoe/runs/sample345/outs
$ tree filtered_feature_bc_matrix
filtered_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
0 directories, 3 files
We provide a function read10X()
for identifying proper
files to use and loading them into memory. The function works for either
just one sample or the root directory that conatins multiple
samples.
Reading a single sample
To load a single sample:
sample1 <- read10X("path/to/filtered_feature_bc_matrix")
Depending on whether the cellranger version is before 3.0 or after,
the returned object sample1
varies in structure. For
cellranger version 3.0 and later, the returned object is a named list
where each element is for a sample, even if we opt to only read one
sample. And this element would be another named list for available
feature types. For example when doing scRNAseq, it will be
"Gene Expression"
.
> sample1
$sampleName
$sampleName$`Gene Expression`
[... dgCMatrix]
For older cellranger, sample1
will a named list, named
by sample names, where each element is directly a dgCMatrix object.
> sample1
$sampleName
[... dgCMatrix]
Reading multiple samples from root directory
The root directory that read10X()
expects is the
directory where you can see subdirecories named by sample names.
# shell commands in unix terminal
$ cd /home/jdoe/
$ tree runs
runs
├── sample123
│ └── outs
│ └── filtered_feature_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
└── sample456
└── outs
└── filtered_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz
multiSample <- read10X("path/to/runs")
The returned object multiSample
is a named list for each
sample available in the runs/
folder. Similar to the case
of reading a single sample, the element for each sample will be a list
for available feature types for cellranger>=3.0 or a dgCMatrix object
for older cellranger.
Creating a liger object from loaded data
The required structure for the input of createLiger()
is
a named list of matrices. For cellranger>=3.0 case where the loaded
data contains substructure for potentially multiple feature types, we
need to extract the desired data and organize them a bit. The following
example shows how to take only the gene expression out of the loaded
data and organize them into the desired format to finally create a liger
object.
multiSample <- read10X("path/to/runs")
multiSample
## $sample123
## $sample123$`Gene Expression`
## [... dgCMatrix]
## $sample123$`Antibody Capture`
## [... dgCMatrix]
## $sample456
## $sample456$`Gene Expression`
## [... dgCMatrix]
## $sample456$`Antibody Capture`
## [... dgCMatrix]
multiSample <- lapply(multiSample, `[[`, i = "Gene Expression")
multiSample
## $sample123
## [... dgCMatrix]
## $sample456
## [... dgCMatrix]
ligerObj <- createLiger(multiSample)
For convenience, if you only want to work on RNA data, you can
directly load with read10XRNA()
.
multiSample <- read10XRNA("path/to/runs")
multiSample
## $sample123
## [... dgCMatrix]
## $sample456
## [... dgCMatrix]
ligerObj <- createLiger(multiSample)
We also provide read10XATAC()
for loading only the ATAC
data from cellranger output. Note here that 10X provides different
platforms that contain ATAC modality, namingly “cellranger atac” and
“cellranger arc”. Users need to specify argument pipeline
for the function to correctly infer the file structure.
Data presented with H5 files
Cellranger outputs identical information into H5 files, which contain
the same sparse matrix, barcodes and feature names. H5 files are based
on the HDF5 library, which is designed for storing and managing large
and complex data. They can be found at the same folder where you see the
filtered_feature_bc_matrix/
folder, and the file would be
named by filtered_feature_bc_matrix.h5
.
# shell commands in unix terminal
$ cd /home/jdoe/runs/sample345/
$ tree outs
outs
├── filtered_feature_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── features.tsv.gz
│ └── matrix.mtx.gz
└── filtered_feature_bc_matrix.h5
...
To create a liger object with H5 files, you simply need to provide the path to each H5 files in a named list.
H5PathList <- list(
sample123 = "/home/jdoe/runs/sample123/outs/filtered_feature_bc_matrix.h5",
sample456 = "/home/jdoe/runs/sample456/outs/filtered_feature_bc_matrix.h5"
)
ligerObj <- createLiger(H5PathList)
Importing R object - dgCMatrix
“dgCMatrix”, supported by package Matrix, is the core class we adopt for holding the cell feature matrix in rliger. It represents CSC (Compressed Sparse Column) matrix, which is the most common form of sparse matrix in R given its efficiency in both memory usage and computation.
We expect that users have each dgCMatrix object representing a single
dataset to be integrated. The main object constructor function,
createLiger()
, expects datasets are organized in a
named list and being passed to the first argument.
# NOT RUN, for demonstration only
datasetList <- list(
"name1" = dgCMatrix1,
"name2" = dgCMatrix2,
"name3" = dgCMatrix3
)
ligerObj <- createLiger(datasetList)
If your matrix is a combination of multiple datasets to be
integrated, we suggest using as.liger()
with argument
datasetVar
supplied with a factor object as the labeling to
indicate the proper splitting behavior.
# NOT RUN, for demonstration only
datasetVar
## [1] sample1 sample1 sample1 sample1 ...
## Levels: sample1 sample2 sample3
ligerObj <- as.liger(dgCMatrix, datasetVar = datasetVar)
Importing R object - Seurat
Seurat has been the most popular R package for single-cell analysis. We directly support running rliger functions on a Seurat object without having to convert a Seurat object into a liger object. Please click on the link to see a detailed example, together with many use cases of interoperation between Seurat and rliger.
Importing R object - SingleCellExperiment
SingleCellExperiment (SCE) has also been a popular class for single-cell analysis. Here’s how we bring raw data from SCEs.
Multiple SCEs, each a dataset
This is the most preferable case. Most of the times, people do QC and
filterings to remove unexpressing genes and low quality cells. If a SCE
is presented as a combined representation of QC’ed datasets, it’s likely
that unwanted zero entries would be introduced to keep the
dimensionality consistent, and this would unnecessarily increase the
memory usage and computation time. If you have multiple SCEs, each
representing a dataset, we recommend to pass them as a named list to
createLiger()
.
# NOT RUN, for demonstration only
sceList <- list(
"name1" = sce1,
"name2" = sce2,
"name3" = sce3
)
ligerObj <- createLiger(sceList)
A single SCE, including many datasets
If it happens that your single SCE object includes multiple datasets
that need to be integrated. We suggest using as.liger()
to
convert it into a liger object. In this case, the second argument
datasetVar
must be provided for inferring how to split the
data. Assume that the example sce
object has a column
sample
in its colData
:
# NOT RUN, for demonstration only
ligerObj <- as.liger(sce, datasetVar = "sample")
Importing from H5AD file, Python AnnData object
WARNING: We strongly suggest that you make a COPY of the H5AD file for rliger analyis. rliger load the HDF5 file with read-and-write mode and occasionally writes to the file, which might cause the file unable to be loaded back to Python AnnData object. This will be improved in the future.
AnnData is a popular Python class for single-cell analysis. When storing the object, it is written to an H5AD file, which is also based on HDF5 library, and is totally supported to be loaded into a liger object. Note that the H5AD specification has been changed since anndata 0.8.0, and we only support the new format. If you have H5AD file written by an older version of anndata, you need to convert it to the new format using an upgraded version of Python anndata package.
# NOT RUN, for demonstration only
ligerObj <- createLiger(
list(
sample1 = "path/to/sample1.h5ad",
sample2 = "path/to/sample2.h5ad"
),
formatType = "anndata"
)
A few limitations to note here are that:
- We by default read the data located at
adata.X
. However, due to the fact that people might be storing data atadata.layers['key']
oradata.raw.X
, and LIGER requires the raw data as the start point, we provide argumentanndataX
to allow setting the path to the data. SeecreateH5LigerDataset()
for more details. - LIGER expects that each H5AD file contains only one dataset to be integrated, because we are not loading the H5 based data into memory and hence don’t yet have method to split the data by a given variable. This will be updated in the future.
Working with mixed types of data
The createLiger()
constructor function expects a named
list as its first input, while the element in this list can be of
various forms.
Mixing cellranger output H5 files with anndata H5AD files
When importing a number of H5-based datasets that are of the same
specification, one can tweak the additional arguments of the function to
apply the same settings to all these datasets. However, if you want to
integrate H5 datasets of different types (e.g. cellranger output and
H5AD), you will need to load each dataset into ligerDataset object
individually, and then call the createLiger()
function with
a list of these ligerDataset objects.
# NOT RUN, for demonstration only
ligerDataset1 <- createH5LigerDataset("path/to/runs/sample/outs/filtered_feature_bc_matrix.h5", formatType = "10x")
ligerDataset2 <- createH5LigerDataset("path/to/sample2.h5ad", formatType = "anndata", anndataX = "raw/X")
ligerObj <- createLiger(list(ligerDataset1, ligerDataset2))
Mixing all kinds of R in-memory objects
If each of your R objects contains a single dataset to be integrated, you can simply pass them to the named list.
# NOT RUN, for demonstration only
ligerObj <- createLiger(
list(
sample1 = dgCMatrix1,
seurat2 = seuratObj,
sce3 = sce,
)
)
Function as.ligerDataset
is called under the hook to
convert each R object to a ligerDataset object, and it directly fetches
the 'counts'
layer/slot of the default assay from a Seurat
object, or the 'counts'
assay from an SCE object. If this
is not the desired behavior, you can manually extract the correct
matrix.
# NOT RUN, for demonstration only
ligerObj <- createLiger(
list(
sample1 = dgCMatrix1,
seurat2 = SeuratObject::LayerData(seuratObj, layer = "somelayer", assay = "someassay"),
sce3 = SummarizedExperiment::assay(sce, "someassay"),
)
)
If the given Seurat or SCE objects contain multiple datasets, you can
use as.liger()
to convert them into a liger object, with
datasetVar
properly specified to have the datasets split.
Then c()
method can be applied to merge multiple liger
objects into one, analogous to how one would concatenate R list
objects.
# NOT RUN, for demonstration only
ligerObj1 <- as.liger(seuratObj, datasetVar = "sample")
ligerObj1
## An object of class liger with XXXXX cells
## datasets(2): patient001 (XXXX cells), patient002 (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0):
## dimReds(0):
ligerObj2 <- as.liger(sce, datasetVar = "batch")
ligerObj2
## An object of class liger with XXXXX cells
## datasets(2): someid101 (XXXX cells), someid103 (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0):
## dimReds(0):
ligerObj <- c(ligerObj1, ligerObj2)
# You can even insert more datasets with `dataset<-()`
dataset(ligerObj, "pbmc") <- dgCMatrix1
ligerObj
## An object of class liger with XXXXX cells
## datasets(5): patient001 (XXXX cells), patient002 (XXXX cells), someid101 (XXXX cells), someid103 (XXXX cells), pbmc (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(0):
## dimReds(0):
Mixing H5 files with in-memory objects
Technically, with the hints provided above, you can create a liger
object with both in-memory and H5-based datasets presented at the same
time. However, the integration methods will not work in this situation.
Generally, users will have to either write the in-memory part to H5
files and replace the datasets with the H5 copies accordingly, or read
subsampled data from H5 files to memory and replace the datasets with
the in-memory copies. Note that we assume that H5 files are employed for
holding large datasets that cannot be fit into the memory and this is
why we suggest subsampling them. If they are indeed okay to be loaded
into memory, you can totally load them with
subsetLigerDataset(..., newH5 = FALSE)
and replace the
datasets with the full in-memory copies.
It is becoming more common that users have large datasets stored in
H5 files and are expensive to be all loaded into memory.
runOnlineINMF()
runs highly efficient methods to integrate
datasets presented in this form. If you want to bring a rather smaller
in-memory dataset to the pool and have it integrated together, you can
write the dataset to an H5 file and then create a liger object with it.
In this way, the integration will have the maximum information captured
from all datasets.
# NOT RUN, for demonstration only
class(smallData)
## [1] "dgCMatrix"
write(smallData, "path/to/smallData.h5")
largeLiger
## An object of class liger with XXXXXX cells
## datasets(n): sample1 (XXXXX cells), sample2 (XXXXX cells), ...
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(6000): AAA1, AAA2, BBB1, BBB2, ...
## dimReds(0):
smallDataset <- createH5LigerDataset("path/to/smallData.h5")
dataset(largeLiger, "sampleSmall") <- smallDataset
largeLiger
## An object of class liger with XXXXXX cells
## datasets(n): sample1 (XXXXX cells), sample2 (XXXXX cells), ..., sampleSmall (XXXX cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., hemo
## varFeatures(6000): AAA1, AAA2, BBB1, BBB2, ...
## dimReds(0):
However, after integration, downstream analyses that rely on
gene/feature expression values (e.g. differential expression analysis)
would still need in-memory form of the data. We provide
downsample()
to sub-sample data from H5 files and read the
subset to memory, with fraction of dataset source and/or cell type
balanced. This might be improved in the future.
The following example would randomly sample 4,000 cells from each
dataset in largeLiger
and load only the normalized data
into memory. The newH5
argument is set to
FALSE
to load the data into memory, and the
useSlot
argument is set to "normData"
to load
only the normalized data. The maxCells
argument is set to
4000
to sample 4,000 cells from each dataset. The resulting
downsampledData
is a liger object with the same structure
as largeLiger
, but with only the downsampled normalized
in-memory data.
# NOT RUN, for demonstration only
downsampledData <- downsample(largeLiger, newH5 = FALSE, maxCells = 4000, useSlot = "normData")
downsampledData
## An object of class liger with XXXXXX cells
## datasets(n): sample1 (4000 cells), sample2 (4000 cells), ..., sampleSmall (4000 cells)
## cellMeta(10): dataset, barcode, group, nUMI, ..., leiden_cluster
## varFeatures(6000): AAA1, AAA2, BBB1, BBB2, ...
## dimReds(1): UMAP