This function identifies highly variable genes from each dataset and combines these gene sets (either by union or intersection) for use in downstream analysis. Assuming that gene expression approximately follows a Poisson distribution, this function identifies genes with gene expression variance above a given variance threshold (relative to mean gene expression). Alternatively, we allow selecting a desired number of genes for each dataset by ranking the relative variance, and then take the combination.
Usage
selectGenes(object, thresh = 0.1, nGenes = NULL, alpha = 0.99, ...)
# S3 method for liger
selectGenes(
object,
thresh = 0.1,
nGenes = NULL,
alpha = 0.99,
useDatasets = NULL,
useUnsharedDatasets = NULL,
unsharedThresh = 0.1,
combine = c("union", "intersection"),
chunk = 1000,
verbose = getOption("ligerVerbose", TRUE),
var.thresh = thresh,
alpha.thresh = alpha,
num.genes = nGenes,
datasets.use = useDatasets,
unshared.datasets = useUnsharedDatasets,
unshared.thresh = unsharedThresh,
tol = NULL,
do.plot = NULL,
cex.use = NULL,
unshared = NULL,
...
)
# S3 method for Seurat
selectGenes(
object,
thresh = 0.1,
nGenes = NULL,
alpha = 0.99,
useDatasets = NULL,
layer = "ligerNormData",
assay = NULL,
datasetVar = "orig.ident",
combine = c("union", "intersection"),
verbose = getOption("ligerVerbose", TRUE),
...
)
Arguments
- object
A liger, ligerDataset or
Seurat
object, with normalized data available (no scale factor multipled nor log transformed).- thresh
Variance threshold used to identify variable genes. Higher threshold results in fewer selected genes. Liger and Seurat S3 methods accept a single value or a vector with specific threshold for each dataset in
useDatasets
.* Default0.1
.- nGenes
Number of genes to find for each dataset. By setting this, we optimize the threshold used for each dataset so that we get
nGenes
selected features for each dataset. Accepts single value or a vector for dataset specific setting matchinguseDataset
.* DefaultNULL
does not optimize.- alpha
Alpha threshold. Controls upper bound for expected mean gene expression. Lower threshold means higher upper bound. Default
0.99
.- ...
Arguments passed to other methods.
- useDatasets
A character vector of the names, a numeric or logical vector of the index of the datasets to use for shared variable feature selection. Default
NULL
uses all datasets.- useUnsharedDatasets
A character vector of the names, a numeric or logical vector of the index of the datasets to use for finding unshared variable features. Default
NULL
does not attempt to find unshared features.- unsharedThresh
The same thing as
thresh
that is applied to test unshared features. A single value for all datasets inuseUnsharedDatasets
or a vector for dataset-specific setting.* Default0.1
.- combine
How to combine variable genes selected from all datasets. Choose from
"union"
or"intersection"
. Default"union"
.- chunk
Integer. Number of maximum number of cells in each chunk, when gene selection is applied to any HDF5 based dataset. Default
1000
.- verbose
Logical. Whether to show information of the progress. Default
getOption("ligerVerbose")
orTRUE
if users have not set.- var.thresh, alpha.thresh, num.genes, datasets.use, unshared.datasets, unshared.thresh
Deprecated. These arguments are renamed and will be removed in the future. Please see function usage for replacement.
- tol, do.plot, cex.use, unshared
Deprecated. Gene variability metric is now visualized with separated function
plotVarFeatures
. Users can now set none-NULLuseUnsharedDatasets
to select unshared genes, instead of having to switchunshared
on.- layer
Where the input normalized counts should be from. Default
"ligerNormData"
. For older Seurat, always retrieve fromdata
slot.- assay
Name of assay to use. Default
NULL
uses current active assay.- datasetVar
Metadata variable name that stores the dataset source annotation. Default
"orig.ident"
.
Value
Updated object
liger method - Each involved dataset stored in ligerDataset is updated with its
featureMeta
slot andvarUnsharedFeatures
slot (if requested withuseUnsharedDatasets
), whilevarFeatures(object)
will be updated with the final combined gene set.Seurat method - Final selection will be updated at
Seurat::VariableFeatures(object)
. Per-dataset information is stored in themeta.features
slot of the chosen Assay.
Examples
pbmc <- normalize(pbmc)
#> ℹ Normalizing datasets "ctrl"
#> ℹ Normalizing datasets "stim"
#> ✔ Normalizing datasets "stim" ... done
#>
#> ℹ Normalizing datasets "ctrl"
#> ✔ Normalizing datasets "ctrl" ... done
#>
# Select basing on thresholding the relative variance
pbmc <- selectGenes(pbmc, thresh = .1)
#> ℹ Selecting variable features for dataset "ctrl"
#> ✔ ... 168 features selected out of 249 shared features.
#> ℹ Selecting variable features for dataset "stim"
#> ✔ ... 166 features selected out of 249 shared features.
#> ✔ Finally 173 shared variable features are selected.
# Select specified number for each dataset
pbmc <- selectGenes(pbmc, nGenes = c(60, 60))
#> ℹ Selecting variable features for dataset "ctrl"
#> ✔ ... 60 features selected out of 249 shared features.
#> ℹ Selecting variable features for dataset "stim"
#> ✔ ... 60 features selected out of 249 shared features.
#> ✔ Finally 80 shared variable features are selected.