Calculate Normalized Mutual Information (NMI) by comparing two cluster labeling variables
Source:R/clustering.R
calcNMI.Rd
This function aims at calculating the Normalized Mutual Information for the clustering result obtained with LIGER and the external clustering (existing "true" annotation). NMI ranges from 0 to 1, with a score of 0 indicating no agreement between clusterings and 1 indicating perfect agreement. The mathematical definition of NMI is as follows:
$$ H(X) = -\sum_{x \in X}P(X=x)\log_2 P(X=x) $$ $$ H(X|Y) = -\sum_{y \in Y}P(Y=y)\sum_{x \in X}P(X=x|Y=y)\log_2 P(X=x|Y=y) $$ $$ I(X;Y) = H(X) - H(X|Y) $$ $$ NMI(X;Y) = \frac{I(X;Y)}{\sqrt{H(X)H(Y)}} $$
Where \(X\) is the cluster variable to be evaluated and \(Y\) is the true cluster variable. \(x\) and \(y\) are the cluster labels in \(X\) and \(Y\) respectively. \(H\) is the entropy and \(I\) is the mutual information.
The true clustering annotation must be specified as the base line. We suggest setting it to the object cellMeta so that it can be easily used for many other visualization and evaluation functions.
The NMI can be calculated for only specified datasets, since true annotation
might not be available for all datasets. Evaluation for only one or a few
datasets can be done by specifying useDatasets
. If useDatasets
is specified, the argument checking for trueCluster
and
useCluster
will be enforced to match the cells in the specified
datasets.
Usage
calcNMI(
object,
trueCluster,
useCluster = NULL,
useDatasets = NULL,
verbose = getOption("ligerVerbose", TRUE)
)
Arguments
- object
A liger object, with the clustering result present in cellMeta.
- trueCluster
Either the name of one variable in
cellMeta(object)
or a factor object with annotation that matches with all cells being considered.- useCluster
The name of one variable in
cellMeta(object)
. DefaultNULL
uses default clusters.- useDatasets
A character vector of the names, a numeric or logical vector of the index of the datasets to be considered for the purity calculation. Default
NULL
uses all datasets.- verbose
Logical. Whether to show information of the progress. Default
getOption("ligerVerbose")
orTRUE
if users have not set.
Examples
# Assume the true cluster in `pbmcPlot` is "leiden_cluster"
# generate fake new labeling
fake <- sample(1:7, ncol(pbmcPlot), replace = TRUE)
# Insert into cellMeta
pbmcPlot$new <- factor(fake)
calcNMI(pbmcPlot, trueCluster = "leiden_cluster", useCluster = "new")
#> [1] 0.01796525
# Now assume we got existing base line annotation only for "stim" dataset
nStim <- ncol(dataset(pbmcPlot, "stim"))
stimTrueLabel <- factor(fake[1:nStim])
# Insert into cellMeta
cellMeta(pbmcPlot, "stim_true_label", useDatasets = "stim") <- stimTrueLabel
# Assume "leiden_cluster" is the clustering result we got and need to be
# evaluated
calcNMI(pbmcPlot, trueCluster = "stim_true_label",
useCluster = "leiden_cluster", useDatasets = "stim")
#> [1] 0.02723257
# Comparison of the same labeling should always yield 1.
calcNMI(pbmcPlot, trueCluster = "leiden_cluster", useCluster = "leiden_cluster")
#> [1] 1