6 Table

Table1

A. Deep Learning algorithms reviewed in the paper

Table 6.1: Deep Learning algorithms reviewed in the paper
App Algorithm Models Evaluation Environment Codes Refs
Imputation
DCA AE DREMI Keras, Tensorflow, scanpy https://github.com/theislab/dca (Arisdakessian et al. 2019)
SAVER-X AE+TL t-SNE, ARI R/sctransfer https://github.com/jingshuw/SAVERX (Borgwardt et al. 2006)
DeepImpute DNN MSE, Pearson’s correlation Keras/Tensorflow https://github.com/lanagarmire/DeepImpute (Petegrosso, Li, and Kuang 2020)
LATE AE MSE Tensorflow https://github.com/audreyqyfu/LATE (Buttner et al. 2019)
scGAMI AE NMI, ARI, HS and CS Tensorflow https://github.com/QUST-AIBBDRC/scGMAI/ (Cover 1999)
scIGANs GAN ARI, ACC, AUC, and F-score PyTorch https://github.com/xuyungang/scIGANs (Tran et al. 2020)
Batch correction
BERMUDA AE+TL kBET, entropy of Mixing, SI PyTorch https://github.com/txWang/BERMUDA (Badsha et al. 2020)
DESC AE ARI, KL Tensorflow https://github.com/eleozzr/desc (T. Wang et al. 2019)
iMAP AE+GAN kBET, LISI PyTorch https://github.com/Svvord/iMAP (X. Li et al. 2020)
Clustering, latent representation, dimension reduction, and data augmentation
Dhaka VAE ARI, Spearman Correlation Keras/Tensorflow https://github.com/MicrosoftGenomics/Dhaka (Hie, Bryson, and Berger 2019)
scvis VAE KNN preservation, log-likelihood Tensorflow https://bitbucket.org/jerry00/scvis-dev/src/master/ (Fowlkes and Mallows 1983)
scVAE VAE ARI Tensorflow https://github.com/scvae/scvae (Rashid et al. 2019)
VASC VAE NMI, ARI, HS, and CS H5py, Keras https://github.com/wang-research/VASC (Tirosh, Izar, et al. 2016)
scDeepCluster AE ARI, NMI, clustering accuracy Keras, Scanpy https://github.com/ttgump/scDeepCluster (Ding, Condon, and Shah 2018)
cscGAN GAN t-SNE, marker genes, MMD, AUC Scipy, Tensorflow https://github.com/imsb-uke/scGAN (D. Wang and Gu 2018)
Multi-functional models (IM: imputation, BC: batch correction, CL: clustering)
scVI VAE IM: L1 distance; CL: ARI, NMI, SI; BC: Entropy of Mixing PyTorch, Anndata https://github.com/YosefLab/scvi-tools (Y. Xu et al. 2020)
LDVAE VAE Reconstruction errors Part of scVI https://github.com/YosefLab/scvi-tools (Xie, Girshick, and Farhadi, n.d.)
SAUCIE AE IM: R2 statistics; CL: SI; BC: modified kBET; Visualization: Precision/Recall Tensorflow https://github.com/KrishnaswamyLab/SAUCIE/ (Amodio et al. 2019)
scScope AE IM:Reconstruction errors; BC: Entropy of mixing; CL: ARI Tensorflow, Scikit-learn https://github.com/AltschulerWu-Lab/scScope (Lindenbaum and Krishnaswamy 2018)
Cell type Identification
DigitalDLSorter DNN Pearson correlation R/Python/Keras https://github.com/cartof/digitalDLSorter (Svensson et al. 2020)
scCapsNet CapsNet Cell-type Prediction accuracy Keras, Tensorflow https://github.com/wanglf19/scCaps (Wolock, Lopez, and Klein 2019)
netAE VAE Cell-type Prediction accuracy, t-SNE for visualization PyTorch https://github.com/LeoZDong/netAE (H. Li et al. 2017)
scDGN DANN Prediciton accuracy PyTorch https://github.com/SongweiGe/scDGN (Racle et al. 2017)
Function analysis
CNNC CNN AUROC, AUPRC, and accuracy Keras, Tensorflow https://github.com/xiaoyeye/CNNC (N. D. Patel, Nguang, and Coghill 2007)
scGen VAE Correlation, visualization Tensorflow https://github.com/theislab/scgen (Yuan and Bar-Joseph 2019)
DL Model keywords: AE: autoencoder, AE+TL: autoencoder with transfer learning, AE: variational autoencoder, GAN: Generative adversarial network, CNN: convolutional neural network, DNN: deep neural network, DANN: domain adversarial neural network, CapsNet: capsule neural network

B. Comparison of Deep Learning algorithms reviewed in the paper

Table 6.2: Comparison of Deep Learning algorithms reviewed in the paper
App Algorithm Feature ApplicationNotes
Imputation
DCA Strongest recovery of the top 500 genes AE integrated into the Scanpy framework
Choices of noise models, including NB, and ZINB High scalability of AE, up to millions of cells
Outperform other existing methods in capturing cell population structure This method was compared to SAVER, scImpute, and MAGIC
SAVER-X Pretraining from existing data sets (transfer learning) SAVER-X pretraining on PBMCs outperformed other denoising methods, including DCA, scVI, scImpute, and MAGIC
Decomposes the variation into three components SAVER-X was also applied for cross-species analysis
DeepImpute Divide-and-conquer approach, using a bank of DNN models DeepImpute had the highest overall accuracy and offered shorter computation time than other methods like MAGIC, DrImpute, ScImpute, SAVER, VIPER, and DCA
Reduced complexity by learning smaller sub-network DeepImpute showed benefits in improving clustering results and identifying significantly differentially expressed genes
Minimized overfitting by removing target genes from input Scalable and faster training time
LATE Takes the log-transformed expression as input LATE outperforms other existing methods like MAGIC, SAVER, DCA, scVI, particularly when the ground truth contains only a few or no zeros
No explicit distribution assumption on input data Better scalability than DCA and scVI up to 1.3 million cells with 10K genes
scGAMI A model designed for clustering but it includes an AE Significantly improved the clustering performance in eight of seventeen selected scRNA-seq datasets
Uses fast independent component analysis algorithm: FastICA scGMI’s scalability needs to be improved
scIGANs Trains a GAN model to generate samples with imputed expressions This framework forces the model to reconstruct the real samples and discriminate between real and generated samples.
Best reported performance in clustering compared to DCA, DeepImpute, SAVER, scImpute, MAGIC
Scalable over 100K cells, also robust to small datasets
Batch correction
BERMUDA Preserves batch-specific biological signals through transfer-learning Preserves batch-specific cell populations Outperform other methods like mnnCorrect, BBKNN, Seurat, and scVI
Removes batch effects even when the cell population compositions across different batches are vastly different
Scalable by using mini-batch gradient descent algorithm during training
DESC Removes batch effect through clustering with the hypothesis that batch differences in expressions are smaller than true biological variations DESC is effective in removing the batch effect, whereas CCA, MNN, Seurat 3.0, scVI, BERMUDA, and scanorama were sensitive to batch definitions
Does not require explicit batch information for batch removal DESC is biologically interpretable and can reveal both discrete and pseudo-temporal structures of cells
Small memory footprint and GPU enabled
iMAP iMAP combines AE and GAN for batch effect removal iMAP was shown to separate the batch-specific cell types but mix batch shared cell types and outperformed other existing batch correction methods including Harmony, scVI, fastMNN, Seurat
It consists of two processing stages, each including a separate DL model Capable handling datasets from Smart-seq2 and 10X Genomics platforms
Demonstrated the stability over hyperparameters, and scalability for thousands of cells.
Clustering, latent representation, dimension reduction, and data augmentation
Dhaka It was proposed to reduce the dimension of scRNA-seq data for efficient stratification of tumor subpopulations Dhaka was shown to have an ARI higher than most other comparing methods including t-SNE, PCA, SIMLR, NMF, an autoencoder, MAGIC, and scVI
Dhaka can represent an evolutionary trajectory
scvis VAE network that learns low-dimensional representations scvis was tested on the simulated data and outperformed t-SNE
Capture both local and global neighboring structures scvis is much more scalable than BH t-SNE (t-SNE takes O(M log(M)) time and O(M log(M)) space) for very large dataset (>1 million cells)
scVAE scVAE includes multiple VAE models for denoising gene expression levels and learning the low-dimensional latent representation GMVAE was also compared with Seurat and shown to perform better, however, scVAE performed no better than scVI or scvis
Gaussian Mixture VAE (GMVAE) with negative binomial distribution achieved the highest lower bound and ARI Algorithm applicable to large datasets with million cells
VASC Another VAE for dimension reduction and latent representation VASC was compared with PCA, t-SNE, ZIFA, and SIMLR on 20 datasets
Explicitly model dropout with a Zero-inflated layer In the study of embryonic development from zygote to blast cells, VASC shthe owed better performance to model embryo developmental progression
VASC is reported to handle a large number of cells or cell types
Demonstrated model stability
scDeepCluster AE network that simultaneously learns feature representation and performs clustering via explicit modeling of cell clusters It was tested on the simulation data with different dropout rates and compared with DCA, MPSSC and SIMLR CIDR, PCA + k-mean, scvis and DEC significantly outperforming them
Running time of scDeepCluster scales linearly with the number of cells, suitable for large scRNA-seq datasets
cscGAN It was designed to augment the existing scRNA-seq samples by generating expression profiles of specific cell types or subpopulations cscGAN was shown to generate high-quality scRAN-seq data for specific cell types.
The cscGAN learns the expression patterns of a specific subpopulation (few cells), and simultaneously learns the cells from all populations (a large number of cells). The augmentation in this method improved the identification of rare cell types and the ability to capture transitional cell states from trajectory analysis
Better scalability than SUGAR (Synthesis Using Geometrically Aligned Random-walks)
Capable re-establish developmental trajectories through pseudo-time analysis via cscGAN data augmentation
Multi-functional models (IM: imputation, BC: batch correction, CL: clustering)
scVI Designed to address a range of fundamental analysis tasks, including batch correction, visualization, clustering, and differential expression ScVI was shown to be faster than most non-DL algorithms and scalable to handle twice as many cells as non-DL algorithms with a fixed memory
Integrated a normalization procedure and batch correction For imputation, ScVI, together with other ZINB-based models, performed better than methods using alternative distributions
Similar scalability as DCA
LDVAE Adaption of scVI to improve the model interpretability For LDVAE the variations along the different axes of the latent variable establish direct linear relationships with input genes.
SAUCIE It is applied to the normalized data instead of count data Results showed that SAUCIE had a better or comparable performance with other approaches
SAUCIE has better scalability and faster runtimes than any of the other models
Applications with single-cell CyTOF datasets
scScope AE with recurrent steps designed for imputation and batch correction It was compared with PCA, MAGIC, ZINB-WaVE, SIMLR, AE, scVI, and DCA
Efficiently identify cell subpopulations from complex datasets with high dropout rates, large numbers of subpopulations and rare cell types
For scalability and training speed, scScope was shown to offer scalability (for >100K cells) with high efficiency (faster than most of the approaches)
Cell type Identification
DigitalDLSorter A deconvolution model with 4-layer DNN DigitalDLSorter achieved excellent agreement (linear correlation of 0.99 for colorectal cancer, and good agreement in quadratic relationship for breast cancer) at predicting cell type proportion.
Designed to identify and quantify the immune cells infiltrated in tumors captured in bulk RNA-seq, utilizing single-cell RNA-seq data
scCapsNet It takes log-transformed, normalized expressions as input and follows the general CapsNet model Interpretable capsule network designed for cell type prediction
scCapsNet makes the deep-learning black box transparent through the direct interpretation of internal parameters
netAE VAE-based semi-supervised cell type prediction model Deals with scenarios of having a small number of labeled cells.
Aiming at learning a low dimensional space from which the original space can be accurately reconstructed netAE outperformed most of the baseline methods including scVI, ZIFA, PCA and AE as well as a semi-supervised method scANVI
scDGN scDGN takes the log-transformed, normalized expression as the input scDGN was tested for classifying cell types and aligning batches
Supervised domain adversarial network scDGN outperformed many deep learning and traditional machine learning methods in classification accuracy, including DNN, CaSTLe, MNN, scVI, and Seurat
Function analysis
CNNC CNNC takes expression levels of two genes from many cells and transforms them into a 32 x 32 image-like normalized empirical probability function CNNC outperforms prior methods for inferring TF–gene and protein–protein interactions, causality inference, and functional assignments
Inferring causal interactions between genes from scRNA-seq Was shown to have more than 20% higher AUPRC than other methods and reported almost no false-negative for the top 5% predictions
scGen ScGen follows the general VAE for scRNA-seq data but uses the “latent space arithmetics” to learn perturbations’ response Compared with other methods including CVAE, style transfer GAN, linear approaches based on vector arithmetics(VA) and PCA+VA, scGen predicted full distribution of ISG15 gene (strongest regulated gene by IFN-b) response to IFN- b
Designed to learn cell response to certain perturbation (drug treatment, gene knockout, etc) scGen can be used to translate the effect of a stimulation trained in study A to how stimulated cells would look in study B, given a control sample set
Abbreviation: NB: negative binomial distribution; ZINB: zero-inflated negative binomial distribution; TF: Transcription factor;

Table2

A. Simulated single-cell data/algorithms

Table 6.3: Simulated single-cell data/algorithms
Title Algorithm Number_of_Cells Simulation_Methods Refs
Splatter DCA, DeepImpute, PERMUDA, scDeepCluster, scVI, scScope, solo ~2000 Splatter/R (Tian 2019)
CIDR sclGAN 50 CIDR simulation [54, Reference Not Found]
NB+dropout Dhaka 500 Hierachical model of NB/Gamma + random dropout
Bulk RNA-seq SAUCIE 1076 1076 CCLE bulk RNAseq + dropout conditional on expression level
SIMLR scScope 1 million SIMLR, high-dimensional data generated from latent vector (Miao et al. 2018)

B. Human single-cell data sources used by different DL algorithms

Table 6.4: Human single-cell data sources used by different DL algorithms
Title Algorithm Cell_Origin Number_of_Cells Data_Sources Refs
68k PBMCs DCA, SAVER-X, LATE, scVAE, scDeepCluster, scCapsNet, scDGN Blood 68,579 10X Single Cell Gene Expression Datasets
Human pluripotent DCA hESCs 1,876 GSE102176 (Lotfollahi, Wolf, and Theis 2019)
CITE-seq SAVER-X Cord blood mononuclear cells 8,005 GSE100866 (Duvenaud 2015)
Midbrain and Dopaminergic Neuron Development SAVER-X Brain/ embryo ventral midbrain cells 1,977 GSE76381 [124, Ref Not Found]
HCA SAVER-X Immune cell, Human Cell Atlas 500,000 HCA data portal
Breast tumor SAVER-X Immune cell in tumor micro-environment 45,000 GSE114725 (Kang et al. 2018)
293T cells DeepImpute, iMAP Embryonic kidney 13,480 10X Single Cell Gene Expression Datasets
Jurkat DeepImpute, iMAP Blood/ lymphocyte 3,200 10X Single Cell Gene Expression Datasets
ESC, Time-course scGAN ESC 350,758 GSE75748 (Haber et al. 2017)
Baron-Hum-1 scGMAI, VASC Pancreatic islets 1,937 GSM2230757 (Hagai et al. 2018)
Baron-Hum-2 scGMAI, VASC Pancreatic islets 1,724 GSM2230758 (Hagai et al. 2018)
Camp scGMAI, VASC Liver cells 303 GSE96981 (Y. Peng et al. 2018)
CEL-seq2 PERMUDA, DESC Pancreas/Islets of Langerhans GSE85241 (Stoeckius et al. 2017)
Darmanis scGMAI, sclGAN, VASC Brain/cortex 466 GSE67835 (Azizi et al. 2018)
Tirosh-brain Dhaka, scvis Oligodendroglioma >4,800 GSE70630 (Chu et al. 2016)
Patel Dhaka Primary glioblastoma cells 875 GSE57872 (210?)
Li scGMAI, VASC Blood 561 GSE146974 (T. Wang et al. 2019)
Tirosh-skin scvis melanoma 4,645 GSE72056 (D. Wang et al. 2021)
xenograft 3, and 4 Dhaka Breast tumor ~250 EGAS00001002170 (Camp et al. 2017)
Petropoulos VASC/netAE Human embryos 1,529 E-MTAB-3929
Pollen scGMAI, VASC 348 SRP041736 (Muraro et al. 2016)
Xin scGMAI, VASC Pancreatic cells (a-, ß-, d-) 1,600 GSE81608 (Darmanis et al. 2015)
Yan scGMAI, VASC embryonic stem cells 124 GSE36552 (Tirosh, Venteicher, et al. 2016)
PBMC3k VASC, scVI Blood 2,700 SRP073767 (Torroja and Sanchez-Cabo 2019)
CyTOF, Dengue SAUCIE Dengue infection 11 M, ~42 antibodies Cytobank, 82023 (Amodio et al. 2019)
CyTOF, ccRCC SAUCIE Immunue profile of 73 ccRCC patients 3.5 M, ~40 antibodies Cytobank: 875 (A. P. Patel et al. 2014)
CyTOF, breast SAUCIE 3 patients Flow Repository: FR-FCM-ZYJP (Kang et al. 2018)
Chung, BC DigitalDLSorter Breast tumor 515 GSE75688 (Levine et al. 2015)
Li, CRC DigitalDLSorter Colorectal cancer 2,591 GSE81861 (Qiu et al. 2017)
Pancreatic datasets scDGN Pancreas 14,693 SeuratData
Kang, PBMC scGen PBMC stimulated by INF-ß ~15,000 GSE96583 (Y. X. Wang, Waterman, and Huang 2014)

C. Mouse single-cell data sources used by different DL algorithms

Table 6.5: Mouse single-cell data sources used by different DL algorithms
Title Algorithm Cell_Origin Number_of_Cells Data_Sources Refs
Brain cells from E18 mice DCA, SAUCIE Brain Cortex 1,306,127 10x: Single Cell Gene Expression Datasets
Midbrain and Dopaminergic Neuron Development SAVER-X Ventral Midbrain 1,907 GSE76381 (La Manno et al. 2016)
Mouse cell atlas SAVER-X NA 405,796 GSE108097 (Han et al. 2018)
neuron9k DeepImpute Cortex 9,128 10x: Single Cell Gene Expression Datasets
Mouse Visual Cortex DeepImpute Brain cortex 114,601 GSE102827 (Hrvatin et al. 2018)
murine epidermis DeepImpute Epidermis 1,422 GSE67602 (Joost et al. 2016)
myeloid progenitors LATE, DESC, SAUCIE Bone marrow 2,730 GSE72857 (Paul et al. 2015)
Cell-cycle sclGAN mESC 288 E-MTAB-2805 (Buettner et al. 2015)
A single-cell survey NA Intestine 7,721 GSE92332 (Haber et al. 2017)
Tabula Muris iMAP Mouse cells >100K NA
Baron-Mou-1 VASC Pancreas 822 GSM2230761 (Baron et al. 2016)
Biase scGMAI, VASC Embryos/SMARTer 56 GSE57249 (Biase, Cao, and Zhong 2014)
Biase scGMAI, VASC Embryos/Fluidigm 90 GSE59892 (Biase, Cao, and Zhong 2014)
Deng scGMAI, VASC Liver 317 GSE45719 (Chu et al. 2016)
Klein VASC, scDeepCluster, sclGAN Stem Cells 2,717 GSE65525 (Klein et al. 2015)
Goolam VASC Mouse Embryo 124 E-METAB-3321 (Goolam et al. 2016)
Kolodziejczyk VASC mESC 704 E-MTAB-2600 (Kim et al. 2015)
Usoskin VASC Lumbar 864 GSE59739 (Usoskin et al. 2015)
Zeisel VASC, scVI, SAUCIE, netAE Cortex, hippocampus 3,005 GSE60361 (Zeisel et al. 2015)
Bladder cells scDeepCluster Bladder 12,884 GSE129845 (Baron et al. 2016)
HEMATO scVI Blood cell >10,000 GSE89754 (Tusi et al. 2018)
retinal bipolar cells scVI, scCapsNet, SAUCIE retinal ~25,000 GSE81905 (Shekhar et al. 2016)
Embryo at 9 time points LDAVE embryos from E6.5 to E8.5 116,312 GSE87038 (Pijuan-Sala et al. 2019)
Embryo at 9 time points LDAVE embryos from E9.5 to E13.5 ~2 millions GSE119945 (Cao et al. 2019)
CyTOF SAUCIE Mouse thymus 200K, ~38 antibodies Cytobank: 52942 (Setty et al. 2016)
Solo Solo Mouse kidneys ~8,000 GSE140262 (Bernstein et al. 2020)
Nestorowa netAE hematopoietic stem and progenitor cells 1,920 GSE81682 (Nestorowa et al. 2016)
small intestinal epithelium scGen Infected with Salmonella and worm H. polygyrus 1,957 GSE92332 (Haber et al. 2017)

D. Single-cell data derived from other species

Table 6.6: Single-cell data derived from other species
Title Algorithm Species Tissue Number_of_Cells Data_Sources Refs
Worm neuron cells\(^{1}\) scDeepCluster C. elegans Neuron 4,186 GSE98561 (Joost et al. 2016)
Cross species, stimulation with LPS and dsRNA scGen Mouse, rat, rabbit, and pig bone marrow-derived phagocyte 5,000 to 10,000 /species 13 accessions in ArrayExpress (Kanehisa et al. 2017)
1 Processed data is available at https://github.com/ttgump/scDeepCluster/tree/master/scRNA-seq%20data

E. Large single-cell data source used by various algorithms

Table 6.7: Large single-cell data source used by various algorithms
Title Sources Notes
10X Single-cell gene expression dataset https://support.10xgenomics.com/single-cell-gene-expression/datasets Contains large collection of scRNA-seq dataset generated using 10X system
Tabula Muris https://tabula-muris.ds.czbiohub.org/ Compendium of scRNA-seq data from mouse
HCA https://data.humancellatlas.org/ Human single-cell atlas
MCA https://figshare.com/s/865e694ad06d5857db4b, or GSE108097 Mouse single-cell atlas
scQuery https://scquery.cs.cmu.edu/ A web server cell type matching and key gene visualization. It is also a source for scRNA-seq collection (processed with common pipeline)
SeuratData https://github.com/satijalab/seurat-data List of datasets, including PBMC and human pancreatic islet cells
cytoBank https://cytobank.org/ Community of big data cytometry

Table3

Evaluation metrics used in surveyed DL algorithms

Table 6.8: Evaluation metrics used in surveyed DL algorithms
EvaluationMethod Equations Explanation
Pseudobulk RNA-seq Average of normalized (log2-transformed) scRNA-seq counts across cells is calculated and then correlation coefficient between the pseudobulk and the actual bulk RNA-seq profile of the same cell type is evaluated.
Mean squared error (MSE) \(MSE=\frac{1}{n} \sum_{i=1}^{n}(x_{i}- \hat{x}_{i})^{2}\) MSE assesses the quality of a predictor, or an estimator, from a collection of observed data \(x\), with \(\hat{x}\) being the predicted values.
Pearson correlation \(\rho_{X,Y}=\frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}}\) where cov() is the covariance, \(\sigma X\) and \(\sigma Y\) are the standard deviation of \(X\) and \(Y\), respectively.
Spearman correlation \(\rho_{s}=\rho_{r_{X},r_{Y}}=\frac{cov(r_X,r_Y)}{\sigma_{r_X}\sigma_{r_Y}}\) The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables, where \(r_{X}\) is the rank of \(X\).
Entropy of accuracy, Hacc (Tran et al. 2020) \(H_{acc}=-\frac{1}{M} \sum_{i=1}^{M} \sum_{j=1}^{N_i} p_i(x_j)logp_{i}(x_{j})\) Measures the diversity of the ground-truth labels within each predicted cluster group. \(p_{i}(x_{j})\) (or \(q_{i}(x_{j})\)) are the proportions of cells in the \(j\)th ground-truth cluster (or predicted cluster) relative to the total number of cells in the \(i\)th predicted cluster (or ground-truth clusters), respectively.
Entropy of purity, Hpur (Tran et al. 2020) \(H_{pur}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M_i}q_i(x_j)logq_{i}(x_{j})\) Measures the diversity of the predicted cluster labels within each ground-truth group
Entropy of mixing (Haghverdi et al. 2018) \(E=\sum_{i=1}^{C}p_{i}\log(p_{i})\) This metric evaluates the mixing of cells from different batches in the neighborhood of each cell. \(C\) is the number of batches, and \(p_{i}\) is the proportion of cells from batch \(i\) among \(N\) nearest cells.
Mutual Information (MI) (Strehland and Ghosh 2002) \(MI(U,V)=\sum_{i}^{|U|}\sum_{j=1}^{|V|}P_{UV}(i,j)log(\frac{P_{UV}(i,j)}{P_{U}(i)P_{V}(j)})\) where \(P_{U}(i)=\frac{|U_{i}|}{N}\) and \(P_{V}(j)=\frac{|V_{j}|}{N}\). Also, define the joint distribution probability is \(P_{UV}(i,j)=\frac{|U_{i} \cap V_{j}|}{N}\). The \(MI\) is a measure of mutual dependency between two cluster assignments \(U\) and \(V\).
Normalized Mutual Information (NMI) [165, BIB not found] \(NMI(U,V)=\frac{2 \times MI(U,V)}{[H(U)+H(V)]}\) where \(H(U)=\sum P_{U}(i)log(P_{U}(i)), H(V)=\sum P_{V}(i)log(P_V(i))\). The \(NMI\) is a normalization of the \(MI\) score between 0 and 1.
Kullback–Leibler (KL) divergence [166, BIB not found] \(D_{KL}(P||Q)=\sum_{x \in \chi}P(x)log(\frac{P(x)}{Q(x)})\) where discrete probability distributions \(P\) and \(Q\) are defined on the same probability space \(<U+03C7>\). This relative entropy is the measure for directed divergence between two distributions.
Jaccard Index \(J(U,V)=\frac{\lfloor U \cap V \rfloor}{\lfloor U \cup V \rfloor}\) \(0 = J(U,V) = 1\). \(J = 1\) if clusters \(U\) and \(V\) are the same. If \(U\) are \(V\) are empty, \(J\) is defined as 1.
Fowlkes-Mallows Index for two clustering algorithms (FM) \(FM=\sqrt{ \frac{TP}{TP + FP} \times \frac{TP}{TP+FN} }\) TP as the number of pairs of points that are present in the same cluster in both \(U\) and \(V\); \(FP\) as the number of pairs of points that are present in the same cluster in \(U\) but not in \(V\); \(FN\) as the number of pairs of points that are present in the same cluster in \(V\) but not in \(U\); and TN as the number of pairs of points that are in different clusters in both U and V.
Rand index (RI) \(RI=\frac{(a+b)}{\binom{n}{2}}\) Measure of constancy between two clustering outcomes, where \(a\) (or \(b\)) is the count of pairs of cells in one cluster (or different clusters) from one clustering algorithm but also fall in the same cluster (or different clusters) from the other clustering algorithm.
Adjusted Rand index (ARI) (Hubert and Arabie 1985) \(ARI=\frac{RI-E[RI]}{max(RI)-E[RI]}\) ARI is a corrected-for-chance version of RI, where \(E[RI]\) is the expected Rand Index.
Silhouette index \(s(i)=\frac{b(i)-a(i)}{max(a(i),b(i))}\) where \(a(i)\) is the average dissimilarity of ith cell to all other cells in the same cluster, and \(b(i)\) is the average dissimilarity of ith cell to all cells in the closest cluster. The range of \(s(i)\) is [-1,1], with 1 to be well-clustered and -1 to be completely misclassified.
Maximum Mean Discrepancy (MMD) (Borgwardt et al. 2006) \(MMD(F,p,q)=sup_{f \in F}||\mu_{p}-\mu_{q}||_{f}\) \(MMD\) is a non-parametric distance between distributions based on the reproducing kernel Hilbert space, or, a distance-based measure between two distribution \(p\) and \(q\) based on the mean embeddings \(\mu_{p}\) and \(\mu_{q}\) in a reproducing kernel Hilbert space \(F\).
k-Nearest neighbor batch-effect test (kBET) (Buttner et al. 2019) \(a_{n}^{k}=\sum_{l=1}^{L}\frac{(N_{nl}^{k} - k \bullet f_{l})^{2}}{k \bullet f_{l}} ~ X_{L-1}^{2}\) Given a dataset of \(N\) cells from \(L\) batches with \(N_l\) denoting the number of cells in batch \(l\), \(N_{nl}^{k}\) is the number of cells from batch \(l\) in the \(k\)-nearest neighbors of cell \(n\), \(f_{l}\) is the global fraction of cells in batch \(l\), or \(f_{l}=\frac{N_l}{N}\), and \(X_{L-1}^{2}\) denotes the \(X^{2}\) distribution with \(L-1\) degrees of freedom. It uses a \(X^{2}\)-based test for random neighborhoods of fixed size to determine the significance (“well-mixed”).
Local Inverse Simpson’s Index (LISI) (Korsunsky et al. 2019) \(\frac{1}{ \lambda(n)}=\frac{1}{\sum_{l=1}^{L}(p(l))^{2}}\) This is the inverse Simpson’s Index in the \(k\)-nearest neighbors of cell \(n\) for all batches, where \(p(l)\) denotes the proportion of batch \(l\) in the \(k\)-nearest neighbors. The score reports the effective number of batches in the \(k\)-nearest neighbors of cell \(n\).
Homogeneity \(HS=1-\frac{H(P(U|V))}{H(P(U))}\) where \(H()\) is the entropy, and \(U\) is the ground-truth assignment and \(V\) is the predicted assignment. The \(HS\) range from 0 to 1, where 1 indicates perfectly homogeneous labeling.
Completeness \(CS=1-\frac{H(P(V|U))}{H(P(V))}\) Its values range from 0 to 1, where 1 indicates all members from a ground-truth label are assigned to a single cluster.
V-Measure [169, BIB not found] \(V_{\beta}=\frac{(1+\beta)HS \times CS}{\beta HS + CS}\) where \(\beta\) indicates the weight of \(HS\). \(V\)-Measure is symmetric, i.e. switching the true and predicted cluster labels does not change \(V\)-Measure.
Precision, recall \(Precision = \frac{TP}{TP+FP}, recall=\frac{TP}{TP+FN}\) TP: true positive, FP: false positive, FN, false negative.
Accuracy \(Accuracy = \frac{TP+TN}{N}\) N: all samples tested, TN: true negative
F1-score \(F_{1}=\frac{2Precision \bullet Recall}{Precision+Recall}\) A harmonic mean of precision and recall. It can be extended to \(F_\beta\) where \(\beta\) is a weight between precision and recall (similar to \(V\)-measure).
AUC, RUROC curve Area Under Curve (grey area). Receiver operating characteristic (ROC) curve (red line). A similar measure can be performed on the Precision-Recall curve (PRC), or AUPRC. Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model (mostly for an imbalanced dataset).