4 Overview of datasets and evaluation metrics

A variety of datasets and performance evaluation metrics have been used to demonstrate the performance of the surveyed DL models for different tasks. We summarize these datasets and evaluation metrics in Table2 & Table3. We detail the mathematical definition of the evaluation metrics in the following.

4.1 Evaluation methods

An extensive list of evaluation methods has been proposed for different scRNA-seq analysis tasks (Tran et al. 2020; Hou et al. 2020; Sun and Zhou 2019). We provide an overview here the methods adopted in the surveyed papers. We discuss them according to the key categories on which the surveyed papers are organized, namely, imputation, batch effect correction, dimension reduction and clustering, cell type identification, and functional analysis.

4.1.1 Imputation

The evaluation of the performance of imputation methods considers their ability to recover biological signals and improve downstream analyses. For this two main methods have been used. First is the evaluation of similarity between bulk and imputed scRNA-seq data. Second is the evaluation of imputation on unsupervised clustering.

The first approach consist in assessing the similarity between bulk and imputed scRNA-seq data. For a given scRNA-seq dataset, the “pseudobulk,” or the average of normalized (log2-transformed) scRNA-seq counts across cells, is calculated first, and the Spearman’s rank correlation coefficient (SCC) between the pseudobulk and the bulk RNA-seq profile of the same cell type is evaluated. The statistical significance is assessed whether SCCs bewteen the bulk and pseudobulks from two imputation methods are equal.

The second approach consist in measuring the accuracy of several clustering assignments methods using four metrics:

Entropy of accuracy (\(H_{acc}\)) and entropy of purity (\(H_{pur}\)). Hacc (\(H_{pur}\)) measures the diversity of the ground-truth labels (predicted cluster labels) within each predicted cluster group (ground-truth group), respectively.

\[H_{acc}=-1\frac{1}{M}\sum_{i=1}^{M}\sum_{j=1}^{N_{i}}p_{i}(x_{j})\log{p_{i}(x_{j})}\]

\[H_{pur}=-1\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M_{i}}q_{i}(x_{j})\log{q_{i}(x_{j})}\]

where \(M\) is the total number of predicted clusters from the clustering algorithm, \(N\) is the number of ground-truth clusters, \(M_{i}\), (or \(N_{i}\)) is the number of predicted clusters (or ground-truth clusters) in the ith ground -truth cluster (or predicted cluster), respectively. \(p_{i}(x_{j})\) (or \(q_{i}(x_{j}))\) are the proportions of cells in the \(j\)th ground-truth cluster (or predicted cluster) relative to the total number of cells in the \(i\)th predicted cluster (or ground-truth clusters), respectively. A smaller value of \(H_{acc}\) means the cells in a predicted cluster are constantly labeled from the same ground-truth group, while A smaller value of \(H_{pur}\) means the cells in the ground-truth groups are homogeneous with the same predicted cluster labels (Tran et al. 2020). However, smaller \(H_{acc}\) (or \(H_{pur}\) ) can lead to over-clustering (or under-cluster), when each predicted cluster contains 1 cell (\(H_{acc}\) = 0) or all cells in one predicted cluster (\(H_{pur}\) = 0).

Adjusted Rand index (ARI). Rand index (RI) is another measure of constancy between two clustering outcomes. If \(a\) (or \(b\)) is the count of number of pairs of cells in one cluster (or different clusters) from one clustering algorithm but also fall in the same cluster (or different clusters) from the other clustering algorithm, then, \(RI=(a+b)/{\binom{n}{2}}\), where \(\binom{n}{2}\) is the total number of pairs when given n cells. The RI has a value between 0 and 1, with 0 indicating that the two clustering algorithms do not agree on any pair of cells and 1 indicating that the two clustering algorithms are exactly the same. ARI is a corrected-for-chance version of \(RI\), or

\[ARI = \frac{RI-E[RI]}{\max(RI)-E[RI]}\] where \(E[RI]\) is the expected Rand Index (Hubert and Arabie 1985).

Median Silhouette index. The Silhouette index is defined as

\[ s(i)=\frac{b(i)-a(i)}{\max(a(i),b(i))}\]

where \(a(i)\) is the average dissimilarity of \(i\)th cell to all other cells in the same cluster, and \(b(i)\) is the average dissimilarity of \(i\)th cell to all cells in the closest cluster. The range of \(s(i)\) is [−1,1], with 1 to be well-clustered with appropriate labels, and -1 to be completely misclassified. \(s(i) = 0\) indiates the cell could be assigned to nearest clusters (or overlapping clusters).

A good imputation method should allow perform downstream (clustering) analyses without introducing any artifacts or false signals.

4.1.2 Batch effect correction

When evaluating the performance of a batch correction method, we need to consider how well it mixes the shared cell types between different batches and at the same time identifies batch-specific cells. The existing metrics can be classified as cluster-level and cell-level metrics. Cluster level metrics are those used for evaluating clustering performance and include adjusted rand index (ARI), normalized mutual information (NMI), and silhouette coefficients. They are easy to compute but do not measure local mixture of cells from different batches. This drawback is addressed by the cell-level metrics, which includes k-nearest neighbor batch-effect test (kBET), local inverse Simpson’s index (LISI), and classifier-based. Because the cluster-level metrics will be discussed in detail in Section 4.2.3(which one?), we focus on discussing cell-level metrics in this section.

Entropy of mixing. This metric evaluates the mixing of cells from different batches in the neighborhood of each cell (Haghverdi et al. 2018). It first randomly sample N cells and then for each cell it calculates the regional entropy of mixing as

\[E = \sum_{i=1}^{C}p_{i}\log{(p_{i})}\]

where \(C\) is the number of batches and \(p_{i}\) is the proportion of cells from batch \(i\) among \(N\) nearest cells (e.g. \(N\) = 100). The total entropy is the sum of reginal entropies. The computation repeats \(K\) times to obtain an empirical distribution of the entropy of mixing.

Maximum Mean Discrepancy (MMD) is a non-parametric distance between distributions based on the reproducing kernel Hilbert space (RKHS) (Borgwardt et al. 2006), or, MMD is a distance-based measure between two distribution \(p\) and \(q\) based on the mean embeddings \(\mu_{p}\) and \(\mu_{q}\) in a reproducing kernel Hilbert space \(F\),

\[MMD(F,p,q)=\sup_{f \in F}\|\mu_{p}-\mu_{q}\|_{f}\]

MMD will vanish for most finite samples \(x_{k}\) and \(y_{k}\) only if two distributions are the same.

k-Nearest neighbor batch-effect test (kBET). kBET assesses the batch mixing by comparing the batch-specific distribution within \(k\)-nearest neighbors (kNNs) of a cell with the global distribution of batches (Buttner et al. 2019). It uses a \(X^2\)-based test for random neighborhoods of fixed size to determine whether they are well mixed. Given a dataset of \(N\) cells from \(L\) batches with \(N_{l}\) denoting the number of cells in batch \(l\). Under the null hypothesis that cells are ‘well mixed,’ that is the absence of batch effect, we have the test statistics as

\[a_{n}^{k} = \sum_{l=1}^{L}\frac{(N_{nl}^{k}-k*f_{l})^{2}}{k*f_{l}} \sim X^{2}_{L-1}\]

where \(N_{nl}^{k}\) is the number of cells from batch \(l\) in the \(k\)-nearest neighbors of cell \(n\), \(f_{l}\) is the global fraction of cells in batch \(l\), or \(f_{l}=\frac{N_{l}}{N}\), and \(X_{L-1}^2\) denotes the \(X^2\) distribution with \(L-1\) degrees of freedom. The averaged rejection rate of the \(Χ^2\) test for all cells is used to define the performance of a batch correction method.

Local Inverse Simpson’s Index (LISI). Like kBET, LISI also compares the batch mixing at local level with global batch distribution. However, unlike kBET, which is agnostic of cell types, LISI requires well mixing of cells from the same cell type but not of those from different types (Korsunsky et al. 2019). LISI evaluates cell-type-specific mixing using an inverse Simpson’s Index in a local neighborhood of each cell. LISI builds Gaussian kernel-based distributions of local neighborhoods sensitive to local diversity. It calculates inverse Simpson’s Index in the k-nearest neighbors of cell \(n\) for all batches as \(\frac{1}{\lambda(n)}=\frac{1}{\sum_{l=1}^{L}(p(l))^{2}}\), where \(p(l)\) denotes the proportion of batch \(l\) in the \(k\)-nearest neighbors. The score reports the effective number of batches in the \(k\)-nearest neighbors of cell \(n\). Inverse Simpson’s Index in the \(k\)-nearest neighbors of cell \(n\) can also be calculated to evaluate the diversity of different cell types. However, in an ideal case, LISI score should be 1, reflecting a separation of unique cell types.

Classifier-based. Although LISI addresses the issue of cell-type proportion of different batches but it is hard to summarize all single cell-level LISI scores into a simple statistic for comparing across different methods (Eraslan et al. 2019). The classifier-based approach addresses this issue by using two distinct local classifiers for each single cell. The first classifier classifies every single cell as positive and negative cells. A single cell \(n\) is positive if at least 50\(%\) cells of its k-nearest neighbor (KNN) cells are from the same cell-type label, otherwise ‘negative.’ The positive cells are further classified into true and false positive cells, where true positive cells are those surrounded by appropriate portions of cells with L batches. In other words, if we sample \(k\) cells from this cell-type cluster, the expected number of cells from batch \(l\) will be \(k*f_{l}\), where \(f_{l}\) is the global fraction of cells in batch \(l\). A positive cell in this cluster is a true positive when the observed cell numbers for each batch among its \(k\) neighbors are within 3 standard deviations around the expected numbers. The proportions of positive cells and true positive cells are used as the summary metrics to evaluate the overall performance of batch effect removal. The higher the proportions, the better the algorithm.

4.1.3 Clustering

Evaluating the performance of clustering algorithms is not as trivial as counting the number of errors like supervised learning. In general, the clustering performance evaluation metric should not just take absolute corrected labelled cells into account but also consider if the clustering defines a good similarity or separation in the dataset compared to ground truth. When ground truth is not available, evaluation must be performed using model itself such as clustering distance, dispersion, etc. Similar measures, such as Adjusted Rand Index (ARI) and Silhouette Index discussed in Section 4.2.1 (which one?) can also be employed here to measure the agreement between predicted assignment to the ground-truth assignment.

Normalized Mutual Information (NMI). The mutual information(MI) (Strehland and Ghosh 2002) is a measure of mutual dependency between two cluster assignments \(U\) and \(V\). It quantifies the amount of information we could have about one assignment by observing the other assignment. For \(N\) samples, we have the entropy for cluster assignments \(U\) and \(V\) as

\[H(U)=\sum_{i=1}^{\vert U \vert}P_{U}(i)\log{(P_{U}(i))}, H(V)=\sum_{i=1}^{\vert V \vert}P_{V}(i)\log{(P_{V}(i))}\]

where \(P_{U}(i)=\frac{\vert U_{i} \vert}{N}\) and \(P_{V}(j)=\frac{\vert V_{j} \vert}{N}\). Also, define the joint distribution probability is \(P_{UV}(i,j)=\frac{\vert U_{i}\cap V_{j} \vert}{N}\). Then, the mutual information of \(U\) and \(V\) is defined as

\[MI(U,V)=\sum_{i=1}^{\vert U \vert}\sum_{j=1}^{\vert V \vert}P_{UV}(i,j)\log{\frac{P_{UV}(i,j)}{P_{U}(i)P_{V}(j)}}\]

The NMI is a normalization of the \(MI\) score between 0 and 1. For example, the average NMI is defined as (Cover 1999)

\[NMI(U,V)=\frac{2 \times MI(U,V)}{[H(U) + H(V)]}\]

Homogeneity, Completeness, and V-Measure. The homogeneity score (HS) measures the extent to which its clusters contain only samples that belong to one cell type, or \(HS=1-H(P(U\vert V))/H(P(U))\), where \(H()\) is the entropy, and \(U\) is the ground-truth assignment and \(V\) is the predicted assignment. The \(HS\) range from 0 to 1, where 1 indicates perfectly homogeneous labelling. Similarly, the completeness score (CS) is defined as \(CS=1-H(P(V \vert U))/H(P(V))\), its values range from 0 to 1, where 1 indicates all member from a ground-truth label are assigned to a single cluster.

The V-Measure [54(Reference Not Found)] is the harmonic mean between \(HS\) and \(CS\), defined as \(V_{\beta}=\frac{(1+\beta)HS×CS}{\beta HC+CS}\), where \(\beta\) indicates the weight of \(HS\). V-Measure is a more symmetric, i.e. switching the true and predicted cluster labels does not change V-Measure.