Two Cell Types • SplitClusterTest

library(SplitClusterTest)

The simulated data simdata_2ct with two cell types is generated based on the real single-cell data DuoClustering2018::sce_full_Zhengmix4eq() with the help of scDesign3. For more details about generating synthetic data, please check our paper.

The structure of simdata_2ct is as follows:

str(simdata_2ct)
#> List of 4
#>  $ simu_sce:Formal class 'SingleCellExperiment' [package "SingleCellExperiment"] with 9 slots
#>   .. ..@ int_elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. ..@ rownames       : NULL
#>   .. .. .. ..@ nrows          : int 198
#>   .. .. .. ..@ listData       :List of 1
#>   .. .. .. .. ..$ rowPairs:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. .. .. .. ..@ rownames       : NULL
#>   .. .. .. .. .. .. ..@ nrows          : int 198
#>   .. .. .. .. .. .. ..@ listData       : Named list()
#>   .. .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. ..@ metadata       : list()
#>   .. ..@ int_colData        :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. ..@ rownames       : NULL
#>   .. .. .. ..@ nrows          : int 998
#>   .. .. .. ..@ listData       :List of 3
#>   .. .. .. .. ..$ reducedDims:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. .. .. .. ..@ rownames       : NULL
#>   .. .. .. .. .. .. ..@ nrows          : int 998
#>   .. .. .. .. .. .. ..@ listData       : Named list()
#>   .. .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. .. ..$ altExps    :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. .. .. .. ..@ rownames       : NULL
#>   .. .. .. .. .. .. ..@ nrows          : int 998
#>   .. .. .. .. .. .. ..@ listData       : Named list()
#>   .. .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. .. ..$ colPairs   :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. .. .. .. ..@ rownames       : NULL
#>   .. .. .. .. .. .. ..@ nrows          : int 998
#>   .. .. .. .. .. .. ..@ listData       : Named list()
#>   .. .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. ..@ metadata       : list()
#>   .. ..@ int_metadata       :List of 1
#>   .. .. ..$ version:Classes 'package_version', 'numeric_version'  hidden list of 1
#>   .. .. .. ..$ : int [1:3] 1 24 0
#>   .. ..@ rowRanges          :Formal class 'CompressedGRangesList' [package "GenomicRanges"] with 5 slots
#>   .. .. .. ..@ unlistData     :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
#>   .. .. .. .. .. ..@ seqnames       :Formal class 'Rle' [package "S4Vectors"] with 4 slots
#>   .. .. .. .. .. .. .. ..@ values         : Factor w/ 0 levels: 
#>   .. .. .. .. .. .. .. ..@ lengths        : int(0) 
#>   .. .. .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. .. .. ..@ ranges         :Formal class 'IRanges' [package "IRanges"] with 6 slots
#>   .. .. .. .. .. .. .. ..@ start          : int(0) 
#>   .. .. .. .. .. .. .. ..@ width          : int(0) 
#>   .. .. .. .. .. .. .. ..@ NAMES          : NULL
#>   .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. .. .. ..@ strand         :Formal class 'Rle' [package "S4Vectors"] with 4 slots
#>   .. .. .. .. .. .. .. ..@ values         : Factor w/ 3 levels "+","-","*": 
#>   .. .. .. .. .. .. .. ..@ lengths        : int(0) 
#>   .. .. .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. .. .. ..@ seqinfo        :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
#>   .. .. .. .. .. .. .. ..@ seqnames   : chr(0) 
#>   .. .. .. .. .. .. .. ..@ seqlengths : int(0) 
#>   .. .. .. .. .. .. .. ..@ is_circular: logi(0) 
#>   .. .. .. .. .. .. .. ..@ genome     : chr(0) 
#>   .. .. .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. .. .. .. .. ..@ rownames       : NULL
#>   .. .. .. .. .. .. .. ..@ nrows          : int 0
#>   .. .. .. .. .. .. .. ..@ listData       : Named list()
#>   .. .. .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. .. .. ..@ rownames       : NULL
#>   .. .. .. .. .. ..@ nrows          : int 198
#>   .. .. .. .. .. ..@ listData       : Named list()
#>   .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. ..@ metadata       : list()
#>   .. .. .. ..@ elementType    : chr "GRanges"
#>   .. .. .. ..@ metadata       : list()
#>   .. .. .. ..@ partitioning   :Formal class 'PartitioningByEnd' [package "IRanges"] with 5 slots
#>   .. .. .. .. .. ..@ end            : int [1:198] 0 0 0 0 0 0 0 0 0 0 ...
#>   .. .. .. .. .. ..@ NAMES          : chr [1:198] "ENSG00000116251" "ENSG00000142676" "ENSG00000142669" "ENSG00000169442" ...
#>   .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. ..@ metadata       : list()
#>   .. ..@ colData            :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. ..@ rownames       : chr [1:998] "naive.cytotoxic10013" "naive.cytotoxic5827" "naive.cytotoxic1319" "naive.cytotoxic4199" ...
#>   .. .. .. ..@ nrows          : int 998
#>   .. .. .. ..@ listData       :List of 1
#>   .. .. .. .. ..$ cell_type: chr [1:998] "celltype0" "celltype0" "celltype0" "celltype0" ...
#>   .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. ..@ metadata       : list()
#>   .. ..@ assays             :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
#>   .. .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
#>   .. .. .. .. .. ..@ listData       :List of 2
#>   .. .. .. .. .. .. ..$ counts   : num [1:198, 1:998] 3 70 2 3 1 7 0 6 0 1 ...
#>   .. .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. .. .. .. .. ..$ : chr [1:198] "ENSG00000116251" "ENSG00000142676" "ENSG00000142669" "ENSG00000169442" ...
#>   .. .. .. .. .. .. .. .. ..$ : chr [1:998] "naive.cytotoxic10013" "naive.cytotoxic5827" "naive.cytotoxic1319" "naive.cytotoxic4199" ...
#>   .. .. .. .. .. .. ..$ logcounts: num [1:198, 1:998] 1.386 4.263 1.099 1.386 0.693 ...
#>   .. .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. .. .. .. .. .. ..$ : chr [1:198] "ENSG00000116251" "ENSG00000142676" "ENSG00000142669" "ENSG00000169442" ...
#>   .. .. .. .. .. .. .. .. ..$ : chr [1:998] "naive.cytotoxic10013" "naive.cytotoxic5827" "naive.cytotoxic1319" "naive.cytotoxic4199" ...
#>   .. .. .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. .. .. ..@ metadata       : list()
#>   .. ..@ NAMES              : NULL
#>   .. ..@ elementMetadata    :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#>   .. .. .. ..@ rownames       : NULL
#>   .. .. .. ..@ nrows          : int 198
#>   .. .. .. ..@ listData       : Named list()
#>   .. .. .. ..@ elementType    : chr "ANY"
#>   .. .. .. ..@ elementMetadata: NULL
#>   .. .. .. ..@ metadata       : list()
#>   .. ..@ metadata           : list()
#>  $ de_idx  : Named int [1:20] 2 15 16 24 34 35 55 64 71 86 ...
#>   ..- attr(*, "names")= chr [1:20] "ENSG00000142676" "ENSG00000143947" "ENSG00000034510" "ENSG00000144713" ...
#>  $ ct0_idx : int [1:499] 1 2 3 4 5 6 7 8 9 10 ...
#>  $ ct1_idx : int [1:499] 500 501 502 503 504 505 506 507 508 509 ...

The true differentially expressed (DE) genes are

truth = names(simdata_2ct$de_idx)
truth
#>  [1] "ENSG00000142676" "ENSG00000143947" "ENSG00000034510" "ENSG00000144713"
#>  [5] "ENSG00000109475" "ENSG00000145425" "ENSG00000112306" "ENSG00000205542"
#>  [9] "ENSG00000147604" "ENSG00000177600" "ENSG00000166441" "ENSG00000251562"
#> [13] "ENSG00000122026" "ENSG00000167526" "ENSG00000198242" "ENSG00000108298"
#> [17] "ENSG00000115268" "ENSG00000105640" "ENSG00000105372" "ENSG00000083845"

Step 1: Perform Testing-after-Clustering on Each Half via (Multiple) Data Splitting

mss = mds1(simdata_2ct$simu_sce,
     M = 1, # times of data splitting
     params1 = list(normalized_method = "none"), # parameters for the first half
     params2 = list(normalized_method = "none")) # parameters for the second half
#> 
#> ===== Multiple Data Splitting:  1 / 1  =====
#> 
#> ----- data splitting (1st half) -----
#> 
#> ----- data splitting (2nd half) ----

It returns the mirror statistics,

hist(mss[[1]], breaks = 50)
cutoff = calc_tau(mss[[1]], q = 0.05)
abline(v = cutoff, lwd = 2, col = "red")

Histogram of Mirror Statistics (Two Cell Types)

In this step, cutoff is chosen to control the nominal False Discovery Rate (FDR) at q = 0.05.

Note that the above histogram is for the results from the first data splitting mss[[1]]. If with multiple data splitting, we can aggregate the results by calculating the inclusion rate for each gene, and also calculate a cutoff. Then, either for DS or MDS, we can obtain the estimation of the significant DE gene set. We wrap this as the second step.

For multiple data splitting (MDS), we can aggregate results by calculating the inclusion rate for each gene and determining a new cutoff. This allows us to estimate the set of significant DE genes, whether using DS or MDS. We wrap this process as the second step:

Step 2. Calculate Cutoff and Estimate the Selection Set

sel = mds2(mss, q = 0.05)
#> use the tie.method =  fair
sel
#>  [1] "ENSG00000144713" "ENSG00000167526" "ENSG00000115268" "ENSG00000105640"
#>  [5] "ENSG00000108298" "ENSG00000177600" "ENSG00000198242" "ENSG00000112306"
#>  [9] "ENSG00000142676" "ENSG00000145425" "ENSG00000105372" "ENSG00000083845"
#> [13] "ENSG00000166441" "ENSG00000122026" "ENSG00000143947" "ENSG00000205542"
#> [17] "ENSG00000034510" "ENSG00000109475" "ENSG00000251562" "ENSG00000147604"
#> [21] "ENSG00000138326" "ENSG00000102879" "ENSG00000116251" "ENSG00000185201"
#> [25] "ENSG00000140319" "ENSG00000148908"

we can also evaluate the accuracy

calc_acc(sel, truth)
#>       fdr     power        f1 
#> 0.2307692 1.0000000 0.8695652

A Whole Step

As a whole, the above two steps mds1 and mds2 have been wrapped into a single function mds, and you can simply run the following command to obtain the selection set.

sel = mds(simdata_2ct$simu_sce,
     q = 0.05, 
     M = 1, 
     params1 = list(normalized_method = "none"), 
     params2 = list(normalized_method = "none"))
#> 
#> ===== Multiple Data Splitting:  1 / 1  =====
#> 
#> ----- data splitting (1st half) -----
#> 
#> ----- data splitting (2nd half) ----
#> use the tie.method =  fair

As a comparison, we also check the naive double-dipping method.

sel.dd = dd(simdata_2ct$simu_sce, params = list(normalized_method = "none"), q = 0.05, test.use = "t")

and the accuracy is

calc_acc(sel.dd, truth)
#>   fdr power    f1 
#>     0     1     1

Normalization

In practice, to account for the library size, a normalization method is needed. One suitable approach is the SCT transform.

sel.sct = mds(simdata_2ct$simu_sce,
     q = 0.05, 
     M = 1, 
     params1 = list(normalized_method = "sct"), 
     params2 = list(normalized_method = "sct"))
#> 
#> ===== Multiple Data Splitting:  1 / 1  =====
#> 
#> ----- data splitting (1st half) -----
#> 
#> ----- data splitting (2nd half) ----
#> use the tie.method =  fair
sel.dd.sct = dd(simdata_2ct$simu_sce, params = list(normalized_method = "sct"))
#> Warning: The following arguments are not used: norm.method
sapply(list(ds = sel.sct, dd = sel.dd.sct), function(x) calc_acc(x, names(simdata_2ct$de_idx)))
#>               ds        dd
#> fdr   0.09090909 0.8245614
#> power 1.00000000 1.0000000
#> f1    0.95238095 0.2985075

Whitening

We recommend the whitening strategy for weak signal and (highly) correlated data.

sel = mds(simdata_2ct$simu_sce,
     q = 0.05, 
     M = 1, 
     params1 = list(normalized_method = "sct", pca.whiten = TRUE), 
     params2 = list(normalized_method = "sct", pca.whiten = TRUE))
#> 
#> ===== Multiple Data Splitting:  1 / 1  =====
#> 
#> ----- data splitting (1st half) -----
#> 
#> ----- data splitting (2nd half) ----
#> use the tie.method =  fair
sel.dd = dd(simdata_2ct$simu_sce, params = list(normalized_method = "sct", pca.whiten = TRUE))
#> Warning: The following arguments are not used: norm.method
sapply(list(ds = sel, dd = sel.dd), function(x) calc_acc(x, names(simdata_2ct$de_idx)))
#>               ds        dd
#> fdr   0.04761905 0.8290598
#> power 1.00000000 1.0000000
#> f1    0.97560976 0.2919708

Session Info

sessionInfo()
#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SplitClusterTest_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] spatstat.univar_3.1-1       spam_2.11-0                
#>   [3] systemfonts_1.1.0           plyr_1.8.9                 
#>   [5] igraph_2.1.1                lazyeval_0.2.2             
#>   [7] sp_2.1-4                    splines_4.1.3              
#>   [9] listenv_0.9.1               scattermore_1.2            
#>  [11] GenomeInfoDb_1.30.1         ggplot2_3.5.1              
#>  [13] digest_0.6.37               htmltools_0.5.8.1          
#>  [15] fansi_1.0.6                 magrittr_2.0.3             
#>  [17] tensor_1.5                  cluster_2.1.2              
#>  [19] ROCR_1.0-11                 globals_0.16.3             
#>  [21] matrixStats_1.4.1           pkgdown_2.1.1              
#>  [23] spatstat.sparse_3.1-0       colorspace_2.1-1           
#>  [25] ggrepel_0.9.6               textshaping_0.4.0          
#>  [27] xfun_0.49                   dplyr_1.1.4                
#>  [29] RCurl_1.98-1.16             jsonlite_1.8.9             
#>  [31] progressr_0.15.0            spatstat.data_3.1-2        
#>  [33] survival_3.2-13             zoo_1.8-12                 
#>  [35] glue_1.8.0                  polyclip_1.10-7            
#>  [37] gtable_0.3.6                zlibbioc_1.40.0            
#>  [39] XVector_0.34.0              leiden_0.4.3.1             
#>  [41] DelayedArray_0.20.0         future.apply_1.11.3        
#>  [43] SingleCellExperiment_1.16.0 BiocGenerics_0.40.0        
#>  [45] abind_1.4-8                 scales_1.3.0               
#>  [47] spatstat.random_3.3-2       miniUI_0.1.1.1             
#>  [49] Rcpp_1.0.13-1               viridisLite_0.4.2          
#>  [51] xtable_1.8-4                reticulate_1.39.0          
#>  [53] dotCall64_1.2               stats4_4.1.3               
#>  [55] htmlwidgets_1.6.4           httr_1.4.7                 
#>  [57] RColorBrewer_1.1-3          Seurat_4.4.0               
#>  [59] ica_1.0-3                   pkgconfig_2.0.3            
#>  [61] farver_2.1.2                sass_0.4.9                 
#>  [63] uwot_0.2.2                  deldir_2.0-4               
#>  [65] utf8_1.2.4                  tidyselect_1.2.1           
#>  [67] rlang_1.1.4                 reshape2_1.4.4             
#>  [69] later_1.3.2                 munsell_0.5.1              
#>  [71] tools_4.1.3                 cachem_1.1.0               
#>  [73] cli_3.6.3                   generics_0.1.3             
#>  [75] ggridges_0.5.6              evaluate_1.0.1             
#>  [77] stringr_1.5.1               fastmap_1.2.0              
#>  [79] yaml_2.3.10                 ragg_1.3.3                 
#>  [81] goftest_1.2-3               knitr_1.48                 
#>  [83] fs_1.6.5                    fitdistrplus_1.2-1         
#>  [85] purrr_1.0.2                 RANN_2.6.2                 
#>  [87] pbapply_1.7-2               future_1.34.0              
#>  [89] nlme_3.1-155                mime_0.12                  
#>  [91] compiler_4.1.3              plotly_4.10.4              
#>  [93] png_0.1-8                   spatstat.utils_3.1-1       
#>  [95] tibble_3.2.1                bslib_0.8.0                
#>  [97] stringi_1.8.4               highr_0.11                 
#>  [99] desc_1.4.3                  lattice_0.20-45            
#> [101] Matrix_1.6-5                vctrs_0.6.5                
#> [103] pillar_1.9.0                lifecycle_1.0.4            
#> [105] spatstat.geom_3.3-3         lmtest_0.9-40              
#> [107] jquerylib_0.1.4             RcppAnnoy_0.0.22           
#> [109] data.table_1.16.2           cowplot_1.1.3              
#> [111] bitops_1.0-9                irlba_2.3.5.1              
#> [113] httpuv_1.6.15               patchwork_1.3.0            
#> [115] GenomicRanges_1.46.1        R6_2.5.1                   
#> [117] promises_1.3.0              KernSmooth_2.23-20         
#> [119] gridExtra_2.3               IRanges_2.28.0             
#> [121] parallelly_1.38.0           codetools_0.2-18           
#> [123] MASS_7.3-55                 SummarizedExperiment_1.24.0
#> [125] SeuratObject_5.0.2          sctransform_0.4.1          
#> [127] S4Vectors_0.32.4            GenomeInfoDbData_1.2.7     
#> [129] parallel_4.1.3              grid_4.1.3                 
#> [131] tidyr_1.3.1                 rmarkdown_2.29             
#> [133] MatrixGenerics_1.6.0        Rtsne_0.17                 
#> [135] spatstat.explore_3.3-3      Biobase_2.54.0             
#> [137] shiny_1.9.1