The simulated data simdata_2ct
with two cell types is
generated based on the real single-cell data
DuoClustering2018::sce_full_Zhengmix4eq()
with the help of
scDesign3
. For more details about generating synthetic
data, please check our paper.
The structure of simdata_2ct
is as follows:
str(simdata_2ct)
#> List of 4
#> $ simu_sce:Formal class 'SingleCellExperiment' [package "SingleCellExperiment"] with 9 slots
#> .. ..@ int_elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. ..@ rownames : NULL
#> .. .. .. ..@ nrows : int 198
#> .. .. .. ..@ listData :List of 1
#> .. .. .. .. ..$ rowPairs:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. .. .. .. ..@ rownames : NULL
#> .. .. .. .. .. .. ..@ nrows : int 198
#> .. .. .. .. .. .. ..@ listData : Named list()
#> .. .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. .. ..@ metadata : list()
#> .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. ..@ elementMetadata: NULL
#> .. .. .. ..@ metadata : list()
#> .. ..@ int_colData :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. ..@ rownames : NULL
#> .. .. .. ..@ nrows : int 998
#> .. .. .. ..@ listData :List of 3
#> .. .. .. .. ..$ reducedDims:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. .. .. .. ..@ rownames : NULL
#> .. .. .. .. .. .. ..@ nrows : int 998
#> .. .. .. .. .. .. ..@ listData : Named list()
#> .. .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. .. ..@ metadata : list()
#> .. .. .. .. ..$ altExps :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. .. .. .. ..@ rownames : NULL
#> .. .. .. .. .. .. ..@ nrows : int 998
#> .. .. .. .. .. .. ..@ listData : Named list()
#> .. .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. .. ..@ metadata : list()
#> .. .. .. .. ..$ colPairs :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. .. .. .. ..@ rownames : NULL
#> .. .. .. .. .. .. ..@ nrows : int 998
#> .. .. .. .. .. .. ..@ listData : Named list()
#> .. .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. .. ..@ metadata : list()
#> .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. ..@ elementMetadata: NULL
#> .. .. .. ..@ metadata : list()
#> .. ..@ int_metadata :List of 1
#> .. .. ..$ version:Classes 'package_version', 'numeric_version' hidden list of 1
#> .. .. .. ..$ : int [1:3] 1 24 0
#> .. ..@ rowRanges :Formal class 'CompressedGRangesList' [package "GenomicRanges"] with 5 slots
#> .. .. .. ..@ unlistData :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
#> .. .. .. .. .. ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots
#> .. .. .. .. .. .. .. ..@ values : Factor w/ 0 levels:
#> .. .. .. .. .. .. .. ..@ lengths : int(0)
#> .. .. .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. .. .. ..@ metadata : list()
#> .. .. .. .. .. ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots
#> .. .. .. .. .. .. .. ..@ start : int(0)
#> .. .. .. .. .. .. .. ..@ width : int(0)
#> .. .. .. .. .. .. .. ..@ NAMES : NULL
#> .. .. .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. .. .. ..@ metadata : list()
#> .. .. .. .. .. ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots
#> .. .. .. .. .. .. .. ..@ values : Factor w/ 3 levels "+","-","*":
#> .. .. .. .. .. .. .. ..@ lengths : int(0)
#> .. .. .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. .. .. ..@ metadata : list()
#> .. .. .. .. .. ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
#> .. .. .. .. .. .. .. ..@ seqnames : chr(0)
#> .. .. .. .. .. .. .. ..@ seqlengths : int(0)
#> .. .. .. .. .. .. .. ..@ is_circular: logi(0)
#> .. .. .. .. .. .. .. ..@ genome : chr(0)
#> .. .. .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. .. .. .. .. ..@ rownames : NULL
#> .. .. .. .. .. .. .. ..@ nrows : int 0
#> .. .. .. .. .. .. .. ..@ listData : Named list()
#> .. .. .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. .. .. ..@ metadata : list()
#> .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. ..@ metadata : list()
#> .. .. .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. .. .. ..@ rownames : NULL
#> .. .. .. .. .. ..@ nrows : int 198
#> .. .. .. .. .. ..@ listData : Named list()
#> .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. ..@ metadata : list()
#> .. .. .. ..@ elementType : chr "GRanges"
#> .. .. .. ..@ metadata : list()
#> .. .. .. ..@ partitioning :Formal class 'PartitioningByEnd' [package "IRanges"] with 5 slots
#> .. .. .. .. .. ..@ end : int [1:198] 0 0 0 0 0 0 0 0 0 0 ...
#> .. .. .. .. .. ..@ NAMES : chr [1:198] "ENSG00000116251" "ENSG00000142676" "ENSG00000142669" "ENSG00000169442" ...
#> .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. ..@ metadata : list()
#> .. ..@ colData :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. ..@ rownames : chr [1:998] "naive.cytotoxic10013" "naive.cytotoxic5827" "naive.cytotoxic1319" "naive.cytotoxic4199" ...
#> .. .. .. ..@ nrows : int 998
#> .. .. .. ..@ listData :List of 1
#> .. .. .. .. ..$ cell_type: chr [1:998] "celltype0" "celltype0" "celltype0" "celltype0" ...
#> .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. ..@ elementMetadata: NULL
#> .. .. .. ..@ metadata : list()
#> .. ..@ assays :Formal class 'SimpleAssays' [package "SummarizedExperiment"] with 1 slot
#> .. .. .. ..@ data:Formal class 'SimpleList' [package "S4Vectors"] with 4 slots
#> .. .. .. .. .. ..@ listData :List of 2
#> .. .. .. .. .. .. ..$ counts : num [1:198, 1:998] 3 70 2 3 1 7 0 6 0 1 ...
#> .. .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. .. .. .. .. ..$ : chr [1:198] "ENSG00000116251" "ENSG00000142676" "ENSG00000142669" "ENSG00000169442" ...
#> .. .. .. .. .. .. .. .. ..$ : chr [1:998] "naive.cytotoxic10013" "naive.cytotoxic5827" "naive.cytotoxic1319" "naive.cytotoxic4199" ...
#> .. .. .. .. .. .. ..$ logcounts: num [1:198, 1:998] 1.386 4.263 1.099 1.386 0.693 ...
#> .. .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
#> .. .. .. .. .. .. .. .. ..$ : chr [1:198] "ENSG00000116251" "ENSG00000142676" "ENSG00000142669" "ENSG00000169442" ...
#> .. .. .. .. .. .. .. .. ..$ : chr [1:998] "naive.cytotoxic10013" "naive.cytotoxic5827" "naive.cytotoxic1319" "naive.cytotoxic4199" ...
#> .. .. .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. .. .. ..@ elementMetadata: NULL
#> .. .. .. .. .. ..@ metadata : list()
#> .. ..@ NAMES : NULL
#> .. ..@ elementMetadata :Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. .. .. ..@ rownames : NULL
#> .. .. .. ..@ nrows : int 198
#> .. .. .. ..@ listData : Named list()
#> .. .. .. ..@ elementType : chr "ANY"
#> .. .. .. ..@ elementMetadata: NULL
#> .. .. .. ..@ metadata : list()
#> .. ..@ metadata : list()
#> $ de_idx : Named int [1:20] 2 15 16 24 34 35 55 64 71 86 ...
#> ..- attr(*, "names")= chr [1:20] "ENSG00000142676" "ENSG00000143947" "ENSG00000034510" "ENSG00000144713" ...
#> $ ct0_idx : int [1:499] 1 2 3 4 5 6 7 8 9 10 ...
#> $ ct1_idx : int [1:499] 500 501 502 503 504 505 506 507 508 509 ...
The true differentially expressed (DE) genes are
truth = names(simdata_2ct$de_idx)
truth
#> [1] "ENSG00000142676" "ENSG00000143947" "ENSG00000034510" "ENSG00000144713"
#> [5] "ENSG00000109475" "ENSG00000145425" "ENSG00000112306" "ENSG00000205542"
#> [9] "ENSG00000147604" "ENSG00000177600" "ENSG00000166441" "ENSG00000251562"
#> [13] "ENSG00000122026" "ENSG00000167526" "ENSG00000198242" "ENSG00000108298"
#> [17] "ENSG00000115268" "ENSG00000105640" "ENSG00000105372" "ENSG00000083845"
Step 1: Perform Testing-after-Clustering on Each Half via (Multiple) Data Splitting
mss = mds1(simdata_2ct$simu_sce,
M = 1, # times of data splitting
params1 = list(normalized_method = "none"), # parameters for the first half
params2 = list(normalized_method = "none")) # parameters for the second half
#>
#> ===== Multiple Data Splitting: 1 / 1 =====
#>
#> ----- data splitting (1st half) -----
#>
#> ----- data splitting (2nd half) ----
It returns the mirror statistics,
hist(mss[[1]], breaks = 50)
cutoff = calc_tau(mss[[1]], q = 0.05)
abline(v = cutoff, lwd = 2, col = "red")
In this step, cutoff
is chosen to control the nominal
False Discovery Rate (FDR) at q = 0.05
.
Note that the above histogram is for the results from the first data
splitting mss[[1]]
. If with multiple data splitting, we can
aggregate the results by calculating the inclusion rate for each gene,
and also calculate a cutoff. Then, either for DS or MDS, we can obtain
the estimation of the significant DE gene set. We wrap this as the
second step.
For multiple data splitting (MDS), we can aggregate results by calculating the inclusion rate for each gene and determining a new cutoff. This allows us to estimate the set of significant DE genes, whether using DS or MDS. We wrap this process as the second step:
Step 2. Calculate Cutoff and Estimate the Selection Set
sel = mds2(mss, q = 0.05)
#> use the tie.method = fair
sel
#> [1] "ENSG00000144713" "ENSG00000167526" "ENSG00000115268" "ENSG00000105640"
#> [5] "ENSG00000108298" "ENSG00000177600" "ENSG00000198242" "ENSG00000112306"
#> [9] "ENSG00000142676" "ENSG00000145425" "ENSG00000105372" "ENSG00000083845"
#> [13] "ENSG00000166441" "ENSG00000122026" "ENSG00000143947" "ENSG00000205542"
#> [17] "ENSG00000034510" "ENSG00000109475" "ENSG00000251562" "ENSG00000147604"
#> [21] "ENSG00000138326" "ENSG00000102879" "ENSG00000116251" "ENSG00000185201"
#> [25] "ENSG00000140319" "ENSG00000148908"
we can also evaluate the accuracy
calc_acc(sel, truth)
#> fdr power f1
#> 0.2307692 1.0000000 0.8695652
A Whole Step
As a whole, the above two steps mds1
and
mds2
have been wrapped into a single function
mds
, and you can simply run the following command to obtain
the selection set.
sel = mds(simdata_2ct$simu_sce,
q = 0.05,
M = 1,
params1 = list(normalized_method = "none"),
params2 = list(normalized_method = "none"))
#>
#> ===== Multiple Data Splitting: 1 / 1 =====
#>
#> ----- data splitting (1st half) -----
#>
#> ----- data splitting (2nd half) ----
#> use the tie.method = fair
As a comparison, we also check the naive double-dipping method.
sel.dd = dd(simdata_2ct$simu_sce, params = list(normalized_method = "none"), q = 0.05, test.use = "t")
and the accuracy is
calc_acc(sel.dd, truth)
#> fdr power f1
#> 0 1 1
Normalization
In practice, to account for the library size, a normalization method is needed. One suitable approach is the SCT transform.
sel.sct = mds(simdata_2ct$simu_sce,
q = 0.05,
M = 1,
params1 = list(normalized_method = "sct"),
params2 = list(normalized_method = "sct"))
#>
#> ===== Multiple Data Splitting: 1 / 1 =====
#>
#> ----- data splitting (1st half) -----
#>
#> ----- data splitting (2nd half) ----
#> use the tie.method = fair
sel.dd.sct = dd(simdata_2ct$simu_sce, params = list(normalized_method = "sct"))
#> Warning: The following arguments are not used: norm.method
sapply(list(ds = sel.sct, dd = sel.dd.sct), function(x) calc_acc(x, names(simdata_2ct$de_idx)))
#> ds dd
#> fdr 0.09090909 0.8245614
#> power 1.00000000 1.0000000
#> f1 0.95238095 0.2985075
Whitening
We recommend the whitening strategy for weak signal and (highly) correlated data.
sel = mds(simdata_2ct$simu_sce,
q = 0.05,
M = 1,
params1 = list(normalized_method = "sct", pca.whiten = TRUE),
params2 = list(normalized_method = "sct", pca.whiten = TRUE))
#>
#> ===== Multiple Data Splitting: 1 / 1 =====
#>
#> ----- data splitting (1st half) -----
#>
#> ----- data splitting (2nd half) ----
#> use the tie.method = fair
sel.dd = dd(simdata_2ct$simu_sce, params = list(normalized_method = "sct", pca.whiten = TRUE))
#> Warning: The following arguments are not used: norm.method
sapply(list(ds = sel, dd = sel.dd), function(x) calc_acc(x, names(simdata_2ct$de_idx)))
#> ds dd
#> fdr 0.04761905 0.8290598
#> power 1.00000000 1.0000000
#> f1 0.97560976 0.2919708
Session Info
sessionInfo()
#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] SplitClusterTest_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] spatstat.univar_3.1-1 spam_2.11-0
#> [3] systemfonts_1.1.0 plyr_1.8.9
#> [5] igraph_2.1.1 lazyeval_0.2.2
#> [7] sp_2.1-4 splines_4.1.3
#> [9] listenv_0.9.1 scattermore_1.2
#> [11] GenomeInfoDb_1.30.1 ggplot2_3.5.1
#> [13] digest_0.6.37 htmltools_0.5.8.1
#> [15] fansi_1.0.6 magrittr_2.0.3
#> [17] tensor_1.5 cluster_2.1.2
#> [19] ROCR_1.0-11 globals_0.16.3
#> [21] matrixStats_1.4.1 pkgdown_2.1.1
#> [23] spatstat.sparse_3.1-0 colorspace_2.1-1
#> [25] ggrepel_0.9.6 textshaping_0.4.0
#> [27] xfun_0.49 dplyr_1.1.4
#> [29] RCurl_1.98-1.16 jsonlite_1.8.9
#> [31] progressr_0.15.0 spatstat.data_3.1-2
#> [33] survival_3.2-13 zoo_1.8-12
#> [35] glue_1.8.0 polyclip_1.10-7
#> [37] gtable_0.3.6 zlibbioc_1.40.0
#> [39] XVector_0.34.0 leiden_0.4.3.1
#> [41] DelayedArray_0.20.0 future.apply_1.11.3
#> [43] SingleCellExperiment_1.16.0 BiocGenerics_0.40.0
#> [45] abind_1.4-8 scales_1.3.0
#> [47] spatstat.random_3.3-2 miniUI_0.1.1.1
#> [49] Rcpp_1.0.13-1 viridisLite_0.4.2
#> [51] xtable_1.8-4 reticulate_1.39.0
#> [53] dotCall64_1.2 stats4_4.1.3
#> [55] htmlwidgets_1.6.4 httr_1.4.7
#> [57] RColorBrewer_1.1-3 Seurat_4.4.0
#> [59] ica_1.0-3 pkgconfig_2.0.3
#> [61] farver_2.1.2 sass_0.4.9
#> [63] uwot_0.2.2 deldir_2.0-4
#> [65] utf8_1.2.4 tidyselect_1.2.1
#> [67] rlang_1.1.4 reshape2_1.4.4
#> [69] later_1.3.2 munsell_0.5.1
#> [71] tools_4.1.3 cachem_1.1.0
#> [73] cli_3.6.3 generics_0.1.3
#> [75] ggridges_0.5.6 evaluate_1.0.1
#> [77] stringr_1.5.1 fastmap_1.2.0
#> [79] yaml_2.3.10 ragg_1.3.3
#> [81] goftest_1.2-3 knitr_1.48
#> [83] fs_1.6.5 fitdistrplus_1.2-1
#> [85] purrr_1.0.2 RANN_2.6.2
#> [87] pbapply_1.7-2 future_1.34.0
#> [89] nlme_3.1-155 mime_0.12
#> [91] compiler_4.1.3 plotly_4.10.4
#> [93] png_0.1-8 spatstat.utils_3.1-1
#> [95] tibble_3.2.1 bslib_0.8.0
#> [97] stringi_1.8.4 highr_0.11
#> [99] desc_1.4.3 lattice_0.20-45
#> [101] Matrix_1.6-5 vctrs_0.6.5
#> [103] pillar_1.9.0 lifecycle_1.0.4
#> [105] spatstat.geom_3.3-3 lmtest_0.9-40
#> [107] jquerylib_0.1.4 RcppAnnoy_0.0.22
#> [109] data.table_1.16.2 cowplot_1.1.3
#> [111] bitops_1.0-9 irlba_2.3.5.1
#> [113] httpuv_1.6.15 patchwork_1.3.0
#> [115] GenomicRanges_1.46.1 R6_2.5.1
#> [117] promises_1.3.0 KernSmooth_2.23-20
#> [119] gridExtra_2.3 IRanges_2.28.0
#> [121] parallelly_1.38.0 codetools_0.2-18
#> [123] MASS_7.3-55 SummarizedExperiment_1.24.0
#> [125] SeuratObject_5.0.2 sctransform_0.4.1
#> [127] S4Vectors_0.32.4 GenomeInfoDbData_1.2.7
#> [129] parallel_4.1.3 grid_4.1.3
#> [131] tidyr_1.3.1 rmarkdown_2.29
#> [133] MatrixGenerics_1.6.0 Rtsne_0.17
#> [135] spatstat.explore_3.3-3 Biobase_2.54.0
#> [137] shiny_1.9.1