This section demonstrates the data splitting procedure for selecting relevant features when there exists cluster structure under the Poisson setting.

using SplitClusterTest
using Plots


x, cl = gen_data_pois(1000, 2000, 0.5, prop_imp=0.1, type = "discrete")
([2.0 1.0 … 5.0 2.0; 6.0 5.0 … 3.0 5.0; … ; 9.0 8.0 … 4.0 2.0; 2.0 6.0 … 3.0 3.0], [0, 1, 0, 0, 1, 1, 0, 1, 0, 0  …  1, 1, 0, 0, 0, 0, 0, 1, 1, 0])

Plot the first two PCs of X

pc1, pc2 = first_two_PCs(x)
scatter(pc1[cl .== 0], pc2[cl .== 0])
scatter!(pc1[cl .== 1], pc2[cl .== 1])
Example block output

Adopt the data splitting procedure to select the relevant features.

ms = ds(x, ret_ms = true);
τ = calc_τ(ms)
3.9197051310774795

the mirror statistics of relevant features tend to be larger and away from null features, where the null features still exhibit a symmetric distribution about zero. Then we can properly take the cutoff to control the FDR, as shown by the red vertical line.

histogram(ms, label = "")
Plots.vline!([τ], label = "", lw = 3)
Example block output