| Title: | Various Methods for the Two Sample Problem in D>1 Dimensions |
|---|---|
| Description: | The routine twosample_test() in this package runs the two-sample test using various test statistic for multivariate data. The user can also run several tests and then find a p value adjusted for simultaneous inference. The p values are found via permutation or via the parametric bootstrap. The routine twosample_power() allows the estimation of the power of the tests. The routine run.studies() allows a user to quickly study the power of a new method and how it compares to those included in the package. For details of the methods and references see the included vignettes. |
| Authors: | Wolfgang Rolke [aut, cre] (ORCID: <https://orcid.org/0000-0002-3514-726X>) |
| Maintainer: | Wolfgang Rolke <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 1.2.1 |
| Built: | 2026-06-11 06:41:16 UTC |
| Source: | https://github.com/cran/MD2sample |
This function creates the functions needed to run the various case studies.
case.studies( which, n = 200, nx = n, ny = n, nbins = -1, Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), ReturnCaseNames = FALSE )case.studies( which, n = 200, nx = n, ny = n, nbins = -1, Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), ReturnCaseNames = FALSE )
which |
name of the case study. |
n |
=200, sample size for both data sets |
nx |
=n, sample size for first data sets |
ny |
=n, sample size for second data sets |
nbins |
=-1, number of bins for chi-square test, 2D only |
Ranges |
= matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), ranges of variables |
ReturnCaseNames |
=FALSE, should list of case studies be returned? |
a list of functions and vectors
This function does the chi square two sample test for continuous data in two dimensions
chisq2D_test_cont( dta_x, dta_y, Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), nbins = c(5, 5), minexpcount = 5, SuppressMessages = FALSE )chisq2D_test_cont( dta_x, dta_y, Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), nbins = c(5, 5), minexpcount = 5, SuppressMessages = FALSE )
dta_x |
a matrix of numbers |
dta_y |
a matrix of numbers |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds |
nbins |
=c(5,5) number of bins in x and y direction |
minexpcount |
=5 minimum counts required per bin |
SuppressMessages |
=FALSE, print informative messages? |
a list with statistics and p values
This function does the chi square two sample test for continuous data in two dimensions
chisq2D_test_disc(dta, minexpcount = 5)chisq2D_test_disc(dta, minexpcount = 5)
dta |
a list with values and counts |
minexpcount |
=5 minimum counts required per bin |
a list with statistics and p values
This function does the chi square two sample test for continuous data in two dimensions
chiTS.cont(x, y, TSextra)chiTS.cont(x, y, TSextra)
x |
a matrix of numbers |
y |
a matrix of numbers |
TSextra |
list with some info |
either a test statistic or a p value
This function does the chi square two sample test for discrete data in two dimensions
chiTS.disc(x, y, vals_x, vals_y, TSextra)chiTS.disc(x, y, vals_x, vals_y, TSextra)
x |
a vector with counts |
y |
a vector with counts |
vals_x |
a vector with points |
vals_y |
a vector with points |
TSextra |
list with some info |
either a test statistic or a p value
This function does the tests by Chen and Friedman
edge.tests(x, y)edge.tests(x, y)
x |
a matrix of numbers |
y |
a matrix of numbers |
a list with statistics and p values
FR test is a multivariate generalization of nonparametric two-sample test. This function is an implementation with customized options, including a visualization of the minimum spanning tree (MST).
FR.test( samp1, samp2, use.cosine = FALSE, binary = FALSE, binary.cutoff = 2, plot.MST = FALSE, col = c("#F0E442", "#56B4E9"), label.names = c("Sample 1", "Sample 2"), vertex.size = 5, edge.width = 1, ... )FR.test( samp1, samp2, use.cosine = FALSE, binary = FALSE, binary.cutoff = 2, plot.MST = FALSE, col = c("#F0E442", "#56B4E9"), label.names = c("Sample 1", "Sample 2"), vertex.size = 5, edge.width = 1, ... )
samp1 |
Numeric matrix or data frame for Sample 1. Rows are multivariate dimensions, and columns are samples. E.g. gene by cell. |
samp2 |
Numeric matrix or data frame for Sample 2. |
use.cosine |
An option if to use cosine distance. Logical variable. By default ( |
binary |
An option if to use binary values. Logical variable. Default: |
binary.cutoff |
Numeric value for binary cutoff. Binary value = 1 if greater than |
plot.MST |
Boolean variable indicating if to plot the minimum spanning tree (MST). Default: |
col |
Character vector of length two for customized colors of the nodes in MST. Default: |
label.names |
Character vector of length two for customized names of the two samples. Default: |
vertex.size, edge.width, ...
|
Additional plotting parammeters passed to |
Test statistics and p-values.
runs |
Total number of subtrees. |
runs.samp1 |
Number of subtrees of Sample 1. |
runs.samp2 |
Number of subtrees of Sample 2. |
stat |
The standardized FR statistic. |
p.value |
P-value of the FR test. |
the results of the included power studies
power_studies_resultspower_studies_results
A list of matrices with powers
This function runs the case studies included in the package.
run.studies( Continuous = TRUE, study, TS, TSextra, With.p.value = FALSE, nsample = 200, alpha = 0.05, param_alt, SuppressMessages = FALSE, B = 1000, maxProcessor )run.studies( Continuous = TRUE, study, TS, TSextra, With.p.value = FALSE, nsample = 200, alpha = 0.05, param_alt, SuppressMessages = FALSE, B = 1000, maxProcessor )
Continuous |
=TRUE, run cases for continuous data. |
study |
either the name of the study, or its number in the list. If missing all the studies are run. |
TS |
routine to calculate new test statistics. |
TSextra |
list passed to TS (optional). |
With.p.value |
=FALSE, does user supplied routine return p values? |
nsample |
= 200, desired sample size. 200 is used in included case studies. |
alpha |
=0.05, type I error probability of tests. 0.05 is used in included case studies. |
param_alt |
(list of) values of parameter under the alternative hypothesis. If missing included values are used. |
SuppressMessages |
=FALSE, should informative messages be printed? |
B |
= 1000, number of simulation runs. |
maxProcessor |
number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done. |
For details consult vignette(package="MD2sample")
A (list of ) matrices of p.values.
#The new test is a (included) chi square test: TSextra=list(which="pval", nbins=rbind(c(3,3), c(4,4))) run.studies(Continuous=TRUE, study=c("NormalD2", "tD2"), TS=MD2sample::chiTS.cont, TSextra=TSextra, With.p.value=TRUE, B=100)#The new test is a (included) chi square test: TSextra=list(which="pval", nbins=rbind(c(3,3), c(4,4))) run.studies(Continuous=TRUE, study=c("NormalD2", "tD2"), TS=MD2sample::chiTS.cont, TSextra=TSextra, With.p.value=TRUE, B=100)
This function does some rounding to nice numbers
## S3 method for class 'digits' signif(x, d = 4)## S3 method for class 'digits' signif(x, d = 4)
x |
a list of two vectors |
d |
=4 number of digits to round to |
A list with rounded vectors
test function
timecheck(dta, TS, typeTS, TSextra)timecheck(dta, TS, typeTS, TSextra)
dta |
data set |
TS |
test statistics |
typeTS |
format of TS |
TSextra |
additional info TS |
Mean computation time
This function runs a number of two sample tests which find their own p value.
TS_cont_pval(x, y)TS_cont_pval(x, y)
x |
a matrix of numbers if data is continuous or of counts if data is discrete. |
y |
a matrix of numbers if data is continuous or of counts if data is discrete. |
A list of two numeric vectors, the test statistics and the p values.
Estimate the power of various two sample tests using Rcpp and parallel computing.
twosample_power( f, ..., TS, TSextra, alpha = 0.05, B = 1000, nbins = c(5, 5), minexpcount = 5, Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), samplingmethod = "Binomial", rnull, With.p.value = FALSE, DoTransform = TRUE, SuppressMessages = FALSE, LargeSampleOnly = FALSE, maxProcessor, doMethods = "all" )twosample_power( f, ..., TS, TSextra, alpha = 0.05, B = 1000, nbins = c(5, 5), minexpcount = 5, Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), samplingmethod = "Binomial", rnull, With.p.value = FALSE, DoTransform = TRUE, SuppressMessages = FALSE, LargeSampleOnly = FALSE, maxProcessor, doMethods = "all" )
f |
function to generate a list with data sets x and y for continuous data or a matrix with columns vals_x, vals_y, x and y for discrete data. |
... |
additional arguments passed to f, up to 2. |
TS |
routine to calculate test statistics for new tests. |
TSextra |
additional info passed to TS, if necessary. |
alpha |
=0.05, the type I error probability of the hypothesis test. |
B |
=1000, number of simulation runs. |
nbins |
=c(5, 5), number of bins for chi square test if Dim=2. |
minexpcount |
=5, lowest required count for chi-square test. |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2), a 2x2 matrix with lower and upper bounds. |
samplingmethod |
="Binomial" for Binomial sampling or "independence" for independence sampling in the discrete data case. |
rnull |
function to generate new data sets for parametric bootstrap. |
With.p.value |
=FALSE, does user supplied routine return p values? |
DoTransform |
=TRUE, should data be transformed to to unit hypercube? |
SuppressMessages |
=FALSE, should messages be printed? |
LargeSampleOnly |
=FALSE, should only methods with large sample theories be run? |
maxProcessor |
number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done. |
doMethods |
="all", which methods should be included? |
For details consult vignette("MD2sample","MD2sample")
A numeric matrix or vector of power values.
#Note that the resulting power estimates are meaningless because #of the extremely low number of simulation runs B, required because of CRAN timing rule # #Power of tests when one data set comes from a standard normal multivariate distribution function #and the other data set from a multivariate normal with correlation #number of simulation runs is ridiculously small because of CRAN submission rules f=function(a=0) { S=diag(2) x=mvtnorm::rmvnorm(100, sigma = S) S[1,2]=a S[2,1]=a y=mvtnorm::rmvnorm(120, sigma = S) list(x=x, y=y) } twosample_power(f, c(0, 0.5), B=10, maxProcessor=1) #Power of use supplied test. Example is a (included) chi-square test: TSextra=list(which="statistics", nbins=rbind(c(3,3), c(4,4))) twosample_power(f, c(0, 0.5), TS=chiTS.cont, TSextra=TSextra, B=10, maxProcessor=1) #Same example, but this time the user supplied routine calculates p values: TSextra=list(which="pvalues", nbins=c(4,4)) twosample_power(f, c(0, 0.5), TS=chiTS.cont, TSextra=TSextra, B=10, With.p.value=TRUE, maxProcessor=1) #Example for discrete data g=function(p1, p2) { x = table(sample(1:4, size=1000, replace = TRUE)) y = table(sample(1:4, size=500, replace = TRUE, prob=c(p1,p2,1,1))) cbind(vals_x=rep(1:2,2), vals_y=rep(1:2, each=2), x=x, y=y) } twosample_power(g, 1.5, 1.6, B=10, maxProcessor=1)#Note that the resulting power estimates are meaningless because #of the extremely low number of simulation runs B, required because of CRAN timing rule # #Power of tests when one data set comes from a standard normal multivariate distribution function #and the other data set from a multivariate normal with correlation #number of simulation runs is ridiculously small because of CRAN submission rules f=function(a=0) { S=diag(2) x=mvtnorm::rmvnorm(100, sigma = S) S[1,2]=a S[2,1]=a y=mvtnorm::rmvnorm(120, sigma = S) list(x=x, y=y) } twosample_power(f, c(0, 0.5), B=10, maxProcessor=1) #Power of use supplied test. Example is a (included) chi-square test: TSextra=list(which="statistics", nbins=rbind(c(3,3), c(4,4))) twosample_power(f, c(0, 0.5), TS=chiTS.cont, TSextra=TSextra, B=10, maxProcessor=1) #Same example, but this time the user supplied routine calculates p values: TSextra=list(which="pvalues", nbins=c(4,4)) twosample_power(f, c(0, 0.5), TS=chiTS.cont, TSextra=TSextra, B=10, With.p.value=TRUE, maxProcessor=1) #Example for discrete data g=function(p1, p2) { x = table(sample(1:4, size=1000, replace = TRUE)) y = table(sample(1:4, size=500, replace = TRUE, prob=c(p1,p2,1,1))) cbind(vals_x=rep(1:2,2), vals_y=rep(1:2, each=2), x=x, y=y) } twosample_power(g, 1.5, 1.6, B=10, maxProcessor=1)
This function runs a number of two sample tests using Rcpp and parallel computing.
twosample_test( x, y, vals_x = NA, vals_y = NA, TS, TSextra, B = 5000, nbins = c(5, 5), minexpcount = 5, Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), DoTransform = TRUE, samplingmethod = "Binomial", rnull, SuppressMessages = FALSE, LargeSampleOnly = FALSE, maxProcessor, doMethods = "all" )twosample_test( x, y, vals_x = NA, vals_y = NA, TS, TSextra, B = 5000, nbins = c(5, 5), minexpcount = 5, Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), DoTransform = TRUE, samplingmethod = "Binomial", rnull, SuppressMessages = FALSE, LargeSampleOnly = FALSE, maxProcessor, doMethods = "all" )
x |
Continuous data: either a matrix of numbers, or a list with two matrices called x and y. if it is a matrix Observations are in different rows. Discrete data: a vector of counts or a matrix with columns named vals_x, vals_y, x and y. |
y |
a matrix of numbers if data is continuous or a vector of counts if data is discrete. |
vals_x |
=NA, a vector of values for discrete random variables, or NA if data is continuous. |
vals_y |
=NA, a vector of values for discrete random variables, or NA if data is continuous. |
TS |
user supplied routine to calculate test statistics for new tests. |
TSextra |
(optional) additional info passed to TS, if necessary. |
B |
=5000, number of simulation runs for permutation test. |
nbins |
=c(5,5), for chi square tests (2D only). |
minexpcount |
=5, lowest required count for chi-square test (2D only). |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2), a 2x2 matrix with lower and upper bounds (2D only). |
DoTransform |
=TRUE, should data be transformed to unit hypercube? |
samplingmethod |
="Binomial" for Binomial sampling or "independence" for independence sampling. |
rnull |
function to generate new data sets for simulation as an alternative to the permutation method. |
SuppressMessages |
=FALSE, should informative messages be printed? |
LargeSampleOnly |
=FALSE, should only methods with large sample theories be run? |
maxProcessor |
number of cores to use. If missing the number of physical cores-1 is used. If set to 1 no parallel processing is done. |
doMethods |
="all", Which methods should be included? |
For details consult vignette("MD2sample","MD2sample")
A list of two numeric vectors, the test statistics and the p values.
#Two continuous data sets from a multivariate normal: x = mvtnorm::rmvnorm(100, c(0,0)) y = mvtnorm::rmvnorm(120, c(0,0)) twosample_test(x, y, B=100, maxProcessor=1) #Using a new test, this one is an (included) chi square test. #Also enter data as a list: TSextra=list(which="statistics", nbins=rbind(c(3,3), c(4,4))) dta=list(x=x, y=y) twosample_test(dta, TS=chiTS.cont, TSextra=TSextra, B=100, maxProcessor=1) #Two discrete data sets from some distribution: x = table(sample(1:4, size=1000, replace = TRUE)) y = table(sample(1:4, size=1000, replace = TRUE, prob=c(1,2,1,1))) vals_x=rep(1:2,2) vals_y=rep(1:2, each=2) twosample_test(x, y, vals_x, vals_y, B=100, maxProcessor=1) #Run a discrete chi square test and enter the data as a matrix: TSextra=list(which="statistics") dta=cbind(x=x, y=y, vals_x=vals_x, vals_y=vals_y) twosample_test(dta, TS=chiTS.disc, TSextra=TSextra, B=100, maxProcessor=1)#Two continuous data sets from a multivariate normal: x = mvtnorm::rmvnorm(100, c(0,0)) y = mvtnorm::rmvnorm(120, c(0,0)) twosample_test(x, y, B=100, maxProcessor=1) #Using a new test, this one is an (included) chi square test. #Also enter data as a list: TSextra=list(which="statistics", nbins=rbind(c(3,3), c(4,4))) dta=list(x=x, y=y) twosample_test(dta, TS=chiTS.cont, TSextra=TSextra, B=100, maxProcessor=1) #Two discrete data sets from some distribution: x = table(sample(1:4, size=1000, replace = TRUE)) y = table(sample(1:4, size=1000, replace = TRUE, prob=c(1,2,1,1))) vals_x=rep(1:2,2) vals_y=rep(1:2, each=2) twosample_test(x, y, vals_x, vals_y, B=100, maxProcessor=1) #Run a discrete chi square test and enter the data as a matrix: TSextra=list(which="statistics") dta=cbind(x=x, y=y, vals_x=vals_x, vals_y=vals_y) twosample_test(dta, TS=chiTS.disc, TSextra=TSextra, B=100, maxProcessor=1)
This function runs a number of two sample tests using Rcpp and parallel computing and then finds the correct p value for the combined tests.
twosample_test_adjusted_pvalue( x, y, vals_x = NA, vals_y = NA, B = c(5000, 1000), nbins = c(5, 5), minexpcount = 5, samplingmethod = "Binomial", Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), DoTransform = TRUE, rnull, SuppressMessages = FALSE, maxProcessor, doMethods )twosample_test_adjusted_pvalue( x, y, vals_x = NA, vals_y = NA, B = c(5000, 1000), nbins = c(5, 5), minexpcount = 5, samplingmethod = "Binomial", Ranges = matrix(c(-Inf, Inf, -Inf, Inf), 2, 2), DoTransform = TRUE, rnull, SuppressMessages = FALSE, maxProcessor, doMethods )
x |
Continuous data: either a matrix of numbers, or a list with two matrices called x and y. if it is a matrix Observations are in different rows. Discrete data: a vector of counts or a matrix with columns named vals_x, vals_y, x and y. |
y |
a matrix of numbers if data if data is continuous or a vector of counts if data is discrete. |
vals_x |
=NA, a vector of values for discrete random variable, or NA if data is continuous. |
vals_y |
=NA, a vector of values for discrete random variable, or NA if data is continuous. |
B |
=c(5000, 1000), number of simulation runs for permutation test and for estimation of the empirical distribution function. |
nbins |
=c(5, 5), number of bins for chi square tests (2D only). |
minexpcount |
= 5, minimum required expected counts for chi-square tests. |
samplingmethod |
="Binomial" or "independence" for discrete data. |
Ranges |
=matrix(c(-Inf, Inf, -Inf, Inf),2,2) a 2x2 matrix with lower and upper bounds. |
DoTransform |
=TRUE, should data be transformed to interval (0,1)? |
rnull |
routine for parametric bootstrap. |
SuppressMessages |
= FALSE, print informative messages? |
maxProcessor |
number of cores for parallel processing. |
doMethods |
Which methods should be included? If missing a small number of methods that generally have good power are used. |
For details consult the vignette("MD2sample","MD2sample")
NULL, results are printed out.
#Note that the number of simulation runs B is very small to #satisfy CRAN's run time constraints. #Two continuous data sets from a multivariate normal: x = mvtnorm::rmvnorm(100, c(0,0)) y = mvtnorm::rmvnorm(120, c(0,0)) twosample_test_adjusted_pvalue(x, y, maxProcessor=1, B=20) #Two discrete data sets from some distribution: x = table(sample(1:4, size=1000, replace = TRUE)) y = table(sample(1:4, size=500, replace = TRUE, prob=c(1, 1.5, 1, 1))) twosample_test_adjusted_pvalue(x, y, rep(1:2,2), rep(1:2, each=2), maxProcessor=1, B=20)#Note that the number of simulation runs B is very small to #satisfy CRAN's run time constraints. #Two continuous data sets from a multivariate normal: x = mvtnorm::rmvnorm(100, c(0,0)) y = mvtnorm::rmvnorm(120, c(0,0)) twosample_test_adjusted_pvalue(x, y, maxProcessor=1, B=20) #Two discrete data sets from some distribution: x = table(sample(1:4, size=1000, replace = TRUE)) y = table(sample(1:4, size=500, replace = TRUE, prob=c(1, 1.5, 1, 1))) twosample_test_adjusted_pvalue(x, y, rep(1:2,2), rep(1:2, each=2), maxProcessor=1, B=20)