Title: | Testing Two-Sample Mean in High Dimension |
---|---|
Description: | Implements the high-dimensional two-sample test proposed by Zhang (2019) <http://hdl.handle.net/2097/40235>. It also implements the test proposed by Srivastava, Katayama, and Kano (2013) <doi:10.1016/j.jmva.2012.08.014>. These tests are particularly suitable to high dimensional data from two populations for which the classical multivariate Hotelling's T-square test fails due to sample sizes smaller than dimensionality. In this case, the ZWL and ZWLm tests proposed by Zhang (2019) <http://hdl.handle.net/2097/40235>, referred to as zwl_test() in this package, provide a reliable and powerful test. |
Authors: | Huaiyu Zhang, Haiyan Wang |
Maintainer: | Huaiyu Zhang <[email protected]> |
License: | GPL-2 |
Version: | 0.1.0 |
Built: | 2024-11-12 03:24:09 UTC |
Source: | https://github.com/cran/highDmean |
This function generates simulated high dimensional two-sample data from user specified populations with given mean vectors, covariance structure, sample sizes, and dimension of each observation. It could generate the long-range dependent process proposed by Hall et al. (1998) in additional to some processes provided in arima.sim().
buildData( n, m, p, muX, muY, dep, commoncov = TRUE, VarScaleY = 1, S = 1, innov = function(n, ...) stats::rnorm(n, 0, 1), heteroscedastic = FALSE, het.diag )
buildData( n, m, p, muX, muY, dep, commoncov = TRUE, VarScaleY = 1, S = 1, innov = function(n, ...) stats::rnorm(n, 0, 1), heteroscedastic = FALSE, het.diag )
n |
number of observations in the 1st sample. |
m |
number of observations in the 2nd sample. |
p |
the dimensionality of the each observation. The samples from both populations should have the same dimension. |
muX |
|
muY |
|
dep |
dependence structure among the 'IND' for independence; 'SD' for strong dependency, AR(1) with parameter 0.9; 'WD' for weak dependency, ARMA(2, 2) with AR parameters 0.4 and -0.1, and MA parameters 0.2 and 0.3; 'LR' for long-range dependency with parameter 0.7. For more details about the configurations, please refer to Zhang and Wang (2020). |
commoncov |
a logical indicating whether the two populations have equal covariance matrices. If FALSE, the innovations used in generating data for the 2nd population will be scaled by the square root of the value specified in VarScaleY. |
VarScaleY |
constant by which innovations are scaled in generating observations for the 2nd sample when commoncov=FALSE. |
S |
the number of data sets to simulate. |
innov |
a function used to generate the innovations, such as |
heteroscedastic |
a logical indicating whether the components will be scaled by the entries in the diagonal matrix specified by |
het.diag |
a |
A list of S
lists, each consisting of an n
by p
matrix X
, an m
by p
matrix Y
, the sample sizes, n
and m
, for each population, and the dimensionality p
.
Hall, P., Jing, B.-Y., and Lahiri, S. N. (1998). On the sampling window method for long-range dependent data. Statistica Sinica, 8(4):1189-1204.
# Generate 3 two-sample datasets of dimensionality 300 # with sample sizes 45 for one sample & 60 for the other. buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 3, innov = rnorm)
# Generate 3 two-sample datasets of dimensionality 300 # with sample sizes 45 for one sample & 60 for the other. buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 3, innov = rnorm)
A dataset containing the gene expressions for a Gene Ontology (GO) term
on two phenotype groups: BCR/ABL and NEG.
The id of the GO term is GO:0000003
.
The raw dataset is taken from ALL
package.
The data were preprocessed, for which the details are elaborated in Zhang and Wang (2020).
GO_example
GO_example
A list with two subsets of gene expression data.
A matrix containing gene expressions for the BCR/ABL group. The row id is for patient and the column id is for gene.
A matrix containing gene expressions for the NEG group. The row id is for patient and the column id is for gene.
Zhang, H. and Wang, H. (2020). Result consistency of high dimensional two-sample tests applied to gene ontology terms with gene sets. Manuscript in review.
This package is an implementation of the high-dimensional two-sample test proposed by Zhang and Wang (2020) "Result consistency of high dimensional two-sample tests applied to gene ontology terms with gene sets". It also implements the SKK test proposed by Srivastava, Katayama, and Kano (2013) "A two sample test in high dimensional data." These tests are particularly suitable for high dimensional data from two populations for which the classical multivariate Hotelling's T-square test fails due to sample sizes smaller than dimensionality. In this case, the ZWL and ZWLm tests proposed by Zhang and Wang (2020), referred to as zwl_test() in this package, provide a reliable and powerful test.
The function zwl_test()
conducts the ZWL and ZWLm test of equal mean for two-sample high dimensional data provided in
matrices of dimension n
by p
and m
by p
, which are random samples from two populations. It
returns the value of test statistic and p-value under the null hypothesis of equal means.
The SKK_test()
performs the SKK test and returns the value of test statistic and p-value.
The buildData()
function generates simulated high-dimensional data in the two-population setting
with specified sample sizes, numbers of components, covariance structure, etc., and
the functions zwl_sim()
and SKK_sim()
return test statistic values and p-values for lists of simulated data sets generated by buildData()
.
This function generates random samples from shifted gamma distribution. That is, random samples are first generated from gamma distribution with shape parameter shape
and scale parameter scale
and then the mean of the gamma distribution, shape
*scale
, is subtracted from the sample.
rgammashift(n, shape, scale)
rgammashift(n, shape, scale)
n |
number of observations. |
shape |
the shape parameter of gamma distribution |
scale |
the scale parameter of gamma distribution #' |
A vector of n
values. It is equivalent to rgamma(n, shape, scale)- shape * scale.
# Generate a sample of shifted gamma observations with shape parameter 4 and scale parameter 2. set.seed(10) rgammashift(n = 5, shape =4, scale = 2) # It is equivalent to set.seed(10) rgamma(n = 5, shape=4, scale=2)- 4 * 2
# Generate a sample of shifted gamma observations with shape parameter 4 and scale parameter 2. set.seed(10) rgammashift(n = 5, shape =4, scale = 2) # It is equivalent to set.seed(10) rgamma(n = 5, shape=4, scale=2)- 4 * 2
This function performs the SKK test of Srivastava, Katayama, and Kano(2013) on multiple high-dimensional two-sample datasets. It is useful for Monte Carlo experiments.
SKK_sim(DATA)
SKK_sim(DATA)
DATA |
The list of dataset lists generated by |
a dataframe, each row of which reports the values of the SKK test statistics and the p-values.
Srivastava, M. S., Katayama, S., and Kano, Y. (2013). A two sample test in high dimensional data. Journal of Multivariate Analysis, 114:349-358.
# Generate 3 simulated datasets and apply the SKK test data <- buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 3, innov = rnorm) SKK_sim(data)
# Generate 3 simulated datasets and apply the SKK test data <- buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 3, innov = rnorm) SKK_sim(data)
This function implements the two-sample high-dimensional test proposed by Srivastava, Katayama, and Kano(2013).
SKK_test(X, Y)
SKK_test(X, Y)
X |
The data matrix (n by p) from the first population. |
Y |
The data matrix (m by p) from the second population. |
A list consisting of the values of the test statistic and p-value.
Srivastava, M. S., Katayama, S., and Kano, Y. (2013). A two sample test in high dimensional data. Journal of Multivariate Analysis, 114:349-358.
# Generate a simulated dataset and apply the SKK test data <- buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 1, innov = rnorm) SKK_test(data[[1]]$X, data[[1]]$Y) # Apply the SKK test to the data for a GO term stored in GO_example SKK_test(GO_example$X, GO_example$Y)
# Generate a simulated dataset and apply the SKK test data <- buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 1, innov = rnorm) SKK_test(data[[1]]$X, data[[1]]$Y) # Apply the SKK test to the data for a GO term stored in GO_example SKK_test(GO_example$X, GO_example$Y)
Apply the two-sample high-dimensional test by Zhang and Wang (2020) to multiple simulated two-sample high dimensional datasets. This function is useful for Monte Carlo experiments.
zwl_sim(DATA, order = 0)
zwl_sim(DATA, order = 0)
DATA |
The list of dataset lists generated by |
order |
The order of the center correction. Possible choices are 0, 2.
To use the ZWLm test, set |
A dataframe with each row consisting the values of the test statistics, p-values, Tn, and the estimate of Var(Tn).
Zhang, H. and Wang, H. (2020). Result consistency of high dimensional two-sample tests applied to gene ontology terms with gene sets. Manuscript in review.
# Generate 3 simulated two-sample datasets and apply the ZWL test data <- buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 3, innov = rnorm) zwl_sim(data, order = 2)
# Generate 3 simulated two-sample datasets and apply the ZWL test data <- buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 3, innov = rnorm) zwl_sim(data, order = 2)
This function implements the test of equal mean for two-sample high-dimension data using the ZWL and ZWLm tests proposed by Zhang and Wang (2020).
zwl_test(X, Y, order = 0)
zwl_test(X, Y, order = 0)
X |
The data matrix (n by p) from the first population. |
Y |
The data matrix (m by p) from the second population. |
order |
The order of center correction. Possible choices are 0, 2.
To use the ZWLm test, set |
The value of the test statistic.
The p-value of the test statistic based on the asymptotic normality established by Zhang and Wang (2020)
The average of the squared univariate t-statistics.
The estimated variance of Tn
Zhang, H. and Wang, H. (2020). Result consistency of high dimensional two-sample tests applied to gene ontology terms with gene sets. Manuscript in review.
# Generate a simulated two-sample dataset and apply the ZWL test data <- buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 1, innov = rnorm) zwl_test(data[[1]]$X, data[[1]]$Y, order = 2) # Apply the ZWLm test to a GO term to see if the two groups are differentiately expressed. # The data for the GO term were stored in GO_example. zwl_test(GO_example$X, GO_example$Y, order = 0) # Apply the ZWL test to the GO term zwl_test(GO_example$X, GO_example$Y, order = 2)
# Generate a simulated two-sample dataset and apply the ZWL test data <- buildData(n = 45, m =60, p = 300, muX = rep(0,300), muY = rep(0,300), dep = 'IND', S = 1, innov = rnorm) zwl_test(data[[1]]$X, data[[1]]$Y, order = 2) # Apply the ZWLm test to a GO term to see if the two groups are differentiately expressed. # The data for the GO term were stored in GO_example. zwl_test(GO_example$X, GO_example$Y, order = 0) # Apply the ZWL test to the GO term zwl_test(GO_example$X, GO_example$Y, order = 2)