tMixture               package:OCplus               R Documentation

_F_i_t _a _m_i_x_t_u_r_e _o_f _t-_d_i_s_t_r_i_b_u_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     For a vector of individual genewise t-statistics, this functions
     fits a distribution of central and non-central t-distributions,
     with the primary goal of estimating the proportion 'p0' of
     non-differentially expressed genes.

_U_s_a_g_e:

     tMixture(tstat, n1 = 10, n2 = n1, nq, p0, p1, D, delta, paired = FALSE, 
              tbreak, ext = TRUE, threshold.delta=0.75, ...)

_A_r_g_u_m_e_n_t_s:

   tstat: the vector of genewise t-statistics

      n1: number of samples in the first group

      n2: number of samples in the second group

      nq: the number of components in the mixture that is fitted

      p0: a starting value for the proportion of non-differentially
          expressed genes.

      p1: a vector with starting values for the proportions of genes
          that are differentially expressed with effect size 'D'.

       D: a vector of starting values for the effect sizes of the
          differentially expressed genes, corresponding to the
          proportions 'p1'.

   delta: a vector of starting values for the effect sizes of the
          differentially expressed genes, expressed as non-centrality
          parameters; this is just a different way of specifying 'D',
          though if both are given, 'delta' will get priority.

  paired: a logical value indicating whether the t-statistics are
          two-sample or paired.

  tbreak: either the number of equally spaced bins for tabulating
          'tstat', or the explicit break points for the bins, very much
          like the argument 'breaks' to function 'cut'; the default
          value is the square root of the number of genes.

     ext: a logical value indicating whether to extend the bins, i.e.
          to set the lowest bin limit to -infinity and the largest bin
          limit to inifinity.

threshold.delta: mixture components with an estimated absolute
          non-centrality parameter 'delta' below this value are
          considered to be too small for independent estimation; these
          components and their corresponding 'p1' are pooled with the
          null-component and 'p0', see Details.

     ...: additional arguments that are passed to 'optim' to control
          the optimization.

_D_e_t_a_i_l_s:

     The minimum parameter that needs to be specified is 'nq' - if
     nothing else is given, the proportions are equally distributed
     between 'p0' and the 'p1', and the noncentrality parameters are
     set up symmetrically around zero, e.g. 'nq=5' leads to equal
     proportions of 0.2 and noncentrality parameters -2, -1, 1, and 2.
     If any of 'p1', 'D', or 'delta' is specified, 'nq' is redundant
     and will be ignored (with a warning). 'tMixture' will in general
     make a valiant effort to deduce valid starting values from any
     combination of 'nq', 'p0', 'p1', 'D', and 'delta' specified by the
     user, and will complain if that is not possible. 

     The fitting problem that this function tries to solve is badly
     conditioned, and will in general depend on the precise set of
     starting values. Multiple runs from different starting values are
     usually a good idea. We have found however, that the model seems
     fairly robust towards misspecification of the number of
     components, at least when estimating 'p0'. What happens when too
     many components are specified is that some of the nominally
     noncentral t-distributions describing the behaviour of
     differentially expressed genes are fitted with noncentrality
     parameters very close to zero, and the true 'p0' gets spread out
     between the nominal 'p0' and the almost-central components. Adding
     up these different contributions usually gives a similar solution
     to re-fitting the model with fewer components. The cutoff for the
     size of  non-centrality parameters that can be estimated
     realistically is specified via 'threshold.delta', whose default
     value is based on a small simulation study reported in Pawitan et
     al. (2005); see Examples. (Note that the AIC can also be helpful
     in determining the number of components.)

_V_a_l_u_e:

     A list with the following components: 

  p0.est: the estimated proportion of non-differentially expressed
          genes, after collapsing components with estimated
          non-centrality sizes below 'threshold.delta'.

  p0.raw: the estimated proportion before collapsing the components.

      p1: the estimated proportions of differentially expressed genes
          corresponding to the effect sizes, relating to 'p0.raw'.

       D: effect sizes of the differentially expressed genes in
          multiples of the gene-by-gene standard deviation.

   delta: effect sizes of the differentially expressed genes expressed
          as the noncentrality parameter of the corresponding
          noncentral t-distribution.

     AIC: the AIC value for the maximum likelihood fit.

     opt: The output from 'optim', giving details about the
          optimization process.

_A_u_t_h_o_r(_s):

     Y. Pawitan and A. Ploner

_R_e_f_e_r_e_n_c_e_s:

     Pawitan Y, Krishna Murthy KR, Michiels S, Ploner A (2005) Bias in
     the estimation of false discovery rate in microarray studies,
     Bioinformatics.

_S_e_e _A_l_s_o:

     'tstatistics', 'EOC', 'optim'

_E_x_a_m_p_l_e_s:

     # We simulate a small example with 5 percent regulated genes and
     # a rather large effect size
     set.seed(2011)
     xdat = matrix(rnorm(50000), nrow=1000)
     xdat[1:25, 1:25] = xdat[1:25, 1:25] - 2
     xdat[26:50, 1:25] = xdat[26:50, 1:25] + 2
     grp = rep(c("Sample A","Sample B"), c(25,25))
     # Use a helper function for the test statistics
     myt = tstatistics(xdat, grp)$tstat
     r1 = tMixture(myt, n1=25, nq=3)
     r1

     # Equivalently, we could have specified the same set of starting values 
     # as follows:
     # r1 = tMixture(myt, n1=25, p0=1/3, p1=c(1/3, 1/3), delta=c(-1,1))

     # Alternative starting value for p0, other starting values are filled in
     r2 = tMixture(myt, n1=25, nq=3, p0=0.80)
     r2

     # Specification of too many components usually leads to spurious
     # noncentral components like here - note the difference between
     # p0.est and p0.raw!
     r3 = tMixture(myt, n1=25, nq=5)
     r3            

     # We simulate a data in a paired setting
     # Note the arrangement of the columns
     set.seed(2012)
     xdat = matrix(rnorm(50000), nrow=1000)
     ndx1 = seq(1,50, by=2)
     xdat[1:25, ndx1] = xdat[1:25, ndx1] - 2
     xdat[26:50, ndx1] = xdat[26:50, ndx1] + 2
     grp = rep(c("Sample A","Sample B"), 25)
     # Use a helper function for the test statistics
     myt = tstatistics(xdat, grp, paired=TRUE)$tstat
     r1p = tMixture(myt, n1=25, nq=3)
     r1p

