samplesize              package:OCplus              R Documentation

_F_D_R _a_s _a _f_u_n_c_t_i_o_n _o_f _s_a_m_p_l_e _s_i_z_e

_D_e_s_c_r_i_p_t_i_o_n:

     This function tabulates the false discovery rate (FDR) for
     selecting differentially expressed genes as a function of sample
     size and cutoff level. Additionally, the same information can be
     displayed through an attractive plot.

_U_s_a_g_e:

     samplesize(n = seq(5, 50, by = 5), p0 = 0.99, sigma = 1, D, F0, F1, 
                paired = FALSE, crit, crit.style = c("top percentage", "cutoff"),
                        plot =TRUE, local.show=FALSE, nplot = 100, ylim = c(0, 1), main,
                        legend.show = FALSE, grid.show = FALSE, ...)

_A_r_g_u_m_e_n_t_s:

       n: sample size (as subjects per group)

      p0: the proportion of non-differentially expressed genes

   sigma: the standard deviation for the log expression values

       D: assumed average log fold change (in units of 'sigma'), by
          default 1; this is a shortcut for specifying a simple
          symmetrical alternative hypothesis through 'F1'.

      F0: the distribution of the log2 expression values under the null
          hypothesis; by default, this is normal with mean zero and
          standard deviation 'sigma',  but mixtures of normals can be
          specified, see Details and Examples.

      F1: the distribution of the log2 expression values under the
          alternative hypothesis; by default, this is an equal mixture
          of two normals with means  'D' and -'D' and standard
          deviation 'sigma'; mixture of normals are again possible, see
          Details and Examples.

  paired: logical value indicating whether this is the independent
          sample case (default) or the paired sample case.

    crit: a vector of cutoff values for selecting differentially
          expressed genes; the interpretation depends on 'crit.style'.

crit.style: indicates how differentially expressed genes are selected:
          either by a fixed cutoff level for the absolute value of the
          t-statistic or as a fixed percentage of the absolute largest
          t-statistics.

    plot: logical value indicating whether to do the plotting business

local.show: logical value indicating whether to show local or global
          false discovery rate (default: global).

   nplot: number of points that are evaluated for the curves

    ylim: the usual limits on the vertical axis

    main: the main title of the plot

legend.show: logical value indicating whether to show a legend for the 
          types of gene selection in the plot

grid.show: logical value indicating whether to draw grid lines showing
          the sample sizes 'n' to be tabulated in the plot

     ...: the usual graphical parameters, passed to 'plot'

_D_e_t_a_i_l_s:

     This function plots the FDR as a function of the sample size when
     comparing the expression of multiple genes between two groups of
     subjects. This is based on a model assuming that a proportion 'p0'
     of genes is not differentially expressed (regulated) between
     groups, and that 1-'p0' genes are. The logarithmized gene
     expression values of regulated and non regulated genes are assumed
     to be generated by mixtures of normal distributions; these
     mixtures can be specified through the parameters 'F0', 'F1' or
     'D', and 'sigma'; please see 'TOC' for details on the model and
     the specification of the mixtures. By default, the null
     distribution of the log expression values is a normal centered on
     zero, and the alternative an equal mixture of normals centered at
     '+D' and '-D'. 

     The list of nominally differentially expressed genes can be
     selected in two ways:

     *  all genes with absolute t-statistic larger than the specified
        critical cutoff values ('cutoff'),

     *  all genes that represent the specified critical top percentage
        of the absolutely largest t-statistics ('top percentage').

        Multiple critical values correspond to multiple curves, each
        labeled by the critical value, but only one value can be
        specified for the proportion of non-regulated genes 'p0' and
        the standard deviation 'sigma'.

_V_a_l_u_e:

     A matrix with rows corresponding to elements of 'n' and columns
     corresponding to the specified critical values is returned. The
     matrix has the attribute 'param' that contains the specified
     arguments, see Examples.

_N_o_t_e:

     Both the curve labels and the legend may be squashed if the
     plotting device is too small. Increasing the size of the device
     and re-plotting should improve readability.

_A_u_t_h_o_r(_s):

     Y. Pawitan and A. Ploner

_R_e_f_e_r_e_n_c_e_s:

     Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A (2005)
     False Discovery Rate, Sensitivity and Sample Size for Microarray
     Studies. _Bioinformatics_, 21, 3017-3024.

     Jung SH (2005) Sample size for FDR-control in microarray data 
     analysis. _Bioinformatics_, 21, 3097-104.

_S_e_e _A_l_s_o:

     'FDR', 'TOC', 'EOC'

_E_x_a_m_p_l_e_s:

     # Default assumes a proportion of 0.01 regulated genes equally split
     # between two-fold up- and down-regulated
     # We select the top 1, 2, 3 percent absolute largest t-statistics
     samplesize(crit=c(0.03,0.02, 0.01))

     # Same model, but using a hard cutoff for the t-statistics
     samplesize(crit=2:4, crit.style="cutoff")

     # Paired test of the same size has slightly better FDR (as expected)
     samplesize(paired=TRUE)

     # Compare the effect of p0 and effect size
     par(mfrow=c(2,2))
     samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=1)
     samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=1)
     samplesize(crit=c(0.03,0.02, 0.01), p0=0.95, D=2)
     samplesize(crit=c(0.03,0.02, 0.01), p0=0.99, D=2)

     # An asymmetric alternative distribution: 20 percent of the regulated genes 
     # are expected to be (at least) four-fold up regulated
     # NB, no graphical output
     ret = samplesize(F1=list(D=c(-1,1,2), p=c(2,2,1)), p0=0.95, crit=0.05, plot=FALSE)
     ret
     # Look at the parameters
     attr(ret, "param")

     # A wide null distribution that allows to disregard genes with small effect
     # Here: |log2 fold change| < 0.25, i.e. fold change of less than 19 percent
     samplesize(F0=list(D=c(-0.25,0,0.25)), grid=TRUE)

     # This is close to Example 3 in Jung's paper (see References):
     # p0=0.99 and sensitivity=0.6, so we want a rejection rate of 
     # around 0.006 from the top list.
     # Here we require around 40 arrays/group, compared to 
     # around 37 in Jung's paper, most likely because we use 
     # the t-distribution instead of normal. Jung's alternative 
     # is only one-sided, so the exact correspondence is
     # 
     samplesize(p0=0.99,crit.style="top", crit=0.006, F1=list(D=1, p=1), grid=TRUE) 
     abline(h=0.01)

     #The result is very close to the symmetric alternatives: 
     samplesize(p0=0.99,crit=0.006, D=1, grid=TRUE, ylim=c(0,0.9))

