sam                 package:siggenes                 R Documentation

_S_i_g_n_i_f_i_c_a_n_c_e _A_n_a_l_y_s_i_s _o_f _M_i_c_r_o_a_r_r_a_y

_D_e_s_c_r_i_p_t_i_o_n:

     Performs a Significance Analysis of Microarrays (SAM). It is
     possible to perform one and two class analyses using either a
     modified t-statistic or a (standardized)  Wilcoxon rank statistic,
     and a multiclass analysis using a modified F-statistic.  Moreover,
     this function provides a SAM procedure for categorical data such
     as SNP data and the possibility to employ an user-written score
     function.

_U_s_a_g_e:

       sam(data, cl, method = d.stat, delta = NULL, n.delta = 10, p0 = NA,
           lambda = seq(0, 0.95, 0.05), ncs.value = "max", ncs.weights = NULL,
           gene.names = dimnames(data)[[1]], q.version = 1, ...)

_A_r_g_u_m_e_n_t_s:

    data: a matrix, a data frame, or an ExpressionSet object. Each row
          of 'data' (or 'exprs(data)', respectively) must correspond to
          a variable (e.g., a gene), and each column to a sample (i.e.
          an observation).

          Can also be a list (if 'method = chisq.stat' or  'method =
          trend.stat'). For details on how to specify data in this
          case,  see 'chisq.stat'.

      cl: a vector of length 'ncol(data)' containing the class labels
          of the samples. In the two class paired case, 'cl' can also 
          be a matrix with 'ncol(data)' rows and 2 columns. If 'data'
          is an ExpressionSet object, 'cl' can also be a character
          string naming the column of 'pData(data)' that contains the
          class labels of the samples. If 'data' is a list, 'cl' needs
          not to be specified. 

          In the one-class case, 'cl' should be a vector of 1's. 

          In the two class unpaired case, 'cl' should be a vector
          containing 0's (specifying the samples of, e.g., the control
          group) and 1's (specifying, e.g., the case group). 

          In the two class paired case, 'cl' can be either a numeric
          vector or a numeric matrix.  If it is a vector, then 'cl' has
          to consist of the integers between -1 and  -n/2 (e.g., before
          treatment group) and between 1 and n/2 (e.g., after treatment
          group), where n is the length of 'cl' and k is paired with
          -k, k=1,...,n/2. If 'cl' is a matrix, one column should
          contain -1's and 1's specifying, e.g., the before and the
          after treatment samples, respectively, and the other column
          should contain integer between 1 and n/2 specifying the n/2
          pairs of observations.

          In the multiclass case and if 'method = chisq.stat', 'cl'
          should be a vector containing integers between 1 and g, where
          g is the number of groups. (In the case of 'chisq.stat', 'cl'
          needs not to be specified if 'data' is a list of groupwise
          matrices.)

          For examples of how 'cl' can be specified, see the manual of
          'siggenes'.

  method: a character string or a name specifying the method/function
          that should be used in the computation of the expression
          scores d. 

          If 'method = d.stat', a modified t-statistic or F-statistic,
          respectively, will be computed as proposed by Tusher et al.
          (2001). 

          If 'method = wilc.stat', a Wilcoxon rank sum statistic or
          Wilcoxon signed rank statistic will be used as expression
          score. 

          For an analysis of categorical data such as SNP data, 
          'method' can be set to 'chisq.stat'. In this case Pearson's
          ChiSquare statistic is computed for each row. 

          If the variables are ordinal and a trend test should be
          applied  (e.g., in the two-class case, the Cochran-Armitage
          trend test), 'method = trend.stat' can be employed.

          It is also possible to use an user-written function to
          compute the expression scores. For details, see 'Details'.

   delta: a numeric vector specifying a set of values for the threshold
           Delta that should be used. If 'NULL', 'n.delta' Delta values
          will be computed automatically.

 n.delta: a numeric value specifying the number of Delta values that
          will be computed over the range of all possible values for
          Delta if 'delta' is not specified.

      p0: a numeric value specifying the prior probability pi0  that a
          gene is not differentially expressed. If 'NA', 'p0' will be
          computed by the function 'pi0.est'.

  lambda: a numeric vector or value specifying the lambda values used
          in the estimation of the prior probability. For details, see
          '?pi0.est'.

ncs.value: a character string. Only used if 'lambda' is a vector.
          Either '"max"' or '"paper"'. For details, see '?pi0.est'.

ncs.weights: a numerical vector of the same length as 'lambda'
          containing the weights used in the estimation of pi0. By
          default no weights are used. For details, see '?pi0.est'.

gene.names: a character vector of length 'nrow(data)' containing the
          names of the genes. By default the row names of 'data' are
          used.

q.version: a numeric value indicating which version of the q-value
          should be computed. If 'q.version = 2', the original version
          of the q-value, i.e. min{pFDR}, will be computed. If
          'q.version = 1', min{FDR} will be used in the calculation of
          the q-value. Otherwise, the q-value is not computed. For
          details, see '?qvalue.cal'.

     ...: further arguments of the specific SAM methods. If 'method =
          d.stat', see the help of 'd.stat'. If 'method = wilc.stat',
          see the help of 'wilc.stat'. If 'method = chisq.stat', see
          the help of 'chisq.stat'.

_D_e_t_a_i_l_s:

     'sam' provides SAM procedures for several types of analysis (one
     and two class analyses with either a modified t-statistic or a
     Wilcoxon rank statistic, a multiclass analysis with a modified F
     statistic, and an analysis of categorical data). It is, however,
     also  possible to write your own function for another type of
     analysis. The required arguments of this function must be 'data'
     and 'cl'. This function can also have other arguments. The output
     of this function must be a list containing the following objects:

     '_d': a numeric vector consisting of the expression scores of the
          genes.

     '_d._b_a_r': a numeric vector of the same length as 'na.exclude(d)'
          specifying the expected expression scores under the null
          hypothesis.

     '_p._v_a_l_u_e': a numeric vector of the same length as 'd' containing
          the raw, unadjusted p-values of the genes.

     '_v_e_c._f_a_l_s_e': a numeric vector of the same length as 'd' consisting
          of the one-sided numbers of falsely called genes, i.e. if d >
          0 the numbers of genes expected to be larger than d under the
          null hypothesis, and if d<0, the number of genes expected to
          be smaller than d under the null hypothesis.

     '_s': a numeric vector of the same length as 'd' containing the
          standard deviations  of the genes. If no standard deviation
          can be calculated, set 's = numeric(0)'.

     '_s_0': a numeric value specifying the fudge factor. If no fudge
          factor is calculated, set 's0 = numeric(0)'.

     '_m_a_t._s_a_m_p': a matrix with B rows and 'ncol(data)' columns, where B
          is the number of permutations, containing the permutations
          used in the computation of the permuted d-values. If such a
          matrix is not computed, set 'mat.samp = matrix(numeric(0))'.

     '_m_s_g': a character string or vector containing information about,
          e.g., which type of analysis has been performed. 'msg' is
          printed when the function 'print' or  'summary',
          respectively, is called. If no such message should be
          printed, set 'msg = ""'.

     '_f_o_l_d': a numeric vector of the same length as 'd' consisting of
          the fold  changes of the genes. If no fold change has been
          computed, set 'fold = numeric(0)'.

     If this function is, e.g., called 'foo', it can be used by setting
     'method = foo' in 'sam'. More detailed information and an example
     will be contained in the siggenes manual.

_V_a_l_u_e:

     An object of class SAM.

_N_o_t_e:

     SAM was deveoped by Tusher et al. (2001).

     !!! There is a patent pending for the SAM technology at Stanford
     University. !!!

_A_u_t_h_o_r(_s):

     Holger Schwender, holger.schw@gmx.de

_R_e_f_e_r_e_n_c_e_s:

     Schwender, H., Krause, A., and Ickstadt, K. (2006). Identifying
     Interesting Genes with siggenes. _RNews_, 6(5), 45-50.

     Schwender, H. (2004). Modifying Microarray Analysis Methods for 
     Categorical Data - SAM and PAM for SNPs. To appear in:
     _Proceedings of the the 28th Annual Conference of the GfKl_.

     Tusher, V.G., Tibshirani, R., and Chu, G. (2001). Significance
     analysis of microarrays applied to the ionizing radiation
     response. _PNAS_, 98, 5116-5121.

_S_e_e _A_l_s_o:

     'SAM-class','d.stat','wilc.stat', 'chisq.stat'

_E_x_a_m_p_l_e_s:

     ## Not run: 
       # Load the package multtest and the data of Golub et al. (1999)
       # contained in multtest.
       library(multtest)
       data(golub)
       
       # golub.cl contains the class labels.
       golub.cl

       # Perform a SAM analysis for the two class unpaired case assuming
       # unequal variances.
       sam.out <- sam(golub, golub.cl, B=100, rand=123)
       sam.out
       
       # Obtain the Delta plots for the default set of Deltas
       plot(sam.out)
       
       # Generate the Delta plots for Delta = 0.2, 0.4, 0.6, ..., 2
       plot(sam.out, seq(0.2, 0.4, 2))
       
       # Obtain the SAM plot for Delta = 2
       plot(sam.out, 2)
       
       # Get information about the genes called significant using 
       # Delta = 3 (since neither the gene names nor the chip type
       # has been specified ll is set to FALSE to avoid a warning)
       sam.sum3 <- summary(sam.out, 3, ll=FALSE)
       
       # Obtain the rows of golub containing the genes called
       # differentially expressed
       sam.sum3@row.sig.genes
       
       # and their names
       golub.gnames[sam.sum3@row.sig.genes, 3] 

       # The matrix containing the d-values, q-values etc. of the
       # differentially expressed genes can be obtained by
       sam.sum3@mat.sig
       
       # Perform a SAM analysis using Wilcoxon rank sums
       sam(golub, golub.cl, method="wilc.stat", rand=123)
         

       # Now consider only the first ten columns of the Golub et al. (1999)
       # data set. For now, let's assume the first five columns were
       # before treatment measurements and the next five columns were
       # after treatment measurements, where column 1 and 6, column 2
       # and 7, ..., build a pair. In this case, the class labels
       # would be
       new.cl <- c(-(1:5), 1:5)
       new.cl
       
       # and the corresponding SAM analysis for the two-class paired
       # case would be
       sam(golub[,1:10], new.cl, B=100, rand=123)
       
       # Another way of specifying the class labels for the above paired
       # analysis is
       mat.cl <- matrix(c(rep(c(-1, 1), e=5), rep(1:5, 2)), 10)
       mat.cl
       
       # and the above SAM analysis can also be done by
       sam(golub[,1:10], mat.cl, B=100, rand=123)
     ## End(Not run)

