GSNormalize              package:GSEAlm              R Documentation

_A_g_g_r_e_g_a_t_i_n_g _a_n_d _c_a_l_c_u_l_a_t_i_n_g _e_x_p_r_e_s_s_i_o_n _s_t_a_t_i_s_t_i_c_s _b_y _G_e_n_e _S_e_t

_D_e_s_c_r_i_p_t_i_o_n:

     Provides an interface for producing aggregate gene-set statistics,
     for gene-set-enrichment analysis (GSEA).  The function is best
     suited for mean or rescaled-mean GSEA approaches, but is hopefully
     generic enough  to enable other approaches as well.

_U_s_a_g_e:

     GSNormalize(dataset, incidence, gseaFun = crossprod, fun1 = "/", fun2 = sqrt, removeShift=FALSE, removeStat=mean, ...)
     identity(x)
     one(x)

_A_r_g_u_m_e_n_t_s:

 dataset: a numeric matrix, typically of some gene-level statistics 

incidence: 0/1 incidence matrix indicating genes' membership in
          gene-sets

 gseaFun: function name for the type of aggregation to take place,
          defaults to 'crossprod'. See 'Details' 

    fun1: function name for normalization, defaults to "/". See
          'Details' 

    fun2: function name for scaling, defaults to 'sqrt'. See 'Details'

removeShift: logical: should normalization begin with a column-wise
          removal of the mean shift?

removeStat: (if above is TRUE) the column-wise statistic to be swept
          out of 'dataset'.

     ...: Additional arguments optionally passed on to 'gseaFun'.

       x: any numerical value

_D_e_t_a_i_l_s:

     In gene-set-enrichment analysis (GSEA), the core step is
     aggregating (or calculating) gene-set-level statistics from
     gene-set statistics. This utility achieves the feat. It is
     tailored specifically for rescaled-sums of the type suggested by
     Jiang and Gentleman (2007), but is designed as a generic template
     that should other GSEA approaches.  In such cases, at this moment
     users should provide their own version of 'gseaFun'.

     The default will generate sums of gene-level values divided by the
     square-root of the gene-set size (in other words, gene-set means
     multiplied by the square-root of gene-set size). The arithmetic
     works like this:

     gene-set stat = gseaFun(t(incidence),dataset),...) 'fun1'
     fun2(gene-set size).

     In case there is a known (or suspected) overall baseline shift
     (i.e., the mass of gene-level stats is not centered around zero)
     it may be scientifically more meaningful to look for gene-set
     deviating from this baseline rather than from zero. In this case,
     you can set 'removeShift=TRUE'.

     Also provided are the 'identity' function (identity = function(x)
     x), so that leaving 'gseaFun' and 'fun1' at their default and
     setting 'fun2 = identity' will generate gene-set means - and the
     'one' function to neutralize the effect of both 'fun1' and 'fun2'
     (see note below).

_V_a_l_u_e:

     'GSNormalize' returns a matrix with the same number of rows as
     'incidence' and the same number of columns as 'dataset' (if
     'dataset' is a vector, the output will be a vector as well). The
     respective row and column names will carry through from 'dataset'
     and 'incidence' to the output.

     'identity' simply returns x. 'one' returns the number 1.

_N_o_t_e:

     If you want to create your own GSEA function for 'gseaFun', note
     that it should receive the transposed incidence matrix as its
     first argument, and the gene-level stats as its second argument. 
     In other words, both should have genes as rows. also, you can
     easily neutralize the effect of 'fun1', 'fun2' by setting "fun2 =
     one".

_A_u_t_h_o_r(_s):

     Assaf Oron

_R_e_f_e_r_e_n_c_e_s:

     Z. Jiang and R. Gentleman, "Extensions to Gene Set Enrichment
     Analysis",Bioinformatics (23),306-313, 2007.

_S_e_e _A_l_s_o:

     'gsealmPerm', which relies heavily on this function. The function 
     'applyByCategory' from the 'Category' package has similar
     functionality and is preferable when the applied function is
     complicated. 'GSNormalize' is better optimized for matrix
     operations.

_E_x_a_m_p_l_e_s:

     data(sample.ExpressionSet)
     lm1 = lmPerGene(sample.ExpressionSet,~sex+type)

     ### Generating random pseudo-gene-sets
     fauxGS=matrix(sample(c(0,1),size=50000,replace=TRUE,prob=c(.9,.1)),nrow=100)

     ### "tau-stats" for gene-SET-level type effect, adjusting for sex
     fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS)

     qqnorm(fauxEffects)
     ### diagonal line represents zero-shift null; note that it doesn't fit
     abline(0,1,col=2)
     ### a better option may be to run a diagonal through the middle of the
     ### data (nonzero-shift null, i.e. type may have an effect but it is the
     ### same for all gene-sets); note that if any outlier shows, it is a purely random one!

     abline(median(fauxEffects),1,col=4)

     #### Now try with baseline-shift removal

     fauxEffects=GSNormalize(lm1$coefficients[3,]/sqrt(lm1$coef.var[3,]),incidence=fauxGS,removeShift=TRUE)

     qqnorm(fauxEffects)
     abline(0,1,col=2)

