compareGOProfiles         package:goProfiles         R Documentation

_C_o_m_p_a_r_i_s_o_n _o_f _l_i_s_t_s _o_f _g_e_n_e_s _t_h_r_o_u_g_h _t_h_e_i_r _f_u_n_c_t_i_o_n_a_l _p_r_o_f_i_l_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Compare two samples of genes in terms of their GO profiles 'pn'
     and 'qm'. Both samples may share a common subsample of genes, with
     GO profile 'pqn0'. 'compareGOProfiles' implements some inferential
     procedures based on asymptotic properties of the squared euclidean
     distance between  the contracted versions of pn and qm

_U_s_a_g_e:

                                     
     compareGOProfiles(pn, qm = NULL, pqn0 = NULL, n = ngenes(pn), m = ngenes(qm), n0 = ngenes(pqn0), method = "lcombChisq", 
     ab.approx = "asymptotic", confidence = 0.95, nsims = 10000, simplify = T, ...)

_A_r_g_u_m_e_n_t_s:

      pn: an object of class ExpandedGOProfile representing one or more
          "sample" expanded GO profiles for a fixed ontology (see the
          'Details' section)

      qm: an object of class ExpandedGOProfile representing one or more
          "sample" expanded GO profiles for a fixed ontology (see the
          'Details' section)

    pqn0: an object of class ExpandedGOProfile representing one or more
          "sample" expanded GO profiles for a fixed ontology (see the
          'Details' section)

       n: a numeric vector with the number of genes profiled in each
          column of pn. This parameter is included to allow the
          possibility of exploring the consequences of varying sample
          sizes, other than the true sample size in pn.

       m: a numeric vector with the number of genes profiled in each
          column of qm.

      n0: a numeric vector with the number of genes profiled in each
          column of pqn0.

  method: the approximation method to the sampling distribution under
          the null hypothesis specifying that the samples pn and qm
          come from the same population. See the 'Details' section
          below

confidence: the confidence level of the confidence interval in the
          result

ab.approx: the approximation used for computing 'a' and 'b'
          coefficients (see details)

   nsims: some inferential methods require a simulation step; the
          number of simulation replicates is specified with this
          parameter

simplify: should the result be simplified, if possible? See the
          'Details' section

     ...: Other arguments needed

_D_e_t_a_i_l_s:

     An object of S3 class 'ExpandedGOProfile' is, essentially, a
     'data.frame' object with each column representing the relative
     frequencies in all observed node combinations, resulting from
     profiling a set of genes, for a given and fixed ontology. The
     row.names attribute codifies the node combinations and each
     data.frame column (say, each profile) has an attribute, 'ngenes',
     indicating the number of profiled genes. The arguments 'pn', 'qm'
     and 'pqn0' are compared in a column by column wise,  recycling
     columns, if necessary, in order to perform
     max(ncol(pn),ncol(qm),ncol(pqn0)) comparisons (each comparison
     resulting in an object of class 'GOProfileHtest', an
     specialization of 'htest').   In order to be properly compared,
     these arguments are expanded by row, according to their row names.
     That is, the data arguments can have unequal row numbers. Then,
     they are expanded adding rows with zero frequencies, in order to
     make them comparable.

     In the i-th comparison (i from 1 to
     max(ncol(pn),ncol(qm),ncol(pqn0))), the parameters n, m and n0 are
     included to allow the possibility of exploring the consequences of
     varying sample sizes, other than the true sample sizes included as
     an attribute in pn, qm and pqn0.

     When qm = NULL, the genes profiled in pn are compared with a
     subsample of them, those profiled in pqn0 (compare a set of genes
     with a restricted subset, e.g. those overexpressed under a
     disease). In this case we take qm=pqn0. When pqn0 = NULL, two
     profiles with no genes in common are compared.

     Let Pn and Qm correspond to the contracted functional profiles
     (the total counts or relative frequencies of hits in each one of
     the s GO categories being compared) obtained from pn and qm. If P
     stands for the "population" profile originating the sample profile
     Pn[,j], Q for the profile originating Qm[,j] and d(,) for the
     squared euclidean distance, if P != Q, the distribution of
     sqrt(nm/(n+m))(d(Pn[,j],Qm[,j]) - d(P,Q))/se(d) is approximately
     standard normal, N(0,1). This provides the basis for the
     confidence interval in the result field icDistance. When P=Q, the
     asymptotic distribution of (nm/(n+m)) d(Pn[,j],Qm[,j]) corresponds
     to the distribution of a mixture of independent chi-square random
     variables, each one with one degree of freedom. The sampling
     distribution under H0 P=Q may be directly computed from this
     distribution (approximating it by simulation)
     (method="lcombChisq") or by a chi-square approximation to it,
     based on two correcting constants a and b (method="chi-square").
     These constants are chosen to equate the first two moments of both
     distributions (the linear combination of chi-square random
     variables distribution and the approximating chi-square
     distribution). When method="chi-square", the returned test
     statistic value is the chi-square approximation (n
     d(pn[,j],qm[,j]) - b) / a. Then, the result field 'parameter' is a
     vector containing the 'a' and 'b' values and the number of degrees
     of freedom, 'df'. Otherwise, the returned test statistic value is
     (nm/(n+m)) d(Pn[,j],Qm[,j]) and 'parameter' contains the
     coefficients of the linear combination of chi-squares.

_V_a_l_u_e:

     A list containing max(ncol(pn),ncol(qm),ncol(pqn0)) objects of
     class 'GOProfileHtest', directly inheriting from 'htest' or a
     single 'GOProfileHtest' object if
     max(ncol(pn),ncol(qm),ncol(pqn0))==1 and simplify == T. Each
     object of class 'GOProfileHtest' has the following fields: 

profilePn: the first contracted profile to compute the squared
          Euclidean distance

profileQm: the second contracted profile to compute the squared
          Euclidean distance

statistic: test statistic; its meaning depends on the value of
          "method", see the 'Details' section.

parameter: parameters of the sample distribution of the test statistic,
          see the 'Details' section.

 p.value: associated p-value to test the null hypothesis of profiles
          equality.

conf.int: asymptotic confidence interval for the squared euclidean
          distance. Its attribute "conf.level" contains its nominal
          confidence level.

estimate: squared euclidean distance between the contracted profiles.
          Its attribute "se" contains its standard error estimate.

  method: a character string indicating the method used to perform the
          test.

data.name: a character string giving the names of the data.

alternative: a character string describing the alternative hypothesis
          (always 'true squared Euclidean distance between the
          contracted profiles is greater than zero'

_A_u_t_h_o_r(_s):

     Jordi Ocana

_R_e_f_e_r_e_n_c_e_s:

     Sanchez-Pla, A., Salicru M. and Ocana, J. Statistical methods for
     the analysis of highthroughput data based on functional profiles
     derived from the gene ontology. Journal of Statistical Planning
     and Inference, 2007.

_S_e_e _A_l_s_o:

     fitGOProfile, equivalentGOProfiles

_E_x_a_m_p_l_e_s:

     data(sampleProfiles)
     comparedMF <-compareGOProfiles (pn=expandedWelsh01[['MF']], 
                               qm  = expandedSingh01[['MF']], 
                               pqn0= commonExpandedWelsh01Singh01[['MF']])
     print(comparedMF)
     print(compSummary(comparedMF))

