iterateBMAglm.train       package:iterativeBMA       R Documentation

_I_t_e_r_a_t_i_v_e _B_a_y_e_s_i_a_n _M_o_d_e_l _A_v_e_r_a_g_i_n_g: _t_r_a_i_n_i_n_g _s_t_e_p

_D_e_s_c_r_i_p_t_i_o_n:

     Classification and variable selection on microarray data. This is
     a multivariate technique to select a small number of relevant
     variables (typically genes) to classify microarray samples.  This
     function performs the training phase. The data is assumed to
     consist of two classes. Logistic regression is used for
     classification.

_U_s_a_g_e:

     iterateBMAglm.train (train.expr.set, train.class, p=100, nbest=10, maxNvar=30, maxIter=20000, thresProbne0=1)

_A_r_g_u_m_e_n_t_s:

train.expr.set: an 'ExpressionSet' object. We assume the rows in the
          expression data represent variables (genes),  while the
          columns  represent  samples or experiments. This training
          data is used to select relevant genes (variables) for
          classification.

train.class: class vector for the observations (samples or 
          experiments) in the training data.  Class numbers are assumed
          to start from 0, and the length of this class vector should
          be equal to the number of rows in train.dat. Since we assume
          2-class data, we expect the class vector consists of zero's
          and one's.

       p: a number indicating the maximum number of top univariate
          genes used in the iterative BMA algorithm.  This number is
          assumed to be less than the total number of genes in the
          training data. A larger p usually requires longer
          computational time as more iterations of the BMA algorithm
          are potentially applied. The default is 100.

   nbest: a number specifying the number of models of each size 
          returned to 'bic.glm' in the 'BMA' package.  The default is
          10.

 maxNvar: a number indicating the maximum number of variables used in
          each iteration of 'bic.glm' from the 'BMA' package. The
          default is 30.

 maxIter: a number indicating the maximum of iterations of  'bic.glm'.
          The default is 20000.

thresProbne0: a number specifying the threshold for the posterior
          probability that each variable (gene) is non-zero (in
          percent).  Variables (genes) with such posterior  probability
          less than this threshold are dropped in the iterative
          application of 'bic.glm'.  The default is 1 percent.

_D_e_t_a_i_l_s:

     The training phase consists of first ordering all the variables
     (genes) by a univariate measure called between-groups to
     within-groups sums-of-squares (BSS/WSS) ratio, and then
     iteratively applying the 'bic.glm' algorithm from the 'BMA'
     package.  In the first application of the 'bic.glm' algorithm, the
     top 'maxNvar' univariate ranked genes are used.  After each
     application of the 'bic.glm' algorithm, the genes with 'probne0' <
     'thresProbne0' are dropped, and the next univariate ordered genes
     are added to the BMA window.

_V_a_l_u_e:

     An object of class 'bic.glm' returned by the last iteration of
     'bic.glm'.  The object is a list consisting of the following
     components: 

  namesx: the names of the variables in the last iteration of 
          'bic.glm'.

postprob: the posterior probabilities of the models selected.

deviance: the estimated model deviances.

   label: labels identifying the models selected.

     bic: values of BIC for the models.

    size: the number of independent variables in each of the models.

   which: a logical matrix with one row per model and one column per 
          variable indicating whether that variable is in the model.

 probne0: the posterior probability that each variable is non-zero  (in
          percent).

postmean: the posterior mean of each coefficient (from model
          averaging).

  postsd: the posterior standard deviation of each coefficient  (from
          model averaging).

condpostmean: the posterior mean of each coefficient conditional on 
          the variable being included in the model.

condpostsd: the posterior standard deviation of each coefficient 
          conditional on the variable being included in the model.

     mle: matrix with one row per model and one column per variable
          giving  the maximum likelihood estimate of each coefficient
          for each model.

      se: matrix with one row per model and one column per variable
          giving  the standard error of each coefficient for each
          model.

 reduced: a logical indicating whether any variables were dropped 
          before model averaging.

 dropped: a vector containing the names of those variables dropped 
          before model averaging.

    call: the matched call that created the bma.lm object.

_N_o_t_e:

     The 'BMA' and 'Biobase' packages are required.

_R_e_f_e_r_e_n_c_e_s:

     Raftery, A.E. (1995).  Bayesian model selection in social research
     (with Discussion). Sociological Methodology 1995 (Peter V.
     Marsden, ed.), pp. 111-196, Cambridge, Mass.: Blackwells.

     Yeung, K.Y., Bumgarner, R.E. and Raftery, A.E. (2005)  Bayesian
     Model Averaging: Development of an improved multi-class, gene
     selection and classification tool for microarray data. 
     Bioinformatics 21: 2394-2402.

_S_e_e _A_l_s_o:

     'iterateBMAglm.train.predict',  
     'iterateBMAglm.train.predict.test', 'bma.predict', 'brier.score'

_E_x_a_m_p_l_e_s:

     library (Biobase)
     library (BMA)
     library (iterativeBMA)
     data(trainData)
     data(trainClass)

     ## training phase: select relevant genes
     ret.bic.glm <- iterateBMAglm.train (train.expr.set=trainData, trainClass, p=100)

     ## get the selected genes with probne0 > 0
     ret.gene.names <- ret.bic.glm$namesx[ret.bic.glm$probne0 > 0]

     ## show the posterior probabilities of selected models
     ret.bic.glm$postprob

     data (testData)

     ## get the subset of test data with the genes from the last iteration of bic.glm
     curr.test.dat <- t(exprs(testData)[ret.gene.names,])

     ## to compute the predicted probabilities for the test samples
     y.pred.test <- apply (curr.test.dat, 1, bma.predict, postprobArr=ret.bic.glm$postprob, mleArr=ret.bic.glm$mle)

     ## compute the Brier Score if the class labels of the test samples are known
     data (testClass)
     brier.score (y.pred.test, testClass)

