bagging                package:ipred                R Documentation

_B_a_g_g_i_n_g _C_l_a_s_s_i_f_i_c_a_t_i_o_n, _R_e_g_r_e_s_s_i_o_n _a_n_d _S_u_r_v_i_v_a_l _T_r_e_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Bagging for classification, regression and survival trees.

_U_s_a_g_e:

     ipredbagg.factor(y, X=NULL, nbagg=25, control=
                      rpart.control(minsplit=2, cp=0, xval=0), 
                      comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...)
     ipredbagg.numeric(y, X=NULL, nbagg=25, control=rpart.control(xval=0), 
                       comb=NULL, coob=FALSE, ns=length(y), keepX = TRUE, ...)
     ipredbagg.Surv(y, X=NULL, nbagg=25, control=rpart.control(xval=0), 
                    comb=NULL, coob=FALSE, ns=dim(y)[1], keepX = TRUE, ...)
     ## S3 method for class 'data.frame':
     bagging(formula, data, subset, na.action=na.rpart, ...)

_A_r_g_u_m_e_n_t_s:

       y: the response variable: either a factor vector of class labels
          (bagging classification trees), a vector of numerical values 
          (bagging regression trees) or an object of class  'Surv'
          (bagging survival trees).

       X: a data frame of predictor variables.

   nbagg: an integer giving the number of bootstrap replications. 

    coob: a logical indicating whether an out-of-bag estimate of the
          error rate (misclassification error, root mean squared error
          or Brier score) should be computed.  See 'predict.classbagg'
          for details.

 control: options that control details of the 'rpart' algorithm, see
          'rpart.control'. It is wise to set 'xval = 0' in order to
          save computing  time. Note that the  default values depend on
          the class of 'y'.

    comb: a list of additional models for model combination, see below
          for some examples. Note that argument 'method' for
          double-bagging is no longer there,  'comb' is much more
          flexible.

      ns: number of sample to draw from the learning sample. By
          default, the usual bootstrap n out of n with replacement is
          performed.  If 'ns' is smaller than 'length(y)', subagging
          (Buehlmann and Yu, 2002), i.e. sampling 'ns' out of
          'length(y)' without replacement, is performed.

   keepX: a logical indicating whether the data frame of predictors
          should be returned. Note that the computation of the 
          out-of-bag estimator requires  'keepX=TRUE'.

 formula: a formula of the form 'lhs ~ rhs' where 'lhs'  is the
          response variable and 'rhs' a set of predictors.

    data: optional data frame containing the variables in the model
          formula.

  subset: optional vector specifying a subset of observations to be
          used.

na.action: function which indicates what should happen when the data
          contain 'NA's.  Defaults to 'na.rpart'.

     ...: additional parameters passed to 'ipredbagg' or  'rpart',
          respectively.

_D_e_t_a_i_l_s:

     Bagging for classification and regression trees were suggested by
     Breiman (1996a, 1998) in order to stabilise trees. 

     The trees in this function are computed using the implementation
     in the  'rpart' package. The generic function 'ipredbagg'
     implements methods for different responses. If 'y' is a factor,
     classification trees are constructed. For numerical vectors 'y',
     regression trees are aggregated and if 'y' is a survival  object,
     bagging survival trees (Hothorn et al, 2003) is performed.  The
     function 'bagging' offers a formula based interface to
     'ipredbagg'.

     'nbagg' bootstrap samples are drawn and a tree is constructed  for
     each of them. There is no general rule when to stop the tree 
     growing. The size of the trees can be controlled by 'control'
     argument  or 'prune.classbagg'. By default, classification trees
     are as large as possible whereas regression trees and survival
     trees are build with the standard options of 'rpart.control'. If
     'nbagg=1', one single tree is computed for the whole learning
     sample without bootstrapping.

     If 'coob' is TRUE, the out-of-bag sample (Breiman, 1996b) is used
     to estimate the prediction error  corresponding to 'class(y)'.
     Alternatively, the out-of-bag sample can be used for model
     combination, an out-of-bag error rate estimator is not  available
     in this case. Double-bagging (Hothorn and Lausen, 2003) computes a
     LDA on the out-of-bag sample and uses the discriminant variables
     as additional predictors for the classification trees. 'comb' is
     an optional list of lists with two elements 'model' and 'predict'.
      'model' is a function with arguments 'formula' and 'data'. 
     'predict' is a function with arguments 'object, newdata' only. If
     the estimation of the covariance matrix in 'lda' fails due to a
     limited out-of-bag sample size, one can use 'slda' instead. See
     the example section for an example of double-bagging. The
     methodology is not limited to a combination with LDA: bundling
     (Hothorn and Lausen, 2002b)  can be used with arbitrary
     classifiers.

_V_a_l_u_e:

     The class of the object returned depends on 'class(y)':
     'classbagg, regbagg' and 'survbagg'. Each is a list with elements 

       y: the vector of responses.

       X: the data frame of predictors.

  mtrees: multiple trees: a list of length 'nbagg' containing the trees
          (and possibly additional objects) for each bootstrap sample.

     OOB: logical whether the out-of-bag estimate should be computed.

     err: if 'OOB=TRUE', the out-of-bag estimate of misclassification
          or root mean squared error or the Brier score for censored
          data.

    comb: logical whether a combination of models was requested.


     For each class methods for the generics 'prune',  'print',
     'summary' and 'predict' are available for inspection of the
     results and prediction, for example: 'print.classbagg',
     'summary.classbagg',  'predict.classbagg'  and 'prune.classbagg'
     for classification problems.

_A_u_t_h_o_r(_s):

     Torsten.Hothorn <Torsten.Hothorn@rzmail.uni-erlangen.de>

_R_e_f_e_r_e_n_c_e_s:

     Leo Breiman (1996a), Bagging Predictors. _Machine Learning_
     *24*(2), 123-140.

     Leo Breiman (1996b), Out-Of-Bag Estimation. _Technical Report_
     <URL:
     ftp://ftp.stat.berkeley.edu/pub/users/breiman/OOBestimation.ps.Z>.

     Leo Breiman (1998), Arcing Classifiers. _The Annals of Statistics_
     *26*(3), 801-824.

     Peter Buehlmann and Bin Yu (2002), Analyzing Bagging. _The Annals
     of Statistics_ *30*(4), 927-961.

     Torsten Hothorn and Berthold Lausen (2003), Double-Bagging:
     Combining classifiers by bootstrap aggregation. _Pattern
     Recognition_, *36*(6), 1303-1309. 

     Torsten Hothorn and Berthold Lausen (2002b), Bundling Classifiers
     by Bagging Trees. _submitted_. Preprint available from  <URL:
     http://www.mathpreprints.com/math/Preprint/blausen/20021016/1>.

     Torsten Hothorn, Berthold Lausen, Axel Benner and Martin
     Radespiel-Troeger (2004), Bagging Survival Trees. _Statistics in
     Medicine_, *23*(1), 77-91.

_E_x_a_m_p_l_e_s:

     # Classification: Breast Cancer data

     data(BreastCancer)

     # Test set error bagging (nbagg = 50): 3.7% (Breiman, 1998, Table 5)

     mod <- bagging(Class ~ Cl.thickness + Cell.size
                     + Cell.shape + Marg.adhesion   
                     + Epith.c.size + Bare.nuclei   
                     + Bl.cromatin + Normal.nucleoli
                     + Mitoses, data=BreastCancer, coob=TRUE)
     print(mod)

     # Test set error bagging (nbagg=50): 7.9% (Breiman, 1996a, Table 2)
     data(Ionosphere)
     Ionosphere$V2 <- NULL # constant within groups

     bagging(Class ~ ., data=Ionosphere, coob=TRUE)

     # Double-Bagging: combine LDA and classification trees

     # predict returns the linear discriminant values, i.e. linear combinations
     # of the original predictors

     comb.lda <- list(list(model=lda, predict=function(obj, newdata)
                                      predict(obj, newdata)$x))

     # Note: out-of-bag estimator is not available in this situation, use
     # errorest

     mod <- bagging(Class ~ ., data=Ionosphere, comb=comb.lda) 

     predict(mod, Ionosphere[1:10,])

     # Regression:

     data(BostonHousing)

     # Test set error (nbagg=25, trees pruned): 3.41 (Breiman, 1996a, Table 8)

     mod <- bagging(medv ~ ., data=BostonHousing, coob=TRUE)
     print(mod)

     learn <- as.data.frame(mlbench.friedman1(200))

     # Test set error (nbagg=25, trees pruned): 2.47 (Breiman, 1996a, Table 8)

     mod <- bagging(y ~ ., data=learn, coob=TRUE)
     print(mod)

     # Survival data

     # Brier score for censored data estimated by 
     # 10 times 10-fold cross-validation: 0.2 (Hothorn et al,
     # 2002)

     data(DLBCL)
     mod <- bagging(Surv(time,cens) ~ MGEc.1 + MGEc.2 + MGEc.3 + MGEc.4 + MGEc.5 +
                                      MGEc.6 + MGEc.7 + MGEc.8 + MGEc.9 +
                                      MGEc.10 + IPI, data=DLBCL, coob=TRUE)

     print(mod)

