gbm                   package:gbm                   R Documentation

_G_e_n_e_r_a_l_i_z_e_d _B_o_o_s_t_e_d _R_e_g_r_e_s_s_i_o_n _M_o_d_e_l_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     Fits generalized boosted regression models.

_U_s_a_g_e:

     gbm(formula = formula(data),
         distribution = "bernoulli",
         data = list(),
         weights,
         var.monotone = NULL,
         n.trees = 100,
         interaction.depth = 1,
         n.minobsinnode = 10,
         shrinkage = 0.001,
         bag.fraction = 0.5,
         train.fraction = 1.0,
         cv.folds=0,
         keep.data = TRUE,
         verbose = TRUE)

     gbm.fit(x,y,
             offset = NULL,
             misc = NULL,
             distribution = "bernoulli",
             w = NULL,
             var.monotone = NULL,
             n.trees = 100,
             interaction.depth = 1,
             n.minobsinnode = 10,
             shrinkage = 0.001,
             bag.fraction = 0.5,
             train.fraction = 1.0,
             keep.data = TRUE,
             verbose = TRUE,
             var.names = NULL,
             response.name = NULL)

     gbm.more(object,
              n.new.trees = 100,
              data = NULL,
              weights = NULL,
              offset = NULL,
              verbose = NULL)

_A_r_g_u_m_e_n_t_s:

 formula: a symbolic description of the model to be fit. The formula
          may include an offset term (e.g. y~offset(n)+x). If
          'keep.data=FALSE' in the initial call to 'gbm' then it is the
          user's responsibility to resupply the offset to 'gbm.more'.

distribution: a description of the error distribution to be used in the
          model. Currently available options are "gaussian" (squared
          error), "laplace" (absolute loss), "bernoulli" (logistic
          regression for 0-1 outcomes), "adaboost" (the AdaBoost
          exponential loss for 0-1 outcomes), "poisson" (count
          outcomes), and "coxph" (censored observations). The current
          version's Laplace distribution does not handle non-constant
          weights and will stop.

    data: an optional data frame containing the variables in the model.
          By default the variables are taken from
          'environment(formula)', typically the environment from which
          'gbm' is called. If 'keep.data=TRUE' in the initial call to
          'gbm' then 'gbm' stores a copy with the object. If
          'keep.data=FALSE' then subsequent calls to 'gbm.more' must
          resupply the same dataset. It becomes the user's
          responsibility to resupply the same data at this point.

 weights: an optional vector of weights to be used in the fitting
          process. Must be positive but do not need to be normalized.
          If 'keep.data=FALSE' in the initial call to 'gbm' then it is
          the user's responsibility to resupply the weights to
          'gbm.more'.

var.monotone: an optional vector, the same length as the number of
          predictors, indicating which variables have a monotone
          increasing (+1), decreasing (-1), or arbitrary (0)
          relationship with the outcome.

 n.trees: the total number of trees to fit. This is equivalent to the
          number of iterations and the number of basis functions in the
          additive expansion.

cv.folds: Number of cross-validation folds to perform. If 'cv.folds'>1
          then 'gbm', in addition to the usual fit, will perform a
          cross-validation, calculate an estimate of generalization
          error returned in 'cv.error'.

interaction.depth: The maximum depth of variable interactions. 1
          implies an additive model, 2 implies a model with up to 2-way
          interactions, etc.

n.minobsinnode: minimum number of observations in the trees terminal
          nodes. Note that this is the actual number of observations
          not the total weight.

shrinkage: a shrinkage parameter applied to each tree in the expansion.
          Also known as the learning rate or step-size reduction.

bag.fraction: the fraction of the training set observations randomly
          selected to propose the next tree in the expansion. This
          introduces randomnesses into the model fit. If
          'bag.fraction'<1 then running the same model twice will
          result in similar but different fits. 'gbm' uses the R random
          number generator so 'set.seed' (see Random) can ensure that
          the model can be reconstructed. Preferably, the user can save
          the returned 'gbm.object' using 'save'.

train.fraction: The first 'train.fraction * nrows(data)' observations
          are used to fit the 'gbm' and the remainder are used for
          computing out-of-sample estimates of the loss function.

keep.data: a logical variable indicating whether to keep the data and
          an index of the data stored with the object. Keeping the data
          and index makes subsequent calls to 'gbm.more' faster at the
          cost of storing an extra copy of the dataset.

  object: a 'gbm' object created from an initial call to 'gbm'.

n.new.trees: the number of additional trees to add to 'object'.

 verbose: If TRUE, gbm will print out progress and performance
          indicators. If this option is left unspecified for gbm.more
          then it uses 'verbose' from 'object'.

    x, y: For 'gbm.fit': 'x' is a data frame or data matrix containing
          the predictor variables and 'y' is the vector of outcomes.
          The number of rows in 'x' must be the same as the length of
          'y'.

  offset: a vector of values for the offset

    misc: For 'gbm.fit': 'misc' is an R object that is simply passed on
          to the gbm engine. It can be used for additional data for the
          specific distribution. Currently it is only used for passing
          the censoring indicator for the Cox proportional hazards
          model.

       w: For 'gbm.fit': 'w' is a vector of weights of the same length
          as the 'y'.

var.names: For 'gbm.fit': A vector of strings of length equal to the
          number of columns of 'x' containing the names of the
          predictor variables.

response.name: For 'gbm.fit': A character string label for the response
          variable.

_D_e_t_a_i_l_s:

     This package implements the generalized boosted modeling
     framework. Boosting is the process of iteratively adding basis
     functions in a greedy fashion so that each additional basis
     function further reduces the selected loss function. This
     implementation closely follows Friedman's Gradient Boosting
     Machine (Friedman, 2001).

     In addition to many of the features documented in the Gradient
     Boosting Machine, 'gbm' offers additional features including the
     out-of-bag estimator for the optimal number of iterations, the
     ability to store and manipulate the resulting 'gbm' object, and a
     variety of other loss functions that had not previously had
     associated boosting algorithms, including the Cox partial
     likelihood for censored data, the poisson likelihood for count
     outcomes, and a gradient boosting implementation to minimize the
     AdaBoost exponential loss function.

     'gbm.fit' provides the link between R and the C++ gbm engine.
     'gbm' is a front-end to 'gbm.fit' that uses the familiar R
     modeling formulas. However, 'model.frame' is very slow if there
     are many predictor variables. For power-users with many variables
     use 'gbm.fit'. For general practice 'gbm' is preferable.

_V_a_l_u_e:

     'gbm', 'gbm.fit', and 'gbm.more' return a 'gbm.object'.

_A_u_t_h_o_r(_s):

     Greg Ridgeway gregr@rand.org

_R_e_f_e_r_e_n_c_e_s:

     Y. Freund and R.E. Schapire (1997) "A decision-theoretic
     generalization of on-line learning and an application to
     boosting," Journal of Computer and System Sciences, 55(1):119-139.

     G. Ridgeway (1999). "The state of boosting," Computing Science and
     Statistics 31:172-181.

     J.H. Friedman, T. Hastie, R. Tibshirani (2000). "Additive Logistic
     Regression: a Statistical View of Boosting," Annals of Statistics
     28(2):337-374.

     J.H. Friedman (2001). "Greedy Function Approximation: A Gradient
     Boosting Machine," Annals of Statistics 29(5):1189-1232.

     J.H. Friedman (2002). "Stochastic Gradient Boosting,"
     Computational Statistics and Data Analysis 38(4):367-378.

     G. Ridgeway (2003). "An out-of-bag estimator for the optimal
     number of boosting iterations," technical report due out soon.

     <URL: http://www.i-pensieri.com/gregr/gbm.shtml>

     <URL: http://www-stat.stanford.edu/~jhf/R-MART.html>

_S_e_e _A_l_s_o:

     'gbm.object', 'gbm.perf', 'plot.gbm', 'predict.gbm',
     'summary.gbm', 'pretty.gbm.tree'.

_E_x_a_m_p_l_e_s:

     # A least squares regression example
     # create some data

     N <- 1000
     X1 <- runif(N)
     X2 <- 2*runif(N)
     X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
     X4 <- factor(sample(letters[1:6],N,replace=TRUE))
     X5 <- factor(sample(letters[1:3],N,replace=TRUE))
     X6 <- 3*runif(N)
     mu <- c(-1,0,1,2)[as.numeric(X3)]

     SNR <- 10 # signal-to-noise ratio
     Y <- X1**1.5 + 2 * (X2**.5) + mu
     sigma <- sqrt(var(Y)/SNR)
     Y <- Y + rnorm(N,0,sigma)

     # introduce some missing values
     X1[sample(1:N,size=500)] <- NA
     X4[sample(1:N,size=300)] <- NA

     data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

     # fit initial model
     gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
         data=data,                   # dataset
         var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                      # +1: monotone increase,
                                      #  0: no monotone restrictions
         distribution="gaussian",     # bernoulli, adaboost, gaussian,
                                      # poisson, and coxph available
         n.trees=3000,                # number of trees
         shrinkage=0.005,             # shrinkage or learning rate,
                                      # 0.001 to 0.1 usually work
         interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
         bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
         train.fraction = 0.5,        # fraction of data for training,
                                      # first train.fraction*N used for training
         n.minobsinnode = 10,         # minimum total weight needed in each node
         cv.folds = 5,                # do 5-fold cross-validation
         keep.data=TRUE,              # keep a copy of the dataset with the object
         verbose=TRUE)                # print out progress

     # check performance using an out-of-bag estimator
     # OOB underestimates the optimal number of iterations, performance is competitive
     best.iter <- gbm.perf(gbm1,method="OOB")
     print(best.iter)

     # check performance using a 50
     best.iter <- gbm.perf(gbm1,method="test")
     print(best.iter)

     # check performance using 5-fold cross-validation
     best.iter <- gbm.perf(gbm1,method="cv")
     print(best.iter)

     # plot the performance
     # plot variable influence
     summary(gbm1,n.trees=1)         # based on the first tree
     summary(gbm1,n.trees=best.iter) # based on the estimated best number of trees

     # compactly print the first and last trees for curiosity
     print(pretty.gbm.tree(gbm1,1))
     print(pretty.gbm.tree(gbm1,gbm1$n.trees))

     # make some new data
     N <- 1000
     X1 <- runif(N)
     X2 <- 2*runif(N)
     X3 <- ordered(sample(letters[1:4],N,replace=TRUE))
     X4 <- factor(sample(letters[1:6],N,replace=TRUE))
     X5 <- factor(sample(letters[1:3],N,replace=TRUE))
     X6 <- 3*runif(N)
     mu <- c(-1,0,1,2)[as.numeric(X3)]

     Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma)

     data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

     # predict on the new data using "best" number of trees
     # f.predict generally will be on the canonical scale (logit,log,etc.)
     f.predict <- predict.gbm(gbm1,data2,best.iter)

     # least squares error
     print(sum((data2$Y-f.predict)^2))

     # create marginal plots
     # plot variable X1,X2,X3 after "best" iterations
     par(mfrow=c(1,3))
     plot.gbm(gbm1,1,best.iter)
     plot.gbm(gbm1,2,best.iter)
     plot.gbm(gbm1,3,best.iter)
     par(mfrow=c(1,1))
     # contour plot of variables 1 and 2 after "best" iterations
     plot.gbm(gbm1,1:2,best.iter)
     # lattice plot of variables 2 and 3
     plot.gbm(gbm1,2:3,best.iter)
     # lattice plot of variables 3 and 4
     plot.gbm(gbm1,3:4,best.iter)

     # 3-way plots
     plot.gbm(gbm1,c(1,2,6),best.iter,cont=20)
     plot.gbm(gbm1,1:3,best.iter)
     plot.gbm(gbm1,2:4,best.iter)
     plot.gbm(gbm1,3:5,best.iter)

     # do another 100 iterations
     gbm2 <- gbm.more(gbm1,100,
                      verbose=FALSE) # stop printing detailed progress

