errorest                package:ipred                R Documentation

_E_s_t_i_m_a_t_o_r_s _o_f _P_r_e_d_i_c_t_i_o_n _E_r_r_o_r

_D_e_s_c_r_i_p_t_i_o_n:

     Resampling based estimates of prediction error: misclassification
     error,  root mean squared error or Brier score for survival data.

_U_s_a_g_e:

     ## S3 method for class 'data.frame':
     errorest(formula, data, subset, na.action=na.omit, 
              model=NULL, predict=NULL,
              estimator=c("cv", "boot", "632plus"), 
              est.para=control.errorest(), ...)

_A_r_g_u_m_e_n_t_s:

 formula: a formula of the form 'lhs ~ rhs'.  Either describing the
          model of explanatory and  response variables in the usual way
          (see 'lm')  or the model between explanatory and intermediate
          variables in the framework of indirect classification,  see
          'inclass'.

    data: a data frame containing the variables in the model formula 
          and additionally the class membership variable  if 'model =
          inclass'. 'data' is required for indirect classification,
          otherwise 'formula' is evaluated in the calling environment.

  subset: optional vector, specifying a subset of observations to  be
          used.

na.action: function which indicates what should happen when the data
          contains 'NA''s, defaults to 'na.omit'.

   model: function. Modelling technique whose error rate is to be 
          estimated. The function 'model' can either return an  object
          representing a fitted model or a function with argument
          'newdata' which returns predicted values. In this case, the
          'predict' argument to 'errorest' is ignored.

 predict: function. Prediction method to be used. The vector of 
          predicted values must have the same length as the the  number
          of to-be-predicted observations. Predictions  corresponding
          to missing data must be replaced by 'NA'. Additionally,
          'predict' has to return predicted values  comparable to the
          responses (that is: factors for  classification problems).
          See the example on how to make this sure for any predictor.

estimator: estimator of the misclassification error:  'cv'
          cross-validation, 'boot' bootstrap or  '632plus' bias
          corrected bootstrap (classification only). 

est.para: a list of additional parameters that control the calculation
          of the estimator, see  'control.errorest' for details.

     ...: additional parameters to 'model'.

_D_e_t_a_i_l_s:

     The prediction error for classification and regression models as
     well as predictive models for censored data using cross-validation
     or the  bootstrap can be computed by 'errorest'. For
     classification problems, the estimated misclassification error is
     returned. The root mean squared error is computed for regression
     problems and the Brier score for censored data (Graf et al., 1999)
     is reported if the response is censored. 

     Any model can be specified as long as it is a function with
     arguments 'model(formula, data, subset, na.action, ...)'. If  a
     method 'predict.model(object, newdata, ...)' is available,
     'predict' does not need to be specified. However, 'predict'  has
     to return predicted values in the same order and of the same
     length corresponding to the response. See the examples below. 

     $k$-fold cross-validation and the usual bootstrap estimator with
     'est.para$nboot' bootstrap replications can be computed for all
     kind of problems. The bias corrected .632+ bootstrap by Efron and
     Tibshirani (1997) is available for classification problems only.
     Use 'control.errorest' to specify additional arguments.

     'errorest' is a formula based interface to the generic functions 
     'cv' or 'bootest' which implement methods for classification,
     regression and survival problems.

_V_a_l_u_e:

     The class of the object returned depends on the class of the
     response variable and the estimator used. In each case, it is a
     list with an element 'error' and additional information. 'print'
     methods are available for the inspection of the results.

_A_u_t_h_o_r(_s):

     Andrea Peters <Peters.Andrea@imbe.imed.uni-erlangen.de> and 
      Torsten Hothorn <Torsten.Hothorn@rzmail.uni-erlangen.de>

_R_e_f_e_r_e_n_c_e_s:

     Brian D. Ripley (1996), _Pattern Recognition and Neural Networks_.
     Cambridge: Cambridge University Press.

     Bradley Efron and Robert Tibshirani (1997), Improvements on
     Cross-Validation: The .632+ Bootstrap Estimator. _Journal of the
     American Statistical Association_ *92*(438), 548-560.

     Erika Graf, Claudia Schmoor, Willi Sauerbrei and Martin Schumacher
     (1999),  Assessment and comparison of prognostic classification
     schemes for  survival data. _Statistics in Medicine_ *18*(17-18),
     2529-2545.

     Rosa A. Schiavo and David J. Hand (2000), Ten More Years of Error
     Rate Research. _International Statistical Review_ *68*(3),
     296-310. 

     David J. Hand, Hua Gui Li, Niall M. Adams (2001), Supervised
     Classification with Structured Class Definitions. _Computational
     Statistics & Data Analysis_ *36*, 209-225.

_E_x_a_m_p_l_e_s:

     # Classification

     data(iris)

     # force predict to return class labels only
     mypredict.lda <- function(object, newdata)
       predict(object, newdata = newdata)$class

     # 10-fold cv of LDA for Iris data
     errorest(Species ~ ., data=iris, model=lda, 
              estimator = "cv", predict= mypredict.lda)

     data(PimaIndiansDiabetes)
     ## Not run: 
     # 632+ bootstrap of LDA for Diabetes data
     errorest(diabetes ~ ., data=PimaIndiansDiabetes, model=lda,
              estimator = "632plus", predict= mypredict.lda)
     ## End(Not run)

     #cv of a fixed partition of the data
     list.tindx <- list(1:100, 101:200, 201:300, 301:400, 401:500,
             501:600, 601:700, 701:768)

     errorest(diabetes ~ ., data=PimaIndiansDiabetes, model=lda,
               estimator = "cv", predict = mypredict.lda,
               est.para = control.errorest(list.tindx = list.tindx))

     ## Not run: 
     #both bootstrap estimations based on fixed partitions

     list.tindx <- vector(mode = "list", length = 25)
     for(i in 1:25) {
       list.tindx[[i]] <- sample(1:768, 768, TRUE)
     }

     errorest(diabetes ~ ., data=PimaIndiansDiabetes, model=lda,
               estimator = c("boot", "632plus"), predict= mypredict.lda,
               est.para = control.errorest(list.tindx = list.tindx))

     ## End(Not run)
     data(Glass)

     # LDA has cross-validated misclassification error of
     # 38% (Ripley, 1996, page 98)

     # Pruned trees about 32% (Ripley, 1996, page 230)

     # use stratified sampling here, i.e. preserve the class proportions
     errorest(Type ~ ., data=Glass, model=lda, 
              predict=mypredict.lda, est.para=control.errorest(strat=TRUE))

     # force predict to return class labels
     mypredict.rpart <- function(object, newdata)
       predict(object, newdata = newdata,type="class")

     pruneit <- function(formula, ...)
       prune(rpart(formula, ...), cp =0.01)

     errorest(Type ~ ., data=Glass, model=pruneit,
              predict=mypredict.rpart, est.para=control.errorest(strat=TRUE))

     # compute sensitivity and specifity for stabilised LDA

     data(GlaucomaM)

     error <- errorest(Class ~ ., data=GlaucomaM, model=slda,
       predict=mypredict.lda, est.para=control.errorest(predictions=TRUE))

     # sensitivity 

     mean(error$predictions[GlaucomaM$Class == "glaucoma"] == "glaucoma")

     # specifity

     mean(error$predictions[GlaucomaM$Class == "normal"] == "normal")

     # Indirect Classification: Smoking data

     data(Smoking)
     # Set three groups of variables:
     # 1) explanatory variables are: TarY, NicY, COY, Sex, Age
     # 2) intermediate variables are: TVPS, BPNL, COHB
     # 3) response (resp) is defined by:

     resp <- function(data){
       data <- data[, c("TVPS", "BPNL", "COHB")]
       res <- t(t(data) > c(4438, 232.5, 58))
       res <- as.factor(ifelse(apply(res, 1, sum) > 2, 1, 0))
       res
     }

     response <- resp(Smoking[ ,c("TVPS", "BPNL", "COHB")])
     smoking <- cbind(Smoking, response)

     formula <- response~TVPS+BPNL+COHB~TarY+NicY+COY+Sex+Age

     # Estimation per leave-one-out estimate for the misclassification is 
     # 36.36% (Hand et al., 2001), using indirect classification with 
     # linear models
     ## Not run: 
     errorest(formula, data = smoking, model = inclass,estimator = "cv", 
              pFUN = list(list(model=lm, predict = mypredict.lm)), cFUN = resp,  
              est.para=control.errorest(k=nrow(smoking)))
     ## End(Not run)

     # Regression

     data(BostonHousing)

     # 10-fold cv of lm for Boston Housing data
     errorest(medv ~ ., data=BostonHousing, model=lm,
              est.para=control.errorest(random=FALSE))

     # the same, with "model" returning a function for prediction
     # instead of an object of class "lm"

     mylm <- function(formula, data) {
       mod <- lm(formula, data)
       function(newdata) predict(mod, newdata)
     }

     errorest(medv ~ ., data=BostonHousing, model=mylm,
     est.para=control.errorest(random=FALSE))

     # Survival data

     data(GBSG2)

     # prediction is fitted Kaplan-Meier
     predict.survfit <- function(object, newdata) object

     # 5-fold cv of Kaplan-Meier for GBSG2 study
     errorest(Surv(time, cens) ~ 1, data=GBSG2, model=survfit,
              predict=predict.survfit, est.para=control.errorest(k=5))

