MLearn             package:MLInterfaces             R Documentation

_r_e_v_i_s_e_d _M_L_e_a_r_n _i_n_t_e_r_f_a_c_e _f_o_r _m_a_c_h_i_n_e _l_e_a_r_n_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     revised MLearn interface for machine learning, emphasizing a
     schematic description of external learning functions like knn,
     lda, nnet, etc.

_U_s_a_g_e:

     MLearn( formula, data, .method, trainInd, ... )
     makeLearnerSchema(packname, mlfunname, converter)

_A_r_g_u_m_e_n_t_s:

 formula: standard model formula 

    data: data.frame or ExpressionSet instance

 .method: instance of learnerSchema 

trainInd: obligatory numeric vector of indices of data to be used for
          training; all other data are used for testing, or instance of
          the xvalSpec class 

     ...: additional named arguments passed to external learning
          function 

packname: character - name of package harboring a learner function

mlfunname: character - name of function to use

converter: function - with parameters (obj, data, trainInd) that tells
          how to convert the material in obj [produced by
          [packname::mlfunname] ] into a classifierOutput instance.

_D_e_t_a_i_l_s:

     The purpose of the MLearn methods is to provide a uniform calling
     sequence to diverse machine learning algorithms.  In R package,
     machine learning functions can have parameters '(x, y, ...)' or
     '(formula, data, ...)' or some other sequence, and these functions
     can return lists or vectors or other sorts of things.  With
     MLearn, we always have calling sequence 'MLearn(formula, data,
     .method, trainInd, ...)',  and 'data' can be a 'data.frame' or
     'ExpressionSet'.  'MLearn' will always return an S4 instance of
     'classifierObject' or 'clusteringObject'.

     At this time (1.13.x), NA values in predictors trigger an error.

     To obtain documentation on the older (pre bioc 2.1) version of the
     MLearn method, please use help(MLearn-OLD).


     _r_a_n_d_o_m_F_o_r_e_s_t_I randomForest.  Note, that to obtain the default
          performance of randomForestB, you need to set mtry and
          sampsize parameters to sqrt(number of features) and
          table([training set response factor]) respectively, as these
          were not taken to be the function's defaults.   Note you can
          use xvalSpec("NOTEST") as trainInd, to use all the samples;
          the RObject() result will print the misclassification matrix
          estimate along with OOB error rate estimate.

     _k_n_n_I(_k=_1,_l=_0) knn; special support bridge required, defined in
          MLint

     _k_n_n._c_v_I(_k=_1,_l=_0) knn.cv; special support bridge required, defined
          in MLint.  This option uses the embedded leave-one-out
          cross-validation of 'knn.cv', and thereby achieves high
          performance.  You can have more general cross-validation
          using 'knnI' with an 'xvalSpec', but it will be slower. When
          using this learner schema, you should use the numerical
          'trainInd' setting with '1:N' where 'N' is the number of
          samples.

     _d_l_d_a_I stat.diag.da; special support bridge required, defined in
          MLint

     _n_n_e_t_I nnet

     _r_p_a_r_t_I rpart

     _l_d_a_I lda

     _s_v_m_I svm

     _q_d_a_I qda

     _l_o_g_i_s_t_i_c_I(_t_h_r_e_s_h_o_l_d) glm - with binomial family, expecting a
          dichotomous factor as response variable, not bulletproofed
          against other responses yet.  If response probability
          estimate exceeds threshold, predict 1, else 0

     _a_d_a ada

     _l_v_q_I lvqtest after building codebook with lvqinit and updating
          with olvq1.  You will need to write your own detailed schema
          if you want to tweak tuning parameters.

     _n_a_i_v_e_B_a_y_e_s_I naiveBayes

     _b_a_g_g_i_n_g_I bagging

     _s_l_d_a_I slda

     _r_d_a_I rda - you must supply the alpha and delta parameters to use
          this.  Typically cross-validation is used to select these. 
          See 'rdacvI' below.

     _r_d_a_c_v_I rda.cv.  This interface is complicated.  The typical use
          includes cross-validation internal to the rda.cv function. 
          That process searches a tuning parameter space and delivers
          an ordering on parameters. The interface selects the
          parameters by looking at all parameter configurations
          achieving the smallest min+1SE cv.error estimate, and taking
          the one among them that employed the -most- features
          (agnosticism). A final run of rda is then conducted with the
          tuning parameters set at that 'optimal' choice.  The bridge
          code can be modified to facilitate alternative choices of the
          parameters in use.  'plotXvalRDA' is an interface to the plot
          method for objects of class rdacv defined in package rda. 
          You can use xvalSpec("NOTEST") with this procedure to use all
          the samples to build the discriminator.

     _k_s_v_m_I ksvm

     _h_c_l_u_s_t_I(_d_i_s_t_M_e_t_h_o_d, _a_g_g_l_o_m_M_e_t_h_o_d) hclust - you must explicitly
          specify distance and agglomeration procedure.

     _k_m_e_a_n_s_I(_c_e_n_t_e_r_s, _a_l_g_o_r_i_t_h_m) kmeans - you must explicitly specify
          centers and algorithm name.

_V_a_l_u_e:

     Instances of classifierOutput or clusteringOutput

_A_u_t_h_o_r(_s):

     Vince Carey <stvjc@channing.harvard.edu>

_E_x_a_m_p_l_e_s:

     data(crabs)
     set.seed(1234)
     kp = sample(1:200, size=120)
     rf1 = MLearn(sp~CW+RW, data=crabs, randomForestI, kp, ntree=600 )
     rf1
     nn1 = MLearn(sp~CW+RW, data=crabs, nnetI, kp, size=3, decay=.01 )
     nn1
     RObject(nn1)
     knn1 = MLearn(sp~CW+RW, data=crabs, knnI(k=3,l=2), kp)
     knn1
     names(RObject(knn1))
     dlda1 = MLearn(sp~CW+RW, data=crabs, dldaI, kp )
     dlda1
     names(RObject(dlda1))
     lda1 = MLearn(sp~CW+RW, data=crabs, ldaI, kp )
     lda1
     names(RObject(lda1))
     slda1 = MLearn(sp~CW+RW, data=crabs, sldaI, kp )
     slda1
     names(RObject(slda1))
     svm1 = MLearn(sp~CW+RW, data=crabs, svmI, kp )
     svm1
     names(RObject(svm1))
     ldapp1 = MLearn(sp~CW+RW, data=crabs, ldaI.predParms(method="debiased"), kp )
     ldapp1
     names(RObject(ldapp1))
     qda1 = MLearn(sp~CW+RW, data=crabs, qdaI, kp )
     qda1
     names(RObject(qda1))
     logi = MLearn(sp~CW+RW, data=crabs, glmI.logistic(threshold=0.5), kp, family=binomial ) # need family
     logi
     names(RObject(logi))
     rp2 = MLearn(sp~CW+RW, data=crabs, rpartI, kp)
     rp2
     ## recode data for RAB
     #nsp = ifelse(crabs$sp=="O", -1, 1)
     #nsp = factor(nsp)
     #ncrabs = cbind(nsp,crabs)
     #rab1 = MLearn(nsp~CW+RW, data=ncrabs, RABI, kp, maxiter=10)
     #rab1
     #
     # new approach to adaboost
     #
     ada1 = MLearn(sp ~ CW+RW, data = crabs, .method = adaI, 
         trainInd = kp, type = "discrete", iter = 200)
     ada1
     confuMat(ada1)
     #
     lvq.1 = MLearn(sp~CW+RW, data=crabs, lvqI, kp )
     lvq.1
     nb.1 = MLearn(sp~CW+RW, data=crabs, naiveBayesI, kp )
     confuMat(nb.1)
     bb.1 = MLearn(sp~CW+RW, data=crabs, baggingI, kp )
     confuMat(bb.1)
     #
     # ExpressionSet illustration
     # 
     data(sample.ExpressionSet)
     X = MLearn(type~., sample.ExpressionSet[100:250,], randomForestI, 1:16, importance=TRUE )
     library(randomForest)
     library(hgu95av2.db)
     opar = par(no.readonly=TRUE)
     par(las=2)
     plot(getVarImp(X), n=10, plat="hgu95av2", toktype="SYMBOL")
     par(opar)
     #
     # demonstrate cross validation
     #
     nn1cv = MLearn(sp~CW+RW, data=crabs[c(1:20,101:120),], nnetI, xvalSpec("LOO"), size=3, decay=.01 )
     confuMat(nn1cv)
     nn2cv = MLearn(sp~CW+RW, data=crabs[c(1:20,101:120),], nnetI, 
        xvalSpec("LOG",5, balKfold.xvspec(5)), size=3, decay=.01 )
     confuMat(nn2cv)
     nn3cv = MLearn(sp~CW+RW+CL+BD+FL, data=crabs[c(1:20,101:120),], nnetI, 
        xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.absT(2)), size=3, decay=.01 )
     confuMat(nn3cv)
     nn4cv = MLearn(sp~.-index-sex, data=crabs[c(1:20,101:120),], nnetI, 
        xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.absT(2)), size=3, decay=.01 )
     confuMat(nn4cv)
     #
     # try with expression data
     #
     library(golubEsets)
     data(Golub_Train)
     litg = Golub_Train[ 100:150, ]
     g1 = MLearn(ALL.AML~. , litg, nnetI, xvalSpec("LOG",5, balKfold.xvspec(5), fsFun=fs.probT(.75)), size=3, decay=.01 )
     confuMat(g1)
     #
     # illustrate rda.cv interface from package rda (requiring local bridge)
     #
     library(ALL)
     data(ALL)
     #
     # restrict to BCR/ABL or NEG
     #
     bio <- which( ALL$mol.biol %in% c("BCR/ABL", "NEG"))
     #
     # restrict to B-cell
     #
     isb <- grep("^B", as.character(ALL$BT))
     kp <- intersect(bio,isb)
     all2 <- ALL[,kp]
     mads = apply(exprs(all2),1,mad)
     kp = which(mads>1)  # get around 250 genes
     vall2 = all2[kp, ]
     vall2$mol.biol = factor(vall2$mol.biol) # drop unused levels

     r1 = MLearn(mol.biol~., vall2, rdacvI, 1:40)
     confuMat(r1)
     RObject(r1)
     plotXvalRDA(r1)  # special interface to plots of parameter space

     # illustrate clustering support

     cl1 = MLearn(~CW+RW+CL+FL+BD, data=crabs, hclustI(distFun=dist, cutParm=list(k=4)))
     plot(cl1)

     cl1a = MLearn(~CW+RW+CL+FL+BD, data=crabs, hclustI(distFun=dist, cutParm=list(k=4)), 
        method="complete")
     plot(cl1a)

     cl2 = MLearn(~CW+RW+CL+FL+BD, data=crabs, kmeansI, centers=5, algorithm="Hartigan-Wong")
     plot(cl2, crabs[,-c(1:3)])

     c3 = MLearn(~CL+CW+RW, crabs, pamI(dist), k=5)
     c3
     plot(c3, data=crabs[,c("CL", "CW", "RW")])

