labelstomss              package:hopach              R Documentation

_F_u_n_c_t_i_o_n_s _t_o _c_o_m_p_u_t_e _s_i_l_h_o_u_e_t_t_e_s _a_n_d _s_p_l_i_t _s_i_l_h_o_u_e_t_t_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     Silhouettes measure how well an element belongs to its cluster,
     and the average silhouette measures the strength of cluster
     membership overall.  The Median (or Mean) Split Silhouette (MSS)
     is a measure of cluster  heterogeneity. Given a partitioning of
     elements into groups, the MSS algorithm considers each group
     separately and computes the split silhouette for that group, which
     evaluates evidence in favor of further splitting the group. If the
     median (or mean) split silhouette over all groups in the partition
     is low, the groups are homogeneous.

_U_s_a_g_e:

     labelstomss(labels, dist, khigh = 9, within = "med", between = "med", 
     hierarchical = TRUE)

     labelstosil(labels, dist)

     medstosil(medoids, dist)

     msscheck(dist, kmax = 9, khigh = 9, within = "med", between = "med", 
         force = FALSE, echo = FALSE, graph = FALSE)

     silcheck(data, kmax = 9, diss = FALSE, echo = FALSE, graph = FALSE)

_A_r_g_u_m_e_n_t_s:

  labels: vector of cluster labels for each element in the set.

    dist: numeric distance matrix containing the pair wise distances 
          between all elements. All values must be numeric and missing
          values are not allowed.

 medoids: a vector indicating the rows/cols of 'dist' that are the
          cluster medoids, i.e. profiles (or centroids) for each
          cluster.

    data: a data matrix. Each column corresponds to an observation, and
          each row corresponds to a variable. In the gene expression
          context, observations are arrays and variables are genes. All
          values must be numeric. Missing values are ignored. In
          'silcheck', 'data' may also be a distance matrix or
          dissimilarity object if the argument 'diss=TRUE'.

   khigh: integer between 1 and 9 specifying the maximum number of 
          children for each cluster when computing MSS.

    kmax: integer between 1 and 9 specifying the maximum number of
          clusters to consider. Can be different from khigh, though
          typically these are the same value.

  within: character string indicating how to compute the split
          silhouette for each cluster. The available options are "med"
          (median over all elements in the cluster) or "mean" (mean
          over all elements in the  cluster).

 between: character string indicating how to compute the MSS over all
          clusters. The available options are "med" (median over all
          clusters) or "mean" (mean over all clusters). Recommended to
          use the same value as 'within'.

hierarchical: logical indicating if 'labels' should be treated as
          encoding a hierarchical tree, e.g. from HOAPCH.

   force: indicator of whether to require at least 2 clusters, if FALSE
          (default), one cluster is considered.

    echo: indicator of whether to print the selected number of clusters
          and corresponding MSS.

   graph: indicator of whether to generate a plot of MSS (or average
          silhouette in 'silcheck') versus number of clusters.

    diss: idicator of whether 'data' is a dissimilarity matrix (or
          dissimilarity object), as in the 'pam' function of the
          'cluster' package. If TRUE then 'data' will be considered as
          a dissimilarity matrix. If FALSE, then 'data' will be
          considered as a data matrix (observations by variables).

_D_e_t_a_i_l_s:

     The Median (and mean) Split Silhouette (MSS) criteria is defined
     in  paper107 listed in the references (below). This criteria is
     based on the criteria function 'silhouette', proposed by Kaufman
     and Rousseeuw (1990). While average silhouette is a good global
     measure of cluster strength, MSS was developed to be more
     "aggressive" for finding small, homogeneous clusters in large data
     sets. MSS is a measure of average cluster homogeneity. The Median
     version is more robust than the Mean.

_V_a_l_u_e:

     For 'labelstomss', the median (or mean or combination) split
     silhouette, depending on the values of 'within' and 'between'. 

     For 'medstosil' and 'labelstosil', a list with first component the
     cluster label for each element and second compenent the silhouette
     for that element. The average silhouette is simply the mean of the
     second component.

     For 'msscheck', a vector with first component the chosen number of
     clusters (minimizing MSS) and second component the corresponding
     MSS.

     For 'silcheck', a vector with first component the chosen number of
     clusters (maximizing average silhouette) and second component the
     corresponding average silhouette.

_A_u_t_h_o_r(_s):

     Katherine S. Pollard <kpollard@soe.ucsc.edu> and Mark J. van der
     Laan <laan@stat.berkeley.edu>

_R_e_f_e_r_e_n_c_e_s:

     <URL: http://www.bepress.com/ucbbiostat/paper107/>

     <URL:
     http://www.stat.berkeley.edu/~laan/Research/Research_subpages/Pape
     rs/jsmpaper.pdf>

     Kaufman, L. and Rousseeuw, P.J. (1990). Finding Groups in Data: An
     Introduction to Cluster Analysis. Wiley, New York.

_S_e_e _A_l_s_o:

     'pam', 'hopach', 'distancematrix'

_E_x_a_m_p_l_e_s:

     mydata<-rbind(cbind(rnorm(10,0,0.5),rnorm(10,0,0.5),rnorm(10,0,0.5)),cbind(rnorm(15,5,0.5),rnorm(15,5,0.5),rnorm(15,5,0.5)))
     mydist<-distancematrix(mydata,d="cosangle") #compute the distance matrix.

     #pam
     result1<-pam(mydata,k=2)
     result2<-pam(mydata,k=5)
     labelstomss(result1$clust,mydist,hierarchical=FALSE)
     labelstomss(result2$clust,mydist,hierarchical=FALSE)

     #hopach
     result3<-hopach(mydata,dmat=mydist)
     labelstomss(result3$clustering$labels,mydist)
     labelstomss(result3$clustering$labels,mydist,within="mean",between="mean")

