mdqc                  package:mdqc                  R Documentation

_M_D_Q_C: _M_a_h_a_l_a_n_o_b_i_s _D_i_s_t_a_n_c_e _Q_u_a_l_i_t_y _C_o_n_t_r_o_l

_D_e_s_c_r_i_p_t_i_o_n:

     MDQC is a multivariate quality assessment method for microarrays
     based on quality control (QC) reports.

_U_s_a_g_e:

     mdqc(x, method=c("nogroups", "apriori", "global", "cluster", "loading"),
          groups=NULL, k=NULL, pc=NULL,
          robust=c("S-estimator","MCD", "MVE"), nsamp=10*nrow(x))

_A_r_g_u_m_e_n_t_s:

       x: a numeric matrix or data frame containing the quality
          measures (columns) for each array (rows). The number of rows
          must exceed the number of columns.

  method: The Mahalanobis Distances (MDs) can be computed on all the
          quality measures in the QC report (this is the default method
          given by 'method="nogroups"'), on the first k principal
          components resulting from a principal component analysis
          (PCA) of the QC report ('"global"') or on subsets of quality
          measures in the QC report ('"apriori"': groups defined by the
          user, '"cluster"': groups resulting from a cluster analysis,
          or '"loading"': groups resulting from a cluster analysis in
          the space of the loadings of a PCA). While the first two
          methods compute a single MD for each array, the last three
          compute one MD within each created group of quality measures.

  groups: A list to specify the groups of quality measures when the
          apriori method is chosen.  E.g. 'groups = list(c(1,2),
          c(4,6))' puts column 1,2 as one group and 4,6 as a second.

       k: An integer to specify the number of clusters (or groups) to
          be used in the cluster analysis when cluster or loading
          methods are chosen.

      pc: An integer to specify the number of principal components
          analyzed from the PCA when global or loading methods are
          chosen.

  robust: A robust multivariate location/spread estimator (choice of
          S-estimator, MCD or MVE). The default method uses
          S-estimators with a 25% breakdown point.

   nsamp: The number of subsamples that the robust estimator should
          use. This defaults to 10 times the number of rows in the
          matrix.

_D_e_t_a_i_l_s:

     MDQC flags potentially low quality arrays based on the idea of
     outlier detection, that is, it flags those arrays whose quality
     attributes jointly depart from those of the bulk of the data.

     This function computes a distance measure, the Mahalanobis
     Distance, to summarize the quality of each array.  The use of this
     distance allows us to perform a multivariate analysis of the
     information in QC reports taking the correlation structure of the
     quality measures into account. In addition, by using robust
     estimators to identify the typical quality measures of
     good-quality arrays, the evaluation is not affected by the
     measures of outlying arrays.

     MDQC can be based on all the quality measures simultaneously
     (using 'method="nogroups"'), on subsets of them (using
     'method="apriori"', '"cluster"', or '"loading"'), or on a
     transformed space with a lower dimension (using
     'method="global"').

     In the apriori approach the user forms groups of quality
     measures on the basis of an a priori interpretation of them and
     according to the quality aspect they represent.  The cluster and
     the loading methods are two data-driven methods to form the
     groups. The former groups the quality measures using clustering
     analysis, and the latter uses the loadings of a principal
     component analysis to identify the quality measures that contain
     similar information and group them. It is important to note that
     the apriori, the cluster, and the loading methods create
     groups of the original quality measures of the report and compute
     one MD within each group. Finally, the global method computes a
     single MD based on the reduced space of the first k principal
     components from a robust PCA. The number k of PCs can be chosen
     using a scree plot.

     More details on each method are given in Cohen Freue et al. (2007)

_V_a_l_u_e:

     An object of class mdqc (with associated plot, print and
     summary methods) with components 

 ngroups: Number of groups in which the MDs have been computed

  groups: column numbers corresponding to the quality measures in each
          group

mdqcValues: Mahalanobis Distance(s) for each array

       x: dataset containing the numeric quality measures in the report

  method: method used to group or transform the quality measures before
          computing the MD for each array

      pc: number of principal components used in the robust PCA.

       k: number of clusters used in the cluster analysis.

_N_o_t_e:

     We thank Christopher Croux for providing us a MATLAB code that we
     translated into R to compute the multivariate S-estimator

_A_u_t_h_o_r(_s):

     Justin Harrington harringt@stat.ubc.ca and Gabriela V. Cohen Freue
     gcohen@stat.ubc.ca.

_R_e_f_e_r_e_n_c_e_s:

     Cohen Freue, G. V. and Hollander, Z. and Shen, E. and Zamar, R. H.
     and Balshaw, R. and Scherer, A. and McManus, B. and Keown, P. and
     McMaster, W. R. and Ng, R. T. (2007) MDQC: A New Quality
     Assessment Method for Microarrays Based on Quality Control
     Reports. _Bioinformatics_ *23*, 3162 - 3169.

     Bolstad, B. M. and Collin, F. and Brettschneider, J. and Simpson,
     K. and Cope, L. and Irizarry R. A. and Speed T. P. (2005) Quality
     assessment of Affymetrix GeneChip data. In Gentleman R. and Carey
     C. J. and Huber W. and Irizarry R. A. and Dudoit S.
     _Bioinformatics and Computational Biology Solutions Using R and
     Bioconductor_. New York: Springer.

     Brettschneider, J. and Collin, F. and Bolstad, B. M. and Speed, T.
     P. (2007) Quality assessment for short oligonucleotide arrays.
     Forthcoming in _Technometrics (with Discussion)_.

     Ross, M. E. and Zhou, X. and Song, G. and Shurtleff, S. A. and
     Girtman, K. and Williams, W. K. and Liu, H. and Mahfouz, R. and
     Raimondi, S. C. and Lenny, N. and Patel, A. and Downing, J. R.
     (2003) Classification of pediatric acute lymphoblastic leukemia
     by gene expression profiling. _Blood_ *102*, 2951-9.

_S_e_e _A_l_s_o:

     'prcomp.robust','pam', 'mahalanobis', 'allQC'

_E_x_a_m_p_l_e_s:

     data(allQC)

     ## Contains the QC report obtained using Bioconductor's simpleaffy package
     ## for a subset of arrays from a large acute lymphoblastic leukemia (ALL)
     ## study (Ross et al., 2004).
     ## This dataset has been also studied by Bolstad et al. (2005) and
     ## Brettschneider et al. (2007).
     ## For further information see allQC.

     #### No Groups method
     # Figure 2 in Cohen Freue et al. (2007):
     # Results of MDQC based on all measures of the QC report.

     mdout <- mdqc(allQC, method="nogroups")
     plot(mdout)
     print(mdout)
     summary(mdout)

     #### A-Priori grouping method
     # Figure 3 in Cohen Freue et al. (2007):
     # Results of MDQC using the apriori grouping method.

     mdout <- mdqc(allQC, method="apriori", groups=list(1:5, 6:9, 10:11))
     plot(mdout)


     #### Global PCA method
     # Figure 4 in Cohen Freue et al.(2007):
     # Results of MDQC using the global PCA method.

     mdout <- mdqc(allQC, method="global", pc=4)
     plot(mdout)


     #### Clustering grouping method
     # Figure 4 in Supplementary Material of Cohen Freue et al. (2007):
     # Results of MDQC using a cluster analysis to form
     # 3 groups of quality measures.

     mdout <- mdqc(allQC, method="cluster", k=3)
     plot(mdout)


     #### Loading grouping method
     # Figure 4 in Supplementary Material of Cohen Freue et al. (2007):
     # Results of MDQC using a cluster analysis on the first
     # k=4 loading vectors from a robust PCA to form 3 groups of quality measures.

     mdout <- mdqc(allQC, method="loading", k=3, pc=4)
     plot(mdout)

     ### To get the raw MD distances
     mdout$mdqcValues

