1 Introduction

MTBLS79: Direct infusion mass spectrometry metabolomics dataset: a benchmark for data processing and quality control.

Direct-infusion mass spectrometry (DIMS) metabolomics is an important approach for characterising molecular responses of organisms to disease, drugs and the environment. Increasingly large-scale metabolomics studies are being conducted, necessitating improvements in both bioanalytical and computational workflows to maintain data quality. This dataset represents a systematic evaluation of the reproducibility of a multi-batch DIMS metabolomics study of cardiac tissue extracts. It comprises of twenty biological samples (cow vs. sheep) that were analysed repeatedly, in 8 batches across 7 days, together with a concurrent set of quality control (QC) samples. Data are presented from each step of the workflow and are available in MetaboLights (https://www.ebi.ac.uk/metabolights/MTBLS79).

The MTBLS79_DatasetExperiment() object provided by the structToolbox package is a partially processed version of the raw data available in the pmp package. This vignette describes how the structToolbox version was created from the pmp version.

The filters applied are an implementation of the filters described by the authors of the data in Kirwan et al (https://europepmc.org/article/MED/25977770). The pmp package uses ‘Dataset 7:SFPM’ from the article.

2 Data Format

The struct package extends SummarisedExperiment objects to provide some additional functionality and, contrary to SummarisedExperiment stores the data with samples in rows and features in columns.

To work with structToolbox object the data must be converted from SummarisedExperiment to DatasetExperiment. Some additional meta-data columns are also needed to carry out some of the processing.

# load the pmp package
library(pmp)
# load structToolbox package
library(structToolbox)
# some graphics packages
library(ggplot2)

# the pmp SE object
SE = MTBLS79

# convert to DE
DE = as.DatasetExperiment(SE)
DE$name = 'MTBLS79'
DE$description = 'Converted from SE provided by the pmp package'

# add a column indicating the order the samples were measured in
DE$sample_meta$run_order = 1:nrow(DE)

# add a column indicating if the sample is biological or a QC
Type=as.character(DE$sample_meta$Class)
Type[Type != 'QC'] = 'Sample'
DE$sample_meta$Type = factor(Type)

# convert to factors
DE$sample_meta$Batch = factor(DE$sample_meta$Batch)
DE$sample_meta$Class = factor(DE$sample_meta$Class)


# print summary
DE
## A "DatasetExperiment" object
## ----------------------------
## name:          MTBLS79
## description:   Converted from SE provided by the pmp package
## data:          172 rows x 2488 columns
## sample_meta:   172 rows x 6 columns
## variable_meta: 2488 rows x 0 columns

Full processing of the data set requires a number of steps. These will be applied using a single struct model sequence (model.seq).

3 Batch Correction

A batch correction algorithm is applied to reduce intra- and inter- batch variations in the dataset. Quality Control-Robust Spline Correction (QC-RSC) is provided in the pmp package, and it has been wrapped into a structToolbox object called sb_corr.

M = # batch correction
    sb_corr(
      order_col='run_order',
      batch_col='Batch', 
      qc_col='Type', 
      qc_label='QC'
    )

M = model_apply(M,DE)
## The number of NA and <= 0 values in peaksData before QC-RSC: 18222

The figure below shows a plot of a feature vs run order, before and after the correction. It can be seen that the correction has removed instrument drift within and between batches.

C = feature_profile(
      run_order='run_order',
      qc_label='QC',
      qc_column='Type',
      colour_by='Batch',
      feature_to_plot='200.03196'
  )

# plot and modify using ggplot2 
chart_plot(C,DE)+ylab('Peak area')+ggtitle('Before')

chart_plot(C,predicted(M))+ylab('Peak area')+ggtitle('After')

An additional step is added to the published workflow to remove any feature not corrected by QCRCMS. This can occur if there are not enough measured QC values within a batch. QCRMS in the pmp package currently returns NA for all samples in the feature where this occurs. Features where this occurs will be excluded.

M2 = filter_na_count(
      threshold=3,
      factor_name='Batch'
    )
M2 = model_apply(M2,predicted(M))

# calculate number of features removed
nc = ncol(DE) - ncol(predicted(M2))

cat(paste0('Number of features removed: ', nc))
## Number of features removed: 425

The output of this step is the output of MTBLS79_DatasetExperiment(filtered=FALSE).

4 Filtering

In the journal article three spectral cleaning algorithms are applied. In the first filter a Kruskal-Wallis test is used to identify features not reliably detected in the QC samples (p < 0.0001) of all batches.

M3 = kw_rank_sum(
      alpha=0.0001,
      mtc='none',
      factor_names='Batch',
      predicted='significant'
    ) +
    filter_by_name(
      mode='exclude',
      dimension = 'variable',
      seq_in = 'names', 
      names=character(0),
      seq_fcn=function(x){return(x[,1])}
    )
M3 = model_apply(M3, predicted(M2))

nc = ncol(predicted(M2)) - ncol(predicted(M3))
cat(paste0('Number of features removed: ', nc))
## Number of features removed: 262

To make use of univariate tests such as kw_rank_sum as a filter some advanced features of struct are needed. Slots predicted, and seq_in are used to ensure the correct output of the univariate test is connected to the correct input of a feature filter using filter_by_name. Another slot seq_fcn is used to extract the relevant column of the predicted output so that it is compatible with the seq_in input.

The second filter is a Wilcoxon Signed-Rank test. It is used to identify features that are not representative of the average of the biological samples (p < 1e-14).

M4 = wilcox_test(
      alpha=1e-14,
      factor_names='Type', 
      mtc='none', 
      predicted = 'significant'
    ) +
    filter_by_name(
      mode='exclude',
      dimension='variable',
      seq_in='names', 
      names=character(0)
    )
M4 = model_apply(M4, predicted(M3))

nc = ncol(predicted(M3)) - ncol(predicted(M4))
cat(paste0('Number of features removed: ', nc))
## Number of features removed: 162

Finally an RSD filter is used to remove features with high analytical variation (QC RSD > 20 removed)

M5 = rsd_filter(
     rsd_threshold=20,
     factor_name='Type'
)
M5 = model_apply(M5,predicted(M4))

nc = ncol(predicted(M4)) - ncol(predicted(M5))
cat(paste0('Number of features removed: ', nc))
## Number of features removed: 22

The output of this filter is the output of MTBLS79_DatasetExperiment(filtered=TRUE).

5 Peak Matrix Processing

For completeness a similar analysis of the filtered matrix is applied for comparison with the published outputs.

The filtering steps are followed by some peak matrix processing steps that are frequently applied in metabolomics:

  • Probabilistic Quotient Normalisation (PQN)
  • k-nearest neighbours imputation (k = 5)
  • Generalised log transform (glog)

These steps prepare the data for multivariate analysis by accounting for sample concentration differences, imputing missing values and scaling the data.

# peak matrix processing
M6 = pqn_norm(qc_label='QC',factor_name='Type') + 
     knn_impute(neighbours=5) +
     glog_transform(qc_label='QC',factor_name='Type')
M6 = model_apply(M6,predicted(M5))

6 Exploratory Analysis

Principal Component Analysis (PCA) can be used to visualise high-dimensional data. It is an unsupervised method that maximises variance in a reduced number of latent variables, or principal components.

# PCA
M7  = mean_centre() + PCA(number_components = 2)

# apply model sequence to data
M7 = model_apply(M7,predicted(M6))

# plot pca scores
C = pca_scores_plot(factor_name=c('Sample_Rep','Class'),ellipse='none')
chart_plot(C,M7[2]) + coord_fixed() +guides(colour=FALSE)

This plot is similar to Figure 3b of the original publication. Sample replicates are represented by colours and samples groups by different shapes.

7 Session Information

sessionInfo()
## R version 4.0.0 alpha (2020-03-31 r78116)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.0        structToolbox_0.99.8 struct_0.99.9       
## [4] pmp_0.99.3           BiocStyle_2.15.6    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4                  lattice_0.20-41            
##  [3] assertthat_0.2.1            digest_0.6.25              
##  [5] foreach_1.5.0               R6_2.4.1                   
##  [7] GenomeInfoDb_1.23.16        plyr_1.8.6                 
##  [9] stats4_4.0.0                evaluate_0.14              
## [11] pillar_1.4.3                itertools_0.1-3            
## [13] zlibbioc_1.33.1             rlang_0.4.5                
## [15] magick_2.3                  S4Vectors_0.25.15          
## [17] Matrix_1.2-18               missForest_1.4             
## [19] rmarkdown_2.1               labeling_0.3               
## [21] stringr_1.4.0               RCurl_1.98-1.1             
## [23] munsell_0.5.0               DelayedArray_0.13.10       
## [25] compiler_4.0.0              xfun_0.12                  
## [27] pkgconfig_2.0.3             BiocGenerics_0.33.3        
## [29] pcaMethods_1.79.1           htmltools_0.4.0            
## [31] tidyselect_1.0.0            SummarizedExperiment_1.17.5
## [33] gridExtra_2.3               tibble_3.0.0               
## [35] GenomeInfoDbData_1.2.2      bookdown_0.18              
## [37] IRanges_2.21.8              codetools_0.2-16           
## [39] matrixStats_0.56.0          randomForest_4.6-14        
## [41] fansi_0.4.1                 withr_2.1.2                
## [43] crayon_1.3.4                dplyr_0.8.5                
## [45] bitops_1.0-6                grid_4.0.0                 
## [47] ontologyIndex_2.5           gtable_0.3.0               
## [49] lifecycle_0.2.0             magrittr_1.5               
## [51] scales_1.1.0                cli_2.0.2                  
## [53] stringi_1.4.6               impute_1.61.0              
## [55] farver_2.0.3                XVector_0.27.2             
## [57] reshape2_1.4.3              ggthemes_4.2.0             
## [59] sp_1.4-1                    ellipsis_0.3.0             
## [61] vctrs_0.2.4                 iterators_1.0.12           
## [63] tools_4.0.0                 Biobase_2.47.3             
## [65] glue_1.4.0                  purrr_0.3.3                
## [67] parallel_4.0.0              yaml_2.2.1                 
## [69] colorspace_1.4-1            BiocManager_1.30.10        
## [71] GenomicRanges_1.39.3        knitr_1.28