1 Introduction

The aim of this vignette is to reproduce some of the outputs found in the first tutorial of “Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing” by Mendez et al (https://link.springer.com/article/10.1007/s11306-019-1588-0). Instead of python the bioconductor package structToolbox will be used in R.

1.1 Data preparation

The data for the original tutorial can be found on github (https://github.com/CIMCB/MetabWorkflowTutorial). It is provided as an excel sheet and needs to be reorganised into a DatasetExperiment object to be compatible with structToolbox.using the openxlsx package the file can be read directly into an R data.frame and then manipulated as required.

# openxlsx library
library(openxlsx)
# read in file directly from github
X=read.xlsx('https://github.com/CIMCB/MetabWorkflowTutorial/raw/master/GastricCancer_NMR.xlsx')

# sample meta data
SM=X[,1:4]
rownames(SM)=SM$SampleID
# convert to factors
SM$SampleType=factor(SM$SampleType)
SM$Class=factor(SM$Class)
# keep a numeric version of class for regression
SM$Class_num = as.numeric(SM$Class)

## data matrix
# remove meta data
X[,1:4]=NULL
rownames(X)=SM$SampleID

# feature meta data
VM=data.frame(idx=1:ncol(X))
rownames(VM)=colnames(X)

# prepare DatasetExperiment
DE = DatasetExperiment(
    data=X,
    sample_meta=SM,
    variable_meta=VM,
    description='1H-NMR urinary metabolomic profiling for diagnosis of gastric cancer',
    name='Gastric cancer (NMR)')

DE
## A "DatasetExperiment" object
## ----------------------------
## name:          Gastric cancer (NMR)
## description:   1H-NMR urinary metabolomic profiling for diagnosis of gastric
##                cancer
## data:          140 rows x 149 columns
## sample_meta:   140 rows x 5 columns
## variable_meta: 149 rows x 1 columns

1.2 Data cleaning

It is good practice to remove any features that may be of low quality, and to assess the quality of the data in general. In the Tutorial features with QC-RSD > 20% and where more than 10% of the features are missing are retained.

# prepare model sequence
M = rsd_filter(rsd_threshold=20,qc_label='QC',factor_name='Class') +
    mv_feature_filter(threshold = 10,method='across',factor_name='Class')

# apply model
M = model_apply(M,DE)

# get the model output
filtered = predicted(M)

# summary of filtered data
filtered
## A "DatasetExperiment" object
## ----------------------------
## name:          Gastric cancer (NMR)
## description:   1H-NMR urinary metabolomic profiling for diagnosis of gastric
##                cancer
## data:          140 rows x 53 columns
## sample_meta:   140 rows x 5 columns
## variable_meta: 53 rows x 1 columns

Note there is an additional feature vs the Tutorial because the filters here use >= or <=, while the Tutorial uses > and <.

1.3 PCA quality assessment

After suitable scaling and transformation PCA can be used to assess data quality. It is expected that the biological variance (samples) will be larger than the technical variance (QCs). In the Tutorial the filtered data matrix is log10 transformed, autoscaled (scaled to unit variance), and knn with 3 neighbours is used to impute any missing values. This transformed and scaled matrix in then used as input to PCA.

In struct we can chain all of these steps into a single model sequence.

# prepare the model sequence
M = log_transform(base = 10) +
    autoscale() + 
    knn_impute(neighbours = 3) +
    PCA(number_components = 10)

# apply model sequence to data
M = model_apply(M,filtered)

# get the tranformed, scaled and imputed matrix
TSI = predicted(M[3])

# scores plot
C = pca_scores_plot(factor_name = 'SampleType')
g1 = chart_plot(C,M[4])

# loadings plot
C = pca_loadings_plot()
g2 = chart_plot(C,M[4])

plot_grid(g1,g2,align='hv',nrow=1,axis='tblr')

1.4 Univariate statistics

The Tutorial uses a helper function to calculate a number of different univariate statistics. structToolbox provides objects for ttest, counting numbers of features etc. For brevity only the ttest is calculated. The QC samples need to be excluded, and the data reduced to only the GC and HE groups.

# prepare model
TT = filter_smeta(mode='include',factor_name='Class',levels=c('GC','HE')) +  
     ttest(alpha=0.05,mtc='fdr',factor_names='Class')

# apply model
TT = model_apply(TT,filtered)

# keep the data filtered by group for later
filtered = predicted(TT[1])

# convert to data frame
out=as_data_frame(TT[2])

# show first few features
head(out)
##     t_statistic   t_p_value t_significant estimate.mean.GC estimate.mean.HE
## M4   -3.5392652 0.008421042          TRUE         26.47778         51.73947
## M5    1.4296604 0.410396437         FALSE        265.11860        169.91500
## M7    2.7456506 0.051494976         FALSE        118.52558         53.98718
## M8   -2.1294198 0.178392032         FALSE         54.39535         79.26750
## M11   0.5106536 0.776939682         FALSE        201.34390        171.27949
## M14  -1.4786810 0.403091881         FALSE         61.53171         83.90250
##         lower      upper
## M4  -39.56162 -10.961769
## M5  -38.04747 228.454679
## M7   17.60818 111.468619
## M8  -48.20069  -1.543611
## M11 -87.30604 147.434869
## M14 -52.57754   7.835950

2 Machine learning

2.1 Training and Test sets

Splitting data into training and test sets is an important aspect of machine learning. In structToolbox this is implemented using the split_data for random subsampling across the whole dataset, and stratified_split for splitting based on group sizes, which is the approach used in the python Tutorial.

# prepare model
M = stratified_split(p_train=0.75,factor_name='Class')
# apply to filtered data
M = model_apply(M,filtered)
# get data from object
train = M$training
train
## A "DatasetExperiment" object
## ----------------------------
## name:          Gastric cancer (NMR)(Training set)
## description:   1H-NMR urinary metabolomic profiling for diagnosis of gastric
##                cancer
## A subset of the data has been selected as a training set
## data:          62 rows x 53 columns
## sample_meta:   62 rows x 5 columns
## variable_meta: 53 rows x 1 columns
cat('\n')
test = M$testing
test
## A "DatasetExperiment" object
## ----------------------------
## name:          Gastric cancer (NMR)(Testing set)
## description:   1H-NMR urinary metabolomic profiling for diagnosis of gastric
##                cancer
## A subset of the data has been selected as a test set
## data:          21 rows x 53 columns
## sample_meta:   21 rows x 5 columns
## variable_meta: 53 rows x 1 columns

2.2 Optimal number of PLS components

In the python Tutorial a k-fold cross-validation is used to determine the optimal number of PLS components. 100 bootstrap iterations are used to generate confidence intervals. In strucToolbox these are implemented using “iterator” objects, that can be combined with model objects. In the python tutorial R2 is used as the metric for optimisation, so the PLSR model in structToolbox will be used. For speed only 10 bootstrap iterations are used here.

# scale/transform training data
M = log_transform(base = 10) +
    autoscale() + 
    knn_impute(neighbours = 3,by='samples')

# apply model
M = model_apply(M,train)

# get scaled/transformed training data
train_st = predicted(M)

# prepare model sequence
MS = grid_search_1d(
        param_to_optimise = 'number_components',
        search_values = as.numeric(c(1:6)),
        model_index = 2,
        factor_name = 'Class_num',
        max_min = 'max') *
     permute_sample_order(
        number_of_permutations = 10) *
     kfold_xval(
        folds = 5,
        factor_name = 'Class_num') * 
     (mean_centre(mode='sample_meta')+
      PLSR(factor_name='Class_num'))

# run the validation
MS = struct::run(MS,train_st,r_squared())

#
C = gs_line()
chart_plot(C,MS)

The chart plotted shows Q2, which is comparable with Fig 13 of the python tutorial. Two components was selected in the python Tutorial, so we will use that here.

2.3 Evaluate the PLS model

To evaluate the model for discriminant analysis in structToolbox the PLSDA model is appropriate.

# prepare the discriminant model
P = PLSDA(number_components = 2, factor_name='Class')

# apply the model
P = model_apply(P,train_st)

# charts
C = plsda_predicted_plot(factor_name='Class',style='boxplot')
g1 = chart_plot(C,P)

C = plsda_predicted_plot(factor_name='Class',style='density')
g2 = chart_plot(C,P)+xlim(c(-2,2))

C = plsda_roc_plot(factor_name='Class')
g3 = chart_plot(C,P)

plot_grid(g1,g2,g3,align='vh',axis='tblr',nrow=1)

# AUC for comparison with python tutorial
MET = calculate(AUC(),P$y$Class,P$yhat[,1])
MET
## A "AUC" object
## --------------
## name:          Area under ROC
## description:   
## value:         0.9739583

Note that the default cutoff for the PLS models in structToolbox is 0, because groups are encoded as +/-1.

2.4 Permutation test

A permutation test can be used to assess how likely the observed result is to have occurred by chance. In structToolbox permutation_test is an iterator object that can be combined with other iterators and models.

# model sequence
MS = permutation_test(number_of_permutations = 20,factor_name = 'Class_num') * 
     kfold_xval(folds = 5,factor_name = 'Class_num') *
     (mean_centre(mode='sample_meta') + PLSR(factor_name='Class_num', number_components = 2))

# run iterator
MS = struct::run(MS,train_st,r_squared())

# chart
C = permutation_test_plot(style = 'density') 
chart_plot(C,MS) + xlim(c(-1,1)) + xlab('R Squared')

This plot is comprable to the bottom half of figure 17 in the python tutorial. The unpermuted (true) Q2 values are constently better than the permuted (null) models. i.e. the model is good.

2.5 PLS projection plots

PLS can be used to visualise the model and interpret the latent variables.

# prepare the discriminant model
P = PLSDA(number_components = 2, factor_name='Class')

# apply the model
P = model_apply(P,train_st)

C = plsda_scores_plot(components=c(1,2),factor_name = 'Class')
chart_plot(C,P)

2.6 PLS feature importance

Regression coefficients and VIP scores can be used to estimate the importance of individual features to the PLS model. In the python tutorial bootstrapping is used to estimate the confidence intervals, but for brevity here we will skip this.

# prepare chart
C = plsda_vip_plot(level='HE')
g1 = chart_plot(C,P)

C = plsda_regcoeff_plot(level='HE')
g2 = chart_plot(C,P)

plot_grid(g1,g2,align='hv',axis='tblr',nrow=2)