1 Introduction

In this dataset exploratry analysis of the Saccurine dataset from the ropls package will used to demonstrate use of structToolbox.

2 Data preparation

For struct the data needs to be comverted to a DatasetExperiment object. The ropls package provides the data as a list containing a dataMatrix, sampleMetadata and variableMetadata.

data('sacurine',package = 'ropls')
# the 'sacurine' list should now be available

# move the annotations to a new column and rename the features by index to avoid issues
# later when data.frames if get transposed and names get checked/changed

sacurine$variableMetadata$annotation=rownames(sacurine$variableMetadata)
rownames(sacurine$variableMetadata)=1:nrow(sacurine$variableMetadata)
colnames(sacurine$dataMatrix)=1:ncol(sacurine$dataMatrix)


DE = DatasetExperiment(data = data.frame(sacurine$dataMatrix),
                       sample_meta = sacurine$sampleMetadata,
                       variable_meta = sacurine$variableMetadata,
                       name = 'Sacurine data',
                       description = 'See ropls package documentation for details')
DE
## A "DatasetExperiment" object
## ----------------------------
## name:          Sacurine data
## description:   See ropls package documentation for details
## data:          183 rows x 109 columns
## sample_meta:   183 rows x 3 columns
## variable_meta: 109 rows x 4 columns

3 Exploratory analysis

The Sacurine dataset has already been processed:

After signal drift and batch effect correction of intensities, each urine profile was normalized to the osmolality of the sample. Finally, the data were log10 transformed.

The data can therefore be visualised using PCA without further processing. The ropls package automatically applies unit variance scaling (autoscaling) by default, so the same approach is employed here.

# prepare model sequence
M = autoscale() + PCA(number_components = 5)
# apply model sequence to dataset
M = model_apply(M,DE)

# pca scores plots
g=list()
for (k in colnames(DE$sample_meta)) {
    C = pca_scores_plot(factor_name = k)
    g[[k]] = chart_plot(C,M[2])
}
# plot using cowplot
plot_grid(plotlist=g, nrow=1, align='vh')

The final plot coloured by gender is identical to fig.2 of the ropls package vignette. structToolbox also provides some other PCA diagnostic plots, such as D-statistic, scree, and loadings plots.

C = pca_scree_plot()
g1 = chart_plot(C,M[2])

C = pca_loadings_plot()
g2 = chart_plot(C,M[2])

C = pca_dstat_plot(alpha=0.95)
g3 = chart_plot(C,M[2])

p1=plot_grid(plotlist = list(g1,g2),align='h',nrow=1,axis='b')
p2=plot_grid(plotlist = list(g3),nrow=1)
plot_grid(p1,p2,nrow=2)

4 Supervised analysis

The ropls packgage uses its own implementation of the (o)pls algorithms. structToolbox uses the pls package, so it is interesting to compare the outputs from both approaches. For simplicity only the scores plots are compared.

# prepare model sequence
M = autoscale() + PLSDA(factor_name='gender')
M = model_apply(M,DE)

C = plsda_scores_plot(factor_name = 'gender')
chart_plot(C,M[2])

The plot is similar to fig.3 of the ropls vignette. Differences are due to inverted LV axes, a common occurrence with the NIPALS algorithm (used by both structToolbox and ropls) which depends on how the algorithm is initialised.

To compare the R2 values for the model in structToolbox we have to use a regression model, instead of a discriminant model. For this we convert the gender factor to a numeric variable before applying the model.

# convert gender to numeric
DE$sample_meta$gender=as.numeric(DE$sample_meta$gender)

# models sequence
M = autoscale(mode='both') + PLSR(factor_name='gender',number_components=3)
M = model_apply(M,DE)

# some diagnostic charts
C = plsr_cook_dist()
g1 = chart_plot(C,M[2])

C = plsr_prediction_plot()
g2 = chart_plot(C,M[2])

C = plsr_qq_plot()
g3 = chart_plot(C,M[2])

C = plsr_residual_hist()
g4 = chart_plot(C,M[2])

plot_grid(plotlist = list(g1,g2,g3,g4), nrow=2,align='vh')

The ropls package automatically applies cross-validation to asses the performance of the PLSDA model. In structToolbox this is applied separately to give more control over the approach used. The default cross-validation used by the ropls package is 7-fold cross-validation.

# model sequence
M = kfold_xval(folds=7, factor_name='gender') * 
    (autoscale(mode='both') + PLSR(factor_name='gender'))
M = run(M,DE,r_squared())

Training set R2: 0.6975706 0.6798415 0.646671 0.6532914 0.7109769 0.670777 0.6935344

Test set Q2: 0.5460723

The validity of the model can further be assessed using permutation testing. For this we will return to a discriminant model.

# reset gender to original factor
DE$sample_meta$gender=sacurine$sampleMetadata$gender
# model sequence
M = permutation_test(number_of_permutations = 10, factor_name='gender') *
    kfold_xval(folds=7,factor_name='gender') *
    (autoscale() + PLSDA(factor_name='gender',number_components = 3))
M = run(M,DE,balanced_accuracy())

C = permutation_test_plot(style='boxplot')
chart_plot(C,M)+ylab('1 - balanced accuracy')

The permuted models have a balanced accuracy of around 50%, which is to be expected for a dataset with two groups. The unpermuted models have a balanced accuracy of around 90% and is therefore much better than might be expected to occur by chance.