1 Using struct model objects

PCA (Principal Component Analysis) is a commonly applied method for exploring multivariate datasets. We will use the iris DatasetExperiment as an example, which is included in the package and already prepared as a DatasetExperiment object.

D = iris_DatasetExperiment()
head(D$data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4



1.1 PCA model

Before we apply PCA we first need to create a PCA object. This object contains all the inputs, outputs and methods needed to apply PCA. We can set parameters such as the number of components when the PCA model is created, but we can also use dollar notation to change/view it later.

P = PCA(number_components=15)
P$number_components=5
P$number_components
## [1] 5

The inputs for a model can be listed using param_ids(object):

param_ids(P)
## [1] "number_components"



1.2 Model sequences

Unless you have very good reason not to, it is usally sensible to mean centre the columns of the data before PCA. Using the STRUCT framework we can create a model sequence that will mean centre and then apply PCA to the mean centred data.

M = mean_centre() + PCA(number_components = 4)

In STRUCT mean centring and PCA are both model objects, and therefore joining them creates a model_sequence object. The objects in the sequence can be accessed by indexing, and we can combine this with dollar notation. For example, the PCA object is the second object in our sequence and we can access the number of components like this:

M[2]$number_components
## [1] 4



1.3 Training/testing models

Model and model_sequence objects need to be trained using a training DatasetExperiment.

M = model_train(M,D)

Model objects can be used to generate predictions for test datasets. For this example we will just use the training data (sometimes called autoprediction).

M = model_predict(M,D)

The available outputs for an object can be listed and accessed using dollar notation:

output_ids(M[2])
## [1] "scores"      "loadings"    "eigenvalues" "ssx"         "correlation"
## [6] "that"
M[2]$scores
## A "DatasetExperiment" object
## ----------------------------
## name:          
## description:   
## data:          150 rows x 4 columns
## sample_meta:   150 rows x 1 columns
## variable_meta: 4 rows x 1 columns



1.4 Model charts

The struct framework includes charts. Charts associated with a model object can be listed.

chart_names(M[2])
## [1] "pca_biplot_plot"      "pca_correlation_plot" "pca_dstat_plot"      
## [4] "pca_loadings_plot"    "pca_scores_plot"      "pca_scree_plot"

Like model objects, chart objects need to be created before they can be used. Here we will plot the PCA scores plot for our mean centred PCA model_

C = pca_scores_plot(factor_name='Species') # colour by Species
chart_plot(C,M[2])

If we makes changes to our chart object, we must call chart_plot again.

# add petal width to emta data of pca scores
M[2]$scores$sample_meta$Petal.Width=D$data$Petal.Width
# update plot
C$factor_name='Petal.Width'
chart_plot(C,M[2])

The chart_plot method can return e.g. a ggplot object so that you can easily combine it with other plots using the gridExtra package for example.

C1 = pca_scores_plot(factor_name='Species') # colour by Species
g1 = chart_plot(C1,M[2])
C2 = pca_scree_plot()
g2 = chart_plot(C2,M[2])
grid.arrange(grobs=list(g1,g2),nrow=1)



1.5 STATO Integration

Some model objects are also STATO objects. STATO is a general purpose statistics ontology (http://stato-ontology.org/). In the STRUCT framework we use it to provide standarded definitions for objects. The PCA model object is also a STATO object.

is(PCA(),'stato')
## [1] TRUE

We can access the STATO ontology using some methods specific to stato objects.

# this is the stato id for PCA
stato_id(P)
## [1] "OBI:0200051"
# this is the stato name
stato_name(P)
## [1] "principal components analysis dimensionality reduction"
# this is the stato definition
stato_definition(P)
## [1] "A principal components analysis dimensionality reduction is a dimensionality reduction achieved by applying principal components analysis and by keeping low-order principal components and excluding higher-order ones. "

This information is more succinctly displayed using stato_summary. This method also scans over all inputs and outputs for those with STATO definitions and displays those as well. For PCA the number of components is present, but none of the outputs are STATO objects and therefore no definition is provided.

stato_summary(P)
## OBI:0200051 
## principal components analysis dimensionality reduction 
## A principal components analysis dimensionality reduction is a dimensionality reduction achieved by applying principal components analysis and by keeping low-order principal components and excluding higher-order ones.  
## 
## Inputs:
## STATO:0000555 
## number of predictive components 
## number of predictive components is a count used as input to the principle component analysis (PCA)  
## 
## 
## Outputs:

2 Validating models

Validation is an important aspect of chemometric modelling. The STRUCT framework enables this kind of iterative model testing through iterator objects. In order to demonstrate this we will first load the iris data set, which as been pre-prepared as a DatasetExperiment object as part of the STRUCT package.

D = iris_DatasetExperiment()
D
## A "DatasetExperiment" object
## ----------------------------
## name:          Fisher's Iris dataset
## description:   This famous (Fisher's or Anderson's) iris data set gives the
##                measurements in centimeters of the variables
##                sepal length and width and petal length and
##                width, respectively, for 50 flowers from each of
##                3 species of iris. The species are Iris setosa,
##                versicolor, and virginica.
## data:          150 rows x 4 columns
## sample_meta:   150 rows x 1 columns
## variable_meta: 4 rows x 1 columns



2.1 Cross-validation

Cross validation is a common technique for assessing the performance of classification models. For this example we will use a PLSDA model_ Data should be mean centred prior to PLS, so we will build a model sequence first.

M = mean_centre() + PLSDA(number_components=2,factor_name='Species')
M
## A model_seq object containing:
## 
## [1]
## A "mean_centre" object
## ----------------------
## name:          Mean centre
## description:   
## input params:  mode 
## outputs:       centred, mean_data, mean_sample_meta 
## predicted:     centred
## seq_in:        data
## 
## [2]
## A "PLSDA" object
## ----------------
## name:          Partial least squares discriminant analysis
## description:   
## input params:  number_components, factor_name 
## outputs:       scores, loadings, yhat, design_matrix, y, reg_coeff, probability, vip, pls_model, pred, threshold 
## predicted:     pred
## seq_in:        data

Iterators objects like the k-fold cross-validation object can be created just like any other struct object. Parameters can be set at creation =, and accessed/changed later using dollar notation.

XCV = kfold_xval(folds=5,factor_name='Species')
# change the number of folds
XCV$folds=10
XCV$folds
## [1] 10

The model to be cross-validated can be set/accessed used the models method.

models(XCV)=M
models(XCV)
## A model_seq object containing:
## 
## [1]
## A "mean_centre" object
## ----------------------
## name:          Mean centre
## description:   
## input params:  mode 
## outputs:       centred, mean_data, mean_sample_meta 
## predicted:     centred
## seq_in:        data
## 
## [2]
## A "PLSDA" object
## ----------------
## name:          Partial least squares discriminant analysis
## description:   
## input params:  number_components, factor_name 
## outputs:       scores, loadings, yhat, design_matrix, y, reg_coeff, probability, vip, pls_model, pred, threshold 
## predicted:     pred
## seq_in:        data

Alternatively, iterators can be combined with models using the multiplication symbol:

XCV = kfold_xval(folds=5,method='venetian',factor_name='Species') * 
      (mean_centre()+PLSDA(number_components = 2,factor_name='Species'))

The run method can be used with any iterator object. The iterator will then run the model sequence multiple times. In our case we will run cross-validation 5 times splitting the data into different training and test sets each time. The run method also needs a metric to be specified. This metric may be calculated once after all iterations, or after each iteration, depending on the iterator type (resampling, permutation etc). For cross-validation we will calculate balanced accuracy after all iterations.

XCV = run(XCV,D,balanced_accuracy())
XCV$metric
##              metric mean sd
## 1 balanced_accuracy 0.23 NA



Like other STRUCT objects, iterators can have chart objects associated with them. The chart_names function will list them for an object.

chart_names(XCV)
## [1] "kfoldxcv_grid"   "kfoldxcv_metric"

Charts for iterator objects can be plotted in the same way as charts for any other object.

C = kfoldxcv_grid(factor_name='Species')
chart_plot(C,XCV)[[2]] # produces multiple figures. only plot second one.

It is possible to combine multiple iterators by multiplying them together. This is equivalent to nesting one iterator inside the other. For example, we can repeat our cross-validation multiple times by permuting the sample order.

P = permute_sample_order(number_of_permutations = 10) * 
    kfold_xval(folds=5,factor_name='Species')*
    (mean_centre() + PLSDA(factor_name='Species',number_components=2))
P = run(P,D,balanced_accuracy())
P$metric
##              metric  mean         sd
## 1 balanced_accuracy 0.222 0.01974842

3 Session Info

sessionInfo()
## R version 4.0.0 alpha (2020-03-31 r78116)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gridExtra_2.3        openxlsx_4.1.4       ropls_1.19.16       
##  [4] Biobase_2.47.3       BiocGenerics_0.33.3  cowplot_1.0.0       
##  [7] ggplot2_3.3.0        structToolbox_0.99.8 struct_0.99.9       
## [10] pmp_0.99.3           BiocStyle_2.15.6    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4                  lattice_0.20-41            
##  [3] assertthat_0.2.1            digest_0.6.25              
##  [5] foreach_1.5.0               R6_2.4.1                   
##  [7] GenomeInfoDb_1.23.16        plyr_1.8.6                 
##  [9] stats4_4.0.0                evaluate_0.14              
## [11] pillar_1.4.3                itertools_0.1-3            
## [13] zlibbioc_1.33.1             rlang_0.4.5                
## [15] magick_2.3                  S4Vectors_0.25.15          
## [17] Matrix_1.2-18               missForest_1.4             
## [19] rmarkdown_2.1               labeling_0.3               
## [21] stringr_1.4.0               RCurl_1.98-1.1             
## [23] munsell_0.5.0               DelayedArray_0.13.10       
## [25] compiler_4.0.0              xfun_0.12                  
## [27] pkgconfig_2.0.3             pcaMethods_1.79.1          
## [29] htmltools_0.4.0             tidyselect_1.0.0           
## [31] SummarizedExperiment_1.17.5 tibble_3.0.0               
## [33] GenomeInfoDbData_1.2.2      bookdown_0.18              
## [35] IRanges_2.21.8              codetools_0.2-16           
## [37] matrixStats_0.56.0          randomForest_4.6-14        
## [39] viridisLite_0.3.0           fansi_0.4.1                
## [41] withr_2.1.2                 crayon_1.3.4               
## [43] dplyr_0.8.5                 bitops_1.0-6               
## [45] grid_4.0.0                  ontologyIndex_2.5          
## [47] gtable_0.3.0                lifecycle_0.2.0            
## [49] magrittr_1.5                scales_1.1.0               
## [51] zip_2.0.4                   cli_2.0.2                  
## [53] stringi_1.4.6               impute_1.61.0              
## [55] farver_2.0.3                XVector_0.27.2             
## [57] reshape2_1.4.3              ggthemes_4.2.0             
## [59] sp_1.4-1                    pls_2.7-2                  
## [61] ellipsis_0.3.0              vctrs_0.2.4                
## [63] iterators_1.0.12            tools_4.0.0                
## [65] glue_1.4.0                  purrr_0.3.3                
## [67] yaml_2.2.1                  colorspace_1.4-1           
## [69] BiocManager_1.30.10         GenomicRanges_1.39.3       
## [71] knitr_1.28