This study demonstrates how a researcher could use miXGENE to reproduce the experiments published in our paper (Andel et al., 2014). The corresponding miXGENE workflow is here. The workflow is a read-only version. The user can duplicate the workflow an use it as a template. In the original paper we had presented a novel method for knowledge-guided predictive analysis of omics, namely gene expression (GE) and microRNA (miRNA) expression data. In Section 1 we briefly described our published method. Then, in Section 2, the data domain which we used for validation of our method, is presented. In Section 3 we depict the experimental protocol implemetned in miXGENE for the method validation together with its proper parametrization. In Section 4 the observed results are presented and discussed. Finally, Section 5 cuncludes this case study with main contribution of our method.
This is an implementation of a novel predictive method for omics, namely gene expression, data classification and mining called Network-constrained forest (NCF) (Andel et al., 2014). The method is based on Random Forest (RF) (Breiman, 2001) framework, which means it profits from stochastic nature of ensemble classifiers. The method integrates prior knowledge in terms of validated or predicted omics feature interactions directly to the predictive models by means of regularization. Unlike set-level methods (Holec et al., 2012) regularization based methods does not impose strictly defined gene sets or aggregated metafeatures. Instead, it merely prefers certain hypothesis, namely the one based on prior interacting features, over the less expected. The hypothesis is deemed valid if and only if it has strong support in the data.
The NCF method learns base decision trees on those features that lie close to the candidate genes in the feature interaction network. This selection is unlike RF, which uses randomly selected predictors in each decision node. Instead, the NCF firstly samples a feature as a seed, potentially the candidate for causing the phenomenon under study, then it samples the rest from a probabilistic distribution over the omics network. The distribution is parametrized to certainly prefer selecting the features lying closer the seed gene. As the distribution is defined as Markovian random walk, its parameter is naturally the walk length k. The longer walk from seed gene means less preference to genes lying closer the seed.
The data, provided by our collaborative lab at the Institute of Hematology and Blood Transfusion in Prague, are related to myelodysplastic syndrome (MDS). Illumina miRNA (Human v2 MicroRNA Expression Profiling Kit, Illumina, San Diego, USA) and mRNA (HumanRef-8 v3 and HumanHT-12 v4 Expression BeadChips, Illumina) expression profiling were used to investigate the effect of lenalidomide treatment on miRNA and mRNA expression in bone marrow (BM) CD34+ progenitor cells and peripheral blood (PB) CD14+ monocytes. Quantile normalization was performed independently for both the expression sets, then the datasets were scaled to have the identical median of 1. The mRNA dataset has 16,666 attributes representing the GE level through the amount of corresponding mRNA measured, while the miRNA dataset has 1,146 attributes representing the expression level of particular miRNAs. The measurements were conducted on 75 samples labelled as follows:
On these categories we defined 7 binary classification tasks with a clear clinical or biological interest: These tasks were to differentiate:
Considering available domain knowledge in terms of omics interactions, we downloaded the interactions between proteins, and genes and miRNAs, from the following publicly available databases:
Eventually, we handled with 9,077 genes 463 miRNAs, involved in 79,288 protein-protein interactions and 92,886 miR-tar interactions, respectively. The candidate causal genes, a total of 145 and 220 genes associated with MDS and OC respectively, were obtained from (Yu et al., 2010). We used different candidate-gene sets for both diseases as domains, while for specific subtasks we did not consider specific subset as certain tasks are so specific that there were no prior genes defined as candidates.
We constructed a workflow (Link) that replicates the results in (Andel et al., 2014). It compares NCF with three other classical machine learning algorithms. It compares NCF with canonical RF classifier to show the improvement thanks to prior knowledge integration. As a lower bound, performance of a single decision tree is shown to illustrate profit of ensemble learning in general. Support vector machine (SVM) is shown as an upper bound of predictive performance. Nevertheless, SVM is a black-box model, which means it has poor interpretation. In omics data analysis, the model itself is often as well appreciated as its predictive output, though. NCF offers also improved interpretability of resulting models, especially due to employed prior knowledge.
The workflow starts by mass uploading mRNA and miRNA expression matrices provided by the user in the zipped comma-separated-values files. These are real-valued matrices of width corresponding to the number of genes and miRNAs, respectively. The biological samples in both the matrices match. Then the corresponding feature interactions are provided, namely PPI and miRNA-target interactions. They are tab separated files of two columns representing the names of interacting genes or miRNAs, respectively, i.e, each row refers to one interaction. Next, it iterates through the 4 examined learners within a cross-validation metablock. Each learner is independently trained and tested within each of the 5 folds of cross-validation iterator. Finally the results are collected and aggregated in the table and boxplot containers. The predictive power is depicted in terms of Matthews correlation coefficient (MCC).
Considering walk-length k, the key parameter of network-based distribution (see Section 2), in our original paper we did as follows. Firstly we run NCF for several values of k. Then we looked at common patterns of predictive performance behaviour, trend of underfitted tree incidence within the forest and walk length k. Consequently we defined a heuristic to set the optimal walk length k based merely on the incidence of underfitted trees, without looking at predictive performance estimation (i.e. before cross-validation). Eventually, we picked those results under consistent with this heuristic, and presented them.
Unlike the original paper, here we implemented the heuristic directly into the workflow, which enables to validate even the k-optimization itself within cross-validation framework. In the other words, for each training fold, the block runs NCF for several k-values, then picks up the one consistent with the heuristic and tests it against the testing fold. Thus we get truly unbiased estimate of NCF accuracy.
Table 1 illustrates an empirical comparison of NCF and benchmark learners within 10 classification tasks. Predictive performance of evaluated methods is depicted in Mathews correlation coefficients. The results of NCF are reached under optimal parametrization, random walk-length k, as described in Section 3.
The results suggest that our network-enriched ensemble provides a good trade off between these two extremes. NCF shows good classification accuracy, while being more comprehensible than black-box models (see Andel et al., 2014). In most of the cases, NCF has a better or equal predictive power than the state-of-the-art RF and as a whole, in terms of average accuracy, is even competitive with the black-box SVM (see Figure 1).