This study demonstrates how a researcher could use miXGENE to reproduce the experiments published in our paper (Andel et al., 2014). The corresponding miXGENE workflow is here. The workflow is a read-only version. The user can duplicate the workflow an use it as a template. In the original paper we had presented a novel method for knowledge-guided predictive analysis of omics, namely gene expression (GE) and microRNA (miRNA) expression data. In Section 1 we briefly described our published method. Then, in Section 2, the data domain which we used for validation of our method, is presented. In Section 3 we depict the experimental protocol implemetned in miXGENE for the method validation together with its proper parametrization. In Section 4 the observed results are presented and discussed. Finally, Section 5 cuncludes this case study with main contribution of our method.

1 Method Description

This is an implementation of a novel predictive method for omics, namely gene expression, data classification and mining called Network-constrained forest (NCF) (Andel et al., 2014). The method is based on Random Forest (RF) (Breiman, 2001) framework, which means it profits from stochastic nature of ensemble classifiers. The method integrates prior knowledge in terms of validated or predicted omics feature interactions directly to the predictive models by means of regularization. Unlike set-level methods (Holec et al., 2012) regularization based methods does not impose strictly defined gene sets or aggregated metafeatures. Instead, it merely prefers certain hypothesis, namely the one based on prior interacting features, over the less expected. The hypothesis is deemed valid if and only if it has strong support in the data.

The NCF method learns base decision trees on those features that lie close to the candidate genes in the feature interaction network. This selection is unlike RF, which uses randomly selected predictors in each decision node. Instead, the NCF firstly samples a feature as a seed, potentially the candidate for causing the phenomenon under study, then it samples the rest from a probabilistic distribution over the omics network. The distribution is parametrized to certainly prefer selecting the features lying closer the seed gene. As the distribution is defined as Markovian random walk, its parameter is naturally the walk length k. The longer walk from seed gene means less preference to genes lying closer the seed.

2 Data and Domain Knowledge

The data, provided by our collaborative lab at the Institute of Hematology and Blood Transfusion in Prague, are related to myelodysplastic syndrome (MDS). Illumina miRNA (Human v2 MicroRNA Expression Profiling Kit, Illumina, San Diego, USA) and mRNA (HumanRef-8 v3 and HumanHT-12 v4 Expression BeadChips, Illumina) expression profiling were used to investigate the effect of lenalidomide treatment on miRNA and mRNA expression in bone marrow (BM) CD34+ progenitor cells and peripheral blood (PB) CD14+ monocytes. Quantile normalization was performed independently for both the expression sets, then the datasets were scaled to have the identical median of 1. The mRNA dataset has 16,666 attributes representing the GE level through the amount of corresponding mRNA measured, while the miRNA dataset has 1,146 attributes representing the expression level of particular miRNAs. The measurements were conducted on 75 samples labelled as follows:

healthy (H), or afflicted (A),
non-del(5q), or del(5q), which is presence of partial deletion of the chromosome 5 ,
before (BT), or during lenadelomide treatment (DT).

On these categories we defined 7 binary classification tasks with a clear clinical or biological interest: These tasks were to differentiate:

BMBT_DT5q: del(5q) samples before vs. during treatment in BM,
PBBT_DT5q: del(5q) samples BT vs. DT in PB.
BMH_ABT5q: healthy vs. afflicted BT with del(5q) in BM,
PBH_ABT5q: healthy vs. afflicted BT with del(5q) in PB,
BMH_ABTnon-5q: healthy vs. afflicted BT without del(5q) in BM,
PBH_ABTnon-5q: healthy vs. afflicted BT without del(5q) in PB,
BMH_ADT5q: healthy vs. afflicted DT with del(5q) in BM,
PBH_ADT5q: healthy vs. afflicted DT with del(5q) in PB,
BMnon-5q_5qBT: samples BT with del(5q) vs. without del(5q) in BM,
BMnon-5q_5qBT: samples BT with del(5q) vs. without del(5q) in PB,

Considering available domain knowledge in terms of omics interactions, we downloaded the interactions between proteins, and genes and miRNAs, from the following publicly available databases:

in vitro validated miRNA-mRNA interactions from TarBase 6.0 (Vergoulis et al., 2012),
algorithmically predicted interactions were downloaded from miRWalk database (Dweep et al., 2011),
protein-protein interactions (PPI) from Human Protein Reference Database (Prasad et al., 2009).

Eventually, we handled with 9,077 genes 463 miRNAs, involved in 79,288 protein-protein interactions and 92,886 miR-tar interactions, respectively. The candidate causal genes, a total of 145 and 220 genes associated with MDS and OC respectively, were obtained from (Yu et al., 2010). We used different candidate-gene sets for both diseases as domains, while for specific subtasks we did not consider specific subset as certain tasks are so specific that there were no prior genes defined as candidates.

3 Experimental Protocol

We constructed a workflow (Link) that replicates the results in (Andel et al., 2014). It compares NCF with three other classical machine learning algorithms. It compares NCF with canonical RF classifier to show the improvement thanks to prior knowledge integration. As a lower bound, performance of a single decision tree is shown to illustrate profit of ensemble learning in general. Support vector machine (SVM) is shown as an upper bound of predictive performance. Nevertheless, SVM is a black-box model, which means it has poor interpretation. In omics data analysis, the model itself is often as well appreciated as its predictive output, though. NCF offers also improved interpretability of resulting models, especially due to employed prior knowledge.

The workflow starts by mass uploading mRNA and miRNA expression matrices provided by the user in the zipped comma-separated-values files. These are real-valued matrices of width corresponding to the number of genes and miRNAs, respectively. The biological samples in both the matrices match. Then the corresponding feature interactions are provided, namely PPI and miRNA-target interactions. They are tab separated files of two columns representing the names of interacting genes or miRNAs, respectively, i.e, each row refers to one interaction. Next, it iterates through the 4 examined learners within a cross-validation metablock. Each learner is independently trained and tested within each of the 5 folds of cross-validation iterator. Finally the results are collected and aggregated in the table and boxplot containers. The predictive power is depicted in terms of Matthews correlation coefficient (MCC).

Considering walk-length k, the key parameter of network-based distribution (see Section 2), in our original paper we did as follows. Firstly we run NCF for several values of k. Then we looked at common patterns of predictive performance behaviour, trend of underfitted tree incidence within the forest and walk length k. Consequently we defined a heuristic to set the optimal walk length k based merely on the incidence of underfitted trees, without looking at predictive performance estimation (i.e. before cross-validation). Eventually, we picked those results under consistent with this heuristic, and presented them.

Unlike the original paper, here we implemented the heuristic directly into the workflow, which enables to validate even the k-optimization itself within cross-validation framework. In the other words, for each training fold, the block runs NCF for several k-values, then picks up the one consistent with the heuristic and tests it against the testing fold. Thus we get truly unbiased estimate of NCF accuracy.

4 Results

Table 1 illustrates an empirical comparison of NCF and benchmark learners within 10 classification tasks. Predictive performance of evaluated methods is depicted in Mathews correlation coefficients. The results of NCF are reached under optimal parametrization, random walk-length k, as described in Section 3.

**Table 1:** Predictive performance of examined learners in terms of MCC

The results suggest that our network-enriched ensemble provides a good trade off between these two extremes. NCF shows good classification accuracy, while being more comprehensible than black-box models (see Andel et al., 2014). In most of the cases, NCF has a better or equal predictive power than the state-of-the-art RF and as a whole, in terms of average accuracy, is even competitive with the black-box SVM (see Figure 1).

**Figure 2:**Comparison of the examined learners aggregated over folds

5 Contribution of NCF

We propose a novel parameter-free method for learning from high-dimensional and low-sample size data complemented by feature interaction network.
Unlike standard RF, the decorrelation of the individual weak classifiers is not reached through bootstrap sampling and random subsetting of the features, but pseudorandom subsetting driven by the feature interaction network.
The individual trees deal with feature sets sampled from different areas of a feature network, the curated feature interactions thus tend to be promoted. Still, they are not strictly imposed; the method remains stochastic in its nature and an arbitrary feature relationship may appear in a tree.
Unlike set-level techniques, NCF does not rely on feature extraction based on prior modules such as pathways or simple interaction subgraphs.
The obtained results suggest that introducing domain knowledge improves the accuracy of the forest and increases its compliance with the current knowledge.

Bibliography

Leo Breiman: Random forests
Machine Learning, 45 (1): 5-32, 2001

Michael Andel, Jiri Klema, and Zdenek Krejcik: Network-constrained forest for regularized omics data classification
Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on: 410-417, 2014
URL http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6999193& amp; amp;tag=1

Thanasis Vergoulis, Ioannis S. Vlachos, Panagiotis Alexiou, et al.: Tarbase 6.0: capturing the exponential growth of miRNA targets with experimental support
Nucleic Acids Research, 40 (Database-Issue): 222-229, 2012
URL http://dblp.uni-trier.de/db/journals/nar/nar40. html#VergoulisVAGMRGKDH12

Matej Holec, Jiri Klema, Filip Zelezny, and Jakub Tolar: Comparative evaluation of set-level techniques in predictive classification of gene expression samples
BMC Bioinformatics, 13: S10-S15, 2012
URL http://www.biomedcentral.com/1471-2105/13/S10/S15

Harsh Dweep, Carsten Sticht, Priyanka Pandey, and Norbert Gretz: miRWalk Database: Prediction of possible miRNA binding sites by "walking" the genes of three genomes
Journal of Biomedical Informatics, 44 (5): 839 - 847, 2011
URL http://www.sciencedirect.com/science/article/pii/ S1532046411000785

Keshava Prasad and Renu Goel, Kumaran Kandasamy, et al.: Human Protein Reference Database - 2009 update
Nucleic Acids Research - Database Issues, 44 (5): 767 - 772, 2009
URL http://www.ncbi.nlm.nih.gov/pubmed/18988627

Wei Yu, Melinda Clyne, et al.: Phenopedia and Genopedia: disease-centered and gene-centered views of the evolving knowledge of human genetic associations
Bioinformatics, 26 (1): 145 - 146, 2010
URL http://www.ncbi.nlm.nih.gov/pubmed/19864262