2. Data and Methods¶

2.1 Data¶

The high-dimensional RNA-Seq dataset, published in the "Gene Expression Omnibus" database, contains 147 deeply-sequenced RNA-Seq samples using long (125b) paired-end reads¹^,². The dataset has the following five conditions: AD lesional (27 patients), AD non-lesional (27 patients), PSO lesional (28 patients), PSO non-lesional (27 patients) and Control group (38 individuals). The dataset is classified as balanced based on the relative number of individuals in each group. Each sample provides the values for 31,262 unique gene expression signatures. The dataset contains accurate and relevant data without missing values or duplicates, and it did not require additional cleaning for the purpose of this research.

Condition	Number of samples
AD lesional	27
AD non-lesional	27
PSO lesional	28
PSO non-lesional	27
Control group	38

Table 1. Summary of dataset.

2.2 Methods¶

2.2.1 Supervised Machine Learning Algorithms¶

Supervised Machine Learning (ML) algorithms were deployed in pursuit of the Research Aim of this project. In order to train the ML models to distinguish between the five conditions, multiclass classification algorithms were used. Five models were built for this research, based on the ML algorithms: Logistic Regression (LOR), Linear Discriminant Analysis (LDA), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost or XGB).

The classifiers were trained following the principle of "one-versus-all": single binary classifiers were trained for each condition, where all data, except the condition of interest, was treated as a whole. There were five classifiers trained for each model, where each classifier was tasked to distinguish between AD lesional, AD non-lesional, PSO lesional, PSO non-lesional or Control group patients.

The dataset was split by a 4:1 ratio for training and testing sets respectively. After partitioning the data the MinMaxScaler was applied using the default range of 0 to 1 to normalise the input variables (genes). GridSearchCV was then used to find the optimal parameters for each algorithm, allowing for the methodical training and testing of all possible combinations of the parameters, specified in the "grid". Parameters for the models were tuned by 10-fold Cross-Validation (CV).

2.2.1.1 Logistic Regression¶

The LOR algorithm falls under the category of probabilistic classification models and it is one of the most widely used algorithms for classification³. The following LOR hyperparameters⁴ were tuned: C (from -15 to 5), penalty ('l1', 'l2', 'elasticnet','none') and the maximum number of iterations, max_iter (set to 2000; default parameter failed to converge). 'l1' regularisation is more commonly known as Lasso regression, which is used for improved performance in feature selection, while 'l2' regularisation is known as Ridge regression which helps prevent the potential for overfitting. Another penalty that was considered was elastic_net which combines both 'l1' and 'l2' penalties.

2.2.1.2 Linear Discriminant Analysis¶

The LDA⁵ algorithm operates by fitting a Gaussian density to each class. The hyperparameter ⁶ solver was tuned and three toggleable values were tested: svd (the default parameter), lsqr (least squares solution) and eigen (eigenvalue decomposition).

2.2.1.3 Stochastic Gradient Descent¶

SGD is an iterative algorithm that starts from a random point on a function and travels down its slope, until a minima is reached⁷. Four hyperparameters were tuned: loss (hinge and log), penalty ('l1', 'l2', 'elasticnet'), the number of maximum iterations,max_iter, and alpha (from 0.0001 to 100).

2.2.1.4 Support Vector Machine¶

SVM is another popular algorithm for classification that works by finding data points ("support-vectors") of different classes and defining the boundaries ("hyperplanes") between them ⁸. Two of SVM's hyperparameters were tuned: the kernel(linear, poly and rbf, with the latter being the default parameter) and C, the regularisation parameter.

2.2.1.5 Extreme Gradient Boosting¶

XGB is a relatively novel ML algorithm⁹ that was implemented in this investigation. This algorithm is an optimised gradient boosting algorithm¹⁰ that can be used for multiclass classification. Several hyperparameters were tuned, including learning_rate (0.1 and 0.3), n_estimators, max_depth, min_child_weight, gamma, subsample, colsample_bytree, nthread, scale_pos_weight, silent, objective(was set to 'multi:softprob'), num_class(was set to 5).

2.2.2 Coefficients and feature importance¶

After training the classifiers, the linear models (LOR, LDA, SGD and SVM) were used to output coefficients, producing weightings assigned to each gene which give an indication of its importance to each condition. In contrast to the four coefficient generating models, XGB has a different way of measuring the significance of genes, called feature importance, as it originates from decision tree algorithms. This means that XGB is capable of providing insight into the most important genes overall for all conditions.

Weidinger S, Rodriguez E, Tsoi L, Gudjonsson J. Atopic Dermatitis, Psoriasis and healthy control RNA-seq cohort [Internet]. 2019 [cited 2020 Jun 17].
Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121212 ↩
Tsoi LC, Rodriguez E, Degenhardt F, Baurecht H, Wehkamp U, Volks N, et al. Atopic Dermatitis Is an IL-13–Dominant Disease with Greater Molecular Heterogeneity Compared to Psoriasis. J Invest Dermatol [Internet]. 2019 Jul 1 [cited 2020 Apr 21];139(7):1480–9. Available from: http://www.ncbi.nlm.nih.gov/pubmed/30641038 ↩
Hackling G. Mastering Machine Learning With scikit-learn [Internet]. 2014 [cited 2020 Jun 17]. Available from: https://www.semanticscholar.org/paper/Mastering-Machine-Learning-With-scikit-learn-Hackeling/d82f27f4a8dcee6cbab41ff954cc6c2b7709a693 ↩
scikit-learn 0.23.1 documentation [Internet]. [cited 2020 Jun 17]. Available from: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ↩
Ye J, Janardan R, Li Q. Two-dimensional linear discriminant analysis. Neural information processing systems foundation; 2005. ↩
scikit-learn 0.23.1 documentation [Internet]. [cited 2020 Jun 17]. Available from: https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html ↩
scikit-learn 0.23.1 documentation [Internet]. [cited 2020 Jun 17]. Available from: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html ↩
scikit-learn 0.23.1 documentation [Internet]. [cited 2020 Jun 17]. Available from: https://scikit-learn.org/stable/modules/svm.html ↩
Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining [Internet]. New York, NY, USA: Association for Computing Machinery; 2016 [cited 2020 Jun 17]. p. 785–94. Available from: https://dl.acm.org/doi/10.1145/2939672.2939785 ↩
XGBoost Documentation [Internet]. [cited 2020 Jun 17]. Available from: https://xgboost.readthedocs.io/en/latest/ ↩