2. Data and Methods¶
2.1 Data¶
The highdimensional RNASeq dataset, published in the "Gene Expression Omnibus" database, contains 147 deeplysequenced RNASeq samples using long (125b) pairedend reads^{1}^{,}^{2}. The dataset has the following five conditions: AD lesional (27 patients), AD nonlesional (27 patients), PSO lesional (28 patients), PSO nonlesional (27 patients) and Control group (38 individuals). The dataset is classified as balanced based on the relative number of individuals in each group. Each sample provides the values for 31,262 unique gene expression signatures. The dataset contains accurate and relevant data without missing values or duplicates, and it did not require additional cleaning for the purpose of this research.
Condition  Number of samples 

AD lesional  27 
AD nonlesional  27 
PSO lesional  28 
PSO nonlesional  27 
Control group  38 
Table 1. Summary of dataset.
2.2 Methods¶
2.2.1 Supervised Machine Learning Algorithms¶
Supervised Machine Learning (ML) algorithms were deployed in pursuit of the Research Aim of this project. In order to train the ML models to distinguish between the five conditions, multiclass classification algorithms were used. Five models were built for this research, based on the ML algorithms: Logistic Regression (LOR), Linear Discriminant Analysis (LDA), Stochastic Gradient Descent (SGD), Support Vector Machine (SVM) and Extreme Gradient Boosting (XGBoost or XGB).
The classifiers were trained following the principle of "oneversusall": single binary classifiers were trained for each condition, where all data, except the condition of interest, was treated as a whole. There were five classifiers trained for each model, where each classifier was tasked to distinguish between AD lesional, AD nonlesional, PSO lesional, PSO nonlesional or Control group patients.
The dataset was split by a 4:1 ratio for training and testing sets respectively. After partitioning the data the MinMaxScaler
was applied using the default range of 0 to 1 to normalise the input variables (genes). GridSearchCV
was then used to find the optimal parameters for each algorithm, allowing for the methodical training and testing of all possible combinations of the parameters, specified in the "grid". Parameters for the models were tuned by 10fold CrossValidation (CV).
2.2.1.1 Logistic Regression¶
The LOR algorithm falls under the category of probabilistic classification models and it is one of the most widely used algorithms for classification^{3}. The following LOR hyperparameters^{4} were tuned: C
(from 15
to 5
), penalty
('l1', 'l2', 'elasticnet','none'
) and the maximum number of iterations, max_iter
(set to 2000
; default parameter failed to converge). 'l1'
regularisation is more commonly known as Lasso regression, which is used for improved performance in feature selection, while 'l2'
regularisation is known as Ridge regression which helps prevent the potential for overfitting. Another penalty that was considered was elastic_net
which combines both 'l1'
and 'l2'
penalties.
2.2.1.2 Linear Discriminant Analysis¶
The LDA^{5} algorithm operates by fitting a Gaussian density to each class. The hyperparameter ^{6} solver
was tuned and three toggleable values were tested: svd
(the default parameter), lsqr
(least squares solution) and eigen
(eigenvalue decomposition).
2.2.1.3 Stochastic Gradient Descent¶
SGD is an iterative algorithm that starts from a random point on a function and travels down its slope, until a minima is reached^{7}. Four hyperparameters were tuned: loss
(hinge
and log
), penalty
('l1', 'l2', 'elasticnet'
), the number of maximum iterations,max_iter
, and alpha
(from 0.0001
to 100
).
2.2.1.4 Support Vector Machine¶
SVM is another popular algorithm for classification that works by finding data points ("supportvectors") of different classes and defining the boundaries ("hyperplanes") between them ^{8}. Two of SVM's hyperparameters were tuned: the kernel
(linear
, poly
and rbf
, with the latter being the default parameter) and C
, the regularisation parameter.
2.2.1.5 Extreme Gradient Boosting¶
XGB is a relatively novel ML algorithm^{9} that was implemented in this investigation. This algorithm is an optimised gradient boosting algorithm^{10} that can be used for multiclass classification. Several hyperparameters were tuned, including learning_rate
(0.1
and 0.3
), n_estimators
, max_depth
, min_child_weight
, gamma
, subsample
, colsample_bytree
, nthread
, scale_pos_weight
, silent
, objective
(was set to 'multi:softprob'
), num_class
(was set to 5
).
2.2.2 Coefficients and feature importance¶
After training the classifiers, the linear models (LOR, LDA, SGD and SVM) were used to output coefficients, producing weightings assigned to each gene which give an indication of its importance to each condition. In contrast to the four coefficient generating models, XGB has a different way of measuring the significance of genes, called feature importance, as it originates from decision tree algorithms. This means that XGB is capable of providing insight into the most important genes overall for all conditions.

Weidinger S, Rodriguez E, Tsoi L, Gudjonsson J. Atopic Dermatitis, Psoriasis and healthy control RNAseq cohort [Internet]. 2019 [cited 2020 Jun 17].
Available from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121212 ↩ 
Tsoi LC, Rodriguez E, Degenhardt F, Baurecht H, Wehkamp U, Volks N, et al. Atopic Dermatitis Is an IL13–Dominant Disease with Greater Molecular Heterogeneity Compared to Psoriasis. J Invest Dermatol [Internet]. 2019 Jul 1 [cited 2020 Apr 21];139(7):1480–9. Available from: http://www.ncbi.nlm.nih.gov/pubmed/30641038 ↩

Hackling G. Mastering Machine Learning With scikitlearn [Internet]. 2014 [cited 2020 Jun 17]. Available from: https://www.semanticscholar.org/paper/MasteringMachineLearningWithscikitlearnHackeling/d82f27f4a8dcee6cbab41ff954cc6c2b7709a693 ↩

scikitlearn 0.23.1 documentation [Internet]. [cited 2020 Jun 17]. Available from: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html ↩

Ye J, Janardan R, Li Q. Twodimensional linear discriminant analysis. Neural information processing systems foundation; 2005. ↩

scikitlearn 0.23.1 documentation [Internet]. [cited 2020 Jun 17]. Available from: https://scikitlearn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html ↩

scikitlearn 0.23.1 documentation [Internet]. [cited 2020 Jun 17]. Available from: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html ↩

scikitlearn 0.23.1 documentation [Internet]. [cited 2020 Jun 17]. Available from: https://scikitlearn.org/stable/modules/svm.html ↩

Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining [Internet]. New York, NY, USA: Association for Computing Machinery; 2016 [cited 2020 Jun 17]. p. 785–94. Available from: https://dl.acm.org/doi/10.1145/2939672.2939785 ↩

XGBoost Documentation [Internet]. [cited 2020 Jun 17]. Available from: https://xgboost.readthedocs.io/en/latest/ ↩