LipidSig 2.0: Integrating Lipid Characteristic Insights into Advanced Lipidomics Data Analysis

Machine learning analysis

In this section, lipid species and lipid characteristics data can be combined by users to predict the binary outcome using various machine learning methods and select the best feature combination to explore further relationships. For cross-validation, Monte-Carlo cross-validation (CV) is executed to evaluate the model performance and to reach statistical significance. Additionally, we provide eight feature ranking methods (p-value, p-value*FC, ROC, Random Forest, Linear SVM, Lasso, Ridge, ElasticNet) and six classification methods (Random Forest, Linear SVM, Lasso, Ridge, ElasticNet, XGBoost) for users to train and select the best model. Feature ranking methods can be divided into two categories: a univariate and multivariate analysis.

A series of consequent analyses assist users to evaluate the methods and visualise the results of machine learning, including ROC/PR curve, Model predictivity, Sample probability, Feature importance, and Network.

Demo dataset source: The landscape of cancer cell line metabolism (Nat Med. 2018)

Data Source

Data processing

Remove features with many missing values

More than % missing values

Fill missing value with:

Multiply by minimum

Tune sigma

nPCs

The number of neighbors

nPcs

Data Normalization

Normalization with:

Data Transformation

Transformation with:

Processed data

Download

Result

ROC/PR curve

The ROC and Precision-Recall (PR) curve are very common methods to evaluate the diagnostic ability of a binary classifier. Mean AUC and 95% confidence interval for the ROC and PR curve are calculated from all CV runs in each feature number. Theoretically, the higher the AUC, the better the model performs. PR curve is more sensitive to data with highly skewed datasets and offers a more informative view of an algorithm's performance. An AUC equal to 1 both represents perfect performance in two methods. We provide an overall ROC/PR Curve shown curve of CVs with different feature numbers and a ROC/PR Curve shown curve of average CVs by user-selected feature number.

ROC curve plot

PR curve plot

Feature number:

Average ROC curve plot of N features

Average PR curve plot of N features

Model performance

In this part, many useful indicators are provided for users to evaluate model performance. For each feature number, we calculate and plot the average value and 95% confidence interval of accuracy, sensitivity (recall), specificity, positive predictive value (precision), negative predictive value, F1 score, prevalence, detection rate, detection prevalence, balanced accuracy in all CV runs with confusion Matrix function in carat package. All these indicators can be described in terms of true positive (TP), false positive (FP), false negative (FN) and true negative (TN) and are summarized as follows.

Evaluation method:

Model performance plot

Table of model evaluation information

Predicted probability

This page shows the average predicted probabilities of each sample in testing data from all CV runs and allows users to explore those incorrect or uncertain labels. We show the distribution of predicted probabilities in two reference labels on the left panel while a confusion matrix composed of sample number and proportion is laid out on the right. Results for different feature number can be selected manually by users.

Feature number:

Predicted probability distribution

Confusion matrix of sample number and proportion

Table of predicted probability and labels

Feature importance

Algorithm-based
SHAP analysis

Algorithm-based

After building a high-accuracy model, users are encouraged to explore the contribution of each feature on this page. Two methods here namely ‘Algorithm-based’ and ‘SHAP analysis’ can rank and visualize the feature importance.

In ‘Algorithm-based’ part, when users choose a certain feature number, the selected frequency and the average feature importance of top 10 features from all CV runs will be displayed. For a Linear SVM, Lasso, Ridge or ElasticNet model, the importance of each feature depends on the absolute value of their coefficients in the algorithm, while Random Forest and XGBoost use built-in feature importance results.

SHAP analysis

After building a high-accuracy model, users are encouraged to explore the contribution of each feature on this page. Two methods here namely ‘Algorithm-based’ and ‘SHAP analysis’ can rank and visualize the feature importance. SHapley Additive exPlanations (SHAP) approach on the basis of Shapley values in game theory has recently been introduced to explain individual predictions of any machine learning model.

Feature number:

Simulation times:

Feature number:

Selected frequency plot

Table of the selected frequency

Feature importance plot

Table of feature importance

SHAP feature importance plot

SHAP summary plot

Table of SHAP analysis

Show top N feature:

Number of group:

‘Show top N feature’ can be chosen for SHAP force plot. The samples are clustered into multiple groups (‘Number of group’) based on Shapley values using ward.D method.

SHAP force plot of top N features

Table of force plot information

X axis:

Y axis:

Color:

The x-axis represents the value of a certain feature while the y-axis is the corresponding Shapley value. The colour parameter can be assigned to check if a second feature has an interaction effect with the feature we are plotting.

SHAP dependence plot

Network

Correlation network helps users interrogate the interaction of features in a machine learning model. In this section, users can choose an appropriate feature number according to previous cross-validation results and the features in the best model (based on ROC-AUC+PR-AUC) will be picked up to compute the correlation coefficients between each other.

Feature number:

Feature importance method:

Simulation times:

Correlation method:

Coefficient cutoff:

LipidSig 2.0: Integrating Lipid Characteristic Insights into Advanced Lipidomics Data Analysis

Machine learning analysis

A series of consequent analyses assist users to evaluate the methods and visualise the results of machine learning, including ROC/PR curve, Model predictivity, Sample probability, Feature importance, and Network.

Data Source

Data source

Uploaded data

Data processing

Data Normalization

Data Transformation

Processed data

Result

ROC/PR curve

ROC curve plot

PR curve plot

Average ROC curve plot of N features

Average PR curve plot of N features

Model performance

Model performance plot

Table of model evaluation information

Predicted probability

Predicted probability distribution

Confusion matrix of sample number and proportion

Table of predicted probability and labels

Feature importance

Algorithm-based

After building a high-accuracy model, users are encouraged to explore the contribution of each feature on this page. Two methods here namely ‘Algorithm-based’ and ‘SHAP analysis’ can rank and visualize the feature importance.

SHAP analysis

Selected frequency plot

Table of the selected frequency

Feature importance plot

Table of feature importance

SHAP feature importance plot

SHAP summary plot

Table of SHAP analysis

SHAP force plot of top N features

Table of force plot information

SHAP dependence plot

Network

Network of feature importance