LipidSig 2.0: Integrating Lipid Characteristic Insights into Advanced Lipidomics Data Analysis

  • Home

  • Data
    Check

  • Profiling

  • Differential
    Expression

  • Enrichment

  • Machine
    Learning

  • Correlation

  • Network

  • ID
    Conversion

  • Tutorial

  • FAQ

  • LipidSigR

Machine learning analysis


In this section, lipid species and lipid characteristics data can be combined by users to predict the binary outcome using various machine learning methods and select the best feature combination to explore further relationships. For cross-validation, Monte-Carlo cross-validation (CV) is executed to evaluate the model performance and to reach statistical significance. Additionally, we provide eight feature ranking methods (p-value, p-value*FC, ROC, Random Forest, Linear SVM, Lasso, Ridge, ElasticNet) and six classification methods (Random Forest, Linear SVM, Lasso, Ridge, ElasticNet, XGBoost) for users to train and select the best model. Feature ranking methods can be divided into two categories: a univariate and multivariate analysis.
A series of consequent analyses assist users to evaluate the methods and visualise the results of machine learning, including ROC/PR curve, Model predictivity, Sample probability, Feature importance, and Network.

Demo dataset source: The landscape of cancer cell line metabolism (Nat Med. 2018)

Data Source

Download example
Upload your data table in .csv/.tsv/.xlsx

How to prepare your dataset?
How to use this function?

Uploaded data

  • Lipid abundance data
  • Condition table
Loading...
Loading...

: Successfully uploaded.   : Error happaned. Please check your dataset.   : Warning message.


Data processing

Data Normalization

Data Transformation



Processed data

  • Processed abundance data: User-uploaded abundance data after data processing.
  • Condition table: User-uploaded condition table.
  • Lipid characteristics: Lipid characteristics converted according to the uploaded lipids in the abundance data. Detailed information about the converted characteristics can be found in the FAQ.
  • Lipid id: Links to the LION ID, LIPID MAPS ID, and other resource IDs for the uploaded lipids.
  • Data quality: Box and density plots of the abundance data before and after data processing.


  • Processed abundance data
  • Condition table
  • Lipid characteristics
  • Lipid id
  • Data quality
Loading...
Loading...
Loading...
Loading...

Download

Loading...
Loading...

Loading...
Loading...

Result

  • ROC/PR curve
  • Model performance
  • Predicted probability
  • Feature importance
  • Network

ROC/PR curve

The ROC and Precision-Recall (PR) curve are very common methods to evaluate the diagnostic ability of a binary classifier. Mean AUC and 95% confidence interval for the ROC and PR curve are calculated from all CV runs in each feature number. Theoretically, the higher the AUC, the better the model performs. PR curve is more sensitive to data with highly skewed datasets and offers a more informative view of an algorithm's performance. An AUC equal to 1 both represents perfect performance in two methods. We provide an overall ROC/PR Curve shown curve of CVs with different feature numbers and a ROC/PR Curve shown curve of average CVs by user-selected feature number.

ROC curve plot

Loading...

PR curve plot

Loading...


Average ROC curve plot of N features

Loading...


Average PR curve plot of N features

Loading...

Model performance

In this part, many useful indicators are provided for users to evaluate model performance. For each feature number, we calculate and plot the average value and 95% confidence interval of accuracy, sensitivity (recall), specificity, positive predictive value (precision), negative predictive value, F1 score, prevalence, detection rate, detection prevalence, balanced accuracy in all CV runs with confusion Matrix function in carat package. All these indicators can be described in terms of true positive (TP), false positive (FP), false negative (FN) and true negative (TN) and are summarized as follows.


Model performance plot

Loading...

Table of model evaluation information

Loading...

Predicted probability

This page shows the average predicted probabilities of each sample in testing data from all CV runs and allows users to explore those incorrect or uncertain labels. We show the distribution of predicted probabilities in two reference labels on the left panel while a confusion matrix composed of sample number and proportion is laid out on the right. Results for different feature number can be selected manually by users.



Predicted probability distribution

Loading...


Confusion matrix of sample number and proportion

Loading...


Table of predicted probability and labels

Loading...

Feature importance


  • Algorithm-based
  • SHAP analysis

Algorithm-based

After building a high-accuracy model, users are encouraged to explore the contribution of each feature on this page. Two methods here namely ‘Algorithm-based’ and ‘SHAP analysis’ can rank and visualize the feature importance.

In ‘Algorithm-based’ part, when users choose a certain feature number, the selected frequency and the average feature importance of top 10 features from all CV runs will be displayed. For a Linear SVM, Lasso, Ridge or ElasticNet model, the importance of each feature depends on the absolute value of their coefficients in the algorithm, while Random Forest and XGBoost use built-in feature importance results.


SHAP analysis

After building a high-accuracy model, users are encouraged to explore the contribution of each feature on this page. Two methods here namely ‘Algorithm-based’ and ‘SHAP analysis’ can rank and visualize the feature importance. SHapley Additive exPlanations (SHAP) approach on the basis of Shapley values in game theory has recently been introduced to explain individual predictions of any machine learning model.




Selected frequency plot

Loading...

Table of the selected frequency

Loading...


Feature importance plot

Loading...

Table of feature importance

Loading...

SHAP feature importance plot

Loading...

SHAP summary plot

Loading...

Table of SHAP analysis

Loading...

‘Show top N feature’ can be chosen for SHAP force plot. The samples are clustered into multiple groups (‘Number of group’) based on Shapley values using ward.D method.

SHAP force plot of top N features

Loading...

Table of force plot information

Loading...


The x-axis represents the value of a certain feature while the y-axis is the corresponding Shapley value. The colour parameter can be assigned to check if a second feature has an interaction effect with the feature we are plotting.

SHAP dependence plot

Loading...

Network

Correlation network helps users interrogate the interaction of features in a machine learning model. In this section, users can choose an appropriate feature number according to previous cross-validation results and the features in the best model (based on ROC-AUC+PR-AUC) will be picked up to compute the correlation coefficients between each other.



Network of feature importance

Loading...