LipidSig 2.0: Integrating Lipid Characteristic Insights into Advanced Lipidomics Data Analysis


0 Overview

The tutorial offers guidance on navigating each webpage and the expected outcomes on LipidSig.
We have included 2 helper functions and 6 analytical functions.
  • Helper functions:
    1. Data check
    2. ID conversion
  • Analytical functions:
    1. Profiling
    2. Differential expression
    3. Enrichment
    4. Machine learning
    5. Correlation
    6. Network

The following video overviews all the available functions and the containing analyses.








1 DataCheck

This page is specifically designed to pre-check user-uploaded data against the requirements of various analyses. It will return checking results according to the data needs of specific analysis types. This page supports checks for 'Profiling,' 'Differential Expression,' 'Machine Learning,' and 'Correlation' analyses. Please note that 'Enrichment' and 'Network' analyses require output data from differential expression analysis.

Function introduction: the concept and expected results



Function demonstration: step-by-step tutorial



The following description uses the Profiling analysis as an example to demonstrate how to utilize the Data Check function, following the steps highlighted in the accompanying screenshot.



  1. Select a analysis type that interests you and upload all the necessary data.
  2. Once you have uploaded the data, please wait for processing time. Then, your uploaded data will display in the window on the right-hand side.




  3. Scroll down the window to see a summary of the results for all checking items. A "V" symbolizes a pass, an "X" indicates an error to be fixed, and an "!" represents a noted warning yet does not impact running the analysis.
  4. Review the processed data. Please note that each analysis type involves assigning lipid characteristics based on your input features, so make sure to check any non-recognized features. Once all the input data successfully passes the checks, you are ready to start the analysis. Please proceed to the relevant analysis webpage.







2 Profiling

Lipidomics technology provides a fast and high-throughput screening to identify thousands of lipid species in cells, tissues or other biological samples and has been broadly used in several areas of studies. In this page, we present an overview that gathers comprehensive analyses that allow researchers to explore the quality and the clustering of samples, correlation between lipids and samples, and the abundance and composition of lipids.

Function introduction: the concept and expected results



Function demonstration: step-by-step tutorial

2.1 Data source

Lipid abundance data can be uploaded by users or example datasets are also provided.

2.1.1 Try our example

The demo dataset is using data from ‘Adipose tissue ATGL modifies the cardiac lipidome in pressure-overload-induced left ventricular failure.’ by Salatzki, J. et al. (1). Human plasma lipidome from 10 healthy controls and 13 patients with systolic heart failure (HFrEF) were analyzed by MS-based shotgun lipidomics. The data revealed dysregulation of individual lipid classes and lipid species in the presence of HFrEF.
You can test our example dataset by selecting it from the control panel shown below.

2.1.2 Upload your data

The data upload process is divided into three steps:
  1. Data Format Check
  2. Data Processing
  3. Data Summary and Data Quality
You will only progress to the next step after successfully completing the preceding one.
Each step is elaborated in detail, corresponding to the order marked in the screenshot.


  1. Data Format Check
    For the profiling analysis, you are required to upload only the Lipid Abundance Data in a CSV format. Ensure that the lipid names are listed in the first column and are unique, without any NAs. Additionally, your data should have at least two samples. Once the data is uploaded, it will be displayed on the right-hand side of the screen. The system will then verify if the data meets the necessary criteria and show the results of this check below.


    A "V" symbolizes a pass, an "X" indicates an error to be fixed, and an "!" represents a noted warning yet does not impact running the analysis.
    If the data format is found to be incorrect, you can refer to the Data Check webpage for guidance on how to correct them.

  2. Data Processing
    In this step, you will see a range of data processing options, including 'Remove Features with Many Missing Values,' 'Missing Values Imputation,' 'Data Normalization,' and 'Data Transformation.' When you choose the 'Remove Features with Many Missing Values' option, you can set a threshold for the allowable percentage of missing values in your dataset; features exceeding this threshold will be removed. In the 'Missing Values Imputation' section, a dropdown list allows you to select a method for replacing missing values in your dataset. If you choose the 'minimum' option, the minimum value will be adjusted by a multiplier you input. Additionally, for data normalization and transformation, multiple methods are available for selection from a dropdown list, offering flexibility in processing your data.


    After setting all the options, click the "Processing" button to start it.

  3. Data Summary and Data Quality
    Upon selecting 'Processing,' users will receive a data summary about the processing details of uploaded data, the processed abundance data, and a conversed lipid characteristic table according to the features in the uploaded abundance data. Additionally, data quality results are provided, offering insights into the differences between the pre-and post-processing states of the data. Users can explore these differences through box and density plots available for both states. Detailed information can be viewed by hovering the mouse over these plots. A 'Download PDF' button above the plots allows users to download static plots.


    Users can initiate their analysis after reviewing the processed data by clicking the 'Start' button.

2.2 Cross-Sample Variability

In this page, three types of distribution plot provide a simple view of sample variability. The first histogram depicts the numbers of lipids expressed in each sample. The second histogram illustrates the total amount of lipid in each sample. The last density plot visualizes the underlying probability distribution of the lipid abundance in each sample (line). Through these plots, users can easily compare the amount/abundance difference of lipid between samples (i.e., patients vs. control).

2.2.1 Number of expressed lipids

Through this histogram, users can discover how many lipid species present in each sample, which provides an overview that depicts the total number of lipid species over samples.

2.2.2 Lipid Amount

The histogram here can be used to check the variability of the total lipid amount between samples.

2.2.3 Lipid abundance distribution

The density plot uncovers the distribution of lipid abundance in each sample (line). All abundance was log10 transformed. By hovering over the line, a box will suggest a specific abundance level and the density of the point. Hence, users can have a deeper view of the distribution between samples.

2.3 Dimensionality reduction

Dimensionality reduction is common when dealing with large numbers of observations and/or large numbers of variables in lipids analysis. It transforms data from a high-dimensional space into a low-dimensional space so that to retain vital properties of the original data and close to its intrinsic dimension.
Three dimensionality reduction methods are provided, PCA, t-SNE, UMAP. Additionally, four clustering methods, K-means, partitioning around medoids (PAM), Hierarchical clustering, and DBSCAN, can be reached by clicking pull-down menu. The number of groups shown on (PCA, t-SNE, UMAP) plot can also be input by users (default is two groups).

2.3.1 Principal component analysis (PCA)

PCA is an unsupervised linear dimensionality reduction and data visualization technique for high dimensional data, which tries to preserve the global structure of the data. Scaling (by default) indicates that the variables should be scaled to have unit variance before the analysis takes place, which removing the bias towards high variances. In general, scaling (standardization) is advisable for data transformation when the variables in the original dataset have been measured on a significantly different scale. As for the centring options (by default), we offer the option of mean-centring, subtracting the mean of each variable from the values, making the mean of each variable equal to zero. It can help users to avoid the interference of misleading information given by the overall mean. After users submitting the PCA plot will show. In this section, R package "factoextra" is used to visualize the results.


Accompanying with the PCA plot, we offer scree plot criterion, which is a common method for determining the number of PCs to be retained. The "elbow" of the graph indicates all components to the left of this point can explain most variability of the samples.


Next, two tables related to PCA are also provided for users to see the contribution to each principal component in each sample and the contribution of each feature (lipid species). By using this information above, users can further decide the top N contribution features and adjusting the slider. The correlation circle plot on the left will then show the correlation between a feature (lipid species) and a principal component (PC) used as the coordinates of the variable on the PC (2). It shows the relationships between all variables.. The positively correlated variables will be in the same quadrants, while negatively correlated variables will be on the opposite sides of the plot origin. The closer a variable to the edge of the circle, the better it represents on the factor map.


Lastly, by adjusting the slider of the top N feature on the bottom left, users can have a closer look at the contribution of features to a user-defined principal component (e.g., PC1, PC2 or PC1+PC2). Therefore, in the histogram on the right-hand side, users can find which features (lipid species) contribute more to the user-defined principal component.

2.3.2 t-distributed stochastic neighbour embedding (t-SNE)

t-Distributed Stochastic Neighbour Embedding (t-SNE) is an unsupervised non-linear dimensionality reduction technique that tries to retain the local structure(cluster) of data when visualising the high-dimensional datasets. Package ‘Rtsne’ is used for calculation, and PCA is applied as a pre-processing step. In t-SNE, perplexity and max_iter are adjustable for users. The perplexity may be considered as a knob that sets the number of effective nearest neighbours, while max_iter is the maximum number of iterations to perform. The typical perplexity range between 5 and 50, but if the t-SNE plot shows a ‘ball’ with uniformly distributed points, you may need to lower your perplexity(3).

2.3.3 Uniform Manifold Approximation and Projection (UMAP)

UMAP using a nonlinear dimensionality reduction method, Manifold learning, which effectively visualizing clusters or groups of data points and their relative proximities. Both tSNE and UMAP were intended to predominantly preserve the local structure that is to group neighbouring data points which certainly delivers a very informative visualization of heterogeneity in the data. The significant difference with TSNE is scalability, which allows UMAP eliminating the need for applying pre-processing step (such as PCA). Besides, UMAP applies Graph Laplacian for its initialization as tSNE by default implements random initialization. Thus, some people suggest that the key problem of tSNE is the Kullback-Leibler (KL) divergence, which makes UMAP superior over tSNE. Nevertheless, UMAP’s cluster may not good enough for multi-class pattern classification. (4).
Users can also choose the type of distance metric to use to find nearest neighbours, including "euclidean" (the default), "cosine", "manhattan", and "hamming". Moreover, scaling to apply to input data if it is a data frame or matrix. ‘The size of the local neighbourhood’ indicates the size of the local neighbourhood (as for the number of neighbouring sample points), which is used for manifold approximation. Larger values lead to more global views of the manifold, whilst smaller values result in more local data being preserved. In general, values should be in the range of 2 to 100. (15 by default).

2.4 Correlation Heatmap

Correlation heatmaps illustrate the correlation between samples or lipid characteristics and also depict the patterns in each group. The correlation can be calculated by Pearson or Spearman. The correlation coefficient then is clustered depending on the user-defined method: median, average, single, complete, Ward.D, Ward.D2, WPGMA, and UPGMC and distance: Pearson, Spearman, Kendall, Euclidean, Maximum, Manhattan, Canberra, Binary, or Minkowski correlation. Furthermore, user have to select a characteristic among class, structural category, functional category, total length, total db, total oh, for displaying the heatmap of lipid characteristics. Please note that if the number of lipid samples or characteristics is over 50, the names of samples/characteristics will not be shown on the heatmap.

2.4.1 Heatmap of sample-sample correlations

The first heatmap reveals sample-sample correlations. The correlation between samples is calculated by the selected correlation method. Then, the correlation coefficients are hierarchically clustered by user-defined clustering method and distance. Correlations between lipid species are colored from strong positive correlation (red) to no correlation (white).

2.4.2 Heatmap of lipid characteristics correlations

The second heatmap reveals correlations between lipid characteristics. The correlation is calculated by the user-defined lipid characteristics. Correlations between lipid species are colored from strong positive correlation (red) to no correlation (white), to negative correlation (blue).

2.5 Lipid Characteristics

In this page, users can discover lipid abundance over specific lipid characteristics by scrolling the dropdown menu. Lipids will be firstly classified by the selected characteristics from ‘Lipid characteristics’ table uploaded by users. Next, the lipid abundance will be shown in the bar plot, which depicts the abundance level of each sample within each group (e.g., PE, PC) of selected characteristics (e.g., class). Additionally, a stacked horizontal bar chart tells the percentage of characteristics in each sample. For instance, if users select the class as lipid characteristics from the dropdown menu, the stacked bar chart will tell users the percentage of TAG, ST, SM etc. of each sample, the variability of percentage between samples can also be obtained from this plot.







3 Differential expression

In Differential Expression Page, significant lipid species or lipid characteristics can be explored through two main customised analysis, by ‘Lipid Species’ or ‘by Lipid Specific’ , with user-uploaded data. Subsequently, further analysis and visualisation methods, including dimensionality reduction, hierarchical clustering, characteristics analysis, and two characteristics analysis, can be implemented based on the results of differential expressed analysis by utilising user-defined methods and characteristics.

Function introduction: the concept and expected results



Function demonstration: step-by-step tutorial



3.1 Data source

Lipid dataset can be uploaded by users or using example datasets (Salatzki, J. PLoS genetics 2018). This information, namely Lipid abundance data and Group information needs to be uploaded in CSV format. Lipid abundance data’ contains the feature, which is the Identification of differential lipids, such as Cer 38:1;2 or LPC 16:0, followed by several columns of samples and their values (percentage).

3.1.1 Try our example

Demo dataset from ‘Adipose tissue ATGL modifies the cardiac lipidome in pressure-overload-induced left Ventricular failure’ (1) encompasses human plasma lipidome from 10 healthy controls and 13 patients with systolic heart failure (HFrEF), which analysed by MS-based shotgun lipidomics. The data revealed dysregulation of individual lipid classes and lipid species in the presence of HFrEF.
You can test our example dataset by selecting it from the control panel shown below.

3.1.2 Upload your data

The data upload process is divided into three steps:
  1. Data Format Check
  2. Data Processing
  3. Data Summary and Data Quality
You will only progress to the next step after successfully completing the preceding one.
Each step is elaborated in detail, corresponding to the order marked in the screenshot.


  1. Data Format Check
    For the Differential Expression Analysis, please upload your Lipid Abundance Data and Group Information, both in CSV format. Ensure that lipid names in your abundance data are unique, listed in the first column, and do not contain any NAs. Your dataset should also include at least two samples.
    'Group Information' should consist of sample names that correspond precisely to those in the abundance data. For data involving two groups, the columns must be named in the following order: "sample_name," "label_name," "group," and "pair." Each pair must be numbered sequentially from 1 to N without any omissions, blanks, or skipped numbers; otherwise, mark the value as NA. For data with multiple groups, the columns should be arranged as "sample_name," "label_name," and "group."
    After uploading the data, a drop-down menu for selecting the reference group will appear, as shown in the screenshot. Choose a group from this menu and click the "Submit" button to proceed with the upload process.


    Once you submit the data, it will appear on the right-hand side of the screen. The system will automatically check if the data adheres to the required criteria and display the verification results below.


    A "V" symbolizes a pass, an "X" indicates an error to be fixed, and an "!" represents a noted warning yet does not impact running the analysis.
    If the data format is found to be incorrect, you can refer to the Data Check webpage for guidance on how to correct them.

  2. Data Processing
    In this step, you will see a range of data processing options, including 'Remove Features with Many Missing Values,' 'Missing Values Imputation,' 'Data Normalization,' and 'Data Transformation.' When you choose the 'Remove Features with Many Missing Values' option, you can set a threshold for the allowable percentage of missing values in your dataset; features exceeding this threshold will be removed. In the 'Missing Values Imputation' section, a dropdown list allows you to select a method for replacing missing values in your dataset. If you choose the 'minimum' option, the minimum value will be adjusted by a multiplier you input. Additionally, for data normalization and transformation, multiple methods are available for selection from a dropdown list, offering flexibility in processing your data.


    After setting all the options, click the "Processing" button to start it.

  3. Data Summary and Data Quality
    Upon selecting 'Processing,' users will receive a data summary about the processing details of uploaded data, the processed abundance data, group information table, and a conversed lipid characteristic table according to the features in the uploaded abundance data. Additionally, data quality results are provided, offering insights into the differences between the pre-and post-processing states of the data. Users can explore these differences through box and density plots available for both states. Detailed information can be viewed by hovering the mouse over these plots. A 'Download PDF' button above the plots allows users to download static plots.


    Users can initiate their analysis after reviewing the processed data by clicking the 'Start' button.

3.2 Lipid species analysis

3.2.1 Step1: Differentially expressed analysis

In the lipid species analysis section, differentially expressed analysis is performed to find significant lipid species. In short, samples will be divided into groups (independent) based on the Group Information of input. Two statistical methods are provided for different data types: the t-test and the Wilcoxon test (Wilcoxon rank-sum test)for two-group data, along with one-way ANOVA and the Kruskal-Wallis test for multi-group data. Additionally, the p-value will be adjusted using the Benjamini-Hochberg procedure. The condition and cut-offs for significant lipid species are also users selected. Please note that the fold change defined here means only the target lipid species with numbers of fold up-regulation or down-regulation than control will be selected and shown on plots and tables.


After you submit your chosen approach, the analysis will generate one table and several plots illustrating the differentially expressed results. The table includes statistical information, with significantly expressed lipid species highlighted in pink based on the user-defined threshold.
For two-group data:
  1. The lollipop chart displays lipid species meeting the selected cut-offs. The x-axis represents log2 fold change and the y-axis lists lipid species. The color of each point reflects the -log10(adj_value/p-value). Hovering over a point to view its log2(Fold Change) and -log10(adj_value/p-value).
  2. The MA plot (B-1) categorizes lipid species into three groups: up-regulated (red), down-regulated (blue), and non-significant (grey). Moving the mouse over a point provides more details about each lipid species. The lipid species of interest will appear in the boxplot on the right-hand side with their abundance level by groups. Similarly, the volcano plot (B-2) follows the same color scheme to highlight the most biologically significant lipid species visually. Hovering over a point will display its abundance boxplot.
  3. Abundance boxplot and table of a lipid species. Use the drop-down menu to select a significantly expressed lipid species. Once a specific lipid species is chosen, its abundance boxplot and corresponding table will be displayed.



For multiple-group data:
  1. The lollipop chart displays lipid species meeting the selected cut-offs. The x-axis represents the -log10 p-value, and the y-axis lists lipid species. The color of each point reflects the -log10(adj_value/p-value). Hovering over a point to view its -log10(adj_value/p-value).
  2. The scatter plot of significant expressed lipid species in each class. Moving the mouse over a point provides more details about each lipid species. Hovering over a point will display its abundance boxplot.
  3. Abundance boxplot and table of a lipid species. Use the drop-down menu to select a significantly expressed lipid species. Once a specific lipid species is chosen, its abundance boxplot and corresponding table will be displayed.



3.2.2 Step2: Visualization of DE lipids

The following analyses are using data filtering from previous Differentially Expressed Analysis. Further analysis includes Dimensionality Reduction, Clustering, Lipid Characteristics, and two characteristic analysis.

3.2.2.1 Dimensionality reduction

Dimensionality reduction is common when dealing with large numbers of observations and/or large numbers of variables in lipids analysis. It transforms data from a high-dimensional space into a low-dimensional space so that to retain vital properties of the original data and close to its intrinsic dimension.
Four dimensionality reduction methods are offered, namely PCA, PLS-DA, t-SNE, UMAP. Additionally, clustering methods, such as K-means, partitioning around medoids (PAM), Hierarchical clustering, and DBSCAN, can be reached by clicking pull-down menu. Scaling (by default) indicates that the variables should be scaled to have unit variance before the analysis takes place, which removing the bias towards high variances. In general, scaling (standardization) is advisable for data transformation when the variables in the original dataset have been measured on a significantly different scale. As for the centring options (by default), we offer the option of mean-centring, subtracting the mean of each variable from the values, making the mean of each variable equal to zero. It can help users to avoid the interference of misleading information given by the overall mean. After submitting the plot will show by selecting. For more details of dimensionality reduction methods, please see section 2.3.

This section offers extra dimensionality reduction method, PLS-DA. For the plot of PLS-DA, the distance from the centre of the variables in the PLS-DA loading plot indicate the contribution of the variable. The value of x-axis reveals the contribution of the variable to PLS-DA-1, whereas the value of y-axis discloses the contribution of the variable to PLS-DA-2.

3.2.2.2 Hierarchical clustering

Lipid species that derived from two/multiple groups will be clustered and visualised on heatmap using hierarchical clustering. Through heatmap, users may discover the difference between the two/multiple groups by observing the distribution of lipid species. This analysis provides an overview of lipid species differences between groups. Please note that if the number of lipids or samples are over 50, the names of lipids/samples will not be shown on the heatmap.
All analysis methods can be selected by the users. The input data for clustering can either be significant lipid species or all lipid species. The distance can be calculated by Pearson, Spearman, or Kendall correlation. The top of the heatmap will be grouped by sample group (top annotation), while the side of the heatmap (row annotation) can be chosen by users and depends on uploaded lipid characteristics, such as class, category, total length, total double bond (Total.DB), hydroxyl group number (Total.OH) in the demo dataset. The methods of clustering compromise median, average, single, complete, Ward.D, Ward.D2, WPGMA, and UPGMC. The matrix of the heatmap can be downloaded by hitting the “Download matrix” at the bottom of the heatmap.

When hovering over the cells of the heatmap, the name of the sample(column) and the name of lipid species, and the value is the z-score of lipid abundance in the row direction. If moving to the row annotation (right-hand side), the value will be the option that you select in pull-down menu: ‘Side colour by lipid characteristics’ (i.e. Total.DB). The top annotation indicates the group of samples (e.g. control/experimental).

3.2.2.3 Characteristics analysis

In this part, we categorize significant lipid species based on different lipid characteristics and visualise the difference between groups by log2 Fold Change for two-group data and -log10(p-value) for multi-group data.
The 29 converted lipid characteristics can be assessed by clicking the pull-down menu. The results will display in several plots.
  1. Lollopop chart
    The lollipop chart is another way to compare multiple values simultaneously and it aligns the log2(fold change) or -log10(p-value) of all significant groups(values) within the selected characteristics.
  2. Word cloud
    The word cloud shows the count of each group(value) of the selected characteristics. Note, the cut-off and data of these plots are according to the input from step1.
  3. Bar chart
    The bar chart shows the significant groups (values) with mean fold change over 2 in the selected characteristics by colors, red representing significant, while black meaning insignificant. Please note that the bar chart will not generate for multiple group data.



3.3 Lipid Characteristics Analysis

3.3.1 Step1: Differentially expressed analysis

The massive degree of structural diversity of lipids contributes to the functional variety of lipids. The characteristics can range from subtle variance (i.e. the number of a double bond in the fatty acid) to major change (i.e. diverse backbones).
In Lipid Characteristics Analysis section, lipid species are categorised and summarised into new lipid abundance table according to two selected lipid characteristic, then conducted differential expressed analysis. Samples will be divided into groups based on the Group Information of input data. Two-way ANOVA and appropriate post hoc tests for different data types are utilized. For two-group data, the t-test and the Wilcoxon test (Wilcoxon rank-sum test) are applied, while for multi-group data, one-way ANOVA and the Kruskal-Wallis test are used. This Differentially Expressed Analysis section separates into 2 sections, analysing based on first 'Characteristics' and adding 'Subgroup of characteristics' to the analysis. The first section is analysed based on the first selected 'characteristics'. The second section is the subgroup analysis of the first section. In short, lipid species will be split by the characteristic that user-chosen in the second pull-down menu then undergo the first section analysis. Note that two selected characteristics should be both continuous data and one categorical data with one continuous data. The cut-offs of differentially expressed lipids are inputted by users.

3.3.1.1 Usage of differential expressed analysis

The procedure of Lipid Characteristics Analysis section is that two-way ANOVA will be used to evaluate the interaction between lipid characteristics (Class, Total.DB, etc.) and group information on the lipid abundance values or percentages (according to selected characteristics). For example, when you pick Total.DB in the pull-down menu of `Characteristics`, the result of two-way ANOVA will tell you whether the interaction effect between Total.DB and different groups is present. The post hoc test will then calculate significance for Total.DB and produce a p-value. In 'Subgroup of characteristics' section, users can further choose another characteristic (e.g. class). And then by pulling down the menu, the analysed result from the first section (e.g. Total.DB) will be categorised by one of the subgroups (e.g. Cer) of the selected characteristic (e.g. class). The star above the bar shows the significant difference of the specific subgroup of the selected characteristic between groups.


3.3.1.2 Results of differential expressed analysis

To interpret the diagram, in the first section, the bar chart and line chart both depict the difference groups in each category of the selected characteristics. The box plot reveals the distribution of the characteristics within each group.


In the second section, these three plots are divided by each second selected characteristic, which allows users to have a detailed look into each subgroup of lipid characteristic. All significant groups in the bar plots and line chart will be highlighted with an asterisk if the group is significant. Additionally, the boxplots using the asterisk rating system to presenting P values, P values less than 0.001 are given three asterisks, less than 0.01 are given two asterisks, and less than 0.05 are given one asterisk. If no line or asterisk is shown, represents no significance between groups.


3.3.2 Step2: Visualisation of DE lipids

3.3.2.1 Dimensionality reduction

Dimensionality reduction in this section assists users to tackle with large numbers of variables in lipids analysis. The high-dimensional space is transformed into a low-dimensional space. Hence, the crucial properties of the lipid data are revealed and still close to its intrinsic characteristics.
We offer four types of dimensionality reduction approaches, PCA, PLS-DA, t-SNE, UMAP, and four clustering methods, K-means, partitioning around medoids (PAM), Hierarchical clustering, and DBSCAN. We calculate Principal component analysis with the classical prcomp function. Scaling (by default) and the centring options (by default) can also be adjusted by users. For more information, please find Section 2.3 and 3.2.2.1.

3.3.2.1.1 Results of dimensionality reduction

After users submitting the PCA plot will show. In this section, R package "factoextra" is used to visualize the results. Accompanying with the PCA plot, we offer scree plot criterion, which is a common method for determining the number of PCs to be retained.


Next, two tables related to PCA are also provided for users to see the contribution to each principal component in each sample and the contribution of each feature (lipid species). By using this information above, users can further decide the top N contribution features and adjusting the slider. The correlation circle on the left will then show the correlation between a feature (lipid species) and a principal component (PC) used as the coordinates of the variable on the PC (2). It shows the relationships between all variables.. The positively correlated variables will be in the same quadrants, while negatively correlated variables will be on the opposite sides of the plot origin. The closer a variable to the edge of the circle, the better it represents on the factor map.


The correlation circle on the left will then show the correlation between a feature (lipid species) and a principal component (PC) used as the coordinates of the variable on the PC (2). It shows the relationships between all variables.The direction (vectors) of the features (lipid species) point out the contribution to the principal components. The positively correlated variables will be in the same quadrants, while negatively correlated variables will be on the opposite sides of the plot origin. The closer a variable to the edge of the circle, the better it represents on the factor map.


Lastly, by adjusting the slider of top N feature on the bottom left, users can have a closer look of the contribution of features to a user-defined principal component (e.g., PC1, PC2 or PC1+PC2). Therefore, in the histogram on the right-hand side, users can find which features (lipid species) contribute more to the user-defined principal component.

3.3.2.2 Hierarchical clustering

New lipid abundance table summed up from species will be clustered and shown on the heatmap using hierarchical clustering. Through heatmap, users may discover the difference between the two/multiple groups by observing the distribution of lipid characteristic abundance. This analysis gives a picture of lipid characteristic abundance differences between two/multiple groups.
Four distance measures can be chosen, Person, Spearman, or Kendall, and eight clustering methods, median, average, single, complete, Ward.D, Ward.D2, WPGMA, and UPGMC, can be selected by pulling down the menu. Columns are all sample, and rows are the significant characteristic group (value) selected in the first `Characteristics` section from Step1, such as Total.DB. Please note that if the number of lipids or samples are over 50, the names of lipids/samples will not be shown on the heatmap. More detailed information and interpretation can be obtained from the section 3.2.2.2.

3.3.2.3 Two characteristics analysis

The heatmaps provide the correlation between the double bond and chain length of lipid species. The color in the heatmaps is gradient according to different data types (log2FC for two group data and p-value for multi-group data)—red represents positive values and blue for negative values. The "*" sign on the plot indicates significant.
The results are displayed in three sections, as indicated in the order marked in the screenshot below.

  1. Heatmap and table of Totla FA and Each FA
    You can switch by the tab on the top of the heatmap to view the heatmap of Totla FA and Each FA.
  2. Heatmap and table of a specific characteristic
    You can view the heatmap of each characteristic by the drop-down menu.
  3. Abundance boxplot and table of a lipid species
    In the drop-down menu, you can select from lipid species differentiated by pairs of double bond and chain length, as outlined in the table from the previous section (B). Once you choose a specific lipid species, the abundance boxplot and the corresponding table will be displayed.









4 Enrichment

Enrichment analysis on our platform provides two main approaches: Over Representation Analysis (ORA) and Lipid Set Enrichment Analysis (LSEA). ORA analysis illustrates significant lipid species enriched in the categories of lipid class. LSEA analysis is a computational method determining whether an a priori-defined set of lipids shows statistically significant, concordant differences between two biological states (e.g., phenotypes).

Function introduction: the concept and expected results



Function demonstration: step-by-step tutorial



4.1 Data Source

The first step of viewing enrichment analysis results is to upload your data. This analysis necessitates a table of lipid species differential expression. You can obtain the required table by performing a 'Differential Expression Analysis' on our platform's Differential Expression webpage. A screenshot of the download page is provided below for your reference. We highly recommend using the results directly from our platform to minimize the risk of format errors.

If you opt to use your data, please ensure that it meets all the listed requirements and successfully passes all the checkpoints indicated on the page. Below, list the detailed requirements.
  1. Lipid species differential expression
    1. The column names are identical to those in the result of the differential expression analysis.
    2. Columns of 'feature', 'log2FC', 'pval', 'padj' columns do not contain NA values.
    3. All lipid names (features) are unique.
You can use or download the provided example data if you need clarification on the data format.

4.2 Over Representation Analysis

The Over-Representation Analysis results are displayed through two bar plots and a table, providing insights into the enrichment of significant lipid species within various lipid classes. Each result is described in marked order in the screenshot.
  1. The bar plot on the left is a summary table. It shows the top 10 significant up-regulated and down-regulated terms for datasets involving two groups and the top 20 for multiple groups. Here are offer two primary ways for further manipulation, as indicated by numbered markers in the screenshot.
    1. You can view the detailed information of a specific bar by hovering the mouse on it.
    2. You can hide or display bars belonging to a certain group by clicking on the respective legend.
  2. The bar plot on the right categorizes these significant lipid species into 'up' or 'down' groups based on their log2 fold change (red represents up-regulated, blue represents down-regulated, and gery represents non-significant.), according to a user-selected characteristic.
    Here are offer two primary ways for further manipulation, as indicated by numbered markers in the screenshot.
    1. You can view the detailed information of a specific bar by hovering the mouse on it.
    2. You can hide or display bars belonging to a certain group by clicking on the respective legend.
  3. Detailed information about the outcomes of the enrichment analysis is listed in the table, including characteristics, p-value, lipids, etc. To better view the table, you can click on the column name to reorder it. For example, if you click on the column "p-value," the value will display in decreasing order.

4.3 Lipid set enrichment analysis

Lipid Set Enrichment Analysis (LSEA) is a computational method determining whether an a priori-defined set of lipids shows statistically significant, concordant differences between two biological states (e.g., phenotypes). Each result is described in marked order in the screenshot.
  1. The bar plot on the left is a summary table. It shows the top 10 significant up-regulated and down-regulated terms for datasets involving two groups and the top 20 for multiple groups. Here are offer two primary ways for further manipulation, as indicated by numbered markers in the screenshot.
    1. You can click on a specific bar to display the corresponding enrichment plot.
    2. You can view the detailed information of a specific bar by hovering the mouse on it.
    3. You can hide or display bars belonging to a certain group by clicking on the respective legend.
  2. The bar plot on the right categorizes these significant lipid species into 'up' or 'down' groups based on their NES (Normalized Enrichment Score) (red represents up-regulated, blue represents down-regulated, and gery represents non-significant.), according to a user-selected characteristic.
    Here are offer two primary ways for further manipulation, as indicated by numbered markers in the screenshot.
    1. You can click on a specific bar to display the corresponding enrichment plot.
    2. You can view the detailed information of a specific bar by hovering the mouse on it.
    3. You can hide or display bars belonging to a certain group by clicking on the respective legend.
  3. Detailed information about the outcomes of the enrichment analysis is listed in the table, including characteristics, p-value, lipids, etc. To better view the table, you can click on the column name to reorder it. For example, if you click on the column "p-value," the value will display in decreasing order.







5 Machine learning

In this section, lipid species and lipid characteristics data can be combined by users to predict the binary outcome using various machine learning methods and select the best feature combination to explore further relationships. For cross-validation, Monte-Carlo cross-validation (CV) is executed to evaluate the model performance and to reach statistical significance. Additionally, we provide eight feature ranking methods (p-value, p-value*FC, ROC, Random Forest, Linear SVM, Lasso, Ridge, ElasticNet) and six classification methods (Random Forest, Linear SVM, Lasso, Ridge, ElasticNet, XGBoost) for users to train and select the best model. Feature ranking methods can be divided into two categories: a univariate and multivariate analysis.
A series of consequent analyses assist users to evaluate the methods and visualise the results of machine learning, including ROC/PR curve, Model predictivity, Sample probability, Feature importance, and Network.

Function introduction: the concept and expected results



Function demonstration: step-by-step tutorial



5.1 Usage and Data Source

5.1.1 Data Source

For the data sources, users can either upload their datasets or use our Demo datasets.

5.1.1.1 Try our example

Our demo dataset is modified from Li, Haoxin, et al. ‘The landscape of cancer cell line metabolism.’ Nature Medicine (2019)(5). We extracted the lipidomics data from the metabolites divided cancer cell lines into sensitive or resistant to SCD gene knockout evenly based on gene dependency scores (CERES). In details, cancer cell lines with low polyunsaturated fatty acid (PUFA) concentration were more sensitive than Stearoyl-CoA Desaturase (SCD) gene knockout. The cell lines in the upper and lower quartile are picked according to their sensitivity and separated into two groups: ‘Resistant to SCD knockout (1)’ and ‘Sensitive to SCD knockout (0)’. This example allows users to see whether the lipid composition and characteristics will have a combined effect on cellular function or phenotype.
You can test our example dataset by selecting it from the control panel shown below.

5.1.1.2 Upload your data

The data upload process is divided into three steps:
  1. Data Format Check
  2. Data Processing
  3. Data Summary and Data Quality
You will only progress to the next step after successfully completing the preceding one.
Each step is elaborated in detail, corresponding to the order marked in the screenshot.


  1. Data Format Check
    For the Machine Learning Analysis, you are required to upload your Lipid Abundance Data and Condition Table, both in CSV format. The lipid names in your abundance data must be unique, listed in the first column, and free of any NAs. Additionally, ensure your dataset includes at least 60 samples. The 'Condition Table' must contain sample names that match your abundance data. The columns in this table should be organized as "sample_name" and "group." The "group" column values must be numeric, limited to 0 or 1, and each group must contain more than 30 samples. After uploading your data, it will be displayed on the right-hand side of the screen. The system will then verify if your data meets the necessary criteria, and the results of this check will be shown below.


    A "V" symbolizes a pass, an "X" indicates an error to be fixed, and an "!" represents a noted warning yet does not impact running the analysis.
    If the data format is found to be incorrect, you can refer to the Data Check webpage for guidance on how to correct them.

  2. Data Processing
    In this step, you will see a range of data processing options, including 'Remove Features with Many Missing Values,' 'Missing Values Imputation,' 'Data Normalization,' and 'Data Transformation.' When you choose the 'Remove Features with Many Missing Values' option, you can set a threshold for the allowable percentage of missing values in your dataset; features exceeding this threshold will be removed. In the 'Missing Values Imputation' section, a dropdown list allows you to select a method for replacing missing values in your dataset. If you choose the 'minimum' option, the minimum value will be adjusted by a multiplier you input. Additionally, for data normalization and transformation, multiple methods are available for selection from a dropdown list, offering flexibility in processing your data.


    After setting all the options, click the "Processing" button to start it.

  3. Data Summary and Data Quality
    Upon selecting 'Processing,' users will receive a data summary about the processing details of uploaded data, the processed abundance data, condition table, and a conversed lipid characteristic table according to the features in the uploaded abundance data. Additionally, data quality results are provided, offering insights into the differences between the pre-and post-processing states of the data. Users can explore these differences through box and density plots available for both states. Detailed information can be viewed by hovering the mouse over these plots. A 'Download PDF' button above the plots allows users to download static plots.


    Users can initiate their analysis after reviewing the processed data by clicking the 'Start' button.

5.1.2 Usage

Monte-Carlo cross-validation (MCCV) is a model validation technique that we used to create multiple random splits of the dataset into training and validation data, which prevent an unnecessary large model and thus prevent over-fitting for the calibration model (6). With MCCV, users can conduct split-sample CV multiple times and aggregate the results from each to quantify predictive performance for a candidate mode. For each CV, data is randomly split into training and testing data where training data is used to select the top 2, 3, 5, 10, 20, 50 and 100 important features and train a model, which will be validated on testing data. If there are less than 100 features in the user’s data, the total feature number is set as the maximum. The times of cross-validation (CV) and the proportion of data used for testing can be defined by users with the parameters ‘Test data proportion’ and ‘Cross-validation times’. Note, the more ‘Cross-validation times, the longer it takes to calculate the results. Additionally, we give eight feature ranking methods (p-value, p-value*FC, ROC, Random Forest, Linear SVM, Lasso, Ridge, ElasticNet) and six classification methods (Random Forest, Linear SVM, Lasso, Ridge, ElasticNet, XGBoost) for users to train and select the best model.


Feature selection methods are aimed to rank the most significant variables to a model to predict the target variable. Our platform provides two categories of feature selection methods: the univariate and the multivariate analysis. Univariate analysis, including p-value, p-value*Fold Change or ROC, compare each feature between two groups and picks top N features based on -log10(p-value), –log10(p-value)*Fold change or Area Under Curve (AUC), respectively according to the user-selected ranking methods. On the other hand, for multivariate analysis, we offer Random Forest, Linear SVM (e1071), Lasso (glmnet), Ridge (glmnet), and ElasticNet (glmnet). Random Forest (ranger) uses built-in feature importance results, while others rank the features according to the absolute value of their coefficients in the algorithm. Note that the names in the bracket are the packages we adopt.
Classification, in brief, is the method that assigns a label of one of the classes to every lipid uploaded by users. Classification methods share similar analysis methods as feature ranking methods, comprising Random Forest, Linear SVM, Lasso, Ridge, ElasticNet, XGBoost. For both processes of ranking and classification analysis are the same in Lasso, Ridge and ElasticNet method. The processes executed in the background are performing 5-fold cross-validation to find the best lambda to fit a model and the iteration is set as 1000 to save running time. Then, the 'nrounds' parameter in XGBoost is also decided by previous 5-fold cross-validation with max_depth=3, early_stopping_rounds=10. Finally, if ElasticNet is selected, users should define the ‘alpha’ value, which is between 0 (=Ridge) and 1 (=Lasso) (figure 1) (7). Except for the above, default values are passed to other parameters in all models. Users are also allowed to introduce one or more lipid characteristics as ‘Additional variables’ to train and test a model from the optional user uploaded table.

5.2 ROC/PR curve

The ROC and Precision-Recall (PR) curve are very common methods to evaluate the diagnostic ability of a binary classifier. Mean AUC and 95% confidence interval for the ROC and PR curve are calculated from all CV runs in each feature number. Theoretically, the higher the AUC, the better the model performs. PR curve is more sensitive to data with highly skewed datasets (i.e., rare positive samples), and offers a more informative view of an algorithm's performance (8). A random classifier yields a ROC-AUC about 0.5 and a PR-AUC close to positive sample proportion. On the contrary, an AUC equal to 1 both represents perfect performance in two methods.
Speaking of interpreting plots, the ROC curve is created with ‘sensitivity’ (proportion of positive samples that are correctly classified) as y-axis and ‘1-specificity’ (proportion of negative samples that are correctly classified) as x-axis based on different thresholds whereas the PR curve is a similar graph with ‘precision’ (proportion of positive samples out of those that are predicted positive) on the y-axis and ‘recall’ (=sensitivity) on the x-axis. Generally, a better model shows a ROC curve approaching the left upper corner and a PR curve around the right upper corner.
To combine the testing results from all CV runs, 300 thresholds evenly distributed from 0 to 1. The thresholds are then calculated the corresponding sensitivity, specificity, precision and recall with predicted probabilities and true labels of testing samples in each CV. These values are then averaged to plot a final ROC and PR curve.
The upper section displays the average ROC and PR curve for different feature number with their mean AUC and 95% confidence interval in the legend. For example, ‘2 AUC-0.962 (0.94-0.985)‘ means when feature numbers are two, the mean AUC is 0.962 and with a 95% confidence interval in 0.94 to 0.985. In the lower section, users can select a certain feature number to see the average of those cross-validations (CVs) (red line) for both ROC and PR curve. Each CVs are in grey. AUC, 95% CI, and p-value are shown in the box.


5.3 Model Performance

In this part, many useful indicators are provided for users to evaluate model performance. For each feature number, we calculate and plot the average value and 95% confidence interval of accuracy, sensitivity (recall), specificity, positive predictive value (precision), negative predictive value, F1 score, prevalence, detection rate, detection prevalence, balanced accuracy in all CV runs with confusion Matrix function in carat package. All these indicators can be described in terms of true positive (TP), false positive (FP), false negative (FN) and true negative (TN) and are summarized as follows.
Sensitivity = Recall = TP / (TP + FN)
Specificity = TN / (FP + TN)
Prevalence = (TP + FN) / (TP + FP + FN + TN)
Positive predictive value (PPV) = Precision = TP / (TP + FP)
Negative predictive value (NPV) = TN / (FN + TN)
Detection rate = TP / (TP + FP + FN + TN)
Detection prevalence = (TP + FP) / (TP + FP + FN + TN)
Balanced accuracy = (Sensitivity + Specificity) / 2
F1 score = 2 ∗ Precision ∗ Recall / (Precision + Recall)
So, by selecting the pull-down menu, users can view the selected evaluation value (e.g., Accuracy) of each number of features (e.g., 2-42 features). Next, the highest value will be coloured in red, assisting users to assess the number of features they want to use for classification. The detailed information table incorporates all evaluation values mentioned above is also available for download.

5.4 Predicted Probability

This page shows the average predicted probabilities of each sample in testing data from all CV runs and allows users to explore those incorrect or uncertain labels. We show the distribution of predicted probabilities in two reference labels on the left panel while a confusion matrix composed of sample number and proportion is laid out on the right. In the ‘Average sample probability in all CVs’ distribution plot, each point represents a sample, which is the mean of the prediction from all models of all Cross Validations. The y-axis is the predicted probabilities of the samples, which means the probabilities that the prediction value of each ML model is one. To depict in detail, the blue group of samples show the probabilities that the true value is one and the prediction value of the sample is also one. The red group illustrates the probabilities that the true value is zero and the prediction value of the sample is also one. Hence, the Black group should be as close to zero, whereas the blue group should be as close to one as possible. In term of the ‘Confusion matrix’ on the right-hand side, the y-axis indicates the predicted class, the x-axis is the actual class. Therefore, the upper left is a true positive; the upper right is a false positive; the lower left is a false negative; the lower right is a true negative. The numbers are the count, while the number in the bracket is the percentage. The probability plot can also be downloaded by hitting the button. Note that results for different feature number can be selected manually by users.

5.5 Feature importance

After building a high-accuracy model, users are encouraged to explore the contribution of each feature on this page. Two methods here namely ‘Algorithm-based’ and ‘SHAP analysis’ can rank and visualize the feature importance.

5.5.1 Algorithm-based

In 'Algorithm-based' part, when users choose a certain feature number, the selected frequency and the average feature importance of top 10 features from all CV runs will be displayed. For a Linear SVM, Lasso, Ridge or ElasticNet model, the importance of each feature depends on the absolute value of their coefficients in the algorithm, while Random Forest and XGBoost use built-in feature importance results.

5.5.2 SHAP analysis

On the other hand, SHapley Additive exPlanations (SHAP) approach on the basis of Shapley values in game theory has recently been introduced to explain individual predictions of any machine learning model. More detailed information can be found in this paper Li, Haoxin, et al. ‘A Unified Approach to Interpreting Model Predictions’ (2017)(9). For our analysis, based on the result of ROC-AUC and PR-AUC, users can decide the best model and the corresponding feature number. Once users enter the ‘Feature number’, the corresponding best model in all CVs will be used to compute approximate Shapley values of each feature for all samples with fastshap package in R. On top of that, except XGBoost, Monte Carlo repetitions are conducted to estimate each Shapley value and the simulation times are defined by the parameter ‘Simulation times’. Although it might take some time, more repetitions are suggested to achieve accurate results.


In the SHAP Feature Importance on the upper left panel, the top 10 features will be ranked and demonstrated according to the average absolute value of shapely values from all samples. Also, SHAP Summary Plot on the upper right shows the distribution of all shapely values for each feature. It uses sina plot to present more important features by a more binary pattern. The colour exemplifies the value of the feature from low (yellow) to high (purple), which indicate the variable is high/low for that observation. The x-axis presents whether the impact is positive or negative on quality rating (target variable). In the summary plot, the relationship between the value of a feature and the influence on the prediction is shown. For the exact form of the relationship, SHAP dependence plots present more details.

Users can also build the SHAP force plot and dependence plot with different parameter sets on the lower section. The SHAP force plot stacks these Shapley values and shows how the selected features affect the final output for each sample. The bar colours of the SHAP force plot are filled with the features. 'Show top N feature' and the samples are clustered into multiple groups (‘Number of group’) based on Shapley values using ward.D method. In the lower-left corner, SHAP dependence plot allows users to explore how the model output varies by a feature value. It reveals whether the link between the target and the variable is linear, monotonic, or more complex. The x-axis, y-axis and colour for the plot are all adjustable. Generally, the x-axis represents the value of a certain feature while the y-axis is the corresponding Shapley value. The colour parameter can be assigned to check if a second feature has an interaction effect with the feature we are plotting.




5.6 Network

Correlation network helps users interrogate the interaction of features in a machine learning model. In this section, users can choose an appropriate feature number according to previous cross-validation results and the features in the best model (based on ROC-AUC+PR-AUC) will be picked up to compute the correlation coefficients between each other. To build a network, nodes (features) are filled based on feature importance whereas line width represents the value of the correlation coefficient. Two methods including ‘Algorithm-based’ and ‘SHAP analysis’ can be selected to evaluate the feature importance. The detailed information about them can be found in the Feature importance part. Please note that we assign a plus or minus to feature importance in SHAP analysis here based on the direction of feature values and Shapley values of samples. We also provide three correlation types such as Pearson, Spearman and Kendall to conduct. Finally, a coefficient cut-off ranging from 0 to 1 can be adjusted to hide the lines with absolute values of correlation coefficients lower than this cut-off.








6 Correlation

In this section, we provide a comprehensive correlation analysis to assist researchers to interrogate the clinical features that connect to lipids species and other mechanistically relevant lipid characteristics. Correlation analysis between lipids and clinical features is broadly used in many fields of study, such as Bowler RP et al. discovering that sphingomyelins are strongly associated with emphysema and glycosphingolipids are associated with COPD exacerbations(10). Hence, continuous clinical data can be uploaded here, and diverse correlation analyses are offered. For instance, the Correlation Coefficient and Linear Regression are supported for continuous clinical data. Moreover, lipids can be classified either by lipid species or by lipid categories when conducting these correlation analyses.

Function introduction: the concept and expected results



Function demonstration: step-by-step tutorial



6.1 Data source

For the data sources, users can either upload their datasets or use our Demo datasets. The datatype of the dataset must be continuous data.

6.1.1 Try our example

Here is the source of the demo dataset (continuous clinical data) that be used as an example in this section: Bowler RP et al. 'Plasma Sphingolipids Associated with Chronic Obstructive Pulmonary Disease Phenotypes' (10) (Am J Respir Crit Care Med. 2015). 69 distinct plasma sphingolipid species were detected in 129 current and former smokers by targeted mass spectrometry. This cohort was used to interrogate the associations of plasma sphingolipids with subphenotypes of COPD including airflow obstruction, emphysema, and frequent exacerbations.
You can test our example dataset by selecting it from the control panel shown below.

6.1.2 Upload your data

The data upload process is divided into three steps:
  1. Data Format Check
  2. Data Processing
  3. Data Summary and Data Quality
You will only progress to the next step after successfully completing the preceding one.
Each step is elaborated in detail, corresponding to the order marked in the screenshot.


  1. Data Format Check
    For the correlation analysis, you are required to upload your Lipid Abundance Data and Condition Table, with an optional Adjusted Table, all in CSV format. Ensure that the lipid names in your abundance data are uniquely listed in the first column and do not contain any NAs. Your dataset should comprise at least ten samples. The 'Condition Table' should include sample names and clinical conditions (such as disease status, gene dependence score, etc.), categorizing each sample into a specific condition for association analysis. The sample names in this table must precisely match those in your abundance data, and all values, except for the "sample_name" column, must be numeric. The optional 'Adjusted Table' is for user-defined variables that will be accounted for in linear or logistic regression analyses. These variables might include cancer types or clinical information like gender, age, or BMI. The sample names here also need to match those in your abundance data exactly, and the number of conditions must be two or more. Once you upload the data, it will be displayed on the right-hand side of the screen. The system will automatically check if the data conforms to the specified criteria and show the verification results below.


    A "V" symbolizes a pass, an "X" indicates an error to be fixed, and an "!" represents a noted warning yet does not impact running the analysis.
    If the data format is found to be incorrect, you can refer to the Data Check webpage for guidance on how to correct them.

  2. Data Processing
    In this step, you will see a range of data processing options, including 'Remove Features with Many Missing Values,' 'Missing Values Imputation,' 'Data Normalization,' and 'Data Transformation.' When you choose the 'Remove Features with Many Missing Values' option, you can set a threshold for the allowable percentage of missing values in your dataset; features exceeding this threshold will be removed. In the 'Missing Values Imputation' section, a dropdown list allows you to select a method for replacing missing values in your dataset. If you choose the 'minimum' option, the minimum value will be adjusted by a multiplier you input. Additionally, for data normalization and transformation, multiple methods are available for selection from a dropdown list, offering flexibility in processing your data.


    After setting all the options, click the "Processing" button to start it.

  3. Data Summary and Data Quality
    Upon selecting 'Processing,' users will receive a data summary about the processing details of uploaded data, the processed abundance data, condition table, adjusted table (if upload), and a conversed lipid characteristic table according to the features in the uploaded abundance data. Additionally, data quality results are provided, offering insights into the differences between the pre-and post-processing states of the data. Users can explore these differences through box and density plots available for both states. Detailed information can be viewed by hovering the mouse over these plots. A 'Download PDF' button above the plots allows users to download static plots.


    Users can initiate their analysis after reviewing the processed data by clicking the 'Start' button.

6.2 Usage of Correlation

The correlation analysis can be conducted by both lipid species and lipid category. This section is designed for continuous clinical data. Two correlation analyses are accessible, Correlation Coefficient, and Linear Regression. A heatmap will be shown once the correlation analysis is completed. Clustering methods are also available: the distance measurement, including Pearson, Spearman, Kendall, Euclidean, Maximum, Manhattan, Canberra, Binary, or Minkowski; the clustering methods, including median, average, single, complete, Ward.D, Ward.D2, WPGMA, and UPGMC. After submitting, a heatmap depicted the pattern between lipid species/lipid characteristics and clinical features will show with a matrix that is ready for download. Please note that if the number of lipids or samples are over 50, the names of lipids/samples will not be shown on the heatmap.

6.2.1 Lipid Species Analysis

The following correlation analysis is conducted after lipids classified by lipid species.

6.2.1.1 Correlation

The Correlation Coefficient gives a summary view that tells researchers whether a relationship exists between clinical features and lipid species, how strong that relationship is and whether the relationship is positive or negative. Here we provide two types of correlations, Pearson, and Spearman, then adjusted by Benjamini & Hochberg methods. The cut-offs for correlation coefficient and the p-value can be decided by users. Rule of thumb in medical research recommended by Mukaka for interpreting the size of a correlation coefficient has been provided below (11).




A heatmap will show after users inputting cut-offs and choosing a value for clustering/methods for clustering. Users can use either correlation coefficient between clinical features (e.g., genes) and lipid species or choose their statistic instead as clustering values. Note that only the variables pass the user-defined cut-offs for p-value and correlation coefficient will be shown on the heatmap. The rows of the heatmap will show clinical features, whereas the columns illustrate the lipid species. Please note that if the number of lipids or samples are over 50, the names of lipids/samples will not be shown on the heatmap.

6.2.1.2 Linear Regression

Linear regression is a statistical technique that uses several explanatory variables to predict the outcome of a continuous response variable, allowing researchers to estimate the associations between lipid levels and clinical features. For multiple linear regression analysis, additional variables in ‘adjusted table’ will be added into the algorithm and used to adjust the confounding effect. Once calculation completes, each lipid species will be assigned a beta coefficient and t statistic (p-value), which can be chosen for clustering.


A heatmap will show after submitting. Note that only the variables pass the user-defined cut-offs for p-value and correlation coefficient will be shown on the heatmap. The rows of the heatmap will show clinical features, whereas the columns illustrate the lipid species. Please note that if the number of lipids or samples are over 50, the names of lipids/samples will not be shown on the heatmap.

6.2.2 Lipid Characteristics Analysis

The following correlation analysis is conducted after lipids classified by lipid characteristics that from users uploaded table 'Lipid characteristics'.

6.2.2.1 Correlation

The Correlation Coefficient gives a summary view that tells researchers whether a relationship exists between clinical features and user-defined lipid characteristics, how strong that relationship is and whether the relationship is positive or negative. So, user can choose one of the lipid characteristics, then two types of correlations, Pearson, and Spearman, are provided. The data will be adjusted by Benjamini & Hochberg methods. The cut-offs for correlation coefficient and the p-value can be decided by users. Rule of thumb in medical research recommended by Mukaka for interpreting the size of a correlation coefficient has been provided below (11).


As for the settings of the heatmap, the cut-offs and choosing a value for clustering/methods for clustering. Users can use either correlation coefficient between clinical features and lipid characteristics or choose their statistics instead of clustering values. Note that only the variables pass the user-defined cut-offs for p-value and correlation coefficient will be shown on the heatmap. The rows of the heatmap will show clinical features, whereas the columns illustrate the lipid characteristics. Please note that if the number of lipids or samples are over 50, the names of lipids/samples will not be shown on the heatmap.

6.2.2.2 Linear Regression

Linear regression is a statistical technique that uses several explanatory variables to predict the outcome of a continuous response variable, allowing researchers to estimate the associations between lipid levels and clinical features. In this page, the lipids will be classified and summed by the user-selected lipid characteristics (e.g., class), then implementing univariate or multivariate linear regression analysis according to whether there is an ‘adjusted table’. Each component in the selected characteristics will be assigned a beta coefficient and t statistic (p-value), which can be chosen for clustering.








7 Network

In the network analysis section, five distinct types of networks are provided.
Results from 'Differential Expression' are further analyzed using the 'GATOM network', which isolates significant subnetworks within a constructed atom-based network. The 'Pathway activity network', computes flux changes in the lipid reaction network, facilitating the identification of active or suppressed pathways. Lastly, the 'Lipid reaction network', graphically represents significant lipid classes/species within lipid biosynthesis pathways.
In the investigated lipid classes, the 'Reactome network', aggregates reactions pertinent to these classes as documented in Reactome. The 'Lipid-related gene enrichment', analysis identifies genes associated with the lipid classes of interest, followed by an enrichment analysis to discern significant pathways. The 'Reactome Network' contains the interaction of multiple lipids and genes and are summarized from the Reactome database, which is a curated database encompassed pathways and reactions in human biology(12). For enrichment network, selected lipids will be mapped to the corresponding genes and we use KEGG(13) and Reactome(14) databases to undergo the further enrichment analysis.
The following sections will describe the manipulation and input required of each network.

Function introduction: the concept and expected results



Function demonstration: step-by-step tutorial



7.1 Pathway Activity Network

7.1.1 Upload your data

The first step of viewing this network is to upload your data. This network necessitates two specific input tables: one for lipid class abundance and another for group information. You can obtain both required tables by performing a 'Differential Expression Analysis' on our platform's Differential Expression webpage. A screenshot of the download page is provided below for your reference. We highly recommend using the results directly from our platform to minimize the risk of format errors.

If you opt to use your data, please ensure that it meets all the listed requirements and successfully passes all the checkpoints indicated on the page. Below, list the detailed requirements for each of these tables.
  1. Lipid class abundance table
    1. The required columns include 'feature', 'networkId' and the column containing sample abundance values.
    2. All lipid names (features) are unique.
    3. Sample names in 'sample_name' column are as same as the sample names in lipid abundance data.
    4. Columns of feature (Lipid class) column do not contain NA values.
    5. Except for the first column, 'feature (Lipid class)', all the columns are numeric.

  2. Group information table
    1. The required columns include 'sample_name', 'label_name', 'group' and 'pair' columns.
      (Only two groups analysis contains 'pair' column)
    2. All sample names are unique.
    3. Columns of 'sample_name', 'label_name' and 'group' columns do not contain NA values.
    You can check the 'Group information table' format on the data check webpage.
You can use or download the provided example data if you need clarification on the data format.

7.1.2 Result network and tables

The network provides activity pathways between lipid classes. Nodes within the network represent distinct lipid classes, while lines represent pathways. Each pathway has been counted for a pathway value, reflected in the line's color (red for positive values and blue for negative values).
The network's manipulation is provided below, corresponding to the order in the screenshot.

  1. Manipulate the network by clicking on the toolbar at the bottom right and left.
  2. Hover on a specific node or line to view the corresponding information of a lipid or reaction.

Below the network, the table shows the information of the network, including the reaction chain, pathway score, and genes involved in the pathways. To better view the table, you can click on the column name to reorder it. For example, if you click on the column "pathway score," the values will be displayed in ascending order.

7.2 Lipid Reaction Network

7.2.1 Upload your data

The first step of viewing this network is to upload your data. This network necessitates two specific input tables: one for lipid species differential expression and another for lipid class differential expression. You can obtain both required tables by performing a 'Differential Expression Analysis' on our platform's Differential Expression webpage. A screenshot of the download page is provided below for your reference. We highly recommend using the results directly from our platform to minimize the risk of format errors.

If you opt to use your data, please ensure that it meets all the listed requirements and successfully passes all the checkpoints indicated on the page. Below, list the detailed requirements for each of these tables.
  1. Lipid species differential expression
    1. The required columns include 'feature', 'log2FC', 'pval', 'padj', 'class', 'Species.Name', 'Molecular.Species.Name', 'Structural.Species.Name', 'Level' and 'networkId' columns.
    2. Columns of 'feature', 'log2FC', 'pval', 'padj', 'class', 'Species.Name' and 'Level' columns do not contain NA values.
    3. All lipid names (features) are unique.

  2. Lipid class differential expression
    1. The required columns include 'feature', 'networkId', 'log2FC', 'pval' and 'padj' columns.
    2. Columns of 'feature', 'log2FC', 'pval' and 'padj' columns do not contain NA values.
    3. All lipid names (features) are unique.
You can use or download the provided example data if you need clarification on the data format.

7.2.2 Result network and tables

This network illustrates the important reactions of differentially expressed lipid classes and species, where square nodes represent lipid classes and round nodes denote individual lipid species linked to those classes. The color of each node is keyed to the log2 fold change value: for round nodes, red signifies positive values, and blue indicates negative ones; for square nodes, yellow marks positive values, purple denotes negative values, and grey signifies non-significant changes or lipid classes not included in the input data.
The network's manipulation is provided below, corresponding to the order in the screenshot.


  1. Manipulate the network by clicking on the toolbar at the bottom right and left.
  2. Hover on a specific node or line to view the corresponding information of a lipid or reaction.
Below the network, the two tables show the information of the network, including p-value and fold change of lipid classes and species, path, and reaction. To better view the table, you can click on the column name to reorder it. For example, if you click on the column "Gene," the gene names arrange the gene names alphabetically from A to Z. If you click on the column "Log2FC," the values will be displayed in ascending order.

7.3 Gatom Network

7.3.1 Upload your data

The first step of viewing this network is to upload your data. This network necessitates a table of differential lipid sepcies abundance. You can obtain the required table by performing a 'Differential Expression Analysis' on our platform's Differential Expression webpage. A screenshot of the download page is provided below for your reference. We highly recommend using the results directly from our platform to minimize the risk of format errors.

If you opt to use your data, please ensure that it meets all the listed requirements and successfully passes all the checkpoints indicated on the page. Below, list the detailed requirements of the table.
  1. Lipid species differential expression
    1. The required columns include 'GATOM.abbr', 'log2FC', 'pval' and 'feature' columns.
    2. Columns of 'GATOM.abbr', 'log2FC', 'pval', 'feature' columns. do not contain NA values.
    3. All lipid names (features) are unique.
You can use or download the provided example data if you need clarification on the data format.

7.3.2 Result network and tables

The network shows the important reactions of differentially expressed lipid species. These lipids, highlighted for their distinct abundance patterns, are integrated into a comprehensive metabolic network to form an experiment-specific framework. Within this network, each node represents a differentially expressed lipid, with color variations reflecting fold change values (red indicating positive values, blue indicating negative values), and the node size corresponds to the significance level represented by the p-value. The network's manipulation is provided below, corresponding to the order in the screenshot.


  1. Manipulate the network by clicking on the toolbar at the bottom right and left.
  2. Hover on a specific node or line to view the corresponding information of a lipid or reaction.
Below the network, the two tables show the information of the network, including p-value and fold change of lipids, path, and reaction. To better view the table, you can click on the column name to reorder it. For example, if you click on the column "Gene," the gene names arrange the gene names alphabetically from A to Z. If you click on the column "p-value," the values will be displayed in descending order.

7.4 Reactome Network

This function supports users to discover the metabolic networks between small molecules within Reactome Network topologies and their interactions with proteins. Reactome currently encompasses comprehensive aspects of lipid metabolism include lipid digestion, mobilization, and transport; fatty acid, triacylglycerol, and ketone body metabolism; peroxisomal lipid metabolism; phospholipid and sphingolipid metabolism; cholesterol biosynthesis; bile acid and bile salt metabolism; and steroid hormone biosynthesis. We integrated the reactions in these pathways and explore the metabolic flow and regulatory proteins of the inputted lipids. This information will be used to build an interactive network containing different nodes such as Biochemical Reaction, Small Molecule, Protein or Complex.
To obtain the Reactome Network, the user can define the lipid class by typing in or select from the pull-down menu. 'Number of the path' means the N shortest paths will be searched between each pair of selected lipid classes. Besides, ‘Maximum path length’ controls the maximum length of these paths and we removed those over this threshold. Note, the more lipid classes or ‘Number of the path’, the more time it takes to search all paths and plot the network.


After submitting input, the interactive Reactome network can be shown on the left-hand side. Relevant molecules/pathways can be emphasised by clicking on the desired node and adjusting positions by users. Another option is clicking on the pull-down menu at the upper left of the plot, the chosen lipid class or the whole group (of Biochemical Reaction, Complex, Protein or Small Molecule) will be highlighted.


In terms of the table below, the 'path' refers to the pathway between selected lipid classes. The 'path rank' is the rank of the pathway that linked the selected lipid classes (e.g. PE-PC), ranking by the length of the pathway; the shorter length, the higher rank. The 'Node order' and the 'Node name' are also shown in the columns.

7.5 Lipid-related gene enrichment

In this page, a multi-omics analysis is provided to associate the lipids with gene function and a similar method has been used to identify relevant pathways of altered metabolites (12). More specifically, R package, graphite(15)/ metaGraphite(16) is used to query the lipid-related genes for lipid classes in KEGG and Reactome databases. These genes are directly connected to lipid candidate through metabolic reaction, binding etc., and will be adopted for further enrichment analysis.
By summarising the pathways from KEGG, Reactome, and GO databases, we present the functions networks of the user-defined lipid-related gene sets. The mapped genes of the user-selected lipid class will be enriched and illustrated in the circular network diagram, which allows users explore the significant functions and the relationship between lipid-related pathways and genes based on their lipid classes.
To discover the enriched pathway, the lipid classes can be chosen by users either by typing in or selecting. KEGG, Reactome, GO:BP/CC/MF are provided in this section as the database for enrichment by using Fisher’s exact test. P-value can be altered by users. The ‘Top N significant terms’ is based on the -log10(p-value), which can present top N terms that are most relevant to selected lipid classes. ‘Genes similarity’ calculates the intersection of genes in term1 and genes in term2, then divided by the number of the genes of term1 or term2.


Next, after submitting, the network will be shown on the left with the colour indicating the -log10(p-value) and the size of the node representing the number of genes in the terms.


As for the table below, the first column, 'pathway' is the pathway enriched by those genes that are related to user-defined lipid classes. The 'Lipid gene number' and 'Pathway gene number' record the numbers of matched lipid-related genes and total genes in the pathway, respectively. The 'Lipid gene symbol' shows matched lipid-related genes within that pathway, which is provided by the database above. '-Log10(p-value)' is the mean of the -log10(p-value) of those genes in 'Pathway gene number'.







8 ID conversion

The ID conversion function is designed to assign 29 lipid characteristics to the uploaded features. For comprehensive details on these 29 lipid characteristics, including their aspects, descriptions, and sources, please refer to the FAQ.

Function introduction: the concept and expected results



Function demonstration: step-by-step tutorial



The following steps describe how to use this function, as indicated in the below screenshot.


  1. Click the option of "Upload your data!".
  2. Choose your preferred method for uploading data. You can either type the features or upload a CSV file.
    Please upload your data following the guidelines below:
    • For typing in lipid species names: Input each lipid species on a separate line, pressing Enter to move to the following line.
    • For CSV file uploads: Ensure the file contains just one " feature " column and list all lipid species names beneath it.
    For lipid species naming format, please refer to FAQ.
  3. Click the upload button.
  4. Once the input features are converted, the right-side window will show the uploaded, recognized, and unrecognized features.







9 Reference

(1). Salatzki J, Foryst-Ludwig A, Bentele K, Blumrich A, Smeir E, Ban Z, et al. Adipose tissue ATGL modifies the cardiac lipidome in pressure-overload-induced left ventricular failure. PLoS genetics. 2018;14(1):e1007171.
(2). Abdi H, Williams LJ. Principal component analysis. Wiley interdisciplinary reviews: computational statistics. 2010;2(4):433-59.
(3). Maaten Lvd, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579-605.
(4). McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018.
(5). Li H, Ning S, Ghandi M, Kryukov GV, Gopal S, Deik A, et al. The landscape of cancer cell line metabolism. Nature medicine. 2019;25(5):850.
(6). Xu Q-S, Liang Y-Z. Monte Carlo cross validation. Chemometrics and Intelligent Laboratory Systems. 2001;56(1):1-11.
(7). Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology). 2005;67(2):301-20.
(8). Davis J, Goadrich M, editors. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning; 2006.
(9). Lundberg SM, Lee S-I, editors. A unified approach to interpreting model predictions. Advances in neural information processing systems; 2017.
(10). Bowler RP, Jacobson S, Cruickshank C, Hughes GJ, Siska C, Ory DS, et al. Plasma sphingolipids associated with chronic obstructive pulmonary disease phenotypes. American journal of respiratory and critical care medicine. 2015;191(3):275-84.
(11). Mukaka MM. A guide to appropriate use of correlation coefficient in medical research. Malawi medical journal. 2012;24(3):69-71.
(12). Vantaku V, Dong J, Ambati CR, Perera D, Donepudi SR, Amara CS, et al. Multi-omics Integration Analysis Robustly Predicts High-Grade Patient Survival and Identifies CPT1B Effect on Fatty Acid Metabolism in Bladder Cancer. Clinical cancer research : an official journal of the American Association for Cancer Research. 2019;25(12):3689-701.
(13). Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima KJNar. KEGG: new perspectives on genomes, pathways, diseases and drugs. 2017;45(D1):D353-D61.
(14). research GOCJNa. The gene ontology resource: 20 years and still GOing strong. 2019;47(D1):D330-D8.
(15). Sales G, Calura E, Cavalieri D, Romualdi C. graphite-a Bioconductor package to convert pathway topology to gene network. BMC bioinformatics. 2012;13(1):20.
(16). Sales G, Calura E, Romualdi C. meta Graphite–a new layer of pathway annotation to get metabolite networks. Bioinformatics. 2019;35(7):1258-60.