“Profiling” provides an overview of comprehensive analyses to efficiently examine data quality, the clustering of samples, the correlation between lipid species, and the composition of lipid characteristics.

All of the input data of functions must be a SummarizedExperiment object constructed by LipidSigR::as_summarized_experiment. For detailed instructions for constructing SummarizedExperiment object, please read vignette("1_tool_function")

Cross-sample variability

There are three types of distribution plots, which could provide a simple view of sample variability a simple view of sample variability to compare the amount/abundance difference of lipid between samples (i.e., patients vs. control).

# conduct profiling
result <- cross_sample_variability(se)

# result summary
After running the above code, you will obtain a list called result, containing interactive plots, static plots, and tables for three types of distribution plots. (Note: Only static plots are displayed here.)

# view result: histogram of lipid numbers

Histogram of lipid numbers The histogram overviews the total number of lipid species over samples. From the plot, we can discover the number of lipid species present in each sample.

# view result: histogram of the total amount of lipid in each sample.

Histogram of lipid amount The histogram describes the variability of the total lipid amount between samples.

# view result: density plot of the underlying probability distribution

Density plot of abundance distribution The density plot uncovers the distribution of lipid abundance in each sample (line). From this plot, we can have a deeper view of the distribution between samples.

Dimensionality reduction

Dimensionality reduction is commonly used when dealing with large numbers of observations and/or large numbers of variables in lipids analysis. It transforms data from a high-dimensional space into a low-dimensional space so that it retains vital properties of the original data and is close to its intrinsic dimension.

Here we provide 3 dimensionality reduction methods, PCA, t-SNE, UMAP. As for the number of groups shown on the PCA, t-SNE, and UMAP plot, it can be defined by users (default: 2 groups).


PCA (Principal component analysis) is an unsupervised linear dimensionality reduction and data visualization technique for high dimensional data, which tries to preserve the global structure of the data. Scaling (by default) indicates that the variables should be scaled to have unit variance before the analysis takes place, which removes the bias towards high variances. In general, scaling (standardization) is advisable for data transformation when the variables in the original dataset have been measured on a significantly different scale. As for the centering options (by default), we offer the option of mean-centering, subtracting the mean of each variable from the values, making the mean of each variable equal to zero. It can help users to avoid the interference of misleading information given by the overall mean.

# conduct PCA
result_pca <- dr_pca(
    processed_se, scaling=TRUE, centering=TRUE, clustering='kmeans', 
    cluster_num=2, kmedoids_metric=NULL, distfun=NULL, hclustfun=NULL, 
    eps=NULL, minPts=NULL, feature_contrib_pc=c(1,2), plot_topN=10)

# result summary
After running the above code, you will obtain a list containing interactive plots, static plots, and tables for three types of distribution plots. (Note: Only static plots are displayed here.)

# view result: PCA plot

PCA plot

# view result: scree plot of top 10 principle components

Scree plot A common method for determining the number of PCs to be retained. The ‘elbow’ of the graph indicates all components to the left of this point can explain most variability of the samples

# view result: correlation circle plot of PCA variables

Correlation circle plot The correlation circle plot showing the correlation between a feature (lipid species) and a principal component (PC) used as the coordinates of the variable on the PC (Abdi and Williams 2010). The positively correlated variables are in the same quadrants while negatively correlated variables are on the opposite sides of the plot origin. The closer a variable to the edge of the circle, the better it represents on the factor map.

# view result: bar plot of contribution of top 10 features

Bar plot of contribution of top 10 features The plot displaysthe features (lipid species) that contribute more to the user-defined principal component.


t-SNE (t-Distributed Stochastic Neighbour Embedding) is an unsupervised non-linear dimensionality reduction technique that tries to retain the local structure(cluster) of data when visualising the high-dimensional datasets. Package Rtsne is used for calculation, and PCA is applied as a pre-processing step. In t-SNE, perplexity and max_iter are adjustable for users. The perplexity may be considered as a knob that sets the number of effective nearest neighbours, while max_iter is the maximum number of iterations to perform. The typical perplexity range between 5 and 50, but if the t-SNE plot shows a ‘ball’ with uniformly distributed points, you may need to lower your perplexity (Van der Maaten and Hinton 2008).

# conduct t-SNE
result_tsne <- dr_tsne(
    processed_se, pca=TRUE, perplexity=5, max_iter=500, clustering='kmeans',
    cluster_num=2, kmedoids_metric=NULL, distfun=NULL, hclustfun=NULL, 
    eps=NULL, minPts=NULL)
# result summary
# view result: t-SNE plot

t-SNE plot


UMAP (Uniform Manifold Approximation and Projection) using a nonlinear dimensionality reduction method, Manifold learning, which effectively visualizing clusters or groups of data points and their relative proximities. Both tSNE and UMAP are intended to predominantly preserve the local structure that is to group neighbouring data points which certainly delivers a very informative visualization of heterogeneity in the data. The significant difference with t-SNE is scalability, which allows UMAP eliminating the need for applying pre-processing step (such as PCA). Besides, UMAP applies Graph Laplacian for its initialization as tSNE by default implements random initialization. Thus, some people suggest that the key problem of tSNE is the Kullback-Leibler (KL) divergence, which makes UMAP superior over t-SNE. Nevertheless, UMAP’s cluster may not good enough for multi-class pattern classification (McInnes, Healy, and Melville 2018).

The type of distance metric to find nearest neighbors the size of the local neighborhood (as for the number of neighboring sample points) are set by parameter metric and n_neighbors. Larger values lead to more global views of the manifold, while smaller values result in more local data being preserved. Generally, values are set in the range of 2 to 100. (default: 15).

# conduct UMAP
result_umap <- dr_umap(
    processed_se, n_neighbors=15, scaling=TRUE, umap_metric='euclidean',
    clustering='kmeans', cluster_num=2, kmedoids_metric=NULL,
    distfun=NULL, hclustfun=NULL, eps=NULL, minPts=NULL)

# result summary
# view result: UMAP plot

UMAP plot

Correlation heatmap

The correlation heatmap illustrates the correlation between samples or lipid species and also depicts the patterns in each group. The correlation is calculated by the method defined by parameter corr_method, and the correlation coefficient is then clustered depending on method defined by parameter distfun and the distance defined by parameter hclustfun. Users can choose to output the sample correlation or lipid correlation results by the parameter type.

Please note that if the number of lipids or samples is over 50, the names of lipids/samples will not be shown on the heatmap.

Here, we use type='sample' as example.

# correlation calculation
result_heatmap <- heatmap_correlation(
    processed_se, char=NULL, transform='log10', correlation='pearson', 
    distfun='maximum', hclustfun='average', type='sample')

# result summary          
# view result: sample-sample heatmap

Heatmap of sample to sample correlations Correlations between lipid species are colored from strong positive correlations (red) to no correlation (white).

Lipid characteristics

You can take a view of lipid expression over specific lipid characteristics. First, lipids are classified by characteristics selected from the ‘Lipid characteristics’ table. Here, we select “class” as the selected lipid characteristic. The results will be showed by two plots.

Here, we use class as the char input for an example.

# calculate lipid expression of selected characteristic
result_lipid <- lipid_profiling(processed_se, char="class")
#> There are 4 ratio characteristics that can be converted in your dataset.

# result summary
# view result: bar plot

Bar plot classified by selected characteristic The bar plot depicts the abundance level of each sample within each group (e.g., PE, PC) of selected characteristics (e.g., class).

# view result: stacked horizontal bar chart

Lipid class composition The stacked horizontal bar chart illustrates the percentage of characteristics in each sample. The variability of percentage between samples can also be obtained from this plot.

