Dimensionality reduction based on distance-based features

Here, we demonstrate the application of various dimensionality reduction methods to visualize distance-based features in a reduced-dimensional space. As an example, we will focus specifically on three selected PED ensembles for N-terminal SH3 domain of Drk protein.

PED00156: This ensemble consists of conformations generated randomly and optimized through an iterative process.

PED00157: This ensemble includes conformations generated using the ENSEMBLE method, which creates a variety of realistic conformations of an unfolded protein.

PED00158: This ensemble is a combination of conformations from the RANDOM and ENSEMBLE pools

1- The first step is to extract the specific feature we wish to analyze.

analysis.extract_features(featurization='ca_dist' , min_sep=2 , max_sep=None)

The extract_features function from “ensamble_analysis.py” is designed to extract specific features from a dataset related to proteins.

The parameters we can set are:

“featurization”: Choose between “ca_dist” and “rmsd” as distance-based features.

“normalize”: Whether to normalize the data. Only applicable to the “ca_dist” method. Default is False.

“min_sep”: Minimum separation distance for “ca_dist” methods. Default is 2.

“max_sep”: Maximum separation distance for “ca_dist” methods. Default is None.

After selecting the feature to extract, we can apply various dimensionality reduction methods using the reduce_features function. The parameters that can be set for this function are numerous and vary depending on the chosen reduction method.

We have provided a complete description on the hyperparameters related to each dimensionality reduction method in the methods’ overview section. Here we continue with the analysis of SH3 PED ensembles based on distance-based features and visualize the results using t-SNE and PCA method.

2- The second step is choosing dimensionality reduction method and hyperparameters.

analysis.reduce_features(method='tsne' ,perplexity_vals = [10, 20,  50,100, 150, 200, 250], circular=False, range_n_clusters=range(2,10,1));

Important point

Since we are analyzing the distance feature the circular parameter is set to False

3- The third step is the visualization of the results.

In this section, we will first demonstrate how to plot the results obtained from the t-SNE dimensionality reduction. Following that, we will apply PCA (Principal Component Analysis) and visualize the results

a- t-SNE visualisation

vis.dimensionality_reduction_scatter(color_by='rg', kde_by_ensemble=True, size=20, plotly=True);

b- PCA visualisation

First reduce feature using PCA method:

analysis.reduce_features(method='pca',num_dim = 10);

An then we have different visualization options here:

vis.pca_cumulative_explained_variance();

Plot the cumulative variance. Only applicable when the dimensionality reduction method is “pca”

vis.pca_2d_landscapes()

Plot 2D landscapes when the dimensionality reduction method is “pca” or “kpca”

vis.pca_1d_histograms()

Plot 1D histogram when the dimensionality reduction method is “pca” or “kpca”.

vis.pca_rg_correlation()

Examine and plot the correlation between PC dimension 1 and the amount of Rg. Typically high correlation can be detected in case of IDPs/IDRs

sel_dims = [0, 1, 2]  # Dimensionality reduction feature we want to analyze
vis.pca_residue_correlation(sel_dims=sel_dims)

Plot the correlation between residues based on PCA weights.