Dimensionality Reduction Analysis
This package also includes an analysis of dimensionality reduction, employing various methods: t-SNE, PCA, KernelPCA, and UMAP. Each of these methods offers a different perspective on reducing data dimensions and is used to visualize and better understand the complex structures of proteins, as well as to compare the effectiveness of these methods.
t-SNE (t-Distributed Stochastic Neighbor Embedding): t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
PCA (Principal Component Analysis): Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
KernelPCA (Kernel Principal component analysis): Non-linear dimensionality reduction through the use of kernels and specifically useful for studying angularity (Periodicity) in data
UMAP (Uniform Manifold Approximation and Projection): (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.
Initially, feature extraction is crucial because it captures essential aspects of protein structure that would otherwise be difficult to analyze directly in high-dimensional data. In IDPET, two groups of features can be analyzed: distance-based features and angular features.
Distance-based features in IDPET include the pairwise RMSD matrix between conformations within an ensemble and the Cα-Cα distance matrices.
For angular features, IDPET offers to analyze Phi (Φ) and Psi (Ψ) angles, t-Rosetta-style angles (omega and phi), and alpha angles.
In the next two parts of this demo we will see how we can use the dimensionality reduction modules of the IDPET:


