ensemble_analysis
EnsembleAnalysis class
- class dpet.ensemble_analysis.EnsembleAnalysis(ensembles: List[Ensemble], output_dir: str)
Bases:
object
Data analysis pipeline for ensemble data.
Initializes with a list of ensemble objects and a directory path for storing data.
- Parameters:
ensembles (List[Ensemble])) – List of ensembles.
output_dir (str) – Directory path for storing data.
- comparison_scores(score: str, featurization_params: dict = {}, bootstrap_iters: int = None, bootstrap_frac: float = 1.0, bootstrap_replace: bool = True, bins: Union[int, str] = 50, random_seed: int = None, verbose: bool = False) Tuple[ndarray, List[str]]
Compare all pair of ensembles using divergence/distance scores. See dpet.comparison.all_vs_all_comparison for more information.
- property ens_codes: List[str]
Get the ensemble codes.
- Returns:
A list of ensemble codes.
- Return type:
List[str]
- execute_pipeline(featurization_params: Dict, reduce_dim_params: Dict, subsample_size: int = None)
- Execute the data analysis pipeline end-to-end. The pipeline includes:
Download from database (optional)
Generate trajectories
Randomly sample a number of conformations from trajectories (optional)
Perform feature extraction
Perform dimensionality reduction
- Parameters:
featurization_params (Dict) – Parameters for feature extraction. The only required parameter is “featurization”, which can be “phi_psi”, “ca_dist”, “a_angle”, “tr_omega” or “tr_phi”. Other method-specific parameters are optional.
reduce_dim_params (Dict) – Parameters for dimensionality reduction. The only required parameter is “method”, which can be “pca”, “tsne” or “kpca”.
subsample_size (int, optional) – Optional parameter that specifies the trajectory subsample size. Default is None.
- exists_coarse_grained() bool
Check if at least one of the loaded ensembles is coarse-grained after loading trajectories.
- Returns:
True if at least one ensemble is coarse-grained, False otherwise.
- Return type:
bool
- extract_features(featurization: str, normalize: bool = False, *args, **kwargs) Dict[str, ndarray]
Extract the selected feature.
- Parameters:
featurization (str) – Choose between “phi_psi”, “ca_dist”, “a_angle”, “tr_omega”, “tr_phi”, “rmsd”.
normalize (bool, optional) – Whether to normalize the data. Only applicable to the “ca_dist” method. Default is False.
min_sep (int or None, optional) – Minimum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.
max_sep (int, optional) – Maximum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.
- Returns:
A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.
- Return type:
Dict[str, np.ndarray]
- property features: Dict[str, ndarray]
Get the features associated with each ensemble.
- Returns:
A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.
- Return type:
Dict[str, np.ndarray]
- get_features(featurization: str, normalize: bool = False, *args, **kwargs) Dict[str, ndarray]
Extract features for each ensemble without modifying any fields in the EnsembleAnalysis class.
- featurizationstr
The type of featurization to be applied. Supported options are “phi_psi”, “tr_omega”, “tr_phi”, “ca_dist”, “a_angle”, “rg”, “prolateness”, “asphericity”, “sasa”, “end_to_end” and “flory_exponent”.
- min_sepint, optional
Minimum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.
- max_sepint or None, optional
Maximum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.
- normalizebool, optional
Whether to normalize the extracted features. Normalization is only supported when featurization is “ca_dist”. Default is False.
- Dict[str, np.ndarray]
A dictionary containing the extracted features for each ensemble, where the keys are ensemble IDs and the values are NumPy arrays containing the features.
- ValueError:
If featurization is not supported, or if normalization is requested for a featurization method other than “ca_dist”. If normalization is requested and features from ensembles have different sizes. If coarse-grained models are used with featurization methods that require atomistic detail.
- get_features_summary_dataframe(selected_features: List[str] = ['rg', 'asphericity', 'prolateness', 'sasa', 'end_to_end', 'flory_exponent'], show_variability: bool = True) DataFrame
Create a summary DataFrame for each ensemble.
The DataFrame includes the ensemble code and the average for each feature.
- Parameters:
selected_features (List[str], optional) – List of feature extraction methods to be used for summarizing the ensembles. Default is [“rg”, “asphericity”, “prolateness”, “sasa”, “end_to_end”, “flory_exponent”].
show_variability (bool, optional) – If True, include a column a measurment of variability for each feature (e.g.: standard deviation or error).
- Returns:
DataFrame containing the summary statistics (average and std) for each feature in each ensemble.
- Return type:
pd.DataFrame
- Raises:
ValueError – If any feature in the selected_features is not a supported feature extraction method.
- load_trajectories() Dict[str, Trajectory]
Load trajectories for all ensembles.
This method iterates over each ensemble in the ensembles list and downloads data files if they are not already available. Trajectories are then loaded for each ensemble.
- Returns:
A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.
- Return type:
Dict[str, mdtraj.Trajectory]
Note
This method assumes that the output_dir attribute of the class specifies the directory where trajectory files will be saved or extracted.
- random_sample_trajectories(sample_size: int)
Sample a defined random number of conformations from the ensemble trajectory.
- Parameters:
sample_size (int) – Number of conformations sampled from the ensemble.
- property reduce_dim_data: Dict[str, ndarray]
Get the transformed data associated with each ensemble.
- Returns:
A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.
- Return type:
Dict[str, np.ndarray]
- reduce_features(method: str, fit_on: List[str] = None, *args, **kwargs) ndarray
Perform dimensionality reduction on the extracted features.
- Parameters:
method (The following optional parameters apply based on the selected reduction) – Choose between “pca”, “tsne”, “kpca” and “umap”.
fit_on (List[str], optional) – if method is “pca” or “kpca”, specifies on which ensembles the models should be fit. The model will then be used to transform all ensembles.
Parameters (Additional) –
--------------------- –
method –
pca (-) –
- num_dimint, optional
Number of components to keep. Default is 10.
tsne (-) –
- perplexity_valsList[float], optional
List of perplexity values. Default is range(2, 10, 2).
- metricstr, optional
Metric to use. Default is “euclidean”.
- circularbool, optional
Whether to use circular metrics. Default is False.
- n_componentsint, optional
Number of dimensions of the embedded space. Default is 2.
- learning_ratefloat, optional
Learning rate. Default is 100.0.
- range_n_clustersList[int], optional
Range of cluster values. Default is range(2, 10, 1).
kpca (-) –
- circularbool, optional
Whether to use circular metrics. Default is False.
- num_dimint, optional
Number of components to keep. Default is 10.
- gammafloat, optional
Kernel coefficient. Default is None.
- Returns:
np.ndarray – Returns the transformed data.
For more information on each method, see the corresponding documentation –
- property trajectories: Dict[str, Trajectory]
Get the trajectories associated with each ensemble.
- Returns:
A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.
- Return type:
Dict[str, mdtraj.Trajectory]