ensemble_analysis

EnsembleAnalysis class

class dpet.ensemble_analysis.EnsembleAnalysis(ensembles: List[Ensemble], output_dir: str)

Bases: object

Data analysis pipeline for ensemble data.

Initializes with a list of ensemble objects and a directory path for storing data.

Parameters:
  • ensembles (List[Ensemble])) – List of ensembles.

  • output_dir (str) – Directory path for storing data.

comparison_scores(score: str, featurization_params: dict = {}, bootstrap_iters: int = None, bootstrap_frac: float = 1.0, bootstrap_replace: bool = True, bins: Union[int, str] = 50, random_seed: int = None, verbose: bool = False) Tuple[ndarray, List[str]]

Compare all pair of ensembles using divergence/distance scores. See dpet.comparison.all_vs_all_comparison for more information.

property ens_codes: List[str]

Get the ensemble codes.

Returns:

A list of ensemble codes.

Return type:

List[str]

execute_pipeline(featurization_params: Dict, reduce_dim_params: Dict, subsample_size: int = None)
Execute the data analysis pipeline end-to-end. The pipeline includes:
  1. Download from database (optional)

  2. Generate trajectories

  3. Randomly sample a number of conformations from trajectories (optional)

  4. Perform feature extraction

  5. Perform dimensionality reduction

Parameters:
  • featurization_params (Dict) – Parameters for feature extraction. The only required parameter is “featurization”, which can be “phi_psi”, “ca_dist”, “a_angle”, “tr_omega” or “tr_phi”. Other method-specific parameters are optional.

  • reduce_dim_params (Dict) – Parameters for dimensionality reduction. The only required parameter is “method”, which can be “pca”, “tsne” or “kpca”.

  • subsample_size (int, optional) – Optional parameter that specifies the trajectory subsample size. Default is None.

exists_coarse_grained() bool

Check if at least one of the loaded ensembles is coarse-grained after loading trajectories.

Returns:

True if at least one ensemble is coarse-grained, False otherwise.

Return type:

bool

extract_features(featurization: str, normalize: bool = False, *args, **kwargs) Dict[str, ndarray]

Extract the selected feature.

Parameters:
  • featurization (str) – Choose between “phi_psi”, “ca_dist”, “a_angle”, “tr_omega”, “tr_phi”, “rmsd”.

  • normalize (bool, optional) – Whether to normalize the data. Only applicable to the “ca_dist” method. Default is False.

  • min_sep (int or None, optional) – Minimum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.

  • max_sep (int, optional) – Maximum separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.

Returns:

A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

Return type:

Dict[str, np.ndarray]

property features: Dict[str, ndarray]

Get the features associated with each ensemble.

Returns:

A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

Return type:

Dict[str, np.ndarray]

get_features(featurization: str, normalize: bool = False, *args, **kwargs) Dict[str, ndarray]

Extract features for each ensemble without modifying any fields in the EnsembleAnalysis class.

featurizationstr

The type of featurization to be applied. Supported options are “phi_psi”, “tr_omega”, “tr_phi”, “ca_dist”, “a_angle”, “rg”, “prolateness”, “asphericity”, “sasa”, “end_to_end” and “flory_exponent”.

min_sepint, optional

Minimum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is 2.

max_sepint or None, optional

Maximum sequence separation distance for “ca_dist”, “tr_omega”, and “tr_phi” methods. Default is None.

normalizebool, optional

Whether to normalize the extracted features. Normalization is only supported when featurization is “ca_dist”. Default is False.

Dict[str, np.ndarray]

A dictionary containing the extracted features for each ensemble, where the keys are ensemble IDs and the values are NumPy arrays containing the features.

ValueError:

If featurization is not supported, or if normalization is requested for a featurization method other than “ca_dist”. If normalization is requested and features from ensembles have different sizes. If coarse-grained models are used with featurization methods that require atomistic detail.

get_features_summary_dataframe(selected_features: List[str] = ['rg', 'asphericity', 'prolateness', 'sasa', 'end_to_end', 'flory_exponent'], show_variability: bool = True) DataFrame

Create a summary DataFrame for each ensemble.

The DataFrame includes the ensemble code and the average for each feature.

Parameters:
  • selected_features (List[str], optional) – List of feature extraction methods to be used for summarizing the ensembles. Default is [“rg”, “asphericity”, “prolateness”, “sasa”, “end_to_end”, “flory_exponent”].

  • show_variability (bool, optional) – If True, include a column a measurment of variability for each feature (e.g.: standard deviation or error).

Returns:

DataFrame containing the summary statistics (average and std) for each feature in each ensemble.

Return type:

pd.DataFrame

Raises:

ValueError – If any feature in the selected_features is not a supported feature extraction method.

load_trajectories() Dict[str, Trajectory]

Load trajectories for all ensembles.

This method iterates over each ensemble in the ensembles list and downloads data files if they are not already available. Trajectories are then loaded for each ensemble.

Returns:

A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.

Return type:

Dict[str, mdtraj.Trajectory]

Note

This method assumes that the output_dir attribute of the class specifies the directory where trajectory files will be saved or extracted.

random_sample_trajectories(sample_size: int)

Sample a defined random number of conformations from the ensemble trajectory.

Parameters:

sample_size (int) – Number of conformations sampled from the ensemble.

property reduce_dim_data: Dict[str, ndarray]

Get the transformed data associated with each ensemble.

Returns:

A dictionary where keys are ensemble IDs and values are the corresponding feature arrays.

Return type:

Dict[str, np.ndarray]

reduce_features(method: str, fit_on: List[str] = None, *args, **kwargs) ndarray

Perform dimensionality reduction on the extracted features.

Parameters:
  • method (The following optional parameters apply based on the selected reduction) – Choose between “pca”, “tsne”, “kpca” and “umap”.

  • fit_on (List[str], optional) – if method is “pca” or “kpca”, specifies on which ensembles the models should be fit. The model will then be used to transform all ensembles.

  • Parameters (Additional) –

  • ---------------------

  • method

  • pca (-) –

    • num_dimint, optional

      Number of components to keep. Default is 10.

  • tsne (-) –

    • perplexity_valsList[float], optional

      List of perplexity values. Default is range(2, 10, 2).

    • metricstr, optional

      Metric to use. Default is “euclidean”.

    • circularbool, optional

      Whether to use circular metrics. Default is False.

    • n_componentsint, optional

      Number of dimensions of the embedded space. Default is 2.

    • learning_ratefloat, optional

      Learning rate. Default is 100.0.

    • range_n_clustersList[int], optional

      Range of cluster values. Default is range(2, 10, 1).

  • kpca (-) –

    • circularbool, optional

      Whether to use circular metrics. Default is False.

    • num_dimint, optional

      Number of components to keep. Default is 10.

    • gammafloat, optional

      Kernel coefficient. Default is None.

Returns:

property trajectories: Dict[str, Trajectory]

Get the trajectories associated with each ensemble.

Returns:

A dictionary where keys are ensemble IDs and values are the corresponding MDTraj trajectories.

Return type:

Dict[str, mdtraj.Trajectory]