ensemble_compariosn
- dpet.comparison.all_vs_all_comparison(ensembles: List[Ensemble], score: str, featurization_params: dict = {}, bootstrap_iters: int = None, bootstrap_frac: float = 1.0, bootstrap_replace: bool = True, bins: Union[int, str] = 50, random_seed: int = None, verbose: bool = False) dict
Compare all pair of ensembles using divergence scores. Implemented scores are approximate average Jensen–Shannon divergence (JSD) over several kinds of molecular features. The lower these scores are, the higher the similarity between the probability distribution of the features of the ensembles. JSD scores here range from a minimum of 0 to a maximum of log(2) ~= 0.6931.
- Parameters:
ensembles (List[Ensemble]) – Ensemble objectes to analyze.
score (str) – Type of score used to compare ensembles. Choices: adaJSD (carbon Alfa Distance Average JSD), ramaJSD (RAMAchandran average JSD) and ataJSD (Alpha Torsion Average JSD). adaJSD scores the average JSD over all Ca-Ca distance distributions of residue pairs with sequence separation > 1. ramaJSD scores the average JSD over the phi-psi angle distributions of all residues. ataJSD scores the average JSD over all alpha torsion angles, which are the angles formed by four consecutive Ca atoms in a protein.
featurization_params (dict, optional) – Optional dictionary to customize the featurization process for the above features.
bootstrap_iters (int, optional) – Number of bootstrap iterations. By default its value is None. In this case, IDPET will directly compare each pair of ensemble $i$ and $j$ by using all of their conformers and perform the comparison only once. On the other hand, if providing an integer value to this argument, each pair of ensembles $i$ and $j$ will be compared bootstrap_iters times by randomly selecting (bootstrapping) conformations from them. Additionally, each ensemble will be auto-compared with itself by subsampling conformers via bootstrapping. Then IDPET will perform a statistical test to establish if the inter-ensemble ($i != j$) scores are significantly different from the intra-ensemble ($i == j$) scores. The tests work as follows: for each ensemble pair $i != j$ IDPET will get their inter-ensemble comparison scores obtained in bootstrapping. Then, it will get the bootstrapping scores from auto-comparisons of ensemble $i$ and $j$ and the scores with the higher mean here are selected as reference intra-ensemble scores. Finally, the inter-ensemble and intra-ensemble scores are compared via a one-sided Mann-Whitney U test with the alternative hypothesis being: inter-ensemble scores are stochastically greater than intra-ensemble scores. The p-values obtained in these tests will additionally be returned. For small protein structural ensembles (less than 500 conformations) most comparison scores in IDPET are not robust estimators of divergence/distance. By performing bootstrapping, you can have an idea of how the size of your ensembles impacts the comparison. Use values >= 50 when comparing ensembles with very few conformations (less than 100). When comparing large ensembles (more than 1,000-5,000 conformations) you can safely avoid bootstrapping.
bootstrap_frac (float, optional) – Fraction of the total conformations to sample when bootstrapping. Default value is 1.0, which results in bootstrap samples with the same number of conformations of the original ensemble.
bootstrap_replace (bool, optional) – If True, bootstrap will sample with replacement. Default is True.
bins (Union[int, str], optional) – Number of bins or bin assignment rule for JSD comparisons. See the documentation of dpet.comparison.get_num_comparison_bins for more information.
random_seed (int, optional) – Random seed used when performing bootstrapping.
verbose (bool, optional) – If True, some information about the comparisons will be printed to stdout.
- Returns:
results –
- Aa dictionary containing the following key-value pairs:
- scores: a (M, M, B) NumPy array storing the comparison
scores, where M is the number of ensembles being compared and B is the number of bootstrap iterations (B will be 1 if bootstrapping was not performed).
- p_values: a (M, M) NumPy array storing the p-values
obtained in the statistical test performed when using a bootstrapping strategy (see the bootstrap_iters) method. Returned only when performing a bootstrapping strategy.
- Return type:
dict
- dpet.comparison.calc_freqs(x, bins)
- dpet.comparison.calc_jsd(p_h, q_h)
Calculates JSD between distribution p and q. p_h: histogram frequencies for sample p. q_h: histogram frequencies for sample q.
- dpet.comparison.calc_kld_for_jsd(x_h, m_h)
Calculates KLD between distribution x and m. x_h: histogram frequencies for sample p or q. m_h: histogram frequencies for m = 0.5*(p+q).
- dpet.comparison.check_feature_matrices(func)
- dpet.comparison.confidence_interval(theta_boot, theta_hat=None, confidence_level=0.95, method='percentile')
Returns bootstrap confidence intervals. Adapted from: https://github.com/scipy/scipy/blob/v1.14.0/scipy/stats/_resampling.py
- dpet.comparison.get_adaJSD_matrix(ens_1: Union[Ensemble, Trajectory], ens_2: Union[Ensemble, Trajectory], bins: Union[str, int] = 'auto', return_bins: bool = False, featurization_params: dict = {}, *args, **kwargs)
Utility function to calculate the adaJSD score between two ensembles and return a matrix with JSD scores for each pair of Ca-Ca distances.
- Parameters:
ens_1 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
ens_2 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
arguments (Remaining) –
------------------- – See dpet.comparison.score_adaJSD for more information.
Output –
------ – If return_bins is False, it will return a tuple containing the adaJSD score and a (N, N) NumPy array (where N is the number of residues of the protein in the ensembles being compared) containing the JSD scores of individual Ca-Ca distances. If return_bins is True, it will return also the bin value used in all the comparisons.
- dpet.comparison.get_ataJSD_profile(ens_1: Union[Ensemble, Trajectory], ens_2: Union[Ensemble, Trajectory], bins: Union[str, int], return_bins: bool = False, *args, **kwargs)
Utility function to calculate the ataJSD score between two ensembles and return a profile with JSD scores for each alpha angle in the proteins.
- Parameters:
ens_1 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
ens_2 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
arguments (Remaining) –
------------------- – See dpet.comparison.score_ataJSD for more information.
Output –
------ – If return_bins is False, it will return a tuple containing the ataJSD score and a (N-3, ) NumPy array (where N is the number of residues of the protein in the ensembles being compared) containing the JSD scores of individual alpha angles. If return_bins is True, it will return also the bin value used in all the comparisons.
- dpet.comparison.get_num_comparison_bins(bins: Union[str, int], x: List[ndarray] = None)
Get the number of bins to be used in comparison between two ensembles using an histogram-based score (such as a JSD approximation).
- Parameters:
bins (Union[str, int]) –
Determines the number of bins to be used. When providing an int, the same value will simply be returned. When providing a string, the following rules to determine bin value will be applied: auto: applies sqrt if the size of the smallest ensemble is <
dpet.comparison.min_samples_auto_hist. If it >= than this value, returns dpet.comparison.num_default_bins.
- sqrt: applies the square root rule for determining bin number using
the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Square-root_choice).
- sturges: applies Sturge’s formula for determining bin number using
the size of the smallest ensemble (https://en.wikipedia.org/wiki/Histogram#Sturges’s_formula).
x (List[np.ndarray], optional) – List of M feature matrices (one for each ensembles) of shape (N_i, *). N_i values are the number of structures in each ensemble. The minimum N_i will be used to apply bin assignment rule when the bins argument is a string.
- Returns:
num_bins – Number of bins.
- Return type:
int
- dpet.comparison.get_ramaJSD_profile(ens_1: Union[Ensemble, Trajectory], ens_2: Union[Ensemble, Trajectory], bins: Union[str, int], return_bins: bool = False, *args, **kwargs)
Utility function to calculate the ramaJSD score between two ensembles and return a profile with JSD scores for the Ramachandran plots of pair of corresponding residue in the proteins.
- Parameters:
ens_1 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
ens_2 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
arguments (Remaining) –
------------------- – See dpet.comparison.score_ramaJSD for more information.
Output –
------ – If return_bins is False, it will return a tuple containing the ramaJSD score and a (N-2, ) NumPy array (where N is the number of residues of the protein in the ensembles being compared) containing the JSD scores of individual residues. If return_bins is True, it will return also the bin value used in all the comparisons.
- dpet.comparison.percentile_func(a, q)
- dpet.comparison.process_all_vs_all_output(comparison_out: dict, confidence_level: float = 0.95)
Takes as input a dictionary produced as output of the all_vs_all_comparison function. If a bootstrap analysis was performed in all_vs_all_comparison, this function will assign bootstrap confidence intervals.
- dpet.comparison.score_adaJSD(ens_1: Union[Ensemble, Trajectory], ens_2: Union[Ensemble, Trajectory], bins: Union[str, int] = 'auto', return_bins: bool = False, return_scores: bool = False, featurization_params: dict = {}, *args, **kwargs)
Utility function to calculate the adaJSD (carbon Alfa Distance Average JSD) score between two ensembles. The score evaluates the divergence between distributions of Ca-Ca distances of the ensembles.
- Parameters:
ens_1 (Union[Ensemble, mdtraj.Trajectory],) – Two Ensemble or mdtraj.Trajectory objects storing the ensemble data to compare.
ens_2 (Union[Ensemble, mdtraj.Trajectory],) – Two Ensemble or mdtraj.Trajectory objects storing the ensemble data to compare.
bins (Union[str, int], optional) – Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information.
return_bins (bool, optional) – If True, returns the number of bins used in the calculation.
return_scores (bool, optional) – If True, returns the a tuple with with (avg_score, all_scores), where all_scores is an array with all the F scores (one for each feature) used to compute the average score.
featurization_params (dict, optional) – Optional dictionary to customize the featurization process to calculate Ca-Ca distances. See the Ensemble.get_features function for more information.
output (Remaining arguments and) –
------------------------------ – See dpet.comparison.score_avg_jsd for more information.
- dpet.comparison.score_ataJSD(ens_1: Union[Ensemble, Trajectory], ens_2: Union[Ensemble, Trajectory], bins: Union[str, int], return_bins: bool = False, return_scores: bool = False, *args, **kwargs)
Utility function to calculate the ataJSD (Alpha Torsion Average JSD) score between two ensembles. The score evaluates the divergence between distributions of alpha torsion angles (the angles formed by four consecutive Ca atoms in a protein) of the ensembles.
- Parameters:
ens_1 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
ens_2 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
output (Remaining arguments and) –
------------------------------ – See dpet.comparison.score_avg_jsd for more information.
- dpet.comparison.score_avg_2d_angle_jsd(array_1: ndarray, array_2: ndarray, bins: int, return_scores: bool = False, return_bins: bool = False, *args, **kwargs)
Takes as input two (*, F, 2) bidimensional feature matrices and computes an average JSD score over all F bidimensional features by discretizing them in 2d histograms. The features in this functions are supposed to be angles whose values range from -math.pi to math.pi. For example, int the score_ramaJSD function the F features represent the phi-psi values of F residues in a protein of length L=F+2 (first and last residues don’t have both phi and psi values).
- Parameters:
p_data (np.ndarray) – NumPy arrays of shape (*, F, 2) containing samples from F bi-dimensional distributions to be compared.
q_data (np.ndarray) – NumPy arrays of shape (*, F, 2) containing samples from F bi-dimensional distributions to be compared.
bins (Union[int, str], optional) – Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be -math.pi to math.pi. Note that the effective number of bins used in the functio will be the square of the number returned by dpet.comparison.get_num_comparison_bins, since we are building a 2d histogram.
return_bins (bool, optional) – If True, returns the square root of the effective number of bins used in the calculation.
- Returns:
results – If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins. If return_scores is True it will also return the F scores used to compute the average JSD score.
- Return type:
Union[float, Tuple[float, np.ndarray]]
- dpet.comparison.score_avg_jsd(m1, m2, *args, **kwargs)
- dpet.comparison.score_histogram_jsd(p_data: ndarray, q_data: ndarray, limits: Union[str, Tuple[int]], bins: Union[int, str] = 'auto', return_bins: bool = False) Union[float, Tuple[float, ndarray]]
Scores an approximation of Jensen-Shannon divergence by discretizing in a histogram the values two 1d samples provided as input.
- Parameters:
p_data (np.ndarray) – NumPy arrays of shape (*, ) containing samples from two mono-dimensional distribution to be compared.
q_data (np.ndarray) – NumPy arrays of shape (*, ) containing samples from two mono-dimensional distribution to be compared.
limits (Union[str, Tuple[int]]) –
Define the method to calculate the minimum and maximum values of the range spanned by the bins. Accepted values are:
- ”m”: will use the minimum and maximum values observed by
concatenating samples in p_data and q_data.
- ”p”: will use the minimum and maximum values observed by
concatenating samples in p_data. If q_data contains values outside that range, new bins of the same size will be added to cover all values of q. Currently, this is not used in any IDPET functionality. Note that the bins argument will determine only the bins originally spanned by p_data.
- ”a”: limits for scoring angular features. Will use a
(-math.pi, math.pi) range for scoring such features.
- (float, float): provide a custom range. Currently, not used in any
IDPET functionality.
bins (Union[int, str], optional) – Determines the number of bins to be used when constructing histograms. See dpet.comparison.get_num_comparison_bins for more information. The range spanned by the bins will be define by the limits argument.
return_bins (bool, optional) – If True, returns the bins used in the calculation.
- Returns:
results – If return_bins is False, only returns a float value for the JSD score. The score will range from 0 (no common support) to log(2) (same distribution). If return_bins is True, returns a tuple with the JSD score and the number of bins.
- Return type:
Union[float, Tuple[float, np.ndarray]]
- dpet.comparison.score_ramaJSD(ens_1: Union[Ensemble, Trajectory], ens_2: Union[Ensemble, Trajectory], bins: int, return_scores: bool = False, return_bins: bool = False)
Utility unction to calculate the ramaJSD (Ramachandran plot average JSD) score between two ensembles. The score evaluates the divergence between distributions of phi-psi torsion angles of every residue in the ensembles.
- Parameters:
ens_1 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
ens_2 (Union[Ensemble, mdtraj.Trajectory]) – Two Ensemble objects storing the ensemble data to compare.
output (Remaining arguments and) –
------------------------------ – See dpet.comparison.score_avg_jsd for more information.
- dpet.comparison.sqrt_rule(n)
- dpet.comparison.sturges_rule(n)