Documentation for AnalysisBox

This module is inherited from FeaturesSet class and allows for preliminary statistical analysis of the numerical features.

calculate_basic_stats(self, volume_feature='')

Calculate basic statistical scores (such as: number of missing values, mean, std, min, max, Mann-Whitney test p-values for binary classes, univariate ROC AUC for binary classes, Spearman's correlation with volume if volumetric feature name is sent to function, Shapiro-Wilk test p-values) for each feature and save it to .csv file.

Parameters:
  • volume_feature (str) – Name of the feature, which is considered as volume.

handle_constant(self)

Drop the features with the constant values.

handle_nan(self, axis=1, how='any', mode='delete')

Handle the missing values.

Parameters:
  • axis (int) – Determines if patients (0) or variables (1) with the missing values have to be fixed.

  • how (str) – Determines if handling is needed when there is at least one missing value ('any') or all of them are missing ('all').

  • mode (str) – Determines the strategy: 'delete' will delete the variable/patient, 'fill' will fill a missing value with the imputation method.

normality_check(self, features_to_plot=[], p_thresh=0.05)

Perform Shapiro-Wilcoxon normality check for all the features.

Parameters:
  • features_to_plot (list) – List of specific features to be selected (otherwise selects all the numerical features).

  • p_thresh (float) – Shapiro-Wilk test p-value.

Returns:
  • list – List of the features with normal distribution.

plot_MW_p(self, features_to_plot=[], binary_classes_to_plot=[], p_threshold=0.05)

Plot two-sided Mann-Whitney U test p-values for comparison of features values means in 2 classes (with correction for multiple testing) into interactive .html report.

Parameters:
  • features_to_plot (list) – List of specific features to be selected (otherwise selects all the numerical features).

  • binary_classes_to_plot (list) – List, containing 2 classes of interest, if the dataset is multi-class.

  • p_threshold (float) – Significance level.

plot_correlation_matrix(self, features_to_plot=[], corr_method='spearman', save_to_csv=False)

Plot correlation (Spearman's by default) matrix for the features into interactive .html report.

Parameters:
  • features_to_plot (list) – List of specific features to be selected (otherwise selects all the numerical features).

  • corr_method (str) – Method of calculation: {'pearson', 'kendall', 'spearman'}.

  • save_to_csv (bool) – Enable/disable saving correlation dataframe to .csv file.

plot_distribution(self, features_to_plot=[], binary_classes_to_plot=[])

Plot distribution of the feature values in classes into interactive .html report.

Parameters:
  • features_to_plot (list) – List of specific features to be selected (otherwise selects all the numerical features).

  • binary_classes_to_plot (list) – List, containing 2 classes of interest, if the dataset is multi-class.

plot_univariate_roc(self, features_to_plot=[], binary_classes_to_plot=[], auc_threshold=0.75)

Plot univariate ROC curves (with AUC calculation) for threshold binary classifier, based of each feature separately into interactive .html report.

Parameters:
  • features_to_plot (list) – List of specific features to be selected (otherwise selects all the numerical features).

  • binary_classes_to_plot (list) – List, containing 2 classes of interest in case of multi-class data.

  • auc_threshold (float) – Threshold value for ROC AUC to be highlighted.

volume_analysis(self, volume_feature='', auc_threshold=0.75, features_to_plot=[], corr_threshold=0.75)

Calculate features correlation (Spearman’s) with volume and plot volume-based precision-recall curve.

Parameters:
  • volume_feature (str) – Name of the feature, which is considered as volume.

  • auc_threshold (float) – Threshold value for area under precision-recall curve to be highlighted.

  • features_to_plot (list) – Specific features to be selected (otherwise selects all the numerical features)

  • corr_threshold (float) – Threshold value for absolute value for Spearman’s correlation coefficient to be considered as ‘strong correlation’.