Feature Importance & Evaluation
Predomics provides multiple methods for assessing feature importance and evaluating model performance, from individual feature contributions to population-level prevalence analysis.
Individual Feature Importance
Mean Decrease in Accuracy (MDA)
MDA is a permutation-based importance metric that measures how much each feature contributes to the model’s predictive power:
- Compute the baseline AUC of the model on the test data
- For each feature j in the model: a. Randomly shuffle the values of feature j across samples (N permutations, default N=100) b. Recompute the AUC on the shuffled data c. MDA(j) = mean(baseline_AUC - shuffled_AUC)
- Features with high MDA are critical; removing them destroys predictive accuracy
Interpretation:
- MDA > 0: Feature contributes positively to the model
- MDA ~ 0: Feature is irrelevant or redundant (its information is captured by other features)
- MDA < 0: Feature may be adding noise (rare, but possible)
Limitations:
- MDA evaluates each feature independently – it cannot capture feature interactions
- If two features are highly correlated, shuffling one may not decrease AUC because the other compensates
- MDA is used internally for feature pruning: features with zero or negative importance can be automatically removed to simplify the model
SHAP-like Per-Sample Explanations
For each sample, the contribution of each feature to the predicted score is decomposed:
contribution(feature_j, sample_s) = coefficient_j * value(feature_j, sample_s)
This provides a per-sample, per-feature breakdown of the prediction, enabling:
Beeswarm Plot
- Each dot represents one sample’s SHAP value for one feature
- X-axis: SHAP value (contribution to score)
- Y-axis: Features (sorted by overall importance)
- Color: Feature value (low = blue, high = red)
Reveals the direction of each feature’s effect: high values of feature X push toward disease (positive SHAP, red dots on right) or health (positive SHAP, blue dots on right).
Force Plot (Waterfall)
- For a selected sample, shows the sequential contribution of each feature
- Starting from a base value, each feature pushes the score up or down
- The final score determines the classification
Useful for explaining individual predictions: “This patient was classified as disease because feature A was high (+0.3), feature B was present (+0.2), despite feature C being low (-0.1).”
Dependence Plot
- X-axis: Feature value
- Y-axis: SHAP value for that feature
- One dot per sample
Reveals non-linear relationships between feature values and their contribution to the model. A flat line means the feature’s effect is constant; a curve means the effect changes with abundance.
Coefficient Direction
A simpler importance view that shows the model’s coefficients directly:
- Positive coefficients (+1, or higher in Pow2): Feature enriched in class 1 (disease)
- Negative coefficients (-1, or lower in Pow2): Feature enriched in class 0 (controls)
The coefficient direction chart provides an immediate visual summary of which features push toward which class.
Population-Level Feature Analysis
Feature Prevalence in the FBM
For each feature, its prevalence across the Family of Best Models measures its consistency as a biomarker:
prevalence(feature_j) = count(FBM models containing feature_j) / |FBM|
Features are ranked by prevalence and displayed as:
- Prevalence bar chart: Horizontal bars showing prevalence per feature
- Population heatmap: Models (rows) x features (columns), colored by coefficient
FBM Feature Categories
| Prevalence | Category | Interpretation |
|---|---|---|
| > 80% | Core biomarker | Almost always selected. Very robust signal. |
| 50-80% | Frequent biomarker | Selected in most models. Reliable but may have alternatives. |
| 20-50% | Accessory feature | Selected in some models. May be context-dependent or interchangeable. |
| < 20% | Rare feature | Only in a few models. Possibly noise or specific to certain k values. |
Feature x Sparsity Heatmap
Shows how feature prevalence varies across model sizes k:
- A feature with high prevalence at k=3 and k=5 but low at k=10 is a core feature that gets diluted at higher sparsity
- A feature with increasing prevalence as k grows is secondary – only included when the model has room for extra features
Evaluation Metrics
Classification Metrics
| Metric | Formula | Range | Interpretation |
|---|---|---|---|
| AUC | Area under ROC | [0.5, 1] | Discrimination ability across all thresholds |
| Accuracy | (TP+TN) / N | [0, 1] | Overall correctness (affected by class imbalance) |
| Sensitivity | TP / (TP+FN) | [0, 1] | Ability to detect positives (recall) |
| Specificity | TN / (TN+FP) | [0, 1] | Ability to detect negatives |
| MCC | (TPTN-FPFN) / sqrt(…) | [-1, 1] | Balanced metric, robust to class imbalance |
| F1 Score | 2PrecSens / (Prec+Sens) | [0, 1] | Harmonic mean of precision and sensitivity |
| G-mean | sqrt(Sens * Spec) | [0, 1] | Geometric mean, penalizes imbalanced performance |
Confusion Matrix
The standard 2x2 confusion matrix is extended with a rejection class when threshold confidence intervals are enabled:
| Predicted 0 | Predicted 1 | Rejected | |
|---|---|---|---|
| Actual 0 | TN | FP | Rejected-0 |
| Actual 1 | FN | TP | Rejected-1 |
Rejected samples have scores falling within the threshold confidence interval.
Cross-Validation Reporting
Performance is reported at multiple levels:
- Per-fold: Train and validation metrics for each outer CV fold
- Aggregated: Mean and standard deviation across folds
- Final: Performance of the combined FBM on full training data
- External test: Performance on held-out test data (if provided)
Generation Tracking
During GA optimization, train and test AUC are tracked at each generation to detect:
- Convergence: AUC stabilizes (good stopping criterion)
- Overfitting: Train AUC increases but test AUC decreases or stagnates
- Underfitting: Both train and test AUC are low (need more epochs or different parameters)
External Validation
Models trained on one dataset can be validated on an independent cohort:
- Upload external Xtest and Ytest files
- Apply the best model (or jury ensemble) to the new data
- Compare AUC, sensitivity, specificity on the external cohort
This is the gold standard for assessing generalization and is essential before claiming clinical utility.
Multi-Cohort Meta-Analysis
The comparative/meta-analysis view allows side-by-side comparison of multiple jobs:
- Overlaid metrics: AUC, accuracy, sensitivity across different runs
- Feature overlap: Venn diagram of features selected by different runs
- Cross-cohort biomarkers: Features consistently selected across datasets are the strongest candidates for generalization
References
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. (Original MDA formulation)
- Lundberg, S. M. & Lee, S.-I. (2017). A unified approach to interpreting model predictions. NeurIPS.
- Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta, 405(2), 442-451.