Features
Model Languages
Predomics discovers classification signatures using three interpretable mathematical languages:
- Binary (bin) – Features contribute as 0 or 1 (presence/absence). The score is a simple sum of selected feature values above their individual thresholds.
- Ternary (ter) – Features contribute as -1, 0, or +1. Features enriched in one class contribute positively; features enriched in the other contribute negatively. The most commonly used language.
- Ratio – The score is computed as the ratio of two feature groups (numerator sum / denominator sum). Naturally bounded and robust to normalization differences.
All languages produce compact, human-readable models with typically 3–15 features (parameter k), making them ideal for biomarker discovery.
Search Algorithms
Three complementary optimization strategies are available:
- Genetic Algorithm (GA) – Population-based evolutionary search. Maintains a diverse population of candidate models, iteratively applying crossover, mutation, and selection to converge toward high-fitness signatures. Best for broad exploration of the feature space.
- Beam Search – Deterministic, greedy heuristic that incrementally builds signatures by adding the best feature at each step. Fast and reproducible, well-suited for small feature sets and quick exploration.
- MCMC (Bayesian) – Markov Chain Monte Carlo sampler that explores the model space probabilistically. Useful for estimating posterior distributions of feature inclusion probabilities.
Each algorithm can be combined with any model language and run independently or in batch mode to sweep across parameter combinations.
Family of Best Models (FBM)
Rather than selecting a single “best” model, Predomics tracks the Family of Best Models – all individuals that achieve near-optimal performance. This population provides:
- Feature prevalence analysis: identify which features appear most consistently across top models
- Robustness assessment: stable features (high prevalence) are more likely to generalize
- Functional annotations: link features to known biological functions (butyrate production, inflammation, transit time, oral origin)
- Co-presence analysis: detect feature pairs that co-occur or exclude each other significantly
- Stability analysis: Kuncheva, Tanimoto, and weighted consistency indices per sparsity level to identify the “sweet spot” where models are both performant and stable
- Model clustering: hierarchical clustering (Tanimoto distance) reveals distinct model families and prototype representatives
- Feature × sparsity heatmap: feature prevalence across model sizes reveals the “core signature” that persists regardless of k
Jury / Ensemble Voting
Build an ensemble of expert models from the best individuals across generations:
- Majority voting – Weighted expert consensus with configurable threshold
- Consensus voting – Requires minimum agreement level to predict; uncertain samples are rejected
- Rejection class – Samples below confidence threshold are assigned to class 2 (abstention), reducing false positives
- Per-sample vote matrix visualization and concordance analysis
- Comparison of jury performance vs. best individual model
Evaluation & Metrics
- K-fold cross-validation with nested inner/outer folds and overfit penalty control
- Holdout validation with configurable train/test split ratio
- External validation on independent cohorts uploaded post-training
- Metrics: AUC, accuracy, sensitivity, specificity, MCC, F1 score
- Confusion matrices with rejection class support (3x2 layout)
- Generation-level tracking of train vs. test AUC evolution
Feature Importance
- MDA (Mean Decrease in Accuracy) – Permutation-based importance via random shuffling of each feature
- SHAP-like explanations – Per-sample feature contribution breakdown (feature value x coefficient)
- Beeswarm plot: distribution of SHAP values across all samples
- Force plot: per-sample waterfall of feature contributions
- Dependence plot: SHAP value vs. feature value for a selected feature
- Coefficient direction – Bar chart of model coefficients showing which features push toward which class
- Waterfall chart – Feature contribution decomposition for the best model
Ecosystem Analysis
Visualize and explore microbial ecosystems as co-abundance networks, inspired by the Interpred approach (Cousin-Thorez, 2019) and the SCAPIS ecosystem work (Prifti, 2024).
-
Co-abundance network – Species correlation network built from the abundance matrix using pairwise Spearman correlations. Edges connect species with rho above a configurable threshold. - Community detection – Louvain algorithm partitions the network into ecological modules (niches). Modularity score quantifies partition quality.
- Taxonomic coloring – Hierarchical color scheme: phylum base colors (SCAPIS palette) with family-level shading via lighten/darken gradients. Produces visually distinct colors for every family within each phylum.
- Multiple layout algorithms – Organic (Fruchterman-Reingold with simulated annealing), Force-directed, Circle, and Radial layouts.
- Three color modes – Taxonomy (phylum/family), Module (Louvain community), or Enrichment (which class each species is enriched in).
- FBM overlay – Annotate network nodes with data from the Family of Best Models: prevalence of each species across models and dominant coefficient direction (+1/-1). Bridges the ecological view with the predictive view.
- Interactive controls – Adjustable prevalence threshold, correlation threshold, class filtering (all/class 0/class 1), and module highlight on click.
- Node metrics – Degree, betweenness centrality, per-class prevalence, mean abundance.
Web Application (PredomicsApp)
The web application provides a complete analysis workflow:
- Project management – Create, archive, share, and organize analysis projects
- Dataset library – Centralized dataset management with versioning, tagging, and metadata scanning
- Data exploration – Feature statistics, prevalence distribution, volcano plots, barcode visualization
- Parameter configuration – Template system, admin defaults, batch mode for sweeping parameters
- Real-time monitoring – Live console output with progress sparkline during job execution
- Interactive results – Plotly-based charts for all result views (summary, population, jury, comparative, co-presence, ecosystem)
- Export options – PDF biomarker reports, HTML reports, CSV tables, Python notebooks (.ipynb), R notebooks (.Rmd)
- Prediction API – Deploy trained models as REST endpoints for programmatic scoring
- User management – JWT authentication, API keys, role-based access (admin, viewer, editor)
- Public sharing – Generate read-only links with optional expiry dates
- Browser notifications – Desktop alerts when jobs complete or fail
- Multi-cohort meta-analysis – Compare models across datasets to identify shared biomarkers