Search Algorithms
Predomics provides three complementary optimization algorithms for searching the model space. Each explores different trade-offs between exhaustiveness, speed, and stochasticity.
Genetic Algorithm (GA)
The primary search strategy. A population-based evolutionary algorithm that maintains and evolves a diverse set of candidate models over multiple generations (epochs).
Algorithm Outline
-
Initialization: Generate a random population of individuals, each with a random subset of k features and random coefficients (drawn from the language alphabet). Feature
prior_weightannotations can bias initial feature selection. -
Evaluation: Score each individual on the training data (compute AUC, optimize threshold, apply penalties).
- Selection: Choose parents for the next generation:
- Elite selection: Top-performing individuals survive unchanged
- Niche selection (
select_niche_pct): Within each language x data-type niche, the top N% are guaranteed as parents. This prevents any single niche from dominating. - Tournament/random selection: Fill remaining parent slots
-
Crossover: Combine pairs of parent models to produce offspring. Feature sets are recombined (union, intersection, or mixed strategies). Coefficients are inherited or re-assigned.
- Mutation: Randomly modify offspring:
- Add or remove a feature
- Flip a coefficient sign
- Change the language or data type (rare, large mutations)
-
Diversity filtration (
forced_diversity_pct): At periodic intervals (forced_diversity_epochs), remove individuals that are too similar to others (measured by signed Jaccard distance on feature sets). Binary/Ternary/Pow2 models are filtered as one group; Ratio models are filtered separately. This prevents premature convergence. - Repeat steps 2-6 for the configured number of epochs.
Niche System
The population is partitioned into niches based on (language, data_type) combinations. For example, with languages {bin, ter, ratio} and data types {raw, prev, log}, there are up to 9 niches. The select_niche_pct parameter ensures each niche contributes parents to the next generation, maintaining structural diversity.
When to Use
- Default choice for most analyses
- Best for broad exploration of the feature space
- Handles all languages and data types simultaneously
- Most robust to local optima due to population diversity
Beam Search
A deterministic, greedy heuristic that builds models incrementally.
Algorithm Outline
- Start with empty model (k=0)
- For each candidate feature, tentatively add it to the model
- Score all (current model + 1 feature) combinations
- Keep the top B candidates (beam width)
- Repeat until reaching the target model size k
- Optionally, continue with backward elimination: try removing each feature and keep the best
Variants
- Limited Exhaustive: Evaluate all subsets up to size k (exponential, only practical for small k and feature sets)
- Parallel Forward Selection: Multiple starting points explored in parallel
When to Use
- Fast prototyping: Deterministic and reproducible
- Small feature sets: When the filtered feature space is small enough for near-exhaustive search
- Benchmark: Compare GA results against a greedy baseline
- Reproducibility: Same input always produces the same output (no stochasticity)
Limitations
- Can get trapped in local optima (greedy decisions are irreversible)
- Does not naturally explore multiple languages/data types simultaneously
- Less effective when feature interactions are complex
MCMC (Markov Chain Monte Carlo)
A probabilistic sampler that explores the model space through random walks.
Algorithm Outline
- Start with a random model
- Propose a modification:
- Add a random feature
- Remove a random feature
- Swap a feature for another
- Change a coefficient
- Accept or reject the proposal based on the Metropolis-Hastings criterion:
- If the new model is better: always accept
- If worse: accept with probability proportional to exp(-delta_fitness / temperature)
- Record the current model
- Repeat for many iterations
Sequential Backward Selection Variant
A deterministic variant that starts with a large feature set and iteratively removes the least important feature (by MDA permutation), producing models at each sparsity level.
When to Use
- Posterior estimation: Estimate the probability that each feature belongs to the optimal model
- Uncertainty quantification: The chain samples proportional to model quality
- Complementary exploration: Can find models missed by GA and Beam
Algorithm Comparison
| Property | GA | Beam | MCMC |
|---|---|---|---|
| Stochastic | Yes | No | Yes |
| Multi-language | Yes | Per-run | Per-run |
| Global search | Strong | Weak | Moderate |
| Speed | Moderate | Fast | Slow |
| Reproducibility | Seed-dependent | Deterministic | Seed-dependent |
| Feature interactions | Captured | Limited | Moderate |
| Recommended for | General use | Quick exploration | Posterior estimation |
Cross-Validation
All algorithms support two levels of cross-validation:
Outer Cross-Validation (K-Fold)
The training data is split into K folds (default K=5). The algorithm is run K times in parallel, each time using K-1 folds for training and 1 fold for validation. The final population gathers the Family of Best Models (FBM) from all folds.
Fold 1: Train on folds {2,3,4,5}, validate on fold 1
Fold 2: Train on folds {1,3,4,5}, validate on fold 2
...
Final: Combine FBMs, evaluate on full training data + external test set
Inner Cross-Validation (Overfit Control)
Within each training run, the data is further split into inner folds. At each epoch, model fitness is penalized for the train-validation gap:
fitness = mean(train_fit - |train_fit - valid_fit| * overfit_penalty)
Inner folds can be re-drawn periodically (resampling_inner_folds_epochs) to prevent indirect learning of fold structure.
Stratification
Folds are stratified by class label to preserve class proportions. Since v0.7.4, double stratification is supported: folds are stratified first by class, then by a metadata variable (e.g., hospital, batch, sequencing center) to ensure that confounding factors are balanced across folds.
Feature Pre-selection
Before running any algorithm, features are statistically pre-filtered to reduce the search space:
- Prevalence filter (
feature_minimal_prevalence_pct): Remove features present in fewer than N% of samples in both classes - Mean filter (
feature_minimal_feature_value): Remove features with negligible mean abundance in both classes - Statistical test: Three options:
- Wilcoxon rank-sum (default): Non-parametric, robust to skewness and outliers
- Student’s t-test: Parametric, assumes normality
- Bayesian Fisher’s exact: For binary/presence-absence data, uses Bayes factors
- Multiple testing correction: Benjamini-Hochberg FDR for Wilcoxon and t-test
- Feature ranking: Features ranked by significance, optionally capped at
max_features_per_classper class
References
- Holland, J. H. (1992). Adaptation in Natural and Artificial Systems. MIT Press.
- Blondel, V. D. et al. (2008). Fast unfolding of communities in large networks. J. Stat. Mech., P10008.
- Metropolis, N. et al. (1953). Equation of state calculations by fast computing machines. J. Chem. Phys., 21(6), 1087-1092.