Model Languages & Scoring

Predomics discovers classification signatures using compact mathematical languages. Each model is defined by a subset of features, their coefficients, and a decision threshold. This design produces models that are both interpretable and clinically actionable.

Model Languages (BTR + Pow2)

Predomics defines four model languages, each encoding a different type of ecological relationship between features. The original three (Binary, Ternary, Ratio) are collectively called BTR Prifti et al., 2020.

Binary – Cooperation / Cumulative Effect

Coefficients: C_i in {0, 1}

Score = C_1 * F_1 + C_2 * F_2 + ... + C_k * F_k >= threshold

Selected features contribute their raw values equally. This language captures cumulative presence – the more selected features are present or abundant, the higher the score. Analogous to ecological cooperation, where multiple species jointly contribute to a phenotype.

Use case: When disease is associated with the combined presence of multiple taxa (e.g., oral bacteria accumulating in the gut).

Ternary – Competition / Predation

Coefficients: C_i in {-1, 0, +1}

Score = (L + U + X) - (D + A + R + K) >= threshold

Features enriched in one class contribute positively (+1); features enriched in the other contribute negatively (-1). This language captures antagonistic relationships – features that increase vs. decrease with disease.

Use case: The most commonly used language. Captures biomarker signatures where some species increase while others decrease (e.g., butyrate producers depleted, pathobionts enriched in disease).

Ratio – Imbalance

Coefficients: C_i in {-1, 0, +1}

Score = (L + U + X) / (D + A + R + K + epsilon) >= threshold

The score is the ratio of positively-weighted features over negatively-weighted features. Naturally bounded (0 to infinity) and robust to normalization differences between cohorts.

Use case: When the relative balance between two groups of species matters more than absolute abundances. Particularly useful for cross-cohort generalization.

Pow2 – Weighted Ternary

Coefficients: C_i in {-64, -32, -16, -8, -4, -2, -1, 0, 1, 2, 4, 8, 16, 32, 64}

Score = 1*L + 2*U + 4*X - (64*D + 4*A + 2*R + 1*K) >= threshold

An extension of Ternary that allows features to have different contribution weights, using powers of two for computational efficiency. Introduced in gpredomics to capture features with stronger or weaker effects.

Use case: When some features are known to have much larger effect sizes than others (e.g., a keystone species vs. a minor contributor).

Data Type Transformations

Each model language can be combined with three data type transformations, applied to feature values before scoring:

Data Type	Transformation	Equation (Binary example)	When to use
raw	No transformation	Sum(C_i * F_i) >= threshold	Default. Continuous abundance data.
prev	Binarize to presence/absence	Sum(C_i * I(F_i > epsilon)) >= threshold	When presence matters more than abundance. Sparse data.
log	Natural logarithm	Sum(C_i * (ln F_i - ln epsilon)) >= threshold	Right-skewed distributions (typical in metagenomics). Compresses dynamic range.

The epsilon constant (default 1e-5, configurable via datatype_epsilon) prevents division by zero and logarithm of zero.

The combination of 4 languages x 3 data types yields up to 12 model niches, each exploring different ecological hypotheses.

Threshold Optimization

After computing the score for all samples, the decision threshold is optimized to maximize classification performance:

For AUC: The threshold maximizing Youden’s index (Specificity + Sensitivity - 1) is selected. This is the point on the ROC curve farthest from the diagonal.
For asymmetric targets (e.g., prioritizing sensitivity): The threshold maximizes:
```
(target_metric + antagonist_metric * fr_penalty) / (1 + fr_penalty)
```
where fr_penalty controls the trade-off between the target and its antagonist.
For other metrics: The threshold directly maximizing the chosen metric (MCC, F1, G-mean) is used.

Threshold Confidence Interval

For clinical applications requiring uncertainty quantification, gpredomics (v0.7.3+) can compute a bootstrap confidence interval around the decision threshold:

Stratified bootstrap: Resample the training data B times (default B >= 1000), preserving class proportions
Threshold distribution: Compute the optimal threshold on each bootstrap replicate
Percentile CI: Extract lower and upper bounds at quantiles alpha/2 and 1-alpha/2
Rejection zone: Samples with scores between the CI bounds are classified as “undecided” (rejection class 2)

When subsampling is used (threshold_ci_frac_bootstrap < 1.0), Geyer rescaling corrects for the underestimation of variability inherent in subsampling.

The rejection rate (fraction of undecided samples) is used as a penalty during optimization (threshold_ci_penalty), favoring models that are more decisive.

Fitness Function

Model fitness combines a performance metric with optional penalties:

fitness = performance_metric - sum(penalties)

Performance Metrics

Metric	Formula	Best for
AUC	Area under ROC curve	General purpose, class-imbalance robust
Sensitivity	TP / (TP + FN)	Minimizing false negatives
Specificity	TN / (TN + FP)	Minimizing false positives
G-mean	sqrt(Sensitivity * Specificity)	Balanced performance on imbalanced data
MCC	(TPTN - FPFN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))	Imbalanced datasets
F1	2 * Precision * Sensitivity / (Precision + Sensitivity)	When precision and recall both matter

Fit Penalties

Penalty	Effect	Formula
k_penalty	Favors simpler models	fitness -= k * k_penalty
bias_penalty	Penalizes imbalanced predictions	fitness -= (1 - bad_metric) * bias_penalty
overfit_penalty	Penalizes overfitting in CV	fitness -= mean(train-valid gap) * overfit_penalty
threshold_ci_penalty	Penalizes high rejection rates	fitness -= rejection_rate * threshold_ci_penalty
feature_penalty	User-defined per-feature cost	fitness -= weighted_mean(feature_penalties)

References

Prifti, E. et al. (2020). Interpretable and accurate prediction scores for metagenomics data with Predomics. GigaScience, 9(3). doi:10.1093/gigascience/giaa010
Efron, B. & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.